Message ID | 20140319225137.GM7528@n2100.arm.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wednesday, March 19, 2014 at 11:51:38 PM, Russell King - ARM Linux wrote: > On Wed, Mar 19, 2014 at 10:52:32PM +0100, Marek Vasut wrote: [...] > > Speaking of FEC and slightly off-topic, have you ever seen this on your > > box [1]/[2]/[3] ? I wonder if this might be cache-related as well, since > > I saw similar issue on MX6 with PCIe-connected ethernet. I cannot put a > > finger on this though. > > I think I've seen something similar once or twice. Nice, thanks for confirming this. > Let me just pull up the transmit function. This is the code which > writes to the descriptors (which frankly is pretty horrid): Yes. > bdp->cbd_datlen = skb->len; > bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr, > skb->len, DMA_TO_DEVICE); <-- barrier inside > ebdp->cbd_bdu = 0; > ebdp->cbd_esc = BD_ENET_TX_INT; > ebdp->cbd_esc |= BD_ENET_TX_PINS; > bdp->cbd_sc = status; > writel(0, fep->hwp + FEC_X_DES_ACTIVE); <-- barrier before write > > A couple of points here: > > 1. The hardware operates in a ring - it can read the next ring if it > sees the BD_ENET_TX_READY bit set if has just finished processing > the previous descriptor - this can happen before the write to > FEC_X_DES_ACTIVE. > > 2. The ARM can re-order writes. The writes it can re-order are: > > bdp->cbd_bufaddr > ebdp->cbd_bdu > ebdp->cbd_esc > ebdp->cbd_sc > > Hence, it's entirely possible for the FEC to see the updated descriptor > status before the rest of the descriptor has been written. What's missing > is a barrier between the descriptor writes, and the final write to > bdp->cbd_sc. I think there might even be another thing at play here, but please correct me if I am wrong. If you look at fb8ef788680d48523321e5f150b23700a1caf980 , this patch actually has no functional impact on the driver , right ? It does no functional change at all, but it changed something for Fugang. This makes me also believe there might clearly be something afoot with the DMA descriptors. Reading Documentation/DMA-API.txt , I noticed the ring with the descriptors is allocated with dma_alloc_coherent() . Yet, DMA-API.txt , section I, states: " 29 Consistent memory is memory for which a write by either the device or 30 the processor can immediately be read by the processor or device 31 without having to worry about caching effects. (You may however need 32 to make sure to flush the processor's write buffers before telling 33 devices to read that memory.) " So the writes into the descriptor from the CPU might even be stuck in the CPU's write buffer when this code below from fec_main.c is called: 438 /* Trigger transmission start */ 439 writel(0, fep->hwp + FEC_X_DES_ACTIVE); Does this also make sense or is my idea completely flawed please ? > Had I not got distracted by the L2 issues, I'd have posted my FEC patches > by now... in any case, my current xmit function looks a little different - > I've organised it such that: > > 1. We don't modify anything until we're past the point where things can > error out. > > 2. Writes to the descriptor are localised in one area. > > 3. There's a wmb() barrier between cbd_sc and the previous writes - I > discussed this with Will Deacon, which resulted in this documentation > for the barrier: > > /* > * We need the preceding stores to the descriptor to complete > * before updating the status field, which hands it over to the > * hardware. The corresponding rmb() is "in the hardware". > */ > > The second thing that causes the transmit timeouts is the horrid way > NAPI has been added to the driver - it's racy. NAPI itself isn't the > problem, it's this (compressed a bit to show only the relevant bits): I fully agree there are multiple race conditions in the driver as it is now. [...] > That all said - with the patch below I haven't seen problems since. > (which is the quickest way I can think of to get you a copy of what > I'm presently running - I've killed a number of debug bits denoted by > all the blank lines - and this is against -rc7.) You may notice that > I added some TX ring dumping code to the driver - always useful in > these situations. ;-) I will test this ASAP on multiple MX6es and let you know of the result. > This patch is of course the consolidated version: individually, this > would be at least 19 patches with nice commit messages describing > each change... Please keep me on CC of those, I'd be very interested to test them. > As far as the 600Mbps receive - you need the right conditions for that. > I select the performance cpufreq governor after boot, and let the boot > quiesce. It doesn't take much for it to drop back to 460Mbps - another > running process other than iperf -s is sufficient to do that. > > Let me know how you get on with this. Roger that, will do, thank you very much !
On Thursday, March 20, 2014 at 12:05:56 AM, Marek Vasut wrote: > On Wednesday, March 19, 2014 at 11:51:38 PM, Russell King - ARM Linux wrote: [...] > > As far as the 600Mbps receive - you need the right conditions for that. > > I select the performance cpufreq governor after boot, and let the boot > > quiesce. It doesn't take much for it to drop back to 460Mbps - another > > running process other than iperf -s is sufficient to do that. > > > > Let me know how you get on with this. > > Roger that, will do, thank you very much ! I did some test and I observe no dips in the speed, nice work. Thanks ! I will try to get feedback from more boards and get back to you with the results. Best regards, Marek Vasut
Hi Russell, On Thu, Mar 20, 2014 at 1:01 AM, Marek Vasut <marex@denx.de> wrote: > On Thursday, March 20, 2014 at 12:05:56 AM, Marek Vasut wrote: >> On Wednesday, March 19, 2014 at 11:51:38 PM, Russell King - ARM Linux wrote: > > [...] > >> > As far as the 600Mbps receive - you need the right conditions for that. >> > I select the performance cpufreq governor after boot, and let the boot >> > quiesce. It doesn't take much for it to drop back to 460Mbps - another >> > running process other than iperf -s is sufficient to do that. >> > >> > Let me know how you get on with this. >> >> Roger that, will do, thank you very much ! > > I did some test and I observe no dips in the speed, nice work. Thanks ! > > I will try to get feedback from more boards and get back to you with the > results. Thanks for the patch. Robert Daniels on Cc tried it on mx53 and sent the feedback below. Thanks "I tried the patch and it seems a little better but still has problems. When I first started the test I saw dropped packets. After about 2 minutes I got the pause again with the accompanying kernel backtrace. It looks like I got some more debug information from the backtrace however (I think that was part of the patch.) This is what it looks like now: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at /home/robertd/Development/IC/ Dev/BoardSupport/ic-ii/linux-mainline/net/sched/sch_generic.c:264 dev_watchdo g+0x288/0x2ac() NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc6+ #5 Backtrace: [<800121bc>] (dump_backtrace) from [<800124a0>] (show_stack+0x18/0x1c) r6:8051f29c r5:00000000 r4:808edb9c r3:00000000 [<80012488>] (show_stack) from [<80652098>] (dump_stack+0x84/0x9c) [<80652014>] (dump_stack) from [<80027d1c>] (warn_slowpath_common +0x70/0x94) r5:00000009 r4:808c9d60 [<80027cac>] (warn_slowpath_common) from [<80027d78>] (warn_slowpath_fmt +0x38/0x40) r8:ded37b40 r7:808c8000 r6:ded37b00 r5:dec34000 r4:00000000 [<80027d44>] (warn_slowpath_fmt) from [<8051f29c>] (dev_watchdog +0x288/0x2ac) r3:dec34000 r2:808335f0 [<8051f014>] (dev_watchdog) from [<80031e68>] (call_timer_fn+0x74/0xf4) r10:80031df4 r9:dec34000 r8:8051f014 r7:808c8000 r6:00000100 r5:808c8000 r4:808c9dd0 [<80031df4>] (call_timer_fn) from [<80032598>] (run_timer_softirq +0x19c/0x234) r10:8051f014 r9:dec34000 r8:00200200 r7:00000000 r6:808c9e20 r5:80929fc0 r4:dec34284 [<800323fc>] (run_timer_softirq) from [<8002c2f4>] (__do_softirq +0x110/0x2b4) r10:00000100 r9:00000001 r8:40000001 r7:808c8000 r6:808ca080 r5:808ca084 r4:00000000 [<8002c1e4>] (__do_softirq) from [<8002c7ac>] (irq_exit+0xb8/0x10c) r10:8065c4cc r9:00000001 r8:00000000 r7:00000037 r6:808c8000 r5:808c5fe8 r4:808c8028 [<8002c6f4>] (irq_exit) from [<8000f2a0>] (handle_IRQ+0x5c/0xbc) r5:808c5fe8 r4:808d0d24 [<8000f244>] (handle_IRQ) from [<80008590>] (tzic_handle_irq+0x78/0xa8) r8:808c9f10 r7:00000001 r6:00000020 r5:80928fd8 r4:00000000 r3:00000080 [<80008518>] (tzic_handle_irq) from [<800130a4>] (__irq_svc+0x44/0x5c) Exception stack(0x808c9f10 to 0x808c9f58) 9f00: 00000001 00000001 00000000 808d3e70 9f20: 808c8000 808d099c 808d0938 8092837d 00000000 808c8000 8065c4cc 808c9f64 9f40: 808c9f28 808c9f58 800638e0 8000f674 20000013 ffffffff r9:808c8000 r8:00000000 r7:808c9f44 r6:ffffffff r5:20000013 r4:8000f674 [<8000f64c>] (arch_cpu_idle) from [<8006e874>] (cpu_startup_entry +0x108/0x160) [<8006e76c>] (cpu_startup_entry) from [<8064ce8c>] (rest_init+0xb4/0xdc) r7:808b8360 [<8064cdd8>] (rest_init) from [<80879b58>] (start_kernel+0x328/0x38c) r6:ffffffff r5:808d0880 r4:808d0a30 [<80879830>] (start_kernel) from [<70008074>] (0x70008074) ---[ end trace 18ea3643c6be04df ]--- fec 63fec000.ethernet eth0: TX ring dump Nr SC addr len SKB 0 0x1c00 0xce485000 106 de5db480 1 0x1c00 0xce485800 106 de5b7540 2 0x1c00 0xce486000 106 de5b7300 3 0x1c00 0xce486800 106 de5b73c0 4 0x1c00 0xce487000 106 de5b7e40 5 0x1c00 0xce487800 106 de5b7840 6 0x1c00 0xce500000 106 de5b7b40 7 0x1c00 0xce500800 106 de5b7600 8 0x1c00 0xce501000 106 de5b7d80 9 0x1c00 0xce501800 106 de5b7180 10 0x1c00 0xce502000 106 de5b7a80 11 0x1c00 0xce502800 106 de5b7240 12 0x1c00 0xce503000 106 de5b7000 13 SH 0x1c00 0x00000000 66 (null) 14 0x9c00 0xce504000 66 de5b90c0 15 0x1c00 0xce504800 66 de5b9a80 16 0x1c00 0xce505000 66 de5b9840 17 0x1c00 0xce505800 66 de5b9000 18 0x1c00 0xce506000 66 de5b9480 19 0x1c00 0xce506800 66 de5b9b40 20 0x1c00 0xce507000 66 de5b9600 21 0x1c00 0xce507800 66 de5b9900 22 0x1c00 0xce508000 66 de5b9180 23 0x1c00 0xce508800 66 de5b9240 24 0x1c00 0xce509000 66 de5b93c0 25 0x1c00 0xce509800 66 de5b9540 26 0x1c00 0xce50a000 106 de5b9780 27 0x1c00 0xce50a800 66 de5b9d80 28 0x1c00 0xce50b000 106 de5b9cc0 29 0x1c00 0xce50b800 66 de5b99c0 30 0x1c00 0xce50c000 106 de5b9300 31 0x1c00 0xce50c800 106 de7e9900 32 0x1c00 0xce50d000 66 de7e9e40 33 0x1c00 0xce50d800 106 de7e96c0 34 0x1c00 0xce50e000 66 de7e9c00 35 0x1c00 0xce50e800 106 de7e9d80 36 0x1c00 0xce50f000 106 de7e9cc0 37 0x1c00 0xce50f800 66 de7e9f00 38 0x1c00 0xce510000 106 de7e9a80 39 0x1c00 0xce510800 106 de7e90c0 40 0x1c00 0xce511000 66 de7e9240 41 0x1c00 0xce511800 106 de7e9000 42 0x1c00 0xce512000 106 de7e9780 43 0x1c00 0xce512800 66 de7e9300 44 0x1c00 0xce513000 106 de7e9840 45 0x1c00 0xce513800 106 de5dbd80 46 0x1c00 0xce514000 66 de5dbb40 47 0x1c00 0xce514800 106 de5db780 48 0x1c00 0xce515000 106 de5db9c0 49 0x1c00 0xce515800 66 de5db600 50 0x1c00 0xce516000 106 de5db540 51 0x1c00 0xce516800 106 de5db000 52 0x1c00 0xce517000 66 de5dbc00 53 0x1c00 0xce517800 106 de5db6c0 54 0x1c00 0xce518000 66 de5db3c0 55 0x1c00 0xce518800 106 de5db300 56 0x1c00 0xce519000 106 de5dbf00 57 0x1c00 0xce519800 66 de5db0c0 58 0x1c00 0xce51a000 106 de5db900 59 0x1c00 0xce51a800 106 de5db240 60 0x1c00 0xce51b000 66 de5dbe40 61 0x1c00 0xce51b800 106 de5db840 62 0x1c00 0xce51c000 106 de5dbcc0 63 0x3c00 0xce51c800 106 de5db180 "
On Thu, Mar 20, 2014 at 07:27:43PM -0300, Fabio Estevam wrote: > "I tried the patch and it seems a little better but still has problems. > When I first started the test I saw dropped packets. After about 2 minutes > I got the pause again with the accompanying kernel backtrace. It looks > like I got some more debug information from the backtrace however (I think > that was part of the patch.) Thanks. Now that we can see the ring, we can see what's going on. Every entry with a non-NULL skb is a packet which has been queued up for transmission, but which hasn't been reaped. Any entry with a "SC" value of 0x1c00 has been transmitted. S marks the ring entry where the next packet to be transmitted would be placed into the ring. H points at the previous entry which was reaped - in other words, the next packet which would be reaped if it were transmitted would be ring entry 14. However, this hasn't happened, because that packet is still marked for the FEC to transmit it. The really odd thing is, it has transmitted the subsequent packets from 15 to 63, and 0 to 12. The obvious question is: was the packet at 14 actually transmitted, and has the update to the descriptor been lost somehow - that would require a log of the last 64-ish transmitted packets, and knowledge of what those packets should contain to really confirm - just wondering whether we could transmit an additional byte containing the ring index on these small packets which we could then capture. We would have to be certain that the capture itself hasn't dropped any packets. I'll look into some further patches which would allow that to be confirmed - I can't do that immediately as I've rather a lot to deal with at the moment. > fec 63fec000.ethernet eth0: TX ring dump > Nr SC addr len SKB > 0 0x1c00 0xce485000 106 de5db480 > 1 0x1c00 0xce485800 106 de5b7540 > 2 0x1c00 0xce486000 106 de5b7300 > 3 0x1c00 0xce486800 106 de5b73c0 > 4 0x1c00 0xce487000 106 de5b7e40 > 5 0x1c00 0xce487800 106 de5b7840 > 6 0x1c00 0xce500000 106 de5b7b40 > 7 0x1c00 0xce500800 106 de5b7600 > 8 0x1c00 0xce501000 106 de5b7d80 > 9 0x1c00 0xce501800 106 de5b7180 > 10 0x1c00 0xce502000 106 de5b7a80 > 11 0x1c00 0xce502800 106 de5b7240 > 12 0x1c00 0xce503000 106 de5b7000 > 13 SH 0x1c00 0x00000000 66 (null) > 14 0x9c00 0xce504000 66 de5b90c0 > 15 0x1c00 0xce504800 66 de5b9a80 > 16 0x1c00 0xce505000 66 de5b9840 > 17 0x1c00 0xce505800 66 de5b9000 > 18 0x1c00 0xce506000 66 de5b9480 > 19 0x1c00 0xce506800 66 de5b9b40 > 20 0x1c00 0xce507000 66 de5b9600 > 21 0x1c00 0xce507800 66 de5b9900 > 22 0x1c00 0xce508000 66 de5b9180 > 23 0x1c00 0xce508800 66 de5b9240 > 24 0x1c00 0xce509000 66 de5b93c0 > 25 0x1c00 0xce509800 66 de5b9540 > 26 0x1c00 0xce50a000 106 de5b9780 > 27 0x1c00 0xce50a800 66 de5b9d80 > 28 0x1c00 0xce50b000 106 de5b9cc0 > 29 0x1c00 0xce50b800 66 de5b99c0 > 30 0x1c00 0xce50c000 106 de5b9300 > 31 0x1c00 0xce50c800 106 de7e9900 > 32 0x1c00 0xce50d000 66 de7e9e40 > 33 0x1c00 0xce50d800 106 de7e96c0 > 34 0x1c00 0xce50e000 66 de7e9c00 > 35 0x1c00 0xce50e800 106 de7e9d80 > 36 0x1c00 0xce50f000 106 de7e9cc0 > 37 0x1c00 0xce50f800 66 de7e9f00 > 38 0x1c00 0xce510000 106 de7e9a80 > 39 0x1c00 0xce510800 106 de7e90c0 > 40 0x1c00 0xce511000 66 de7e9240 > 41 0x1c00 0xce511800 106 de7e9000 > 42 0x1c00 0xce512000 106 de7e9780 > 43 0x1c00 0xce512800 66 de7e9300 > 44 0x1c00 0xce513000 106 de7e9840 > 45 0x1c00 0xce513800 106 de5dbd80 > 46 0x1c00 0xce514000 66 de5dbb40 > 47 0x1c00 0xce514800 106 de5db780 > 48 0x1c00 0xce515000 106 de5db9c0 > 49 0x1c00 0xce515800 66 de5db600 > 50 0x1c00 0xce516000 106 de5db540 > 51 0x1c00 0xce516800 106 de5db000 > 52 0x1c00 0xce517000 66 de5dbc00 > 53 0x1c00 0xce517800 106 de5db6c0 > 54 0x1c00 0xce518000 66 de5db3c0 > 55 0x1c00 0xce518800 106 de5db300 > 56 0x1c00 0xce519000 106 de5dbf00 > 57 0x1c00 0xce519800 66 de5db0c0 > 58 0x1c00 0xce51a000 106 de5db900 > 59 0x1c00 0xce51a800 106 de5db240 > 60 0x1c00 0xce51b000 66 de5dbe40 > 61 0x1c00 0xce51b800 106 de5db840 > 62 0x1c00 0xce51c000 106 de5dbcc0 > 63 0x3c00 0xce51c800 106 de5db180 > "
On 3/20/2014 4:06 PM, Russell King - ARM Linux wrote: > On Thu, Mar 20, 2014 at 07:27:43PM -0300, Fabio Estevam wrote: >> "I tried the patch and it seems a little better but still has problems. >> When I first started the test I saw dropped packets. After about 2 minutes >> I got the pause again with the accompanying kernel backtrace. It looks >> like I got some more debug information from the backtrace however (I think >> that was part of the patch.) > > Thanks. Now that we can see the ring, we can see what's going on. > Every entry with a non-NULL skb is a packet which has been queued up > for transmission, but which hasn't been reaped. Any entry with a "SC" > value of 0x1c00 has been transmitted. > > S marks the ring entry where the next packet to be transmitted would > be placed into the ring. H points at the previous entry which was > reaped - in other words, the next packet which would be reaped if it > were transmitted would be ring entry 14. > > However, this hasn't happened, because that packet is still marked for > the FEC to transmit it. The really odd thing is, it has transmitted > the subsequent packets from 15 to 63, and 0 to 12. > > The obvious question is: was the packet at 14 actually transmitted, > and has the update to the descriptor been lost somehow - that would > require a log of the last 64-ish transmitted packets, and knowledge > of what those packets should contain to really confirm - just wondering > whether we could transmit an additional byte containing the ring index > on these small packets which we could then capture. We would have to > be certain that the capture itself hasn't dropped any packets. My question is, on a non-cacheable, bufferable write, can an entire 64 byte line be written ? Since our descriptors are only 32 bytes is there contention between the cpu queuing the next packet and the ENET completing the one before ? 13 SH 0x1c00 0x00000000 66 (null) 14 0x9c00 0xce504000 66 de5b90c0 Note that the problematic descriptor(14) is even. That would fit with the above. In your testing, is it always even? Troy
On Thu, Mar 20, 2014 at 05:24:07PM -0700, Troy Kisky wrote: > My question is, on a non-cacheable, bufferable write, can an entire 64 > byte line be written ? Since our descriptors are only 32 bytes is there > contention between the cpu queuing the next packet and the ENET completing > the one before ? Well, 64-byte lines would be covering descriptors 0,1 2,3 4,5 ... 12,13 14,15 etc. However, it wouldn't really make sense because the store buffers in the L310 are 32-bytes in size, and there's three of them. Given that with my patch the descriptor updates go through two draining of that buffer per update - once after the address, length and other parameters are written, then again after the status bits have been written, I think it's unlikely that what you're suggesting could be a possibility, unless there's some bug we don't know about. Remember, DMA memory is "normal, non-cacheable" so there shouldn't be any cache effects going on here - only store buffer effects which may delay, merge and/or reorder writes, and that's why we have barriers. > 13 SH 0x1c00 0x00000000 66 (null) > 14 0x9c00 0xce504000 66 de5b90c0 > > Note that the problematic descriptor(14) is even. That would fit with > the above. In your testing, is it always even? Since running with my patch on aan iMX6Q, I haven't seen any timeouts, and I've run many iperf sessions in both directions (even multiple machines hammering the poor iMX6Q, running iperf and/or flood pings with various packet sizes.) What may be useful to know is which iMX device these are still happening on, with my patch applied. I have both iMX6Q and iMX6S, the 6Q has been the focus of my hammering so far...
On Thu, Mar 20, 2014 at 10:18 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > Well, 64-byte lines would be covering descriptors 0,1 2,3 4,5 ... 12,13 > 14,15 etc. > > However, it wouldn't really make sense because the store buffers in the > L310 are 32-bytes in size, and there's three of them. Given that with > my patch the descriptor updates go through two draining of that buffer > per update - once after the address, length and other parameters are > written, then again after the status bits have been written, I think > it's unlikely that what you're suggesting could be a possibility, > unless there's some bug we don't know about. > > Remember, DMA memory is "normal, non-cacheable" so there shouldn't be > any cache effects going on here - only store buffer effects which may > delay, merge and/or reorder writes, and that's why we have barriers. > >> 13 SH 0x1c00 0x00000000 66 (null) >> 14 0x9c00 0xce504000 66 de5b90c0 >> >> Note that the problematic descriptor(14) is even. That would fit with >> the above. In your testing, is it always even? > > Since running with my patch on aan iMX6Q, I haven't seen any timeouts, > and I've run many iperf sessions in both directions (even multiple > machines hammering the poor iMX6Q, running iperf and/or flood pings > with various packet sizes.) > > What may be useful to know is which iMX device these are still happening > on, with my patch applied. I have both iMX6Q and iMX6S, the 6Q has been > the focus of my hammering so far... Robert's tests were made on a mx53 (single CortexA9), and its cache controller is not the L310. Regards, Fabio Estevam
On Thu, Mar 20, 2014 at 10:36 PM, Fabio Estevam <festevam@gmail.com> wrote: > Robert's tests were made on a mx53 (single CortexA9), and its cache > controller is not the L310. Ops, I meant CortexA8.
On Wed, Mar 19, 2014 at 5:51 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Mar 19, 2014 at 10:52:32PM +0100, Marek Vasut wrote: >> On Tuesday, March 18, 2014 at 06:26:15 PM, Russell King - ARM Linux wrote: >> > On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote: >> > > Setting prefetch enables and early BRESP could all be done >> > > unconditionally in the core code. >> > >> > I think we can do a few things here, if we know that the CPUs we're >> > connected to are all Cortex-A9: >> > >> > 1. Enable BRESP. >> > >> > 2. Enable I+D prefetching - but we really need to tune the prefetch offset >> > for this to be worthwhile. The value depends on the L3 memory system >> > latency, so isn't something that should be specified at the SoC level. >> > It may also change with different operating points. >> > >> > 3. Full line of zeros - I think this is a difficult one to achieve >> > properly. The required sequence: >> > >> > - enable FLZ in L2 cache >> > - enable L2 cache >> > - enable FLZ in Cortex A9 >> > >> > I'd also assume that when we turn the L2 cache off, we need the reverse >> > sequence too. So this sequence can't be done entirely by the boot >> > loader. >> > >> > With (1) enabled and (2) properly tuned, I see a performance increase of >> > around 60Mbps on transmission, bringing the Cubox-i4 up from 250Mbps to >> > 315Mbps transmit on its gigabit interface with cpufreq ondemand enabled. >> > With "performance", this goes up to [323, 323, 321, 325, 322]Mbps. On >> > receive [446, 603, 605, 605, 601]Mbps, which hasn't really changed >> > very much (and still impressively exceeds the Freescale stated maximum >> > total bandwidth of the gigabit interface.) >> >> Speaking of FEC and slightly off-topic, have you ever seen this on your box >> [1]/[2]/[3] ? I wonder if this might be cache-related as well, since I saw >> similar issue on MX6 with PCIe-connected ethernet. I cannot put a finger on >> this though. > > I think I've seen something similar once or twice. > > Let me just pull up the transmit function. This is the code which > writes to the descriptors (which frankly is pretty horrid): > > bdp->cbd_datlen = skb->len; > bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr, > skb->len, DMA_TO_DEVICE); <-- barrier inside > ebdp->cbd_bdu = 0; > ebdp->cbd_esc = BD_ENET_TX_INT; > ebdp->cbd_esc |= BD_ENET_TX_PINS; > bdp->cbd_sc = status; > writel(0, fep->hwp + FEC_X_DES_ACTIVE); <-- barrier before write > > A couple of points here: > > 1. The hardware operates in a ring - it can read the next ring if it > sees the BD_ENET_TX_READY bit set if has just finished processing > the previous descriptor - this can happen before the write to > FEC_X_DES_ACTIVE. Correct. uDMA will start transfer when D_ENET_TX_READY bit. I think other network MAC work similar with this one. > > 2. The ARM can re-order writes. The writes it can re-order are: > > bdp->cbd_bufaddr > ebdp->cbd_bdu > ebdp->cbd_esc > ebdp->cbd_sc > > Hence, it's entirely possible for the FEC to see the updated descriptor > status before the rest of the descriptor has been written. What's missing > is a barrier between the descriptor writes, and the final write to > bdp->cbd_sc. Possible. I also consider this when debug the other issues. Just bdp->bufaddr and bdp->buflen is important. bdu, hardware don't care it. esc is the same for each BD. When software go though BD ring once, esc will be not changed again, even though we write. but if sc write before buffaddr and bullen, there will be issue. Add memory barre here is better before ebdp->cbd_sc. I also want to add it before, But I have not found a test case to verify it is necessary. > > Had I not got distracted by the L2 issues, I'd have posted my FEC patches > by now... in any case, my current xmit function looks a little different - > I've organised it such that: > > 1. We don't modify anything until we're past the point where things can > error out. > > 2. Writes to the descriptor are localised in one area. > > 3. There's a wmb() barrier between cbd_sc and the previous writes - I > discussed this with Will Deacon, which resulted in this documentation > for the barrier: > > /* > * We need the preceding stores to the descriptor to complete > * before updating the status field, which hands it over to the > * hardware. The corresponding rmb() is "in the hardware". > */ It is better. > > The second thing that causes the transmit timeouts is the horrid way > NAPI has been added to the driver - it's racy. NAPI itself isn't the > problem, it's this (compressed a bit to show only the relevant bits): > > do { > int_events = readl(fep->hwp + FEC_IEVENT); > writel(int_events, fep->hwp + FEC_IEVENT); > if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) { > if (napi_schedule_prep(&fep->napi)) { > writel(FEC_RX_DISABLED_IMASK, > fep->hwp + FEC_IMASK); > __napi_schedule(&fep->napi); > } > } > if (int_events & FEC_ENET_MII) { > complete(&fep->mdio_done); > } > } while (int_events); > > Consider what happens here if: > - we talk in the MII bus and receive a MII interrupt > - we're just finishing NAPI processing but haven't quite got around to > calling napi_complete() > - the ethernet has sent all packets, and has also raised a transmit > interrupt > > The result is the handler is entered, FEC_IEVENT contains TXF and MII > events. Both these events are cleared down, (and thus no longer exist > as interrupt-causing events.) napi_schedule_prep() returns false as > the NAPI rx function is still running, and doesn't mark it for a re-run. > We then do the MII interrupt. Loop again, and int_events is zero, > we exit. > > Meanwhile, the NAPI rx function calls napi_complete() and re-enables > the receive interrupt. If you're unlucky enough that the RX ring is > also full... no RXF interrupt. So no further interrupts except maybe > MII interrupts. > > NAPI never gets scheduled. RX ring never gets emptied. TX ring never > gets reaped. The result is a timeout with a completely full TX ring. Do you see RX ring full? > > I think I've seen both cases: I've seen the case where the TX ring is > completely empty, but it hasn't been reaped. I've also seen the case > where the TX ring contains packets to be transmitted but the hardware > isn't sending them. > > That all said - with the patch below I haven't seen problems since. > (which is the quickest way I can think of to get you a copy of what > I'm presently running - I've killed a number of debug bits denoted by > all the blank lines - and this is against -rc7.) You may notice that > I added some TX ring dumping code to the driver - always useful in > these situations. ;-) > > This patch is of course the consolidated version: individually, this > would be at least 19 patches with nice commit messages describing > each change... > > As far as the 600Mbps receive - you need the right conditions for that. > I select the performance cpufreq governor after boot, and let the boot > quiesce. It doesn't take much for it to drop back to 460Mbps - another > running process other than iperf -s is sufficient to do that. > > Let me know how you get on with this. > > diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h > index 3b8d6d19ff05..510580eeae4b 100644 > --- a/drivers/net/ethernet/freescale/fec.h > +++ b/drivers/net/ethernet/freescale/fec.h > @@ -170,6 +170,11 @@ struct bufdesc_ex { > unsigned short res0[4]; > }; > > +union bufdesc_u { > + struct bufdesc bd; > + struct bufdesc_ex ebd; > +}; > + > /* > * The following definitions courtesy of commproc.h, which where > * Copyright (c) 1997 Dan Malek (dmalek@jlc.net). > @@ -240,14 +245,14 @@ struct bufdesc_ex { > * the skbuffer directly. > */ > > -#define FEC_ENET_RX_PAGES 8 > +#define FEC_ENET_RX_PAGES 32 > #define FEC_ENET_RX_FRSIZE 2048 > #define FEC_ENET_RX_FRPPG (PAGE_SIZE / FEC_ENET_RX_FRSIZE) > #define RX_RING_SIZE (FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES) > #define FEC_ENET_TX_FRSIZE 2048 > #define FEC_ENET_TX_FRPPG (PAGE_SIZE / FEC_ENET_TX_FRSIZE) > -#define TX_RING_SIZE 16 /* Must be power of two */ > -#define TX_RING_MOD_MASK 15 /* for this to work */ > +#define TX_RING_SIZE 64 /* Must be power of two */ > +#define TX_RING_MOD_MASK 63 /* for this to work */ > > #define BD_ENET_RX_INT 0x00800000 > #define BD_ENET_RX_PTP ((ushort)0x0400) > @@ -289,12 +294,12 @@ struct fec_enet_private { > /* CPM dual port RAM relative addresses */ > dma_addr_t bd_dma; > /* Address of Rx and Tx buffers */ > - struct bufdesc *rx_bd_base; > - struct bufdesc *tx_bd_base; > + union bufdesc_u *rx_bd_base; > + union bufdesc_u *tx_bd_base; > /* The next free ring entry */ > - struct bufdesc *cur_rx, *cur_tx; > - /* The ring entries to be free()ed */ > - struct bufdesc *dirty_tx; > + unsigned short tx_next; > + unsigned short tx_dirty; > + unsigned short rx_next; > > unsigned short tx_ring_size; > unsigned short rx_ring_size; > @@ -335,6 +340,9 @@ struct fec_enet_private { > struct timer_list time_keep; > struct fec_enet_delayed_work delay_work; > struct regulator *reg_phy; > + unsigned long quirks; > + > + > }; > > void fec_ptp_init(struct platform_device *pdev); > diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c > index 03a351300013..8105697d5a99 100644 > --- a/drivers/net/ethernet/freescale/fec_main.c > +++ b/drivers/net/ethernet/freescale/fec_main.c > @@ -101,6 +101,8 @@ static void set_multicast_list(struct net_device *ndev); > * ENET_TDAR[TDAR]. > */ > #define FEC_QUIRK_ERR006358 (1 << 7) > +/* Controller has ability to offset rx packets */ > +#define FEC_QUIRK_RX_SHIFT16 (1 << 8) > > static struct platform_device_id fec_devtype[] = { > { > @@ -120,7 +122,8 @@ static struct platform_device_id fec_devtype[] = { > .name = "imx6q-fec", > .driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT | > FEC_QUIRK_HAS_BUFDESC_EX | FEC_QUIRK_HAS_CSUM | > - FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358, > + FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358 | > + FEC_QUIRK_RX_SHIFT16, > }, { > .name = "mvf600-fec", > .driver_data = FEC_QUIRK_ENET_MAC, > @@ -200,6 +203,7 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address"); > /* FEC receive acceleration */ > #define FEC_RACC_IPDIS (1 << 1) > #define FEC_RACC_PRODIS (1 << 2) > +#define FEC_RACC_SHIFT16 BIT(7) > #define FEC_RACC_OPTIONS (FEC_RACC_IPDIS | FEC_RACC_PRODIS) > > /* > @@ -233,57 +237,54 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address"); > > static int mii_cnt; > > -static inline > -struct bufdesc *fec_enet_get_nextdesc(struct bufdesc *bdp, struct fec_enet_private *fep) > +static unsigned copybreak = 200; > +module_param(copybreak, uint, 0644); > +MODULE_PARM_DESC(copybreak, > + "Maximum size of packet that is copied to a new buffer on receive"); > + > + > + > + > + > +static bool fec_enet_rx_zerocopy(struct fec_enet_private *fep, unsigned pktlen) > { > - struct bufdesc *new_bd = bdp + 1; > - struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp + 1; > - struct bufdesc_ex *ex_base; > - struct bufdesc *base; > - int ring_size; > - > - if (bdp >= fep->tx_bd_base) { > - base = fep->tx_bd_base; > - ring_size = fep->tx_ring_size; > - ex_base = (struct bufdesc_ex *)fep->tx_bd_base; > - } else { > - base = fep->rx_bd_base; > - ring_size = fep->rx_ring_size; > - ex_base = (struct bufdesc_ex *)fep->rx_bd_base; > - } > +#ifndef CONFIG_M5272 > + if (fep->quirks & FEC_QUIRK_RX_SHIFT16 && pktlen >= copybreak) > + return true; > +#endif > + return false; > +} > + > +static union bufdesc_u * > +fec_enet_tx_get(unsigned index, struct fec_enet_private *fep) > +{ > + union bufdesc_u *base = fep->tx_bd_base; > + union bufdesc_u *bdp; > + > + index &= fep->tx_ring_size - 1; > > if (fep->bufdesc_ex) > - return (struct bufdesc *)((ex_new_bd >= (ex_base + ring_size)) ? > - ex_base : ex_new_bd); > + bdp = (union bufdesc_u *)(&base->ebd + index); > else > - return (new_bd >= (base + ring_size)) ? > - base : new_bd; > + bdp = (union bufdesc_u *)(&base->bd + index); > + > + return bdp; > } > > -static inline > -struct bufdesc *fec_enet_get_prevdesc(struct bufdesc *bdp, struct fec_enet_private *fep) > +static union bufdesc_u * > +fec_enet_rx_get(unsigned index, struct fec_enet_private *fep) > { > - struct bufdesc *new_bd = bdp - 1; > - struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp - 1; > - struct bufdesc_ex *ex_base; > - struct bufdesc *base; > - int ring_size; > - > - if (bdp >= fep->tx_bd_base) { > - base = fep->tx_bd_base; > - ring_size = fep->tx_ring_size; > - ex_base = (struct bufdesc_ex *)fep->tx_bd_base; > - } else { > - base = fep->rx_bd_base; > - ring_size = fep->rx_ring_size; > - ex_base = (struct bufdesc_ex *)fep->rx_bd_base; > - } > + union bufdesc_u *base = fep->rx_bd_base; > + union bufdesc_u *bdp; > + > + index &= fep->rx_ring_size - 1; > > if (fep->bufdesc_ex) > - return (struct bufdesc *)((ex_new_bd < ex_base) ? > - (ex_new_bd + ring_size) : ex_new_bd); > + bdp = (union bufdesc_u *)(&base->ebd + index); > else > - return (new_bd < base) ? (new_bd + ring_size) : new_bd; > + bdp = (union bufdesc_u *)(&base->bd + index); > + > + return bdp; > } > > static void *swap_buffer(void *bufaddr, int len) > @@ -297,6 +298,26 @@ static void *swap_buffer(void *bufaddr, int len) > return bufaddr; > } > > +static void fec_dump(struct net_device *ndev) > +{ > + struct fec_enet_private *fep = netdev_priv(ndev); > + unsigned index = 0; > + > + netdev_info(ndev, "TX ring dump\n"); > + pr_info("Nr SC addr len SKB\n"); > + > + for (index = 0; index < fep->tx_ring_size; index++) { > + union bufdesc_u *bdp = fec_enet_tx_get(index, fep); > + > + pr_info("%2u %c%c 0x%04x 0x%08lx %4u %p\n", > + index, > + index == fep->tx_next ? 'S' : ' ', > + index == fep->tx_dirty ? 'H' : ' ', > + bdp->bd.cbd_sc, bdp->bd.cbd_bufaddr, bdp->bd.cbd_datlen, > + fep->tx_skbuff[index]); > + } > +} > + > static int > fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev) > { > @@ -312,21 +333,42 @@ fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev) > return 0; > } > > +static void > +fec_enet_tx_unmap(struct bufdesc *bdp, struct fec_enet_private *fep) > +{ > + dma_addr_t addr = bdp->cbd_bufaddr; > + unsigned length = bdp->cbd_datlen; > + > + bdp->cbd_bufaddr = 0; > + > + dma_unmap_single(&fep->pdev->dev, addr, length, DMA_TO_DEVICE); > +} > + > static netdev_tx_t > fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev) > { > struct fec_enet_private *fep = netdev_priv(ndev); > - const struct platform_device_id *id_entry = > - platform_get_device_id(fep->pdev); > - struct bufdesc *bdp, *bdp_pre; > + union bufdesc_u *bdp, *bdp_pre; > void *bufaddr; > unsigned short status; > - unsigned int index; > + unsigned index; > + unsigned length; > + dma_addr_t addr; > + > + > + > + > + > + > + > + > + > > /* Fill in a Tx ring entry */ > - bdp = fep->cur_tx; > + index = fep->tx_next; > > - status = bdp->cbd_sc; > + bdp = fec_enet_tx_get(index, fep); > + status = bdp->bd.cbd_sc; > > if (status & BD_ENET_TX_READY) { > /* Ooops. All transmit buffers are full. Bail out. > @@ -347,21 +389,15 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev) > > /* Set buffer length and buffer pointer */ > bufaddr = skb->data; > - bdp->cbd_datlen = skb->len; > + length = skb->len; > > /* > * On some FEC implementations data must be aligned on > * 4-byte boundaries. Use bounce buffers to copy data > * and get it aligned. Ugh. > */ > - if (fep->bufdesc_ex) > - index = (struct bufdesc_ex *)bdp - > - (struct bufdesc_ex *)fep->tx_bd_base; > - else > - index = bdp - fep->tx_bd_base; > - > if (((unsigned long) bufaddr) & FEC_ALIGNMENT) { > - memcpy(fep->tx_bounce[index], skb->data, skb->len); > + memcpy(fep->tx_bounce[index], skb->data, length); > bufaddr = fep->tx_bounce[index]; > } > > @@ -370,70 +406,72 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev) > * the system that it's running on. As the result, driver has to > * swap every frame going to and coming from the controller. > */ > - if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME) > - swap_buffer(bufaddr, skb->len); > + if (fep->quirks & FEC_QUIRK_SWAP_FRAME) > + swap_buffer(bufaddr, length); > > - /* Save skb pointer */ > - fep->tx_skbuff[index] = skb; > - > - /* Push the data cache so the CPM does not get stale memory > - * data. > - */ > - bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr, > - skb->len, DMA_TO_DEVICE); > - if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) { > - bdp->cbd_bufaddr = 0; > - fep->tx_skbuff[index] = NULL; > + /* Push the data cache so the CPM does not get stale memory data. */ > + addr = dma_map_single(&fep->pdev->dev, bufaddr, length, DMA_TO_DEVICE); > + if (dma_mapping_error(&fep->pdev->dev, addr)) { > dev_kfree_skb_any(skb); > if (net_ratelimit()) > netdev_err(ndev, "Tx DMA memory map failed\n"); > return NETDEV_TX_OK; > } > > - if (fep->bufdesc_ex) { > + /* Save skb pointer */ > + fep->tx_skbuff[index] = skb; > > - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; > - ebdp->cbd_bdu = 0; > + bdp->bd.cbd_datlen = length; > + bdp->bd.cbd_bufaddr = addr; > + > + if (fep->bufdesc_ex) { > + bdp->ebd.cbd_bdu = 0; > if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP && > fep->hwts_tx_en)) { > - ebdp->cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT); > + bdp->ebd.cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT); > skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; > } else { > - ebdp->cbd_esc = BD_ENET_TX_INT; > + bdp->ebd.cbd_esc = BD_ENET_TX_INT; > > /* Enable protocol checksum flags > * We do not bother with the IP Checksum bits as they > * are done by the kernel > */ > if (skb->ip_summed == CHECKSUM_PARTIAL) > - ebdp->cbd_esc |= BD_ENET_TX_PINS; > + bdp->ebd.cbd_esc |= BD_ENET_TX_PINS; > } > } > > + /* > + * We need the preceding stores to the descriptor to complete > + * before updating the status field, which hands it over to the > + * hardware. The corresponding rmb() is "in the hardware". > + */ > + wmb(); > + > /* Send it on its way. Tell FEC it's ready, interrupt when done, > * it's the last BD of the frame, and to put the CRC on the end. > */ > status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR > | BD_ENET_TX_LAST | BD_ENET_TX_TC); > - bdp->cbd_sc = status; > + bdp->bd.cbd_sc = status; > > - bdp_pre = fec_enet_get_prevdesc(bdp, fep); > - if ((id_entry->driver_data & FEC_QUIRK_ERR006358) && > - !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) { > + bdp_pre = fec_enet_tx_get(index - 1, fep); > + if ((fep->quirks & FEC_QUIRK_ERR006358) && > + !(bdp_pre->bd.cbd_sc & BD_ENET_TX_READY)) { > fep->delay_work.trig_tx = true; > schedule_delayed_work(&(fep->delay_work.delay_work), > msecs_to_jiffies(1)); > } > > - /* If this was the last BD in the ring, start at the beginning again. */ > - bdp = fec_enet_get_nextdesc(bdp, fep); > - > skb_tx_timestamp(skb); > > - fep->cur_tx = bdp; > + fep->tx_next = (index + 1) & (fep->tx_ring_size - 1); > > - if (fep->cur_tx == fep->dirty_tx) > + if (fep->tx_next == fep->tx_dirty) { > + > netif_stop_queue(ndev); > + } > > /* Trigger transmission start */ > writel(0, fep->hwp + FEC_X_DES_ACTIVE); > @@ -446,46 +484,43 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev) > static void fec_enet_bd_init(struct net_device *dev) > { > struct fec_enet_private *fep = netdev_priv(dev); > - struct bufdesc *bdp; > unsigned int i; > > /* Initialize the receive buffer descriptors. */ > - bdp = fep->rx_bd_base; > for (i = 0; i < fep->rx_ring_size; i++) { > + union bufdesc_u *bdp = fec_enet_rx_get(i, fep); > > /* Initialize the BD for every fragment in the page. */ > - if (bdp->cbd_bufaddr) > - bdp->cbd_sc = BD_ENET_RX_EMPTY; > + if (bdp->bd.cbd_bufaddr) > + bdp->bd.cbd_sc = BD_ENET_RX_EMPTY; > else > - bdp->cbd_sc = 0; > - bdp = fec_enet_get_nextdesc(bdp, fep); > + bdp->bd.cbd_sc = 0; > + if (i == fep->rx_ring_size - 1) > + bdp->bd.cbd_sc |= BD_SC_WRAP; > } > > - /* Set the last buffer to wrap */ > - bdp = fec_enet_get_prevdesc(bdp, fep); > - bdp->cbd_sc |= BD_SC_WRAP; > - > - fep->cur_rx = fep->rx_bd_base; > + fep->rx_next = 0; > > /* ...and the same for transmit */ > - bdp = fep->tx_bd_base; > - fep->cur_tx = bdp; > for (i = 0; i < fep->tx_ring_size; i++) { > + union bufdesc_u *bdp = fec_enet_tx_get(i, fep); > > /* Initialize the BD for every fragment in the page. */ > - bdp->cbd_sc = 0; > - if (bdp->cbd_bufaddr && fep->tx_skbuff[i]) { > + /* Set the last buffer to wrap */ > + if (i == fep->tx_ring_size - 1) > + bdp->bd.cbd_sc = BD_SC_WRAP; > + else > + bdp->bd.cbd_sc = 0; > + if (bdp->bd.cbd_bufaddr) > + fec_enet_tx_unmap(&bdp->bd, fep); > + if (fep->tx_skbuff[i]) { > dev_kfree_skb_any(fep->tx_skbuff[i]); > fep->tx_skbuff[i] = NULL; > } > - bdp->cbd_bufaddr = 0; > - bdp = fec_enet_get_nextdesc(bdp, fep); > } > > - /* Set the last buffer to wrap */ > - bdp = fec_enet_get_prevdesc(bdp, fep); > - bdp->cbd_sc |= BD_SC_WRAP; > - fep->dirty_tx = bdp; > + fep->tx_next = 0; > + fep->tx_dirty = fep->tx_ring_size - 1; > } > > /* This function is called to start or restart the FEC during a link > @@ -496,8 +531,6 @@ static void > fec_restart(struct net_device *ndev, int duplex) > { > struct fec_enet_private *fep = netdev_priv(ndev); > - const struct platform_device_id *id_entry = > - platform_get_device_id(fep->pdev); > int i; > u32 val; > u32 temp_mac[2]; > @@ -519,7 +552,7 @@ fec_restart(struct net_device *ndev, int duplex) > * enet-mac reset will reset mac address registers too, > * so need to reconfigure it. > */ > - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) { > + if (fep->quirks & FEC_QUIRK_ENET_MAC) { > memcpy(&temp_mac, ndev->dev_addr, ETH_ALEN); > writel(cpu_to_be32(temp_mac[0]), fep->hwp + FEC_ADDR_LOW); > writel(cpu_to_be32(temp_mac[1]), fep->hwp + FEC_ADDR_HIGH); > @@ -568,6 +601,8 @@ fec_restart(struct net_device *ndev, int duplex) > #if !defined(CONFIG_M5272) > /* set RX checksum */ > val = readl(fep->hwp + FEC_RACC); > + if (fep->quirks & FEC_QUIRK_RX_SHIFT16) > + val |= FEC_RACC_SHIFT16; > if (fep->csum_flags & FLAG_RX_CSUM_ENABLED) > val |= FEC_RACC_OPTIONS; > else > @@ -579,7 +614,7 @@ fec_restart(struct net_device *ndev, int duplex) > * The phy interface and speed need to get configured > * differently on enet-mac. > */ > - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) { > + if (fep->quirks & FEC_QUIRK_ENET_MAC) { > /* Enable flow control and length check */ > rcntl |= 0x40000000 | 0x00000020; > > @@ -602,7 +637,7 @@ fec_restart(struct net_device *ndev, int duplex) > } > } else { > #ifdef FEC_MIIGSK_ENR > - if (id_entry->driver_data & FEC_QUIRK_USE_GASKET) { > + if (fep->quirks & FEC_QUIRK_USE_GASKET) { > u32 cfgr; > /* disable the gasket and wait */ > writel(0, fep->hwp + FEC_MIIGSK_ENR); > @@ -655,7 +690,7 @@ fec_restart(struct net_device *ndev, int duplex) > writel(0, fep->hwp + FEC_HASH_TABLE_LOW); > #endif > > - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) { > + if (fep->quirks & FEC_QUIRK_ENET_MAC) { > /* enable ENET endian swap */ > ecntl |= (1 << 8); > /* enable ENET store and forward mode */ > @@ -692,8 +727,6 @@ static void > fec_stop(struct net_device *ndev) > { > struct fec_enet_private *fep = netdev_priv(ndev); > - const struct platform_device_id *id_entry = > - platform_get_device_id(fep->pdev); > u32 rmii_mode = readl(fep->hwp + FEC_R_CNTRL) & (1 << 8); > > /* We cannot expect a graceful transmit stop without link !!! */ > @@ -711,7 +744,7 @@ fec_stop(struct net_device *ndev) > writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK); > > /* We have to keep ENET enabled to have MII interrupt stay working */ > - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) { > + if (fep->quirks & FEC_QUIRK_ENET_MAC) { > writel(2, fep->hwp + FEC_ECNTRL); > writel(rmii_mode, fep->hwp + FEC_R_CNTRL); > } > @@ -723,6 +756,8 @@ fec_timeout(struct net_device *ndev) > { > struct fec_enet_private *fep = netdev_priv(ndev); > > + fec_dump(ndev); > + > ndev->stats.tx_errors++; > > fep->delay_work.timeout = true; > @@ -751,34 +786,28 @@ static void fec_enet_work(struct work_struct *work) > static void > fec_enet_tx(struct net_device *ndev) > { > - struct fec_enet_private *fep; > - struct bufdesc *bdp; > + struct fec_enet_private *fep = netdev_priv(ndev); > + union bufdesc_u *bdp; > unsigned short status; > struct sk_buff *skb; > - int index = 0; > - > - fep = netdev_priv(ndev); > - bdp = fep->dirty_tx; > + unsigned index = fep->tx_dirty; > > - /* get next bdp of dirty_tx */ > - bdp = fec_enet_get_nextdesc(bdp, fep); > + do { > + index = (index + 1) & (fep->tx_ring_size - 1); > + bdp = fec_enet_tx_get(index, fep); > > - while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) { > + status = bdp->bd.cbd_sc; > + if (status & BD_ENET_TX_READY) > + break; > > /* current queue is empty */ > - if (bdp == fep->cur_tx) > + if (index == fep->tx_next) > break; > > - if (fep->bufdesc_ex) > - index = (struct bufdesc_ex *)bdp - > - (struct bufdesc_ex *)fep->tx_bd_base; > - else > - index = bdp - fep->tx_bd_base; > + fec_enet_tx_unmap(&bdp->bd, fep); > > skb = fep->tx_skbuff[index]; > - dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len, > - DMA_TO_DEVICE); > - bdp->cbd_bufaddr = 0; > + fep->tx_skbuff[index] = NULL; > > /* Check for errors. */ > if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC | > @@ -797,19 +826,18 @@ fec_enet_tx(struct net_device *ndev) > ndev->stats.tx_carrier_errors++; > } else { > ndev->stats.tx_packets++; > - ndev->stats.tx_bytes += bdp->cbd_datlen; > + ndev->stats.tx_bytes += bdp->bd.cbd_datlen; > } > > if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) && > fep->bufdesc_ex) { > struct skb_shared_hwtstamps shhwtstamps; > unsigned long flags; > - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; > > memset(&shhwtstamps, 0, sizeof(shhwtstamps)); > spin_lock_irqsave(&fep->tmreg_lock, flags); > shhwtstamps.hwtstamp = ns_to_ktime( > - timecounter_cyc2time(&fep->tc, ebdp->ts)); > + timecounter_cyc2time(&fep->tc, bdp->ebd.ts)); > spin_unlock_irqrestore(&fep->tmreg_lock, flags); > skb_tstamp_tx(skb, &shhwtstamps); > } > @@ -825,45 +853,252 @@ fec_enet_tx(struct net_device *ndev) > > /* Free the sk buffer associated with this last transmit */ > dev_kfree_skb_any(skb); > - fep->tx_skbuff[index] = NULL; > - > - fep->dirty_tx = bdp; > - > - /* Update pointer to next buffer descriptor to be transmitted */ > - bdp = fec_enet_get_nextdesc(bdp, fep); > > /* Since we have freed up a buffer, the ring is no longer full > */ > - if (fep->dirty_tx != fep->cur_tx) { > - if (netif_queue_stopped(ndev)) > - netif_wake_queue(ndev); > + if (netif_queue_stopped(ndev)) { > + > + > + > + > + > + netif_wake_queue(ndev); > + > } > - } > + > + fep->tx_dirty = index; > + } while (1); where break this while(1); > return; > } > > > -/* During a receive, the cur_rx points to the current incoming buffer. > +static void > +fec_enet_receive(struct sk_buff *skb, union bufdesc_u *bdp, struct net_device *ndev) > +{ > + struct fec_enet_private *fep = netdev_priv(ndev); > + > + skb->protocol = eth_type_trans(skb, ndev); > + > + /* Get receive timestamp from the skb */ > + if (fep->hwts_rx_en && fep->bufdesc_ex) { > + struct skb_shared_hwtstamps *shhwtstamps = > + skb_hwtstamps(skb); > + unsigned long flags; > + > + memset(shhwtstamps, 0, sizeof(*shhwtstamps)); > + > + spin_lock_irqsave(&fep->tmreg_lock, flags); > + shhwtstamps->hwtstamp = ns_to_ktime( > + timecounter_cyc2time(&fep->tc, bdp->ebd.ts)); > + spin_unlock_irqrestore(&fep->tmreg_lock, flags); > + } > + > + if (fep->csum_flags & FLAG_RX_CSUM_ENABLED) { > + if (!(bdp->ebd.cbd_esc & FLAG_RX_CSUM_ERROR)) { > + /* don't check it */ > + skb->ip_summed = CHECKSUM_UNNECESSARY; > + } else { > + skb_checksum_none_assert(skb); > + } > + } > + > + napi_gro_receive(&fep->napi, skb); > +} > + > +static void noinline > +fec_enet_receive_copy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev) > +{ > + struct fec_enet_private *fep = netdev_priv(ndev); > + struct sk_buff *skb; > + unsigned char *data; > + bool vlan_packet_rcvd = false; > + > + /* > + * Detect the presence of the VLAN tag, and adjust > + * the packet length appropriately. > + */ > + if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX && > + bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) { > + pkt_len -= VLAN_HLEN; > + vlan_packet_rcvd = true; > + } > + > + /* This does 16 byte alignment, exactly what we need. */ > + skb = netdev_alloc_skb(ndev, pkt_len + NET_IP_ALIGN); > + if (unlikely(!skb)) { > + ndev->stats.rx_dropped++; > + return; > + } > + > + dma_sync_single_for_cpu(&fep->pdev->dev, bdp->bd.cbd_bufaddr, > + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > + > + data = fep->rx_skbuff[index]->data; > + > +#ifndef CONFIG_M5272 > + /* > + * If we have enabled this feature, we need to discard > + * the two bytes at the beginning of the packet before > + * copying it. > + */ > + if (fep->quirks & FEC_QUIRK_RX_SHIFT16) { > + pkt_len -= 2; > + data += 2; > + } > +#endif > + > + if (fep->quirks & FEC_QUIRK_SWAP_FRAME) > + swap_buffer(data, pkt_len); > + > + skb_reserve(skb, NET_IP_ALIGN); > + skb_put(skb, pkt_len); /* Make room */ > + > + /* If this is a VLAN packet remove the VLAN Tag */ > + if (vlan_packet_rcvd) { > + struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN); > + > + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), > + ntohs(vlan->h_vlan_TCI)); > + > + /* Extract the frame data without the VLAN header. */ > + skb_copy_to_linear_data(skb, data, 2 * ETH_ALEN); > + skb_copy_to_linear_data_offset(skb, 2 * ETH_ALEN, > + data + 2 * ETH_ALEN + VLAN_HLEN, > + pkt_len - 2 * ETH_ALEN); > + } else { > + skb_copy_to_linear_data(skb, data, pkt_len); > + } > + > + dma_sync_single_for_device(&fep->pdev->dev, bdp->bd.cbd_bufaddr, > + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > + > + fec_enet_receive(skb, bdp, ndev); > +} > + > +static void noinline > +fec_enet_receive_nocopy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev) > +{ > + struct fec_enet_private *fep = netdev_priv(ndev); > + struct sk_buff *skb, *skb_new; > + unsigned char *data; > + dma_addr_t addr; > + > +#if 0 > + skb_new = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE); > + if (!skb_new) { > + ndev->stats.rx_dropped++; > + return; > + } > + > + addr = dma_map_single(&fep->pdev->dev, skb_new->data, > + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > + if (dma_mapping_error(&fep->pdev->dev, addr)) { > + dev_kfree_skb(skb_new); > + ndev->stats.rx_dropped++; > + return; > + } > +#else > + skb_new = NULL; > + addr = 0; > +#endif > + > + /* > + * We have the new skb, so proceed to deal with the > + * received data. > + */ > + dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr, > + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > + > + skb = fep->rx_skbuff[index]; > + > + /* Now subsitute in the new skb */ > + fep->rx_skbuff[index] = skb_new; > + bdp->bd.cbd_bufaddr = addr; > + > + /* > + * Update the skb length according to the raw packet > + * length. Then remove the two bytes of additional > + * padding. > + */ > + skb_put(skb, pkt_len); > + data = skb_pull_inline(skb, 2); > + > + if (fep->quirks & FEC_QUIRK_SWAP_FRAME) > + swap_buffer(data, skb->len); > + > + /* > + * Now juggle things for the VLAN tag - if the hardware > + * flags this as present, we need to read the tag, and > + * then shuffle the ethernet addresses up. > + */ > + if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX && > + bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) { > + struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN); > + > + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), > + ntohs(vlan->h_vlan_TCI)); > + > + memmove(data + VLAN_HLEN, data, 2 * ETH_ALEN); > + skb_pull_inline(skb, VLAN_HLEN); > + } > + > + fec_enet_receive(skb, bdp, ndev); > +} > + > +static int > +fec_enet_refill_ring(unsigned first, unsigned last, struct net_device *ndev) > +{ > + struct fec_enet_private *fep = netdev_priv(ndev); > + unsigned i = first; > + > + do { > + union bufdesc_u *bdp = fec_enet_rx_get(i, fep); > + struct sk_buff *skb; > + dma_addr_t addr; > + > + if (!fep->rx_skbuff[i]) { > + skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE); > + if (!skb) > + return -ENOMEM; > + > + addr = dma_map_single(&fep->pdev->dev, skb->data, > + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > + if (dma_mapping_error(&fep->pdev->dev, addr)) { > + dev_kfree_skb(skb); > + return -ENOMEM; > + } > + > + fep->rx_skbuff[i] = skb; > + bdp->bd.cbd_bufaddr = addr; > + } > + > + bdp->bd.cbd_sc = (bdp->bd.cbd_sc & BD_SC_WRAP) | > + BD_ENET_RX_EMPTY; > + > + if (fep->bufdesc_ex) { > + bdp->ebd.cbd_esc = BD_ENET_RX_INT; > + bdp->ebd.cbd_prot = 0; > + bdp->ebd.cbd_bdu = 0; > + } > + i = (i + 1) & (fep->rx_ring_size - 1); > + } while (i != last); > + > + return 0; > +} > + > +/* During a receive, the rx_next points to the current incoming buffer. > * When we update through the ring, if the next incoming buffer has > * not been given to the system, we just set the empty indicator, > * effectively tossing the packet. > */ > -static int > +static int noinline > fec_enet_rx(struct net_device *ndev, int budget) > { > struct fec_enet_private *fep = netdev_priv(ndev); > - const struct platform_device_id *id_entry = > - platform_get_device_id(fep->pdev); > - struct bufdesc *bdp; > unsigned short status; > - struct sk_buff *skb; > - ushort pkt_len; > - __u8 *data; > + unsigned pkt_len; > int pkt_received = 0; > - struct bufdesc_ex *ebdp = NULL; > - bool vlan_packet_rcvd = false; > - u16 vlan_tag; > - int index = 0; > + unsigned index = fep->rx_next; > > #ifdef CONFIG_M532x > flush_cache_all(); > @@ -872,12 +1107,16 @@ fec_enet_rx(struct net_device *ndev, int budget) > /* First, grab all of the stats for the incoming packet. > * These get messed up if we get called due to a busy condition. > */ > - bdp = fep->cur_rx; > + do { > + union bufdesc_u *bdp = fec_enet_rx_get(index, fep); > > - while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) { > + status = bdp->bd.cbd_sc; > + if (status & BD_ENET_RX_EMPTY) > + break; > > if (pkt_received >= budget) > break; > + > pkt_received++; > > /* Since we have allocated space to hold a complete frame, > @@ -917,124 +1156,33 @@ fec_enet_rx(struct net_device *ndev, int budget) > > /* Process the incoming frame. */ > ndev->stats.rx_packets++; > - pkt_len = bdp->cbd_datlen; > - ndev->stats.rx_bytes += pkt_len; > - > - if (fep->bufdesc_ex) > - index = (struct bufdesc_ex *)bdp - > - (struct bufdesc_ex *)fep->rx_bd_base; > - else > - index = bdp - fep->rx_bd_base; > - data = fep->rx_skbuff[index]->data; > - dma_sync_single_for_cpu(&fep->pdev->dev, bdp->cbd_bufaddr, > - FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > - > - if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME) > - swap_buffer(data, pkt_len); > - > - /* Extract the enhanced buffer descriptor */ > - ebdp = NULL; > - if (fep->bufdesc_ex) > - ebdp = (struct bufdesc_ex *)bdp; > - > - /* If this is a VLAN packet remove the VLAN Tag */ > - vlan_packet_rcvd = false; > - if ((ndev->features & NETIF_F_HW_VLAN_CTAG_RX) && > - fep->bufdesc_ex && (ebdp->cbd_esc & BD_ENET_RX_VLAN)) { > - /* Push and remove the vlan tag */ > - struct vlan_hdr *vlan_header = > - (struct vlan_hdr *) (data + ETH_HLEN); > - vlan_tag = ntohs(vlan_header->h_vlan_TCI); > - pkt_len -= VLAN_HLEN; > - > - vlan_packet_rcvd = true; > - } > > - /* This does 16 byte alignment, exactly what we need. > - * The packet length includes FCS, but we don't want to > - * include that when passing upstream as it messes up > - * bridging applications. > + /* > + * The packet length includes FCS, but we don't want > + * to include that when passing upstream as it messes > + * up bridging applications. > */ > - skb = netdev_alloc_skb(ndev, pkt_len - 4 + NET_IP_ALIGN); > + pkt_len = bdp->bd.cbd_datlen - 4; > + ndev->stats.rx_bytes += pkt_len; > > - if (unlikely(!skb)) { > - ndev->stats.rx_dropped++; > + if (fec_enet_rx_zerocopy(fep, pkt_len)) { > + fec_enet_receive_nocopy(pkt_len, index, bdp, ndev); > } else { > - int payload_offset = (2 * ETH_ALEN); > - skb_reserve(skb, NET_IP_ALIGN); > - skb_put(skb, pkt_len - 4); /* Make room */ > - > - /* Extract the frame data without the VLAN header. */ > - skb_copy_to_linear_data(skb, data, (2 * ETH_ALEN)); > - if (vlan_packet_rcvd) > - payload_offset = (2 * ETH_ALEN) + VLAN_HLEN; > - skb_copy_to_linear_data_offset(skb, (2 * ETH_ALEN), > - data + payload_offset, > - pkt_len - 4 - (2 * ETH_ALEN)); > - > - skb->protocol = eth_type_trans(skb, ndev); > - > - /* Get receive timestamp from the skb */ > - if (fep->hwts_rx_en && fep->bufdesc_ex) { > - struct skb_shared_hwtstamps *shhwtstamps = > - skb_hwtstamps(skb); > - unsigned long flags; > - > - memset(shhwtstamps, 0, sizeof(*shhwtstamps)); > - > - spin_lock_irqsave(&fep->tmreg_lock, flags); > - shhwtstamps->hwtstamp = ns_to_ktime( > - timecounter_cyc2time(&fep->tc, ebdp->ts)); > - spin_unlock_irqrestore(&fep->tmreg_lock, flags); > - } > - > - if (fep->bufdesc_ex && > - (fep->csum_flags & FLAG_RX_CSUM_ENABLED)) { > - if (!(ebdp->cbd_esc & FLAG_RX_CSUM_ERROR)) { > - /* don't check it */ > - skb->ip_summed = CHECKSUM_UNNECESSARY; > - } else { > - skb_checksum_none_assert(skb); > - } > - } > - > - /* Handle received VLAN packets */ > - if (vlan_packet_rcvd) > - __vlan_hwaccel_put_tag(skb, > - htons(ETH_P_8021Q), > - vlan_tag); > - > - napi_gro_receive(&fep->napi, skb); > + fec_enet_receive_copy(pkt_len, index, bdp, ndev); > } > > - dma_sync_single_for_device(&fep->pdev->dev, bdp->cbd_bufaddr, > - FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > rx_processing_done: > - /* Clear the status flags for this buffer */ > - status &= ~BD_ENET_RX_STATS; > - > - /* Mark the buffer empty */ > - status |= BD_ENET_RX_EMPTY; > - bdp->cbd_sc = status; > - > - if (fep->bufdesc_ex) { > - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; > - > - ebdp->cbd_esc = BD_ENET_RX_INT; > - ebdp->cbd_prot = 0; > - ebdp->cbd_bdu = 0; > - } > - > - /* Update BD pointer to next entry */ > - bdp = fec_enet_get_nextdesc(bdp, fep); > + index = (index + 1) & (fep->rx_ring_size - 1); > + if (index == fep->rx_next) > + break; > + } while (1); > > - /* Doing this here will keep the FEC running while we process > - * incoming frames. On a heavily loaded network, we should be > - * able to keep up at the expense of system resources. > - */ > + if (pkt_received) { > + fec_enet_refill_ring(fep->rx_next, index, ndev); > writel(0, fep->hwp + FEC_R_DES_ACTIVE); > } > - fep->cur_rx = bdp; > + > + fep->rx_next = index; > > return pkt_received; > } > @@ -1044,29 +1192,25 @@ fec_enet_interrupt(int irq, void *dev_id) > { > struct net_device *ndev = dev_id; > struct fec_enet_private *fep = netdev_priv(ndev); > + const unsigned napi_mask = FEC_ENET_RXF | FEC_ENET_TXF; > uint int_events; > irqreturn_t ret = IRQ_NONE; > > - do { > - int_events = readl(fep->hwp + FEC_IEVENT); > - writel(int_events, fep->hwp + FEC_IEVENT); > + int_events = readl(fep->hwp + FEC_IEVENT); > + writel(int_events & ~napi_mask, fep->hwp + FEC_IEVENT); > > - if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) { > - ret = IRQ_HANDLED; > + if (int_events & napi_mask) { > + ret = IRQ_HANDLED; > > - /* Disable the RX interrupt */ > - if (napi_schedule_prep(&fep->napi)) { > - writel(FEC_RX_DISABLED_IMASK, > - fep->hwp + FEC_IMASK); > - __napi_schedule(&fep->napi); > - } > - } > + /* Disable the NAPI interrupts */ > + writel(FEC_ENET_MII, fep->hwp + FEC_IMASK); > + napi_schedule(&fep->napi); > + } > > - if (int_events & FEC_ENET_MII) { > - ret = IRQ_HANDLED; > - complete(&fep->mdio_done); > - } > - } while (int_events); > + if (int_events & FEC_ENET_MII) { > + ret = IRQ_HANDLED; > + complete(&fep->mdio_done); > + } > > return ret; > } > @@ -1074,10 +1218,24 @@ fec_enet_interrupt(int irq, void *dev_id) > static int fec_enet_rx_napi(struct napi_struct *napi, int budget) > { > struct net_device *ndev = napi->dev; > - int pkts = fec_enet_rx(ndev, budget); > struct fec_enet_private *fep = netdev_priv(ndev); > + unsigned status; > + int pkts = 0; > + > + status = readl(fep->hwp + FEC_IEVENT) & (FEC_ENET_RXF | FEC_ENET_TXF); > + if (status) { > + /* > + * Clear any pending transmit or receive interrupts before > + * processing the rings to avoid racing with the hardware. > + */ > + writel(status, fep->hwp + FEC_IEVENT); > > - fec_enet_tx(ndev); > + if (status & FEC_ENET_RXF) > + pkts = fec_enet_rx(ndev, budget); > + > + if (status & FEC_ENET_TXF) > + fec_enet_tx(ndev); > + } > > if (pkts < budget) { > napi_complete(napi); > @@ -1263,8 +1421,6 @@ static int fec_enet_mdio_reset(struct mii_bus *bus) > static int fec_enet_mii_probe(struct net_device *ndev) > { > struct fec_enet_private *fep = netdev_priv(ndev); > - const struct platform_device_id *id_entry = > - platform_get_device_id(fep->pdev); > struct phy_device *phy_dev = NULL; > char mdio_bus_id[MII_BUS_ID_SIZE]; > char phy_name[MII_BUS_ID_SIZE + 3]; > @@ -1302,7 +1458,7 @@ static int fec_enet_mii_probe(struct net_device *ndev) > } > > /* mask with MAC supported features */ > - if (id_entry->driver_data & FEC_QUIRK_HAS_GBIT) { > + if (fep->quirks & FEC_QUIRK_HAS_GBIT) { > phy_dev->supported &= PHY_GBIT_FEATURES; > #if !defined(CONFIG_M5272) > phy_dev->supported |= SUPPORTED_Pause; > @@ -1329,8 +1485,6 @@ static int fec_enet_mii_init(struct platform_device *pdev) > static struct mii_bus *fec0_mii_bus; > struct net_device *ndev = platform_get_drvdata(pdev); > struct fec_enet_private *fep = netdev_priv(ndev); > - const struct platform_device_id *id_entry = > - platform_get_device_id(fep->pdev); > int err = -ENXIO, i; > > /* > @@ -1349,7 +1503,7 @@ static int fec_enet_mii_init(struct platform_device *pdev) > * mdio interface in board design, and need to be configured by > * fec0 mii_bus. > */ > - if ((id_entry->driver_data & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) { > + if ((fep->quirks & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) { > /* fec1 uses fec0 mii_bus */ > if (mii_cnt && fec0_mii_bus) { > fep->mii_bus = fec0_mii_bus; > @@ -1370,7 +1524,7 @@ static int fec_enet_mii_init(struct platform_device *pdev) > * document. > */ > fep->phy_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ahb), 5000000); > - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) > + if (fep->quirks & FEC_QUIRK_ENET_MAC) > fep->phy_speed--; > fep->phy_speed <<= 1; > writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED); > @@ -1405,7 +1559,7 @@ static int fec_enet_mii_init(struct platform_device *pdev) > mii_cnt++; > > /* save fec0 mii_bus */ > - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) > + if (fep->quirks & FEC_QUIRK_ENET_MAC) > fec0_mii_bus = fep->mii_bus; > > return 0; > @@ -1694,23 +1848,24 @@ static void fec_enet_free_buffers(struct net_device *ndev) > struct fec_enet_private *fep = netdev_priv(ndev); > unsigned int i; > struct sk_buff *skb; > - struct bufdesc *bdp; > > - bdp = fep->rx_bd_base; > for (i = 0; i < fep->rx_ring_size; i++) { > - skb = fep->rx_skbuff[i]; > + union bufdesc_u *bdp = fec_enet_rx_get(i, fep); > > - if (bdp->cbd_bufaddr) > - dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, > + skb = fep->rx_skbuff[i]; > + if (skb) { > + dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr, > FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > - if (skb) > dev_kfree_skb(skb); > - bdp = fec_enet_get_nextdesc(bdp, fep); > + } > } > > - bdp = fep->tx_bd_base; > - for (i = 0; i < fep->tx_ring_size; i++) > + for (i = 0; i < fep->tx_ring_size; i++) { > + union bufdesc_u *bdp = fec_enet_tx_get(i, fep); > + if (bdp->bd.cbd_bufaddr) > + fec_enet_tx_unmap(&bdp->bd, fep); > kfree(fep->tx_bounce[i]); > + } > } > > static int fec_enet_alloc_buffers(struct net_device *ndev) > @@ -1718,58 +1873,54 @@ static int fec_enet_alloc_buffers(struct net_device *ndev) > struct fec_enet_private *fep = netdev_priv(ndev); > unsigned int i; > struct sk_buff *skb; > - struct bufdesc *bdp; > > - bdp = fep->rx_bd_base; > for (i = 0; i < fep->rx_ring_size; i++) { > + union bufdesc_u *bdp = fec_enet_rx_get(i, fep); > + dma_addr_t addr; > + > skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE); > if (!skb) { > fec_enet_free_buffers(ndev); > return -ENOMEM; > } > - fep->rx_skbuff[i] = skb; > > - bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, skb->data, > - FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > - if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) { > + addr = dma_map_single(&fep->pdev->dev, skb->data, > + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); > + if (dma_mapping_error(&fep->pdev->dev, addr)) { > + dev_kfree_skb(skb); > fec_enet_free_buffers(ndev); > if (net_ratelimit()) > netdev_err(ndev, "Rx DMA memory map failed\n"); > return -ENOMEM; > } > - bdp->cbd_sc = BD_ENET_RX_EMPTY; > > - if (fep->bufdesc_ex) { > - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; > - ebdp->cbd_esc = BD_ENET_RX_INT; > - } > + fep->rx_skbuff[i] = skb; > > - bdp = fec_enet_get_nextdesc(bdp, fep); > - } > + bdp->bd.cbd_bufaddr = addr; > + bdp->bd.cbd_sc = BD_ENET_RX_EMPTY; > + /* Set the last buffer to wrap. */ > + if (i == fep->rx_ring_size - 1) > + bdp->bd.cbd_sc |= BD_SC_WRAP; > > - /* Set the last buffer to wrap. */ > - bdp = fec_enet_get_prevdesc(bdp, fep); > - bdp->cbd_sc |= BD_SC_WRAP; > + if (fep->bufdesc_ex) > + bdp->ebd.cbd_esc = BD_ENET_RX_INT; > + } > > - bdp = fep->tx_bd_base; > for (i = 0; i < fep->tx_ring_size; i++) { > + union bufdesc_u *bdp = fec_enet_tx_get(i, fep); > fep->tx_bounce[i] = kmalloc(FEC_ENET_TX_FRSIZE, GFP_KERNEL); > > - bdp->cbd_sc = 0; > - bdp->cbd_bufaddr = 0; > - > - if (fep->bufdesc_ex) { > - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; > - ebdp->cbd_esc = BD_ENET_TX_INT; > - } > + /* Set the last buffer to wrap. */ > + if (i == fep->tx_ring_size - 1) > + bdp->bd.cbd_sc = BD_SC_WRAP; > + else > + bdp->bd.cbd_sc = 0; > + bdp->bd.cbd_bufaddr = 0; > > - bdp = fec_enet_get_nextdesc(bdp, fep); > + if (fep->bufdesc_ex) > + bdp->ebd.cbd_esc = BD_ENET_TX_INT; > } > > - /* Set the last buffer to wrap. */ > - bdp = fec_enet_get_prevdesc(bdp, fep); > - bdp->cbd_sc |= BD_SC_WRAP; > - > return 0; > } > > @@ -1990,9 +2141,7 @@ static const struct net_device_ops fec_netdev_ops = { > static int fec_enet_init(struct net_device *ndev) > { > struct fec_enet_private *fep = netdev_priv(ndev); > - const struct platform_device_id *id_entry = > - platform_get_device_id(fep->pdev); > - struct bufdesc *cbd_base; > + union bufdesc_u *cbd_base; > > /* Allocate memory for buffer descriptors. */ > cbd_base = dma_alloc_coherent(NULL, PAGE_SIZE, &fep->bd_dma, > @@ -2014,10 +2163,11 @@ static int fec_enet_init(struct net_device *ndev) > /* Set receive and transmit descriptor base. */ > fep->rx_bd_base = cbd_base; > if (fep->bufdesc_ex) > - fep->tx_bd_base = (struct bufdesc *) > - (((struct bufdesc_ex *)cbd_base) + fep->rx_ring_size); > + fep->tx_bd_base = (union bufdesc_u *) > + (&cbd_base->ebd + fep->rx_ring_size); > else > - fep->tx_bd_base = cbd_base + fep->rx_ring_size; > + fep->tx_bd_base = (union bufdesc_u *) > + (&cbd_base->bd + fep->rx_ring_size); > > /* The FEC Ethernet specific entries in the device structure */ > ndev->watchdog_timeo = TX_TIMEOUT; > @@ -2027,19 +2177,24 @@ static int fec_enet_init(struct net_device *ndev) > writel(FEC_RX_DISABLED_IMASK, fep->hwp + FEC_IMASK); > netif_napi_add(ndev, &fep->napi, fec_enet_rx_napi, NAPI_POLL_WEIGHT); > > - if (id_entry->driver_data & FEC_QUIRK_HAS_VLAN) { > - /* enable hw VLAN support */ > - ndev->features |= NETIF_F_HW_VLAN_CTAG_RX; > - ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX; > - } > + if (fep->bufdesc_ex) { > + /* Features which require the enhanced buffer descriptors */ > + netdev_features_t features = 0; > > - if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) { > - /* enable hw accelerator */ > - ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM > - | NETIF_F_RXCSUM); > - ndev->hw_features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM > - | NETIF_F_RXCSUM); > - fep->csum_flags |= FLAG_RX_CSUM_ENABLED; > + if (fep->quirks & FEC_QUIRK_HAS_VLAN) { > + /* enable hw VLAN support */ > + features |= NETIF_F_HW_VLAN_CTAG_RX; > + } > + > + if (fep->quirks & FEC_QUIRK_HAS_CSUM) { > + /* enable hw accelerator */ > + features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | > + NETIF_F_RXCSUM; > + fep->csum_flags |= FLAG_RX_CSUM_ENABLED; > + } > + > + ndev->hw_features |= features; > + ndev->features |= features; > } > > fec_restart(ndev, 0); > @@ -2110,13 +2265,6 @@ fec_probe(struct platform_device *pdev) > /* setup board info structure */ > fep = netdev_priv(ndev); > > -#if !defined(CONFIG_M5272) > - /* default enable pause frame auto negotiation */ > - if (pdev->id_entry && > - (pdev->id_entry->driver_data & FEC_QUIRK_HAS_GBIT)) > - fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG; > -#endif > - > r = platform_get_resource(pdev, IORESOURCE_MEM, 0); > fep->hwp = devm_ioremap_resource(&pdev->dev, r); > if (IS_ERR(fep->hwp)) { > @@ -2126,6 +2274,14 @@ fec_probe(struct platform_device *pdev) > > fep->pdev = pdev; > fep->dev_id = dev_id++; > + if (pdev->id_entry) > + fep->quirks = pdev->id_entry->driver_data; > + > +#if !defined(CONFIG_M5272) > + /* default enable pause frame auto negotiation */ > + if (fep->quirks & FEC_QUIRK_HAS_GBIT) > + fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG; > +#endif > > fep->bufdesc_ex = 0; > > @@ -2160,8 +2316,7 @@ fec_probe(struct platform_device *pdev) > fep->clk_enet_out = NULL; > > fep->clk_ptp = devm_clk_get(&pdev->dev, "ptp"); > - fep->bufdesc_ex = > - pdev->id_entry->driver_data & FEC_QUIRK_HAS_BUFDESC_EX; > + fep->bufdesc_ex = fep->quirks & FEC_QUIRK_HAS_BUFDESC_EX; > if (IS_ERR(fep->clk_ptp)) { > fep->clk_ptp = NULL; > fep->bufdesc_ex = 0; > > > -- > FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly > improving, and getting towards what was expected from it. > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
I also tried the patch, but unfortunately it doesn't work for me either. I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9, single core) running Linux Kernel version 3.10.9-rt5+ Test done: Start iperf server on target $ iperf -s -u & Let iperf client run for 10 minutes on the host $ iperf -c <ip_target> -u -b 100m -t 600 -l 256 Also start a ping from host to target, to easily detect when the error occurs $ ping <ip_target> Without the patch applied I see the following behavior: After some time the ping from host to target stopped. As soon as iperf was ready I started a ping from target to host. By doing that the network came back to life: the ping from host to target resumed. With the patch applied the following happens: After some amount of time the ping stops, but I've seen large ping times (up to 9000ms) before it does that. After iperf is done, the connection was dead. (just like without the patch) The difference now is that the ping from host to target started reporting "Destination Host Unreachable" after iperf has finished and the ping from target to host didn't bring the connection back to life: it reported 100% packet loss. dmesg output on target shows the following: ------------[ cut here ]------------ WARNING: at /home/bfa/linux/net/sched/sch_generic.c:255 dev_watchdog+0x380/0x390() NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out Modules linked in: CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G O 3.10.9-rt5+ #3 Backtrace: [<80012cf0>] (dump_backtrace+0x0/0x118) from [<800130f0>] (show_stack+0x20/0x24) r6:000000ff r5:8046caf8 r4:9f8d1d80 r3:00000000 [<800130d0>] (show_stack+0x0/0x24) from [<8054c9f0>] (dump_stack+0x24/0x28) [<8054c9cc>] (dump_stack+0x0/0x28) from [<80020d84>] (warn_slowpath_common+0x64/0x78) [<80020d20>] (warn_slowpath_common+0x0/0x78) from [<80020e54>] (warn_slowpath_fmt+0x40/0x48) r8:8076ea10 r7:8072c2c0 r6:00000000 r5:9f837214 r4:9f837000 r3:00000009 [<80020e14>] (warn_slowpath_fmt+0x0/0x48) from [<8046caf8>] (dev_watchdog+0x380/0x390) r3:9f837000 r2:8069ab98 [<8046c778>] (dev_watchdog+0x0/0x390) from [<800326c8>] (call_timer_fn+0x54/0x1ac) [<80032674>] (call_timer_fn+0x0/0x1ac) from [<80032a58>] (run_timer_softirq+0x238/0x2ec) [<80032820>] (run_timer_softirq+0x0/0x2ec) from [<8002a180>] (handle_softirq+0x84/0x1e8) [<8002a0fc>] (handle_softirq+0x0/0x1e8) from [<8002a344>] (do_single_softirq+0x60/0x90) [<8002a2e4>] (do_single_softirq+0x0/0x90) from [<8002a4e4>] (do_current_softirqs+0x170/0x1a8) r7:80735620 r6:807b2440 r5:9f8d0000 r4:00000001 [<8002a374>] (do_current_softirqs+0x0/0x1a8) from [<8002a55c>] (run_ksoftirqd+0x40/0x60) [<8002a51c>] (run_ksoftirqd+0x0/0x60) from [<800516a8>] (smpboot_thread_fn+0x23c/0x2b0) r5:9f8d0000 r4:9f827540 [<8005146c>] (smpboot_thread_fn+0x0/0x2b0) from [<80047684>] (kthread+0xb4/0xb8) [<800475d0>] (kthread+0x0/0xb8) from [<8000edd8>] (ret_from_fork+0x14/0x20) r7:00000000 r6:00000000 r5:800475d0 r4:9f8c3e5c ---[ end trace 0000000000000002 ]--- fec 2188000.ethernet eth0: TX ring dump Nr SC addr len SKB 0 0x1c00 0x2e853000 98 9e8b6000 1 0x1c00 0x2e853800 98 9e9bb6c0 2 0x1c00 0x2e9c4000 98 9e841a80 3 0x1c00 0x2e9c4800 98 9e841000 4 0x1c00 0x2e9c5000 98 9e8c3480 5 0x1c00 0x2e9c5800 98 9e9bb540 6 0x1c00 0x2e9c6000 98 9e8b4f00 7 0x1c00 0x2e9c6800 98 9e941e40 8 0x1c00 0x2e9c7000 98 9e941300 9 0x1c00 0x2e9c7800 98 9e941780 10 0x1c00 0x2e9cc000 98 9e8b4d80 11 0x1c00 0x2e9cc800 98 9e8b4b40 12 0x1c00 0x2e9cd000 98 9e8b4a80 13 0x1c00 0x2e9cd800 98 9e8b4600 14 0x1c00 0x2e9ce000 98 9e941600 15 0x1c00 0x2e9ce800 98 9e8b4540 16 0x1c00 0x2e9cf000 98 9e8b4180 17 0x1c00 0x2e9cf800 98 9e9419c0 18 0x1c00 0x2e9f8000 98 9e8b4300 19 0x1c00 0x2e9f8800 98 9e941840 20 0x1c00 0x2e9f9000 98 9e8b4cc0 21 0x1c00 0x2e9f9800 98 9e8b49c0 22 0x1c00 0x2e9fa000 98 9e8416c0 23 0x1c00 0x2e9fa800 98 9e8c1b40 24 0x1c00 0x2e9fb000 98 9e9ad900 25 0x1c00 0x2e9fb800 98 9e9ad300 26 0x1c00 0x2e9fc000 98 9e9ad840 27 0x1c00 0x2e9fc800 98 9e9ad000 28 0x1c00 0x2e9fd000 98 9e941b40 29 0x1c00 0x2e9fd800 98 9e8b6240 30 0x1c00 0x2e9fe000 98 9e941f00 31 0x1c00 0x2e9fe800 98 9e941000 32 0x1c00 0x2e9ff000 98 9e941a80 33 0x1c00 0x2e9ff800 98 9e8b4e40 34 0x1c00 0x2ea00000 98 9e9ad600 35 0x1c00 0x2ea00800 98 9e8413c0 36 0x1c00 0x2ea01000 98 9e841c00 37 0x1c00 0x2ea01800 98 9e8b6300 38 0x1c00 0x2ea02000 98 9e8b6c00 39 0x1c00 0x2ea02800 98 9e8b40c0 40 0x1c00 0x2ea03000 98 9e841e40 41 0x1c00 0x2ea03800 98 9e8c1600 42 0x1c00 0x2ea04000 98 9e8c1c00 43 0x1c00 0x2ea04800 98 9e9ad6c0 44 0x1c00 0x2ea05000 98 9e9bb0c0 45 0x1c00 0x2ea05800 98 9e9ad480 46 0x1c00 0x2ea06000 98 9e841900 47 0x1c00 0x2ea06800 98 9e8b4240 48 0x1c00 0x2ea07000 98 9e8b4000 49 0x1c00 0x2ea07800 98 9e8b4840 50 0x1c00 0x2ea08000 98 9e9ad240 51 0x1c00 0x2ea08800 98 9e9b19c0 52 0x1c00 0x2ea09000 98 9e941240 53 0x1c00 0x2ea09800 98 9e841600 54 0x1c00 0x2ea0a000 98 9e9bb780 55 0x1c00 0x2ea0a800 98 9e9b1480 56 0x1c00 0x2ea0b000 98 9e9bb600 57 0x1c00 0x2ea0b800 98 9e9b13c0 58 0x1c00 0x2ea0c000 98 9e9ada80 59 0x1c00 0x2ea0c800 98 9e9ad0c0 60 SH 0x1c00 0x00000000 98 (null) 61 0x9c00 0x2ea0d800 98 9e841b40 62 0x1c00 0x2ea0e000 98 9e9bb240 63 0x3c00 0x2ea0e800 98 9e8c3f00 both situations are very reproducible. I've you have suggestions on what I can do or test, please let me know. regards, Bernd Faust On 21 March 2014 02:43, Fabio Estevam <festevam@gmail.com> wrote: > On Thu, Mar 20, 2014 at 10:36 PM, Fabio Estevam <festevam@gmail.com> wrote: > >> Robert's tests were made on a mx53 (single CortexA9), and its cache >> controller is not the L310. > > Ops, I meant CortexA8. > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Fri, Mar 21, 2014 at 10:50:41AM -0500, Frank Li wrote: > On Wed, Mar 19, 2014 at 5:51 PM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > 2. The ARM can re-order writes. The writes it can re-order are: > > > > bdp->cbd_bufaddr > > ebdp->cbd_bdu > > ebdp->cbd_esc > > ebdp->cbd_sc > > > > Hence, it's entirely possible for the FEC to see the updated descriptor > > status before the rest of the descriptor has been written. What's missing > > is a barrier between the descriptor writes, and the final write to > > bdp->cbd_sc. > > Possible. I also consider this when debug the other issues. > Just bdp->bufaddr and bdp->buflen is important. > bdu, hardware don't care it. > esc is the same for each BD. When software go though BD ring once, > esc will be not changed again, even though we write. If you assume that all you're sending is traffic which needs the headers checksummed (in other words, only IP traffic and nothing else) then you're correct. Reality is that other traffic gets included, such as ARPs which have skb->ip_summed = CHECKSUM_NONE. In any case, you are wrong about cbd_esc. For this code: ebdp->cbd_esc = BD_ENET_TX_INT; /* Enable protocol checksum flags * We do not bother with the IP Checksum bits as they * are done by the kernel */ if (skb->ip_summed == CHECKSUM_PARTIAL) ebdp->cbd_esc |= BD_ENET_TX_PINS; this is the assembly which the compiler will generate: 820: e3a03101 mov r3, #1073741824 ; 0x40000000 824: e5863008 str r3, [r6, #8] 828: e5d53064 ldrb r3, [r5, #100] ; 0x64 82c: e203300c and r3, r3, #12 830: e353000c cmp r3, #12 834: 03a03205 moveq r3, #1342177280 ; 0x50000000 838: 05863008 streq r3, [r6, #8] That's a write to cbd_esc with BD_ENET_TX_INT set and BD_ENET_TX_PINS clear, followed by another write with both set if the packet has been partially checksummed. Depending on the CPU and timing, that could be visible to the device. Thankfully, fb8ef788680 fixed it for CPUs which are strongly ordered. However, with a weakly ordered CPU, all it would take is an interrupt between those two writes to make them visible as separate writes, and without the following barrier, the order of cbd_esc and cbd_sc becoming visible is uncertain. > but if sc write before buffaddr and bullen, there will be issue. > Add memory barre here is better before ebdp->cbd_sc. > > I also want to add it before, But I have not found a test case to > verify it is necessary. In these situations, knowledge and theory is better than test-and-verify, especially when we're talking about trying to hit a small window with timings. We know ARMv6 and ARMv7 are weakly ordered. We know we see problems with transmit on these devices. We know that the specification calls for the TX ready bit to be set last. We know on these CPUs that writes can be reordered. Therefore... a barrier is definitely required. > > The result is the handler is entered, FEC_IEVENT contains TXF and MII > > events. Both these events are cleared down, (and thus no longer exist > > as interrupt-causing events.) napi_schedule_prep() returns false as > > the NAPI rx function is still running, and doesn't mark it for a re-run. > > We then do the MII interrupt. Loop again, and int_events is zero, > > we exit. > > > > Meanwhile, the NAPI rx function calls napi_complete() and re-enables > > the receive interrupt. If you're unlucky enough that the RX ring is > > also full... no RXF interrupt. So no further interrupts except maybe > > MII interrupts. > > > > NAPI never gets scheduled. RX ring never gets emptied. TX ring never > > gets reaped. The result is a timeout with a completely full TX ring. > > Do you see RX ring full? I haven't added any debug for the RX ring, so I don't know for certain. What I do know is that I've had the situation where the TX ring is completely full of packets which have been sent and the TX ring hasn't been reaped, which leads to the timeout, and that is down to the NAPI functions not being scheduled. When that happens, the three sets of flood pings thumping away at the iMX6Q don't get any response until the timeout occurs. In that scenario, if a packet were to be received (and there's many packets hitting the interface) then it _would_ trigger reaping of the TX ring. > > + do { > > + index = (index + 1) & (fep->tx_ring_size - 1); > > + bdp = fec_enet_tx_get(index, fep); > > > > - while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) { > > + status = bdp->bd.cbd_sc; > > + if (status & BD_ENET_TX_READY) > > + break; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In answer to your question below, here. > > > > /* current queue is empty */ > > - if (bdp == fep->cur_tx) > > + if (index == fep->tx_next) > > break; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ and here. > > > > - if (fep->bufdesc_ex) > > - index = (struct bufdesc_ex *)bdp - > > - (struct bufdesc_ex *)fep->tx_bd_base; > > - else > > - index = bdp - fep->tx_bd_base; > > + fec_enet_tx_unmap(&bdp->bd, fep); > > > > skb = fep->tx_skbuff[index]; > > - dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len, > > - DMA_TO_DEVICE); > > - bdp->cbd_bufaddr = 0; > > + fep->tx_skbuff[index] = NULL; > > > > /* Check for errors. */ > > if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC | > > @@ -797,19 +826,18 @@ fec_enet_tx(struct net_device *ndev) > > ndev->stats.tx_carrier_errors++; > > } else { > > ndev->stats.tx_packets++; > > - ndev->stats.tx_bytes += bdp->cbd_datlen; > > + ndev->stats.tx_bytes += bdp->bd.cbd_datlen; > > } > > > > if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) && > > fep->bufdesc_ex) { > > struct skb_shared_hwtstamps shhwtstamps; > > unsigned long flags; > > - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; > > > > memset(&shhwtstamps, 0, sizeof(shhwtstamps)); > > spin_lock_irqsave(&fep->tmreg_lock, flags); > > shhwtstamps.hwtstamp = ns_to_ktime( > > - timecounter_cyc2time(&fep->tc, ebdp->ts)); > > + timecounter_cyc2time(&fep->tc, bdp->ebd.ts)); > > spin_unlock_irqrestore(&fep->tmreg_lock, flags); > > skb_tstamp_tx(skb, &shhwtstamps); > > } > > @@ -825,45 +853,252 @@ fec_enet_tx(struct net_device *ndev) > > > > /* Free the sk buffer associated with this last transmit */ > > dev_kfree_skb_any(skb); > > - fep->tx_skbuff[index] = NULL; > > - > > - fep->dirty_tx = bdp; > > - > > - /* Update pointer to next buffer descriptor to be transmitted */ > > - bdp = fec_enet_get_nextdesc(bdp, fep); > > > > /* Since we have freed up a buffer, the ring is no longer full > > */ > > - if (fep->dirty_tx != fep->cur_tx) { > > - if (netif_queue_stopped(ndev)) > > - netif_wake_queue(ndev); > > + if (netif_queue_stopped(ndev)) { > > + > > + > > + > > + > > + > > + netif_wake_queue(ndev); > > + > > } > > - } > > + > > + fep->tx_dirty = index; > > + } while (1); > > where break this while(1);
On Fri, Mar 21, 2014 at 05:33:02PM +0100, Bernd Faust wrote: > I also tried the patch, but unfortunately it doesn't work for me either. > I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9, > single core) running Linux Kernel version 3.10.9-rt5+ Hmm, rt kernels. Does this happen without the rt patches applied? > Test done: > Start iperf server on target > $ iperf -s -u & > > Let iperf client run for 10 minutes on the host > $ iperf -c <ip_target> -u -b 100m -t 600 -l 256 > > Also start a ping from host to target, to easily detect when the error > occurs > $ ping <ip_target> Right, I've just done this here, the results aren't that good: On the iMX6S: ------------------------------------------------------------ Server listening on UDP port 5001 Receiving 1470 byte datagrams UDP buffer size: 160 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.1.181 port 5001 connected with 192.168.1.11 port 45248 [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 3] 0.0-600.2 sec 109 MBytes 1.52 Mbits/sec 13.726 ms 29469735/29914906 (99%) On my x86 laptop: --- 192.168.1.181 ping statistics --- 617 packets transmitted, 616 received, 0% packet loss, time 616014ms rtt min/avg/max/mdev = 0.167/0.294/8.934/0.512 ms ------------------------------------------------------------ Client connecting to 192.168.1.181, UDP port 5001 Sending 256 byte datagrams UDP buffer size: 208 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.1.11 port 45248 connected with 192.168.1.181 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-600.0 sec 7.13 GBytes 102 Mbits/sec [ 3] Sent 29914976 datagrams [ 3] Server Report: [ 3] 0.0-600.2 sec 109 MBytes 1.52 Mbits/sec 13.726 ms 29469735/29914906 (99%) The loss is disappointing and needs further investigation, but as you can see, none of the symptoms you have reported are present - no lockup and no 9000ms ping times (the longest was 9ms). This is a plain 3.14-rc7 kernel configured for preempt, without the -rt patches.
On 21 March 2014 18:32, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Fri, Mar 21, 2014 at 05:33:02PM +0100, Bernd Faust wrote: >> I also tried the patch, but unfortunately it doesn't work for me either. >> I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9, >> single core) running Linux Kernel version 3.10.9-rt5+ > > Hmm, rt kernels. Does this happen without the rt patches applied? > On a kernel without RT patches the problem does not occur, so in my case the problem seems to be introduced/triggered by the RT patches.
On Sun, Mar 23, 2014 at 12:38:31PM +0100, Bernd Faust wrote: > On 21 March 2014 18:32, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > > On Fri, Mar 21, 2014 at 05:33:02PM +0100, Bernd Faust wrote: > >> I also tried the patch, but unfortunately it doesn't work for me either. > >> I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9, > >> single core) running Linux Kernel version 3.10.9-rt5+ > > > > Hmm, rt kernels. Does this happen without the rt patches applied? > > > > On a kernel without RT patches the problem does not occur, so in my > case the problem seems to be introduced/triggered by the RT patches. Yes, this suggests it's some kind of race introduced by the RT patches. I've never looked at the RT patches, so I can't help there. What I can say is that there are a number of other problems with this driver which are all races of one form or another, but they should not affect normal operation. For example: - the FEC_QUIRK_ERR006358 workaround (for iMX6 only) can read the previous descriptor before the write to new descriptor's TX status bit has become visible to the hardware. - FEC_QUIRK_ERR006358 workaround not quite having the desired effect (a scheduled delayed work can't be re-scheduled until after it has already fired.) - the number of places that call fec_stop() without first stopping the transmit queue or dealing with the above work-around before reconfiguring the hardware and re-initialising the rings. - the rx path having a similar problem to the above (writing descriptor entries after the descriptor has been handed over to uDMA.) I haven't addressed any of these yet, but they're on my list. I'm pretty sure there's other problems in there too...
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 03/21/2014
11:32:53 AM:
> Hmm, rt kernels. Does this happen without the rt patches applied?
I am not using the rt patches and I still see the problem. I'm using
3.14.0-rc6+ with
fec patches on the i.MX53 Quick Start Board with the latest U-Boot.
When I run my test I will immediately see packet loss with the eventual tx
timeout issue.
These are the steps I use to reproduce the problem:
Setup: Desktop (Ubuntu 12.04: webfs, iperf3, /srv/ftp/test.bmp ~27 MB)
i.MX53 Quick Start Board (linux 3.14.0-rc6+ with fec patches:
iperf3, wget)
Test:
Desktop:
iperf3 -s -V
i.MX53 QSB:
ssh 1> iperf3 -c 192.168.1.101 -u -l 64 -b 55M -V -t 1000
ssh 2> cd /tmp; while true; do date; wget
http://192.168.1.101:8000/test.bmp; rm -fv /tmp/test.bmp; done
Now for the verbose explanation, I installed webfs on my Ubuntu desktop
machine and put
a ~27 MB bmp file in the web_root directory for the i.MX53 QSB to access.
I built iperf3
and ran the server on my Ubuntu desktop machine in verbose mode.
On the i.MX53 QSB I connect with ssh and run iperf3 with the above options.
I then open
another ssh session and run the wget command from above which continuously
transfers the
test.bmp from my desktop. As soon as I start the wget I start seeing
packet loss
from iperf3. After a while I will also see the tx timeout.
The following is what I see from iperf3 on the desktop:
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Time: Mon, 24 Mar 2014 16:42:13 GMT
Accepted connection from 192.168.1.169, port 48609
Cookie: ic-ii-0.4872.965116.71cd138c70e6e36e
[ 5] local 192.168.1.101 port 5201 connected to 192.168.1.169 port 56375
Starting Test: protocol: UDP, 1 streams, 64 byte blocks, omitting 0
seconds, 1000 second test
[ ID] Interval Transfer Bandwidth Jitter Lost/Total
Datagrams
[ 5] 0.00-1.00 sec 879 KBytes 7.20 Mbits/sec 0.087 ms 0/14064
(0%)
[ 5] 1.00-2.00 sec 910 KBytes 7.46 Mbits/sec 0.088 ms 0/14568
(0%)
[ 5] 2.00-3.00 sec 910 KBytes 7.46 Mbits/sec 0.086 ms 0/14562
(0%)
[ 5] 3.00-4.00 sec 891 KBytes 7.30 Mbits/sec 0.086 ms 0/14259
(0%)
[ 5] 4.00-5.00 sec 891 KBytes 7.30 Mbits/sec 0.084 ms 0/14250
(0%)
[ 5] 5.00-6.00 sec 913 KBytes 7.48 Mbits/sec 0.084 ms 0/14612
(0%)
[ 5] 6.00-7.00 sec 913 KBytes 7.48 Mbits/sec 0.085 ms 0/14615
(0%)
[ 5] 7.00-8.00 sec 806 KBytes 6.60 Mbits/sec 0.246 ms 0/12893
(0%)
[ 5] 8.00-9.00 sec 327 KBytes 2.68 Mbits/sec 2.748 ms 1/5230
(0.019%) <----- Start of wget transfers
[ 5] 9.00-10.00 sec 294 KBytes 2.41 Mbits/sec 0.191 ms 13/4715
(0.28%)
[ 5] 10.00-11.00 sec 301 KBytes 2.46 Mbits/sec 0.184 ms 11/4820
(0.23%)
[ 5] 11.00-12.00 sec 351 KBytes 2.88 Mbits/sec 0.373 ms 0/5616 (0%)
[ 5] 12.00-13.00 sec 289 KBytes 2.36 Mbits/sec 0.086 ms 4/4622
(0.087%)
[ 5] 13.00-14.00 sec 303 KBytes 2.48 Mbits/sec 0.084 ms 18/4859
(0.37%)
[ 5] 14.00-15.00 sec 369 KBytes 3.03 Mbits/sec 0.094 ms 0/5911 (0%)
[ 5] 15.00-16.00 sec 283 KBytes 2.32 Mbits/sec 0.193 ms 6/4537
(0.13%)
[ 5] 16.00-17.00 sec 277 KBytes 2.27 Mbits/sec 0.146 ms 12/4441
(0.27%)
[ 5] 17.00-18.00 sec 366 KBytes 3.00 Mbits/sec 0.054 ms 0/5850 (0%)
[ 5] 18.00-19.00 sec 287 KBytes 2.35 Mbits/sec 0.085 ms 5/4604
(0.11%)
[ 5] 19.00-20.00 sec 276 KBytes 2.26 Mbits/sec 0.081 ms 10/4429
(0.23%)
[ 5] 20.00-21.00 sec 362 KBytes 2.97 Mbits/sec 0.417 ms 1/5798
(0.017%)
[ 5] 21.00-22.00 sec 301 KBytes 2.47 Mbits/sec 0.089 ms 6/4826
(0.12%)
[ 5] 22.00-23.00 sec 278 KBytes 2.28 Mbits/sec 0.228 ms 15/4471
(0.34%)
[ 5] 23.00-24.00 sec 379 KBytes 3.10 Mbits/sec 0.130 ms 7/6068
(0.12%)
[ 5] 24.00-25.00 sec 318 KBytes 2.60 Mbits/sec 0.083 ms 2/5087
(0.039%)
[ 5] 25.00-26.00 sec 279 KBytes 2.29 Mbits/sec 0.301 ms 11/4476
(0.25%)
[ 5] 26.00-27.00 sec 350 KBytes 2.86 Mbits/sec 0.059 ms 11/5605
(0.2%)
[ 5] 27.00-28.00 sec 324 KBytes 2.65 Mbits/sec 0.597 ms 1/5179
(0.019%)
[ 5] 28.00-29.00 sec 284 KBytes 2.32 Mbits/sec 0.088 ms 13/4549
(0.29%)
[ 5] 29.00-30.00 sec 339 KBytes 2.77 Mbits/sec 0.442 ms 12/5431
(0.22%)
[ 5] 30.00-31.00 sec 317 KBytes 2.60 Mbits/sec 0.122 ms 1/5074
(0.02%)
[ 5] 31.00-32.00 sec 278 KBytes 2.28 Mbits/sec 0.099 ms 10/4458
(0.22%)
[ 5] 32.00-33.00 sec 328 KBytes 2.68 Mbits/sec 0.043 ms 12/5252
(0.23%)
[ 5] 33.00-34.00 sec 316 KBytes 2.59 Mbits/sec 0.445 ms 1/5060
(0.02%)
[ 5] 34.00-35.00 sec 277 KBytes 2.27 Mbits/sec 1.432 ms 13/4445
(0.29%)
[ 5] 35.00-36.00 sec 345 KBytes 2.83 Mbits/sec 0.089 ms 12/5535
(0.22%)
[ 5] 36.00-37.00 sec 360 KBytes 2.95 Mbits/sec 0.143 ms 0/5768 (0%)
[ 5] 37.00-38.00 sec 292 KBytes 2.39 Mbits/sec 0.460 ms 3/4671
(0.064%)
[ 5] 38.00-39.00 sec 290 KBytes 2.37 Mbits/sec 0.095 ms 14/4649
(0.3%)
[ 5] 39.00-40.00 sec 353 KBytes 2.89 Mbits/sec 0.318 ms 1/5643
(0.018%)
[ 5] 40.00-41.00 sec 297 KBytes 2.44 Mbits/sec 0.707 ms 3/4761
(0.063%)
[ 5] 41.00-42.00 sec 270 KBytes 2.21 Mbits/sec 0.173 ms 18/4335
(0.42%)
[ 5] 42.00-43.00 sec 359 KBytes 2.94 Mbits/sec 0.452 ms 1/5750
(0.017%)
[ 5] 43.00-44.00 sec 294 KBytes 2.41 Mbits/sec 0.112 ms 4/4703
(0.085%)
[ 5] 44.00-45.00 sec 272 KBytes 2.23 Mbits/sec 0.134 ms 18/4371
(0.41%)
[ 5] 45.00-46.00 sec 385 KBytes 3.16 Mbits/sec 0.090 ms 3/6170
(0.049%)
[ 5] 46.00-47.00 sec 309 KBytes 2.53 Mbits/sec 0.735 ms 2/4949
(0.04%)
[ 5] 47.00-48.00 sec 288 KBytes 2.36 Mbits/sec 0.206 ms 13/4622
(0.28%)
[ 5] 48.00-49.00 sec 356 KBytes 2.92 Mbits/sec 0.107 ms 4/5707
(0.07%)
[ 5] 49.00-50.00 sec 292 KBytes 2.39 Mbits/sec 1.680 ms 4/4670
(0.086%)
[ 5] 50.00-51.00 sec 262 KBytes 2.15 Mbits/sec 0.646 ms 12/4202
(0.29%)
[ 5] 51.00-52.00 sec 349 KBytes 2.86 Mbits/sec 0.151 ms 11/5597
(0.2%)
[ 5] 52.00-53.00 sec 313 KBytes 2.56 Mbits/sec 0.076 ms 3/5011
(0.06%)
[ 5] 53.00-54.00 sec 278 KBytes 2.27 Mbits/sec 0.436 ms 16/4456
(0.36%)
[ 5] 54.00-55.00 sec 361 KBytes 2.95 Mbits/sec 0.086 ms 3/5772
(0.052%)
[ 5] 55.00-56.00 sec 312 KBytes 2.56 Mbits/sec 0.422 ms 1/4992
(0.02%)
[ 5] 56.00-57.00 sec 292 KBytes 2.39 Mbits/sec 0.134 ms 14/4683
(0.3%)
[ 5] 57.00-58.00 sec 344 KBytes 2.82 Mbits/sec 0.090 ms 6/5513
(0.11%)
[ 5] 58.00-59.00 sec 200 KBytes 1.64 Mbits/sec 0.811 ms 1/3199
(0.031%)
[ 5] 59.00-60.00 sec 0.00 Bytes 0.00 bits/sec 0.811 ms 0/0 (-nan%)
<----- Coincides with the tx timeout
[ 5] 60.00-61.00 sec 0.00 Bytes 0.00 bits/sec 0.811 ms 0/0 (-nan%)
[ 5] 61.00-62.00 sec 84.2 KBytes 690 Kbits/sec 0.032 ms 0/1347 (0%)
[ 5] 62.00-63.00 sec 307 KBytes 2.52 Mbits/sec 0.243 ms 0/4915 (0%)
[ 5] 63.00-64.00 sec 301 KBytes 2.47 Mbits/sec 0.044 ms 0/4820 (0%)
[ 5] 64.00-65.00 sec 391 KBytes 3.21 Mbits/sec 0.053 ms 1/6263
(0.016%)
[ 5] 65.00-66.00 sec 364 KBytes 2.98 Mbits/sec 0.431 ms 0/5828 (0%)
Here is another tx timeout dump:
Jan 1fec 63fec000.ethernet eth0: TX ring dump
Nr SC addr len SKB
0 0x1c00 0xce485000 106 dec7bf00
1 0x1c00 0xce485800 106 dec7bcc0
2 0x1c00 0xce486000 106 dec7b480
3 0x1c00 0xce486800 106 dec7b840
4 0x1c00 0xce487000 106 dec7b3c0
5 0x1c00 0xce487800 106 dec7b600
6 0x1c00 0xce500000 106 dec7b780
7 0x1c00 0xce500800 106 dec7be40
8 0x1c00 0xce501000 106 dec7b9c0
9 0x1c00 0xce501800 106 dec7bb40
10 0x1c00 0xce502000 106 dec7b180
11 0x1c00 0xce502800 106 de7b2540
12 0x1c00 0xce503000 106 de7b2480
13 0x1c00 0xce503800 106 de7b2900
14 0x1c00 0xce504000 106 de7b2d80
15 0x1c00 0xce504800 106 de7b23c0
16 0x1c00 0xce505000 106 de7b2cc0
17 0x1c00 0xce505800 106 de7b2000
18 0x1c00 0xce506000 106 de7b2b40
19 0x1c00 0xce506800 106 de7b2840
20 0x1c00 0xce507000 106 de7b2780
21 0x1c00 0xce507800 106 de7b2300
22 0x1c00 0xce508000 106 de7b2e40
23 0x1c00 0xce508800 106 de7b2180
24 0x1c00 0xce509000 106 de7b2c00
25 0x1c00 0xce509800 106 de7b29c0
26 0x1c00 0xce50a000 106 de7b2240
27 0x1c00 0xce50a800 106 de7b2a80
28 0x1c00 0xce50b000 106 de7b2600
29 0x1c00 0xce50b800 106 de7b20c0
30 0x1c00 0xce50c000 106 de7ea780
31 0x1c00 0xce50c800 106 de7ea480
32 0x1c00 0xce50d000 106 de7ea900
33 SH 0x1c00 0x00000000 106 (null)
34 0x9c00 0xce50e000 106 de120e40
35 0x1c00 0xce50e800 106 de120f00
36 0x1c00 0xce50f000 106 de120840
37 0x1c00 0xce50f800 106 de120600
38 0x1c00 0xce510000 66 de1209c0
39 0x1c00 0xce510800 106 de120a80
40 0x1c00 0xce511000 106 de120cc0
41 0x1c00 0xce511800 66 de120000
42 0x1c00 0xce512000 106 de1200c0
43 0x1c00 0xce512800 106 de120780
44 0x1c00 0xce513000 106 de120c00
45 0x1c00 0xce513800 66 de120b40
46 0x1c00 0xce514000 106 de120540
47 0x1c00 0xce514800 106 de120900
48 0x1c00 0xce515000 66 de120240
49 0x1c00 0xce515800 106 de120d80
50 0x1c00 0xce516000 106 de1203c0
51 0x1c00 0xce516800 66 de120480
52 0x1c00 0xce517000 106 de120300
53 0x1c00 0xce517800 106 de120180
54 0x1c00 0xce518000 66 dec7b000
55 0x1c00 0xce518800 106 dec7b300
56 0x1c00 0xce519000 106 dec7b0c0
57 0x1c00 0xce519800 66 dec7b540
58 0x1c00 0xce51a000 106 dec7bc00
59 0x1c00 0xce51a800 106 dec7ba80
60 0x1c00 0xce51b000 66 dec7b900
61 0x1c00 0xce51b800 106 dec7bd80
62 0x1c00 0xce51c000 106 dec7b240
63 0x3c00 0xce51c800 106 dec7b6c0
As the test continues to run, the tx timeout will continue to happen. I
monitored the sequence of tx timeouts to see
if there was anything interesting about which buffer descriptors were
having problems - here is the sequence:
15, 60, 57, 54, 53, 6, 18, 47, 24, 54, 23, 9, 39, 14
This email, and any document attached hereto, may contain
confidential and/or privileged information. If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email. Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.
On Monday, March 24, 2014 at 06:57:18 PM, robert.daniels@vantagecontrols.com wrote: > Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 03/21/2014 > > 11:32:53 AM: > > Hmm, rt kernels. Does this happen without the rt patches applied? > > I am not using the rt patches and I still see the problem. I'm using > 3.14.0-rc6+ with > fec patches on the i.MX53 Quick Start Board with the latest U-Boot. > > When I run my test I will immediately see packet loss with the eventual tx > timeout issue. > > These are the steps I use to reproduce the problem: > > Setup: Desktop (Ubuntu 12.04: webfs, iperf3, /srv/ftp/test.bmp ~27 MB) > i.MX53 Quick Start Board (linux 3.14.0-rc6+ with fec patches: > iperf3, wget) > > Test: > > Desktop: > iperf3 -s -V > > i.MX53 QSB: > ssh 1> iperf3 -c 192.168.1.101 -u -l 64 -b 55M -V -t 1000 I think the UDP test might be lossy, can you try with TCP test please ? Best regards, Marek Vasut
Marek Vasut <marex@denx.de> wrote on 03/24/2014 02:21:58 PM: > > I think the UDP test might be lossy, can you try with TCP test please ? > > Best regards, > Marek Vasut Yes, I tried with TCP and there were no problems. I then tried my same test with a crossover cable: no problems. Then I started testing some different network configurations: 1. Desktop & i.MX53 both connected to the same router: no problems. 2. Desktop & i.MX53 both connected to the same hub: packet loss but no tx timeout 3. Desktop connected to router, hub connected to router, i.MX53 connected to hub: packet loss + tx timeout So, the packet loss is due to the hub, not the fec driver. I'm not sure why I'm getting the tx timeout. My router is a Cisco ValetPlus and the hub is a Netgear DS108. The router and desktop are gigabit and the hub is 10/100. I assumed that the tx timeout should not be happening - is this an incorrect assumption based on my setup? Thanks, - Robert Daniels This email, and any document attached hereto, may contain confidential and/or privileged information. If you are not the intended recipient (or have received this email in error) please notify the sender immediately and destroy this email. Any unauthorized, direct or indirect, copying, disclosure, distribution or other use of the material or parts thereof is strictly forbidden.
On Mon, Mar 24, 2014 at 04:37:20PM -0600, robert.daniels@vantagecontrols.com wrote: > Marek Vasut <marex@denx.de> wrote on 03/24/2014 02:21:58 PM: > > I think the UDP test might be lossy, can you try with TCP test please ? > > > > Best regards, > > Marek Vasut > > Yes, I tried with TCP and there were no problems. > I then tried my same test with a crossover cable: no problems. > Then I started testing some different network configurations: > > 1. Desktop & i.MX53 both connected to the same router: no problems. > 2. Desktop & i.MX53 both connected to the same hub: packet loss but no tx > timeout > 3. Desktop connected to router, hub connected to router, i.MX53 connected > to hub: packet loss + tx timeout > > So, the packet loss is due to the hub, not the fec driver. > I'm not sure why I'm getting the tx timeout. My router is a Cisco > ValetPlus and the hub is a Netgear DS108. > The router and desktop are gigabit and the hub is 10/100. > > I assumed that the tx timeout should not be happening - is this an > incorrect assumption based on my setup? The first point is that the timeout should not be happening, because when the link is up and running, packets should be transmitted. The second point is that I think you have a setup which tickles a bug in the FEC hardware. What I think is going on is that the driver is filling the transmit ring correctly. For some reason, the FEC is either failing to update the header after transmission leaving the buffer owned by the FEC (and therefore unable to be reaped), or the FEC is skipping over a ring descriptor. As I hinted in one of my previous replies, I think the only way we're truely going to work out what's going on here is to come up with some way to mark each packet that is transmitted with where it was in the ring, so a tcpdump session on another machine can be used to inspect which packets from the ring were actually transmitted on the wire. Now, you mention packet loss above. Even on half-duplex networks, I would not expect there to be much packet loss because of the retransmissions after a collision, and 100Mbit networks handle this much better than 10Mbit. It takes repeated collisions to cause a packet to actually be dropped (and a packet being dropped because of collisions should be logged by the transmitting host in its interface statistics.) Another possibility is that the FEC isn't properly negotiating with the hub, and the two ends are ending up configured differently, or that it is negotiating properly but the FEC isn't being configured correctly. Now that we know a little more about your setup, including that it seems to be one specific setup which is causing the problems, I'll do some more testing - I have an old Allied Telesys 10/100 hub here which I'll try putting the iMX6 on and re-run your tests.
On Tuesday, March 25, 2014 at 12:44:43 AM, Russell King - ARM Linux wrote: > On Mon, Mar 24, 2014 at 04:37:20PM -0600, robert.daniels@vantagecontrols.com wrote: > > Marek Vasut <marex@denx.de> wrote on 03/24/2014 02:21:58 PM: > > > I think the UDP test might be lossy, can you try with TCP test please ? > > > > > > Best regards, > > > Marek Vasut > > > > Yes, I tried with TCP and there were no problems. > > I then tried my same test with a crossover cable: no problems. > > Then I started testing some different network configurations: > > > > 1. Desktop & i.MX53 both connected to the same router: no problems. > > 2. Desktop & i.MX53 both connected to the same hub: packet loss but no tx > > timeout > > 3. Desktop connected to router, hub connected to router, i.MX53 connected > > to hub: packet loss + tx timeout > > > > So, the packet loss is due to the hub, not the fec driver. > > I'm not sure why I'm getting the tx timeout. My router is a Cisco > > ValetPlus and the hub is a Netgear DS108. > > The router and desktop are gigabit and the hub is 10/100. > > > > I assumed that the tx timeout should not be happening - is this an > > incorrect assumption based on my setup? > > The first point is that the timeout should not be happening, because > when the link is up and running, packets should be transmitted. > > The second point is that I think you have a setup which tickles a bug > in the FEC hardware. What I think is going on is that the driver is > filling the transmit ring correctly. For some reason, the FEC is > either failing to update the header after transmission leaving the > buffer owned by the FEC (and therefore unable to be reaped), or the > FEC is skipping over a ring descriptor. I think we might be seeing something similar on i.MX28 in U-Boot too, but I dare not to correlate this and the in-kernel FEC issue. We also sporadically see the FEC not transmitting a packet. I am just raising this since I want to update you all there is something like that as well. [...] Best regards, Marek Vasut
On Tue, Mar 25, 2014 at 02:02:42AM +0100, Marek Vasut wrote: > I think we might be seeing something similar on i.MX28 in U-Boot too, > but I dare not to correlate this and the in-kernel FEC issue. We also > sporadically see the FEC not transmitting a packet. That could be the ERR006358 problem. This occurs when you've transmitted a packet previously, and the indicator in FEC_X_DES_ACTIVE remains set at the point in time when you load the next packet. The subsequent write to FEC_X_DES_ACTIVE has no effect on the TX side, which has already shut down. The work-around for that which is in the kernel is to look at the previous packet status: if that indicates it has been sent immediately after setting everything up for the next packet (remembering to take care of the store buffers between the CPU and memory), then you need to re-ping the transmitter after a short delay. (I've cocked that work-around up today... it can happen quite frequently depending on your workload.)
On Mon, Mar 24, 2014 at 11:44:43PM +0000, Russell King - ARM Linux wrote: > Now, you mention packet loss above. Even on half-duplex networks, I > would not expect there to be much packet loss because of the > retransmissions after a collision, and 100Mbit networks handle this > much better than 10Mbit. It takes repeated collisions to cause a > packet to actually be dropped (and a packet being dropped because > of collisions should be logged by the transmitting host in its > interface statistics.) > > Another possibility is that the FEC isn't properly negotiating with > the hub, and the two ends are ending up configured differently, or > that it is negotiating properly but the FEC isn't being configured > correctly. > > Now that we know a little more about your setup, including that it > seems to be one specific setup which is causing the problems, I'll > do some more testing - I have an old Allied Telesys 10/100 hub here > which I'll try putting the iMX6 on and re-run your tests. PC --- 100Mbit hub --- iMX6S iperf (2.0.5) -c PC -r reports: iMX6 Tx: [ 5] 0.0-10.0 sec 85.5 MBytes 71.5 Mbits/sec iMX6 Rx: [ 3] 0.0-10.0 sec 103 MBytes 86.2 Mbits/sec However, there's a marked difference when the iMX6 sends - the collision LED is almost permanently on at full brightness, whereas when the PC is sending, it's almost always off. I'm seeing lots of late collisions and CRC errors in the statistics registers. It looks to me like the iMX6's transmitter is blathering at any moment, not taking any notice whether there's already a carrier on the RX side. I'm not yet sure what to make of this yet.
On Wed, Mar 26, 2014 at 12:11:35AM +0000, Russell King - ARM Linux wrote: > On Mon, Mar 24, 2014 at 11:44:43PM +0000, Russell King - ARM Linux wrote: > > Now, you mention packet loss above. Even on half-duplex networks, I > > would not expect there to be much packet loss because of the > > retransmissions after a collision, and 100Mbit networks handle this > > much better than 10Mbit. It takes repeated collisions to cause a > > packet to actually be dropped (and a packet being dropped because > > of collisions should be logged by the transmitting host in its > > interface statistics.) > > > > Another possibility is that the FEC isn't properly negotiating with > > the hub, and the two ends are ending up configured differently, or > > that it is negotiating properly but the FEC isn't being configured > > correctly. > > > > Now that we know a little more about your setup, including that it > > seems to be one specific setup which is causing the problems, I'll > > do some more testing - I have an old Allied Telesys 10/100 hub here > > which I'll try putting the iMX6 on and re-run your tests. > > PC --- 100Mbit hub --- iMX6S > > iperf (2.0.5) -c PC -r reports: > > iMX6 Tx: [ 5] 0.0-10.0 sec 85.5 MBytes 71.5 Mbits/sec > iMX6 Rx: [ 3] 0.0-10.0 sec 103 MBytes 86.2 Mbits/sec > > However, there's a marked difference when the iMX6 sends - the > collision LED is almost permanently on at full brightness, whereas > when the PC is sending, it's almost always off. > > I'm seeing lots of late collisions and CRC errors in the statistics > registers. > > It looks to me like the iMX6's transmitter is blathering at any moment, > not taking any notice whether there's already a carrier on the RX side. > I'm not yet sure what to make of this yet. Last night, I performed a different test: PC --- Gigabit switch --- 10/100Mbit hub --- 10Mbit hub --- iMX6S | (the rest of my network) This shows that all the (late) collisions occur in the 10Mbit domain, and very few in the 100Mbit domain, which puts the blame fairly squarely on the iMX6 side. I can see nothing wrong with the setup of the iMX6, nor of the AR8035 phy which my board has. Both the phy and the FEC appear to correctly indicate that they are configured for half-duplex with flow control disabled, and I've been through the iomux settings for the RGMII interface. So, I'm left with three possibilities: - the AR8035 doesn't work with half-duplex - there is something wrong with the signalling for carrier sense between the AR8035 and the FEC. - the iMX6 FEC doesn't work with half-duplex It's not easy to monitor the TX_CTL and RX_CTL signals and make sense of them, so I can't really say what's at fault at the moment, but from what I can see, my hardware fails to work correctly with half-duplex connections.
Hi Russell, On 04/01/2014 02:26 AM, Russell King - ARM Linux wrote: > On Wed, Mar 26, 2014 at 12:11:35AM +0000, Russell King - ARM Linux wrote: >> On Mon, Mar 24, 2014 at 11:44:43PM +0000, Russell King - ARM Linux wrote: >>> Now, you mention packet loss above. Even on half-duplex networks, I >>> would not expect there to be much packet loss because of the >>> retransmissions after a collision, and 100Mbit networks handle this >>> much better than 10Mbit. It takes repeated collisions to cause a >>> packet to actually be dropped (and a packet being dropped because >>> of collisions should be logged by the transmitting host in its >>> interface statistics.) >>> >>> Another possibility is that the FEC isn't properly negotiating with >>> the hub, and the two ends are ending up configured differently, or >>> that it is negotiating properly but the FEC isn't being configured >>> correctly. >>> >>> Now that we know a little more about your setup, including that it >>> seems to be one specific setup which is causing the problems, I'll >>> do some more testing - I have an old Allied Telesys 10/100 hub here >>> which I'll try putting the iMX6 on and re-run your tests. >> >> PC --- 100Mbit hub --- iMX6S >> >> iperf (2.0.5) -c PC -r reports: >> >> iMX6 Tx: [ 5] 0.0-10.0 sec 85.5 MBytes 71.5 Mbits/sec >> iMX6 Rx: [ 3] 0.0-10.0 sec 103 MBytes 86.2 Mbits/sec >> >> However, there's a marked difference when the iMX6 sends - the >> collision LED is almost permanently on at full brightness, whereas >> when the PC is sending, it's almost always off. >> >> I'm seeing lots of late collisions and CRC errors in the statistics >> registers. >> >> It looks to me like the iMX6's transmitter is blathering at any moment, >> not taking any notice whether there's already a carrier on the RX side. >> I'm not yet sure what to make of this yet. > > Last night, I performed a different test: > > PC --- Gigabit switch --- 10/100Mbit hub --- 10Mbit hub --- iMX6S > | > (the rest of my network) > > This shows that all the (late) collisions occur in the 10Mbit domain, > and very few in the 100Mbit domain, which puts the blame fairly > squarely on the iMX6 side. > > I can see nothing wrong with the setup of the iMX6, nor of the AR8035 > phy which my board has. Both the phy and the FEC appear to correctly > indicate that they are configured for half-duplex with flow control > disabled, and I've been through the iomux settings for the RGMII > interface. > > So, I'm left with three possibilities: > - the AR8035 doesn't work with half-duplex > - there is something wrong with the signalling for carrier sense between > the AR8035 and the FEC. > - the iMX6 FEC doesn't work with half-duplex > > It's not easy to monitor the TX_CTL and RX_CTL signals and make sense of > them, so I can't really say what's at fault at the moment, but from what > I can see, my hardware fails to work correctly with half-duplex > connections. > What value do you have for pad control register IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)? I notice in some of the DTS files that this has a value of 0x100b0, which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling that seems wrong (also bit 7 seems wrong, as it's undefined in the RM).
On Tue, Apr 01, 2014 at 07:00:29AM -0700, Eric Nelson wrote: > Hi Russell, > > What value do you have for pad control register > IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)? You're referring to the iMX6Q, the Hummingboard is iMX6S, so that's 0x20e06bc. I have value 0x1b030. > I notice in some of the DTS files that this has a value of 0x100b0, > which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling > that seems wrong (also bit 7 seems wrong, as it's undefined in the RM). Yes, bit 7 should not be set as it's reserved, but this is ignored. Setting the hysteresis bit affects the input thresholds, and this signal is only ever used as an output. While selecting HYS mode for an output may seem odd, that should not have any effect here. In any case, TX_CTL doesn't have much to do with it. I've fully proved that the interface works fine under full duplex conditions, achieving 500Mbps transmit and 570-600Mbps receive with TCP connections (which is nicely beyond what freescale claim the hardware is capable of.) The phy needs TX_CTL asserted in order to transmit anything, so as it works in full duplex mode, this must be happening correctly. The FEC needs RX_CTL asserted to indicate that valid data is being received by the phy - if this was not being recognised by the FEC, full duplex mode wouldn't work. The last scenario is the error signals encoded onto TX_CTL/RX_CTL. This is one area I can't comment on yet - monitoring these signals is not exactly easy (and requires two steady hands.) However, I have briefly monitored these signals when running at 100mbit half-duplex. With the scope RX_CTL and triggering on it, I do see TX_CTL being raised by the FEC. Whether TX_CTL is raised by the FEC just before or after RX_CTL has been asserted I can't say yet. However, TX_CTL raised while RX_CTL is raised or toggling means that a collision has occurred. TX_CTL should never be raised while RX_CTL indicates that there is a "carrier" present in half-duplex mode (and therefore there is already another ethernet station transmitting on the wire.) Remember, I'm seeing late collisions. The only way that happens is if something starts transmitting after 64 bytes into an ethernet frame, and it implies that something is not noticing that the link is busy before it starts blurting out. I know that my hubs are fine - the 10mbit hub has been used over _many_ years (it's more than 20 years old now) and has always been rock solid... unlike the 10/100 hub with a dodgy switch between the two segments which I can lock out the 10 mbit segment with enough 100mbit traffic. Let's also not forget that these signals run at 125MHz at gigabit speeds, and 2.5MHz at 10mbit... if they were marginal at 10mbit, gigabit certainly would not work.
On 04/01/2014 10:29 AM, Russell King - ARM Linux wrote: > On Tue, Apr 01, 2014 at 07:00:29AM -0700, Eric Nelson wrote: >> Hi Russell, >> >> What value do you have for pad control register >> IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)? > > You're referring to the iMX6Q, the Hummingboard is iMX6S, so that's > 0x20e06bc. I have value 0x1b030. > Right. I missed that you were running on Solo. >> I notice in some of the DTS files that this has a value of 0x100b0, >> which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling >> that seems wrong (also bit 7 seems wrong, as it's undefined in the RM). > > Yes, bit 7 should not be set as it's reserved, but this is ignored. > Setting the hysteresis bit affects the input thresholds, and this signal > is only ever used as an output. While selecting HYS mode for an output > may seem odd, that should not have any effect here. > > In any case, TX_CTL doesn't have much to do with it. I've fully proved > that the interface works fine under full duplex conditions, achieving > 500Mbps transmit and 570-600Mbps receive with TCP connections (which is > nicely beyond what freescale claim the hardware is capable of.) > > The phy needs TX_CTL asserted in order to transmit anything, so as it > works in full duplex mode, this must be happening correctly. > It appears I'm gender-challenged and confused TX_CTL and RX_CTL. My thought (and question) was whether the pad setup for "transmit enable" to the i.MX is either being missed entirely, or arriving late because of an internal pullup/pull down or other pad misconfiguration. The pad settings have every indication of being cut/pasted, which made me question them. > The FEC needs RX_CTL asserted to indicate that valid data is being > received by the phy - if this was not being recognised by the FEC, full > duplex mode wouldn't work. > > The last scenario is the error signals encoded onto TX_CTL/RX_CTL. This > is one area I can't comment on yet - monitoring these signals is not > exactly easy (and requires two steady hands.) However, I have briefly > monitored these signals when running at 100mbit half-duplex. With the > scope RX_CTL and triggering on it, I do see TX_CTL being raised by the > FEC. > > Whether TX_CTL is raised by the FEC just before or after RX_CTL has been > asserted I can't say yet. > > However, TX_CTL raised while RX_CTL is raised or toggling means that a > collision has occurred. TX_CTL should never be raised while RX_CTL > indicates that there is a "carrier" present in half-duplex mode (and > therefore there is already another ethernet station transmitting on > the wire.) > > Remember, I'm seeing late collisions. The only way that happens is if > something starts transmitting after 64 bytes into an ethernet frame, and > it implies that something is not noticing that the link is busy before it > starts blurting out. > I guess we can rule out slew rate. > I know that my hubs are fine - the 10mbit hub has been used over _many_ > years (it's more than 20 years old now) and has always been rock solid... > unlike the 10/100 hub with a dodgy switch between the two segments which > I can lock out the 10 mbit segment with enough 100mbit traffic. > > Let's also not forget that these signals run at 125MHz at gigabit speeds, > and 2.5MHz at 10mbit... if they were marginal at 10mbit, gigabit > certainly would not work. > Based on this, I don't see how a badly-configured pad could produce the symptoms you're seeing.
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014 03:26:38 AM: > > Last night, I performed a different test: > > PC --- Gigabit switch --- 10/100Mbit hub --- 10Mbit hub --- iMX6S > | > (the rest of my network) > > This shows that all the (late) collisions occur in the 10Mbit domain, > and very few in the 100Mbit domain, which puts the blame fairly > squarely on the iMX6 side. > > I can see nothing wrong with the setup of the iMX6, nor of the AR8035 > phy which my board has. Both the phy and the FEC appear to correctly > indicate that they are configured for half-duplex with flow control > disabled, and I've been through the iomux settings for the RGMII > interface. > > So, I'm left with three possibilities: > - the AR8035 doesn't work with half-duplex > - there is something wrong with the signalling for carrier sense between > the AR8035 and the FEC. > - the iMX6 FEC doesn't work with half-duplex > > It's not easy to monitor the TX_CTL and RX_CTL signals and make sense of > them, so I can't really say what's at fault at the moment, but from what > I can see, my hardware fails to work correctly with half-duplex > connections. I'm not sure where this factors in, but I originally saw this problem using the Freescale 2.6.35 kernel. The driver there exhibits this problem differently, although it could very well be a different problem. What I observed was that when the FEC got into this bad state the driver would attempt to transmit a socket buffer but for some reason the buffer would not actually get transmitted. The driver would continue transmitting packets until it got all the way around in the ring buffer to the buffer descriptor right before the one that was never transmitted. When this buffer descriptor was set to transmit you'd get a double transmit - the new packet and the previously untransmitted buffer. This results in out-of-order packets being sent directly from the i.MX53. To illustrate this, I got my i.MX53 into this bad state and then ran ping with the following results: PING 192.168.1.101 (192.168.1.101) 56(84) bytes of data. 64 bytes from 192.168.1.101: icmp_seq=1 ttl=64 time=6.69 ms 64 bytes from 192.168.1.101: icmp_seq=2 ttl=64 time=0.306 ms 64 bytes from 192.168.1.101: icmp_seq=4 ttl=64 time=0.314 ms 64 bytes from 192.168.1.101: icmp_seq=5 ttl=64 time=0.325 ms 64 bytes from 192.168.1.101: icmp_seq=6 ttl=64 time=0.351 ms 64 bytes from 192.168.1.101: icmp_seq=7 ttl=64 time=0.323 ms 64 bytes from 192.168.1.101: icmp_seq=9 ttl=64 time=0.337 ms 64 bytes from 192.168.1.101: icmp_seq=10 ttl=64 time=0.319 ms 64 bytes from 192.168.1.101: icmp_seq=11 ttl=64 time=0.353 ms 64 bytes from 192.168.1.101: icmp_seq=12 ttl=64 time=0.321 ms 64 bytes from 192.168.1.101: icmp_seq=13 ttl=64 time=0.308 ms 64 bytes from 192.168.1.101: icmp_seq=14 ttl=64 time=0.329 ms 64 bytes from 192.168.1.101: icmp_seq=15 ttl=64 time=0.335 ms 64 bytes from 192.168.1.101: icmp_seq=16 ttl=64 time=0.316 ms 64 bytes from 192.168.1.101: icmp_seq=17 ttl=64 time=0.326 ms 64 bytes from 192.168.1.101: icmp_seq=3 ttl=64 time=14006 ms 64 bytes from 192.168.1.101: icmp_seq=19 ttl=64 time=0.323 ms 64 bytes from 192.168.1.101: icmp_seq=20 ttl=64 time=0.314 ms 64 bytes from 192.168.1.101: icmp_seq=21 ttl=64 time=0.310 ms 64 bytes from 192.168.1.101: icmp_seq=22 ttl=64 time=0.317 ms 64 bytes from 192.168.1.101: icmp_seq=23 ttl=64 time=0.331 ms 64 bytes from 192.168.1.101: icmp_seq=8 ttl=64 time=15000 ms 64 bytes from 192.168.1.101: icmp_seq=25 ttl=64 time=0.322 ms 64 bytes from 192.168.1.101: icmp_seq=26 ttl=64 time=0.333 ms 64 bytes from 192.168.1.101: icmp_seq=27 ttl=64 time=0.337 ms 64 bytes from 192.168.1.101: icmp_seq=28 ttl=64 time=0.337 ms 64 bytes from 192.168.1.101: icmp_seq=29 ttl=64 time=0.335 ms 64 bytes from 192.168.1.101: icmp_seq=30 ttl=64 time=0.325 ms 64 bytes from 192.168.1.101: icmp_seq=31 ttl=64 time=0.307 ms 64 bytes from 192.168.1.101: icmp_seq=32 ttl=64 time=0.333 ms 64 bytes from 192.168.1.101: icmp_seq=18 ttl=64 time=14006 ms 64 bytes from 192.168.1.101: icmp_seq=34 ttl=64 time=0.330 ms 64 bytes from 192.168.1.101: icmp_seq=35 ttl=64 time=0.342 ms Here you can see that icmp_seq=3 wasn't replied to until after icmp_seq=17 was sent. Once these buffer descriptors get into this state, they stay that way until the FEC is reset. I don't see this exact behavior when I run the test with the 3.14 kernel but I'm starting to wonder if it's because the 3.14 kernel is receiving the transmit timeout which allows the driver to reset itself and start working again whereas the 2.6.35 driver is not. I hope this additional information is useful, I don't know enough about these low-level networking details to contribute much but it's possible that what I've seen in the 2.6.35 kernel is actually the same issue that I'm seeing in the 3.14 kernel but handled better. Thanks, Robert Daniels This email, and any document attached hereto, may contain confidential and/or privileged information. If you are not the intended recipient (or have received this email in error) please notify the sender immediately and destroy this email. Any unauthorized, direct or indirect, copying, disclosure, distribution or other use of the material or parts thereof is strictly forbidden.
On Tue, Apr 01, 2014 at 01:38:37PM -0600, robert.daniels@vantagecontrols.com wrote: > I'm not sure where this factors in, but I originally saw this problem using > the Freescale 2.6.35 kernel. The driver there exhibits this problem > differently, although it could very well be a different problem. What > I observed was that when the FEC got into this bad state the driver would > attempt to transmit a socket buffer but for some reason the buffer would > not actually get transmitted. > > The driver would continue transmitting packets until it got all the way > around in the ring buffer to the buffer descriptor right before the one > that was never transmitted. When this buffer descriptor was set to > transmit you'd get a double transmit - the new packet and the previously > untransmitted buffer. > > This results in out-of-order packets being sent directly from the i.MX53. At initial glance, this is coherent with my idea of the FEC skipping a ring entry on the initial pass around. Then when a new entry is loaded, Let's say that the problem entry is number 12 that has been skipped. When we get back around to entry 11, the FEC will transmit entries 11 and 12, as you rightly point out, and it will then look at entry 13 for the next packet. However, the driver loads the next packet into entry 12, and hits the FEC to transmit it. The FEC re-reads entry 13, finds no packet, so does nothing. Then the next packet is submitted to the driver, and it enters it into entry 13, again hitting the FEC. The FEC now sees the entry at 13, meanwhile the entry at 12 is still pending. > I hope this additional information is useful, I don't know enough > about these low-level networking details to contribute much but > it's possible that what I've seen in the 2.6.35 kernel is actually > the same issue that I'm seeing in the 3.14 kernel but handled > better. It confirms the theory, but doesn't really provide much clues for a solution at the moment. However, I've had something of a breakthrough with iMX6 and half-duplex. I think much of the problem comes down to this ERR006358 workaround implemented in the driver (this apparantly doesn't affect your device.) The delayed work implementation, and my delayed timer implementation of the same are fundamentally wrong to the erratum documentation - as is the version implemented in the Freescale BSP. Implementing what the erratum says as an acceptable workaround improves things tremendously - I see iperf on a 10Mbit hub go from 1-2Mbps up to 8Mbps, though still with loads of collisions. That said, I'm not that trusting of the error bits indicated from the FEC. The reason I mention it here is that I wonder if less wacking of the FEC_X_DES_ACTIVE register may help your problem. In 3.14, in the fec_enet_start_xmit function, find the "writel(0, fep->hwp + FEC_X_DES_ACTIVE);" and change it to: wmb(); /* Trigger transmission start */ if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0) writel(0, fep->hwp + FEC_X_DES_ACTIVE); and see whether that helps your problem(s).
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014 04:51:49 PM: > At initial glance, this is coherent with my idea of the FEC skipping a > ring entry on the initial pass around. Then when a new entry is loaded, > > Let's say that the problem entry is number 12 that has been skipped. > When we get back around to entry 11, the FEC will transmit entries 11 > and 12, as you rightly point out, and it will then look at entry 13 > for the next packet. > > However, the driver loads the next packet into entry 12, and hits the > FEC to transmit it. The FEC re-reads entry 13, finds no packet, so > does nothing. > > Then the next packet is submitted to the driver, and it enters it into > entry 13, again hitting the FEC. The FEC now sees the entry at 13, > meanwhile the entry at 12 is still pending. > > > I hope this additional information is useful, I don't know enough > > about these low-level networking details to contribute much but > > it's possible that what I've seen in the 2.6.35 kernel is actually > > the same issue that I'm seeing in the 3.14 kernel but handled > > better. > > It confirms the theory, but doesn't really provide much clues for a > solution at the moment. So according to this theory, if we can detect this situation we should be able to skip entry 12 and then everything should get back into sink? I did a little test where I marked '12' as needing to be skipped and then the next time it was to be used I skipped it... and I got an out of order packet but then things got back on track - that's the good news. Unfortunately, I think there must be something wrong with my hack because after running for a while longer ethernet stops working entirely. Not sure why but I'll see if I can figure that out. This was on my 2.6.35 kernel. > However, I've had something of a breakthrough with iMX6 and half-duplex. > I think much of the problem comes down to this ERR006358 workaround > implemented in the driver (this apparantly doesn't affect your device.) > The delayed work implementation, and my delayed timer implementation of > the same are fundamentally wrong to the erratum documentation - as is > the version implemented in the Freescale BSP. > > Implementing what the erratum says as an acceptable workaround improves > things tremendously - I see iperf on a 10Mbit hub go from 1-2Mbps up to > 8Mbps, though still with loads of collisions. That said, I'm not that > trusting of the error bits indicated from the FEC. > > The reason I mention it here is that I wonder if less wacking of the > FEC_X_DES_ACTIVE register may help your problem. > > In 3.14, in the fec_enet_start_xmit function, find the > "writel(0, fep->hwp + FEC_X_DES_ACTIVE);" and change it to: > > wmb(); > > /* Trigger transmission start */ > if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0) > writel(0, fep->hwp + FEC_X_DES_ACTIVE); > > and see whether that helps your problem(s). I tried less wacking and I still have the same tx transmit timeout issue. Thanks, Robert Daniels This email, and any document attached hereto, may contain confidential and/or privileged information. If you are not the intended recipient (or have received this email in error) please notify the sender immediately and destroy this email. Any unauthorized, direct or indirect, copying, disclosure, distribution or other use of the material or parts thereof is strictly forbidden.
On 04/02/2014 1:30 AM, Russell King - ARM Linux wrote: >On Tue, Apr 01, 2014 at 07:00:29AM -0700, Eric Nelson wrote: >> Hi Russell, >> >> What value do you have for pad control register >> IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)? > >You're referring to the iMX6Q, the Hummingboard is iMX6S, so that's 0x20e06bc. >I have value 0x1b030. > >> I notice in some of the DTS files that this has a value of 0x100b0, >> which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling >> that seems wrong (also bit 7 seems wrong, as it's undefined in the RM). > >Yes, bit 7 should not be set as it's reserved, but this is ignored. >Setting the hysteresis bit affects the input thresholds, and this signal is >only ever used as an output. While selecting HYS mode for an output may seem >odd, that should not have any effect here. > >In any case, TX_CTL doesn't have much to do with it. I've fully proved that >the interface works fine under full duplex conditions, achieving 500Mbps >transmit and 570-600Mbps receive with TCP connections (which is nicely beyond >what freescale claim the hardware is capable of.) > Hi, Russell, Since imx6q/dl/s sillicon enet have bandwidth issue, the bandwidth is 400 ~ 700 Mbps which is supplied by soc arichitecturer. So we claim our enet performance is 460Mbps. In fact, for our kernel 3.0.35 serial release, kernel 3.10.17 release: TCP tx: 470Mbps (stable) TCP rx: 600Mbps (stable) [...] Thanks, Andy
Russell King wrote on 04/01/2014 04:51:49 PM: >On Tue, Apr 01, 2014 at 01:38:37PM -0600, robert.daniels@vantagecontrols.com >wrote: >> I'm not sure where this factors in, but I originally saw this problem >> using the Freescale 2.6.35 kernel. The driver there exhibits this >> problem differently, although it could very well be a different >> problem. What I observed was that when the FEC got into this bad state >> the driver would attempt to transmit a socket buffer but for some >> reason the buffer would not actually get transmitted. >> >> The driver would continue transmitting packets until it got all the >> way around in the ring buffer to the buffer descriptor right before >> the one that was never transmitted. When this buffer descriptor was >> set to transmit you'd get a double transmit - the new packet and the >> previously untransmitted buffer. >> >> This results in out-of-order packets being sent directly from the i.MX53. > >At initial glance, this is coherent with my idea of the FEC skipping a ring >entry on the initial pass around. Then when a new entry is loaded, > >Let's say that the problem entry is number 12 that has been skipped. >When we get back around to entry 11, the FEC will transmit entries 11 and 12, >as you rightly point out, and it will then look at entry 13 for the next packet. > >However, the driver loads the next packet into entry 12, and hits the FEC to >transmit it. The FEC re-reads entry 13, finds no packet, so does nothing. > >Then the next packet is submitted to the driver, and it enters it into entry 13, >again hitting the FEC. The FEC now sees the entry at 13, meanwhile the entry >at 12 is still pending. > >> I hope this additional information is useful, I don't know enough >> about these low-level networking details to contribute much but it's >> possible that what I've seen in the 2.6.35 kernel is actually the same >> issue that I'm seeing in the 3.14 kernel but handled better. > >It confirms the theory, but doesn't really provide much clues for a solution at >the moment. > >However, I've had something of a breakthrough with iMX6 and half-duplex. >I think much of the problem comes down to this ERR006358 workaround implemented >in the driver (this apparantly doesn't affect your device.) The delayed work >implementation, and my delayed timer implementation of the same are >fundamentally wrong to the erratum documentation - as is the version >implemented in the Freescale BSP. > >Implementing what the erratum says as an acceptable workaround improves things >tremendously - I see iperf on a 10Mbit hub go from 1-2Mbps up to 8Mbps, though >still with loads of collisions. That said, I'm not that trusting of the error >bits indicated from the FEC. I don't have 10Mbit hub, but use ethtool to change the phy speed/duplex mode in imx6q sabresd platform: - 100M half: work fine - 10M half: work fine, iperf test, the tx bandwidth is 6.2 ~ 11 Mbps. Log: root@freescale /data/ptp-debug$ ./ethtool -s eth0 autoneg off speed 10 duplex half root@freescale /data/ptp-debug$ PHY: 1:01 - Link is Down root@freescale /data/ptp-debug$ PHY: 1:01 - Link is Up - 10/Half root@freescale /data/ptp-debug$ iperf -c 10.192.242.202 -t 100 -i 1 ------------------------------------------------------------ Client connecting to 10.192.242.202, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 3] local 10.192.242.124 port 52725 connected with 10.192.242.202 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 1.0 sec 768 KBytes 6.29 Mbits/sec [ 3] 1.0- 2.0 sec 1.12 MBytes 9.44 Mbits/sec [ 3] 2.0- 3.0 sec 1.00 MBytes 8.39 Mbits/sec [ 3] 3.0- 4.0 sec 1.00 MBytes 8.39 Mbits/sec [ 3] 4.0- 5.0 sec 768 KBytes 6.29 Mbits/sec [ 3] 5.0- 6.0 sec 1.38 MBytes 11.5 Mbits/sec [ 3] 6.0- 7.0 sec 1.00 MBytes 8.39 Mbits/sec [ 3] 7.0- 8.0 sec 1.12 MBytes 9.44 Mbits/sec [ 3] 8.0- 9.0 sec 896 KBytes 7.34 Mbits/sec [ 3] 9.0-10.0 sec 1.25 MBytes 10.5 Mbits/sec [ 3] 10.0-11.0 sec 896 KBytes 7.34 Mbits/sec [ 3] 11.0-12.0 sec 1.25 MBytes 10.5 Mbits/sec [ 3] 12.0-13.0 sec 896 KBytes 7.34 Mbits/sec [ 3] 13.0-14.0 sec 1.38 MBytes 11.5 Mbits/sec [ 3] 14.0-15.0 sec 896 KBytes 7.34 Mbits/sec [ 3] 15.0-16.0 sec 1.00 MBytes 8.39 Mbits/sec [ 3] 16.0-17.0 sec 1.25 MBytes 10.5 Mbits/sec [ 3] 17.0-18.0 sec 512 KBytes 4.19 Mbits/sec [ 3] 18.0-19.0 sec 1.62 MBytes 13.6 Mbits/sec [ 3] 19.0-20.0 sec 640 KBytes 5.24 Mbits/sec [ 3] 20.0-21.0 sec 1.50 MBytes 12.6 Mbits/sec [ 3] 21.0-22.0 sec 640 KBytes 5.24 Mbits/sec [ 3] 22.0-23.0 sec 1.00 MBytes 8.39 Mbits/sec [ 3] 23.0-24.0 sec 1.50 MBytes 12.6 Mbits/sec [ 3] 24.0-25.0 sec 896 KBytes 7.34 Mbits/sec [ 3] 25.0-26.0 sec 1.12 MBytes 9.44 Mbits/sec [ 3] 26.0-27.0 sec 1.12 MBytes 9.44 Mbits/sec [ 3] 27.0-28.0 sec 1.12 MBytes 9.44 Mbits/sec [ 3] 28.0-29.0 sec 1.00 MBytes 8.39 Mbits/sec .... > >The reason I mention it here is that I wonder if less wacking of the >FEC_X_DES_ACTIVE register may help your problem. > >In 3.14, in the fec_enet_start_xmit function, find the "writel(0, fep->hwp + >FEC_X_DES_ACTIVE);" and change it to: > > wmb(); > > /* Trigger transmission start */ > if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0) > writel(0, fep->hwp + FEC_X_DES_ACTIVE); > >and see whether that helps your problem(s). > So far, I don't see wmb() bring any effort on FEC. Some years ago, we want to add wmb() before set BD "Ready" bit, but we don't find any issue without wmb(), on the contrary, it introduce more System overhead special for Gbps networking. And now, imx6sx sillicon (sigle core), add wmb() cause tx performance drop much since cpu loading is the bottleneck. In our bsp release, we don't find Tx watchdog timeout issue, don't find ping order reversal issue after overnight stress test. Thanks, Andy
On Wed, Apr 02, 2014 at 03:19:58AM +0000, fugang.duan@freescale.com wrote: > > wmb(); > > > > /* Trigger transmission start */ > > if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0) > > writel(0, fep->hwp + FEC_X_DES_ACTIVE); > > > >and see whether that helps your problem(s). > > > So far, I don't see wmb() bring any effort on FEC. Some years ago, we > want to add wmb() before set BD "Ready" bit, but we don't find any issue > without wmb(), on the contrary, it introduce more System overhead special > for Gbps networking. I wonder whether you understand what is going on here, and why it is required. I doubt it somehow from your comments. Maybe if you were to read about the operation of the store buffer in the PL310, it may open your eyes to why it would be necessary for reliable operation. > And now, imx6sx sillicon (sigle core), add wmb() cause tx performance > drop much since cpu loading is the bottleneck. I don't find any measurable performance drop from adding it on either iMX6Q or iMX6S with the L2 cache code fixed up. The reality is, with the mainline ethernet driver with no changes, the best it can do is around 300Mbps transmit and 400Mbps receive. With both my L2 and FEC changes, this has increased to 500Mbps transmit and 570Mbps receive. Plus it now works reasonably (though with lots of collisions) on half-duplex giving a 4 to 8 fold increase in speed there. I really don't care how your private BSP kernels perform. It's not mainline, so it's not of any interest.
From: Russell King - ARM Linux <linux@arm.linux.org.uk> Data: Wednesday, April 02, 2014 4:59 PM >To: Duan Fugang-B38611 >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky; >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds] > >On Wed, Apr 02, 2014 at 03:19:58AM +0000, fugang.duan@freescale.com wrote: >> > wmb(); >> > >> > /* Trigger transmission start */ >> > if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0) >> > writel(0, fep->hwp + FEC_X_DES_ACTIVE); >> > >> >and see whether that helps your problem(s). >> > >> So far, I don't see wmb() bring any effort on FEC. Some years ago, we >> want to add wmb() before set BD "Ready" bit, but we don't find any >> issue without wmb(), on the contrary, it introduce more System >> overhead special for Gbps networking. > >I wonder whether you understand what is going on here, and why it is required. >I doubt it somehow from your comments. Maybe if you were to read about the >operation of the store buffer in the PL310, it may open your eyes to why it >would be necessary for reliable operation. > In kernel 3.0.35 internal BSP, BD memory is non-cacheable, non-bufferable (we add new api to support it: dma_alloc_noncacheable()), So wmb() is not necessary. We don't find net watchdog timeout/ ping order issue. >> And now, imx6sx sillicon (sigle core), add wmb() cause tx performance >> drop much since cpu loading is the bottleneck. > >I don't find any measurable performance drop from adding it on either iMX6Q or >iMX6S with the L2 cache code fixed up. > Yes, it don't impact imx6q since cpu loading is not bottleneck due rx/tx bandwidth is slow and multi-cores. But for imx6sx, enet rx can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core. Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb() is very expensive which cause tx performance drop much. >The reality is, with the mainline ethernet driver with no changes, the best it >can do is around 300Mbps transmit and 400Mbps receive. With both my L2 and FEC >changes, this has increased to 500Mbps transmit and 570Mbps receive. Plus it >now works reasonably (though with lots of >collisions) on half-duplex giving a 4 to 8 fold increase in speed there. > >I really don't care how your private BSP kernels perform. It's not mainline, >so it's not of any interest. > Yes, I agree. There have some arch/driver patches need to upstream to align the performance with internal bsp. >-- >FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly >improving, and getting towards what was expected from it. > Thanks, Andy
On Wed, Apr 02, 2014 at 09:40:53AM +0000, fugang.duan@freescale.com wrote: > From: Russell King - ARM Linux <linux@arm.linux.org.uk> > Data: Wednesday, April 02, 2014 4:59 PM > >I wonder whether you understand what is going on here, and why it is required. > >I doubt it somehow from your comments. Maybe if you were to read about the > >operation of the store buffer in the PL310, it may open your eyes to why it > >would be necessary for reliable operation. > > In kernel 3.0.35 internal BSP, BD memory is non-cacheable, non-bufferable > (we add new api to support it: dma_alloc_noncacheable()), As is the memory you get from dma_alloc_coherent(). So, why did you invent a new API which does something which the mainline kernel APIs already do? Maybe yours is doing something different but you haven't explained it in correct terminology. > So wmb() is not necessary. Even on non-cacheable normal memory, the wmb() is required. Please read up in the ARM architecture reference manual about memory types and their various attributes, followed by the memory ordering chapters. > Yes, it don't impact imx6q since cpu loading is not bottleneck due > rx/tx bandwidth is slow and multi-cores. But for imx6sx, enet rx can > reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core. What netdev features do you support to achieve that? > Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb() > is very expensive which cause tx performance drop much. wmb() is very expensive because of the L2 cache code using a sledge hammer with it - particularly the spinlock, which has a large overhead if lockdep or spinlock debugging is enabled. > Yes, I agree. There have some arch/driver patches need to upstream to > align the performance with internal bsp. Well, post these patches so that people can test them. If your patches are indeed the worlds best thing since sliced bread, I've just wasted over two months of solid work on the iMX6 ethernet driver. However, hacks like dma_alloc_noncacheable will not be acceptable for mainline.
From: Russell King - ARM Linux <linux@arm.linux.org.uk> Data: Wednesday, April 02, 2014 6:47 PM >To: Duan Fugang-B38611 >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky; >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds] > >On Wed, Apr 02, 2014 at 09:40:53AM +0000, fugang.duan@freescale.com wrote: >> From: Russell King - ARM Linux <linux@arm.linux.org.uk> >> Data: Wednesday, April 02, 2014 4:59 PM >> >I wonder whether you understand what is going on here, and why it is >required. >> >I doubt it somehow from your comments. Maybe if you were to read >> >about the operation of the store buffer in the PL310, it may open >> >your eyes to why it would be necessary for reliable operation. >> >> In kernel 3.0.35 internal BSP, BD memory is non-cacheable, >> non-bufferable (we add new api to support it: >> dma_alloc_noncacheable()), > >As is the memory you get from dma_alloc_coherent(). So, why did you invent a >new API which does something which the mainline kernel APIs already do? > >Maybe yours is doing something different but you haven't explained it in >correct terminology. In kernel 3.0.35, there have no CMA memory allocate mechanism. Below Kernel configs are enabled: CONFIG_ARM_DMA_MEM_BUFFERABLE CONFIG_SMP If use dma_alloc_coherent() allocate memory, it must be non-cacheable, but bufferable. The new invented api "dma_alloc_noncacheable()" allocate memory is non-cacheable, non-bufferable, the memory type is Strongly ordered. > >> So wmb() is not necessary. > >Even on non-cacheable normal memory, the wmb() is required. Please read up in >the ARM architecture reference manual about memory types and their various >attributes, followed by the memory ordering chapters. > >> Yes, it don't impact imx6q since cpu loading is not bottleneck due >> rx/tx bandwidth is slow and multi-cores. But for imx6sx, enet rx can >> reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core. > >What netdev features do you support to achieve that? > Imx6sx enet accleration feature support crc checksum, interrupt coalescing. So we enable the two features. >> Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb() >> is very expensive which cause tx performance drop much. > >wmb() is very expensive because of the L2 cache code using a sledge hammer with >it - particularly the spinlock, which has a large overhead if lockdep or >spinlock debugging is enabled. > Yes, if add wmb() to xmit(), imx6sx enet performance will drop more than 100Mbps. [...] Thanks, Andy
On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan@freescale.com wrote: > In kernel 3.0.35, there have no CMA memory allocate mechanism. > Below Kernel configs are enabled: > CONFIG_ARM_DMA_MEM_BUFFERABLE > CONFIG_SMP > > If use dma_alloc_coherent() allocate memory, it must be non-cacheable, > but bufferable. The new invented api "dma_alloc_noncacheable()" > allocate memory is non-cacheable, non-bufferable, the memory type is > Strongly ordered. Right, so what you've just said is that it's fine to violate the requirements of the architecture L1 memory model by setting up a strongly ordered memory mapping for the same physical addresses as an existing mapping which is mapped as normal memory. Sorry, I'm not going to listen to you anymore, you just lost any kind of authority on this matter. > >> So wmb() is not necessary. > > > >Even on non-cacheable normal memory, the wmb() is required. Please read up in > >the ARM architecture reference manual about memory types and their various > >attributes, followed by the memory ordering chapters. > > > >> Yes, it don't impact imx6q since cpu loading is not bottleneck due > >> rx/tx bandwidth is slow and multi-cores. But for imx6sx, enet rx can > >> reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core. > > > >What netdev features do you support to achieve that? > > > Imx6sx enet accleration feature support crc checksum, interrupt coalescing. > So we enable the two features. Checksum and... presumably you're referring to NAPI don't get you to that kind of speed. Even on x86, you can't get close to wire speed without GSO, which you need scatter-gather for, and you don't support that. So I don't believe your 900Mbps figure. Plus, as you're memcpy'ing every packet received, I don't believe you can reach 940Mbps receive either. > >> Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb() > >> is very expensive which cause tx performance drop much. > > > >wmb() is very expensive because of the L2 cache code using a sledge hammer with > >it - particularly the spinlock, which has a large overhead if lockdep or > >spinlock debugging is enabled. > > Yes, if add wmb() to xmit(), imx6sx enet performance will drop more > than 100Mbps. In any case, I suspect that isn't directly attributable to wmb() itself. What I've noticed is that even changing an unsigned short to an unsigned int *can* result in a substantial performance drop. Although the unsigned int results in fewer instructions, it's lower performance because they're placed differently, and the efficiency of the instruction cache changes, resulting in different throughput. What this means is that even changing compiler versions can get you significantly different performance figures. So I don't attribute very much creedence to "wmb() causes performance to drop 100Mbps". It may very well go back up with some other changes which result in a slightly different placement of the instructions.
From: Russell King - ARM Linux <linux@arm.linux.org.uk> Data: Thursday, April 03, 2014 12:51 AM >To: Duan Fugang-B38611 >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky; >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds] > >On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan@freescale.com wrote: >> In kernel 3.0.35, there have no CMA memory allocate mechanism. >> Below Kernel configs are enabled: >> CONFIG_ARM_DMA_MEM_BUFFERABLE >> CONFIG_SMP >> >> If use dma_alloc_coherent() allocate memory, it must be non-cacheable, >> but bufferable. The new invented api "dma_alloc_noncacheable()" >> allocate memory is non-cacheable, non-bufferable, the memory type is >> Strongly ordered. > >Right, so what you've just said is that it's fine to violate the requirements >of the architecture L1 memory model by setting up a strongly ordered memory >mapping for the same physical addresses as an existing mapping which is mapped >as normal memory. > >Sorry, I'm not going to listen to you anymore, you just lost any kind of >authority on this matter. > >> >> So wmb() is not necessary. >> > >> >Even on non-cacheable normal memory, the wmb() is required. Please >> >read up in the ARM architecture reference manual about memory types >> >and their various attributes, followed by the memory ordering chapters. >> > >> >> Yes, it don't impact imx6q since cpu loading is not bottleneck due >> >> rx/tx bandwidth is slow and multi-cores. But for imx6sx, enet rx >> >> can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core. >> > >> >What netdev features do you support to achieve that? >> > >> Imx6sx enet accleration feature support crc checksum, interrupt coalescing. >> So we enable the two features. > >Checksum and... presumably you're referring to NAPI don't get you to that kind >of speed. Even on x86, you can't get close to wire speed without GSO, which >you need scatter-gather for, and you don't support that. So I don't believe >your 900Mbps figure. > >Plus, as you're memcpy'ing every packet received, I don't believe you can reach >940Mbps receive either. > Since Imx6sx enet still don't support TSO and Jumbo packet, scatter-gather cannot improve ethernet performance in Most cases special for iperf test. Imx6sx: sigle core, cpu frequency is 996Mhz, cpu government is performance. Kernel config: disable SMP config For rx path: - hw accleration: crc checksum, interrupt coalescing. - software part: napi, new skb allocation instead of memory copy. - Test result: 940Mbps, 8% cpu idle For tx path: - hw accleration: crc checksum, interrupt coalescing. - software part: napi, no memory copy in driver since tx DMA support data buffer byte alignment. - Test result: 900Mbps, cpu loading near to 100% >> >> Enet IP don't support TSO feaure, cpu loading is the bottleneck. >> >> Wmb() is very expensive which cause tx performance drop much. >> > >> >wmb() is very expensive because of the L2 cache code using a sledge >> >hammer with it - particularly the spinlock, which has a large >> >overhead if lockdep or spinlock debugging is enabled. >> >> Yes, if add wmb() to xmit(), imx6sx enet performance will drop more >> than 100Mbps. > >In any case, I suspect that isn't directly attributable to wmb() itself. >What I've noticed is that even changing an unsigned short to an unsigned int >*can* result in a substantial performance drop. Although the unsigned int >results in fewer instructions, it's lower performance because they're placed >differently, and the efficiency of the instruction cache changes, resulting in >different throughput. > It is interesting, I will try it. >What this means is that even changing compiler versions can get you >significantly different performance figures. So I don't attribute very much >creedence to "wmb() causes performance to drop 100Mbps". It may very well go >back up with some other changes which result in a slightly different placement >of the instructions. > I test the performance with three compiler: gcc-4.4.4-glibc-2.11.1-multilib-1.0, 4.7 4.8.1 The test result is similar. Thanks, Andy
On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan@freescale.com wrote: > From: Russell King - ARM Linux <linux@arm.linux.org.uk> > Data: Thursday, April 03, 2014 12:51 AM > > >To: Duan Fugang-B38611 > >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky; > >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org > >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds] > > > >On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan@freescale.com wrote: > >> In kernel 3.0.35, there have no CMA memory allocate mechanism. > >> Below Kernel configs are enabled: > >> CONFIG_ARM_DMA_MEM_BUFFERABLE > >> CONFIG_SMP > >> > >> If use dma_alloc_coherent() allocate memory, it must be non-cacheable, > >> but bufferable. The new invented api "dma_alloc_noncacheable()" > >> allocate memory is non-cacheable, non-bufferable, the memory type is > >> Strongly ordered. > > > >Right, so what you've just said is that it's fine to violate the requirements > >of the architecture L1 memory model by setting up a strongly ordered memory > >mapping for the same physical addresses as an existing mapping which is mapped > >as normal memory. > > > >Sorry, I'm not going to listen to you anymore, you just lost any kind of > >authority on this matter. > > > >> >> So wmb() is not necessary. > >> > > >> >Even on non-cacheable normal memory, the wmb() is required. Please > >> >read up in the ARM architecture reference manual about memory types > >> >and their various attributes, followed by the memory ordering chapters. > >> > > >> >> Yes, it don't impact imx6q since cpu loading is not bottleneck due > >> >> rx/tx bandwidth is slow and multi-cores. But for imx6sx, enet rx > >> >> can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core. > >> > > >> >What netdev features do you support to achieve that? > >> > > >> Imx6sx enet accleration feature support crc checksum, interrupt coalescing. > >> So we enable the two features. > > > >Checksum and... presumably you're referring to NAPI don't get you to that kind > >of speed. Even on x86, you can't get close to wire speed without GSO, which > >you need scatter-gather for, and you don't support that. So I don't believe > >your 900Mbps figure. > > > >Plus, as you're memcpy'ing every packet received, I don't believe you can reach > >940Mbps receive either. > > > Since Imx6sx enet still don't support TSO and Jumbo packet, scatter-gather > cannot improve ethernet performance in > Most cases special for iperf test. Again, you are losing credibility every time you deny stuff like this. I'm now at the point of just not listening to you anymore because you're contradicting what I know to be solid fact through my own measurements. This seems to be Freescale's overall attitude - as I've read on Freescale's forums. Your customers/users are always wrong, you're always right. Eg, any performance issues are not the fault of Freescale stuff, it's tarnished connectors or similar.
From: Russell King - ARM Linux <linux@arm.linux.org.uk> Data: Thursday, April 03, 2014 4:57 PM >To: Duan Fugang-B38611 >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky; >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds] > >On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan@freescale.com wrote: >> From: Russell King - ARM Linux <linux@arm.linux.org.uk> >> Data: Thursday, April 03, 2014 12:51 AM >> >> >To: Duan Fugang-B38611 >> >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; >> >Troy Kisky; Grant Likely; Bernd Faust; Fabio Estevam; >> >linux-arm-kernel@lists.infradead.org >> >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds] >> > >> >On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan@freescale.com wrote: >> >> In kernel 3.0.35, there have no CMA memory allocate mechanism. >> >> Below Kernel configs are enabled: >> >> CONFIG_ARM_DMA_MEM_BUFFERABLE >> >> CONFIG_SMP >> >> >> >> If use dma_alloc_coherent() allocate memory, it must be >> >> non-cacheable, but bufferable. The new invented api >"dma_alloc_noncacheable()" >> >> allocate memory is non-cacheable, non-bufferable, the memory type >> >> is Strongly ordered. >> > >> >Right, so what you've just said is that it's fine to violate the >> >requirements of the architecture L1 memory model by setting up a >> >strongly ordered memory mapping for the same physical addresses as an >> >existing mapping which is mapped as normal memory. >> > >> >Sorry, I'm not going to listen to you anymore, you just lost any kind >> >of authority on this matter. >> > >> >> >> So wmb() is not necessary. >> >> > >> >> >Even on non-cacheable normal memory, the wmb() is required. >> >> >Please read up in the ARM architecture reference manual about >> >> >memory types and their various attributes, followed by the memory >ordering chapters. >> >> > >> >> >> Yes, it don't impact imx6q since cpu loading is not bottleneck >> >> >> due rx/tx bandwidth is slow and multi-cores. But for imx6sx, >> >> >> enet rx can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle >core. >> >> > >> >> >What netdev features do you support to achieve that? >> >> > >> >> Imx6sx enet accleration feature support crc checksum, interrupt coalescing. >> >> So we enable the two features. >> > >> >Checksum and... presumably you're referring to NAPI don't get you to >> >that kind of speed. Even on x86, you can't get close to wire speed >> >without GSO, which you need scatter-gather for, and you don't support >> >that. So I don't believe your 900Mbps figure. >> > >> >Plus, as you're memcpy'ing every packet received, I don't believe you >> >can reach 940Mbps receive either. >> > >> Since Imx6sx enet still don't support TSO and Jumbo packet, >> scatter-gather cannot improve ethernet performance in Most cases >> special for iperf test. > >Again, you are losing credibility every time you deny stuff like this. >I'm now at the point of just not listening to you anymore because you're >contradicting what I know to be solid fact through my own measurements. > >This seems to be Freescale's overall attitude - as I've read on Freescale's >forums. Your customers/users are always wrong, you're always right. Eg, any >performance issues are not the fault of Freescale stuff, it's tarnished >connectors or similar. > Hi, Russell, I don't contradict your thinking/solution and measurements. You are expert on arm/modules, we keep study attitude to dicuss with you. For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you did great job on performance tuning, and the test result is similar To our internal test result. Your suggestion for the optimiztion is meaningful. Pls understand my thinking. Thanks, Andy
On Thu, Apr 03, 2014 at 09:55:06AM +0000, fugang.duan@freescale.com wrote: > From: Russell King - ARM Linux <linux@arm.linux.org.uk> > Data: Thursday, April 03, 2014 4:57 PM > > >To: Duan Fugang-B38611 > >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky; > >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org > >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds] > > > >On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan@freescale.com wrote: > >> From: Russell King - ARM Linux <linux@arm.linux.org.uk> > >> Data: Thursday, April 03, 2014 12:51 AM > >> > >> >Checksum and... presumably you're referring to NAPI don't get you to > >> >that kind of speed. Even on x86, you can't get close to wire speed > >> >without GSO, which you need scatter-gather for, and you don't support > >> >that. So I don't believe your 900Mbps figure. > >> > > >> >Plus, as you're memcpy'ing every packet received, I don't believe you > >> >can reach 940Mbps receive either. > >> > > >> Since Imx6sx enet still don't support TSO and Jumbo packet, > >> scatter-gather cannot improve ethernet performance in Most cases > >> special for iperf test. > > > >Again, you are losing credibility every time you deny stuff like this. > >I'm now at the point of just not listening to you anymore because you're > >contradicting what I know to be solid fact through my own measurements. > > > >This seems to be Freescale's overall attitude - as I've read on Freescale's > >forums. Your customers/users are always wrong, you're always right. Eg, any > >performance issues are not the fault of Freescale stuff, it's tarnished > >connectors or similar. > > > Hi, Russell, > > I don't contradict your thinking/solution and measurements. You are > expert on arm/modules, we keep study attitude to dicuss with you. > For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you > did great job on performance tuning, and the test result is similar > To our internal test result. Your suggestion for the optimiztion is > meaningful. Pls understand my thinking. The reason I said what I said is because I'm not talking about TSO. I'm talking about GSO. They're similar features, but are done in totally different ways. TSO requires either hardware support, or driver cooperation to segment the data. GSO does not. There's several points here which make me discount your figures: 1. The overhead of routing packets is not insigificant. GSO is a feature where the higher levels (such as TCP) can submit large socket buffers to the lower levels, all the way through to packet queues to just before the device. At the very last moment, just before the buffer is handed over to the driver's start_xmit function, the buffer is carved into appropriately sized chunks, and new skbuffs allocated. Each skbuff "head" contains the protocol headers, and the associated fragments contain pointers/lengths into the large buffer. (This is why SG is required for GSO.) What this means is that all the overhead from the layers such as TCP / IP / routing / packet queuing are all reduced because rather than those code paths having to be run for every single 1500 byte packet, they're run once for maybe 16 to 32K of data to be sent. This avoids the overhead from a lot of code. I have monitored the RGMII TX_CTL signal while performing iperf tests without GSO, and I can see from that the gap between packets, it is very rare that it gets to the point of transmitting two packets back to back. 2. At 500Mbps transmit on the iMX6S, performance testing shows the majority of CPU time is spent in the cache cleaning/flushing code. This overhead is necessary and can't be removed without risking data corruption. 3. I discount your receive figure because even with all my work, if I set my x86 box to perform a UDP iperf test with x86 sending to iMX6S, the result remains extremely poor, with 95% (modified) to 99% (unmodified) of UDP packets lost. This is not entirely the fault of the FEC. Around 70% of the packet loss is in the UDP receive path - iperf seems to be unable to read the packets from UDP fast enough. (This confirmed by checking the statistics in /proc/net/snmp.) This means 30% of packet loss is due to the FEC not keeping up with the on-wire packet rate (which is around 810Mbps.) Therefore, I place the maximum receive packet rate of the FEC with the current packet-copy strategy in the receive path and 128 entry receive ring size at around 550Mbps, and again, much of the overhead comes from the cache handling code according to perf.
On Thu, Apr 03, 2014 at 11:32:06AM +0100, Russell King - ARM Linux wrote: > > Hi, Russell, > > > > I don't contradict your thinking/solution and measurements. You are > > expert on arm/modules, we keep study attitude to dicuss with you. > > For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you > > did great job on performance tuning, and the test result is similar > > To our internal test result. Your suggestion for the optimiztion is > > meaningful. Pls understand my thinking. > > The reason I said what I said is because I'm not talking about TSO. I'm > talking about GSO. They're similar features, but are done in totally > different ways. > > TSO requires either hardware support, or driver cooperation to segment > the data. GSO does not. > > There's several points here which make me discount your figures: Russell, In case there is a confusion. The 900Mbps figure that Fugang said is not on any of i.MX6 SoCs that are publicly available - i.MX6SoloLite (imx6sl), i.MX6Solo/DualLite (imx6dl), i.MX6Dual/Quad (imx6q), but on a new member of i.MX6 family - i.MX6SoloX (imx6sx). This new SoC hasn't been announced by Freescale yet. One major improvement of this new SoC over its ancestors is the FEC throughput. It claims 1Gbps throughput support. So it's really a hardware optimization instead of anything that software can do. Shawn
On Thu, Apr 03, 2014 at 09:36:41PM +0800, Shawn Guo wrote: > On Thu, Apr 03, 2014 at 11:32:06AM +0100, Russell King - ARM Linux wrote: > > > Hi, Russell, > > > > > > I don't contradict your thinking/solution and measurements. You are > > > expert on arm/modules, we keep study attitude to dicuss with you. > > > For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you > > > did great job on performance tuning, and the test result is similar > > > To our internal test result. Your suggestion for the optimiztion is > > > meaningful. Pls understand my thinking. > > > > The reason I said what I said is because I'm not talking about TSO. I'm > > talking about GSO. They're similar features, but are done in totally > > different ways. > > > > TSO requires either hardware support, or driver cooperation to segment > > the data. GSO does not. > > > > There's several points here which make me discount your figures: > > Russell, > > In case there is a confusion. The 900Mbps figure that Fugang said is > not on any of i.MX6 SoCs that are publicly available - i.MX6SoloLite > (imx6sl), i.MX6Solo/DualLite (imx6dl), i.MX6Dual/Quad (imx6q), but on > a new member of i.MX6 family - i.MX6SoloX (imx6sx). This new SoC hasn't > been announced by Freescale yet. One major improvement of this new SoC > over its ancestors is the FEC throughput. It claims 1Gbps throughput > support. So it's really a hardware optimization instead of anything > that software can do. That means it's irrelevant to this discussion because it's different hardware, with who knows what changes to the memory subsystem and CPU/cache implementation. Hence, it can't be compared in any way to the performance I see on iMX6Q and iMX6S. What matters for here is fixing the issues with the mainline driver on hardware that people have today - not only the bugs and instability that people are seeing, but also the low performance. If some of that involves doing stuff "The Right Way" by using memory barriers rather than violating the architecture specification and introducing new kernel APIs in order to do so, we're going to do it the right way, even if that means that those changes may not be in the best interest of this unreleased part. There is no room for compromise on that.
On Thu, Apr 03, 2014 at 02:45:10PM +0100, Russell King - ARM Linux wrote: > On Thu, Apr 03, 2014 at 09:36:41PM +0800, Shawn Guo wrote: > > Russell, > > > > In case there is a confusion. The 900Mbps figure that Fugang said is > > not on any of i.MX6 SoCs that are publicly available - i.MX6SoloLite > > (imx6sl), i.MX6Solo/DualLite (imx6dl), i.MX6Dual/Quad (imx6q), but on > > a new member of i.MX6 family - i.MX6SoloX (imx6sx). This new SoC hasn't > > been announced by Freescale yet. One major improvement of this new SoC > > over its ancestors is the FEC throughput. It claims 1Gbps throughput > > support. So it's really a hardware optimization instead of anything > > that software can do. > > That means it's irrelevant to this discussion because it's different > hardware, with who knows what changes to the memory subsystem and > CPU/cache implementation. Hence, it can't be compared in any way to > the performance I see on iMX6Q and iMX6S. Exactly. I don't think that Fugang should bring i.MX6SX figure into the discussion either, and I see that's where the confusion comes, from what I read the thread today. Shawn
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014 04:51:49 PM: > At initial glance, this is coherent with my idea of the FEC skipping a > ring entry on the initial pass around. Then when a new entry is loaded, > > Let's say that the problem entry is number 12 that has been skipped. > When we get back around to entry 11, the FEC will transmit entries 11 > and 12, as you rightly point out, and it will then look at entry 13 > for the next packet. > > However, the driver loads the next packet into entry 12, and hits the > FEC to transmit it. The FEC re-reads entry 13, finds no packet, so > does nothing. > > Then the next packet is submitted to the driver, and it enters it into > entry 13, again hitting the FEC. The FEC now sees the entry at 13, > meanwhile the entry at 12 is still pending. I've explored the option of providing a work around for this observed problem. In the 2.6.35 kernel, I've used the BD_ENET_TX_INTR flag which is marked as TO2 in the RM and which was being set but never used by the software to help detect the skip. In the interrupt handler I clear this bit which allows me to use it to know when the bd goes from ready -> clean. In the error situation above, you would end up with 12 marked with both READY and INTR and at some point you would have bd 13 marked as INTR only (as it would have been transmitted but not cleaned up.) This allows me to clean up bd 12 despite not actually being transmitted. This gets the driver back in sync with the FEC and things continue on normally... until it happens again. Of course, this results in the occasional dropped packet but I feel like for now (until Freescale figures out what's going on) this is better than nothing. At least the driver is able to recover somewhat from the situation. I'm not sure if the mainline driver could benefit from a strategy like this or not, especially since it manifested this problem differently (tx transmit timeout). Also, in my opinion this is an undesirable hack to make things work acceptably. There could also be some inherent problem with this strategy that I'm unaware of, since dealing with linux kernel ethernet drivers is not exactly my area of expertise. Any thoughts or new insights? Thanks, Robert Daniels This email, and any document attached hereto, may contain confidential and/or privileged information. If you are not the intended recipient (or have received this email in error) please notify the sender immediately and destroy this email. Any unauthorized, direct or indirect, copying, disclosure, distribution or other use of the material or parts thereof is strictly forbidden.
Hello all, I tried the FEC patches from Russel (http://ftp.arm.linux.org.uk/cgit/linux-arm.git/log/?h=fec-testing), but with the following test the FEC complety stops receiving frames: Linux PC <-----> Gigabit Ethernet switch <------> SabreSD board Both ethernet links run at 1 Gigabit, full duplex. The SabreSD board runs the fec-testing kernel from Russel. On the SabreSD board we run "iperf -s -u" On the Linux PC we - run the iperf client (command used: “while [ 1 ]; do iperf –c ‘IP address of SabreSD’ -u -b 100m -t 300 -l 256;sleep 1;done”) - ping the SabreSD board every second After a while (sometimes 10 seconds, sometimes a couple of minutes), we see that the SabreSD board stops replying to the pings from the Linux PC. Closer inspection shows that the FEC doesn't generate receive frame interrupts anymore. Analysis with a debugger shows that the RXF interrupt is enabled and that the FEC is still receiving ethernet frames (because the IEEE_R_FRAME_OK event counter is increasing), but no receive frame interrupts are occurring. We only see MII interrupts occurring. The RDAR register has the value 0 meaning that the FEC cannot write the received frame in main memory because of a lack of available free receive descriptors. When the FEC has stopped generating receive interrupts, we used the ping command on the Sabre board to send some frames from the Sabre board to the Linux PC. The FEC then starts to generate receive interrupts again. The driver source code shows that the receive buffer is also emptied when a transmit interrupt occurs. So it seems that the FEC completely stops generating receive interrupts when frames are received and there are no more empty descriptors. Any ideas or insights? Regards, Jaccon 2014-04-04 22:21 GMT+02:00 <robert.daniels@vantagecontrols.com>: > Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014 > 04:51:49 PM: > >> At initial glance, this is coherent with my idea of the FEC skipping a >> ring entry on the initial pass around. Then when a new entry is loaded, >> >> Let's say that the problem entry is number 12 that has been skipped. >> When we get back around to entry 11, the FEC will transmit entries 11 >> and 12, as you rightly point out, and it will then look at entry 13 >> for the next packet. >> >> However, the driver loads the next packet into entry 12, and hits the >> FEC to transmit it. The FEC re-reads entry 13, finds no packet, so >> does nothing. >> >> Then the next packet is submitted to the driver, and it enters it into >> entry 13, again hitting the FEC. The FEC now sees the entry at 13, >> meanwhile the entry at 12 is still pending. > > I've explored the option of providing a work around for this observed > problem. > In the 2.6.35 kernel, I've used the BD_ENET_TX_INTR flag which is marked as > TO2 in the RM and which was being set but never used by the software to > help > detect the skip. In the interrupt handler I clear this bit which allows me > to use it to know when the bd goes from ready -> clean. In the error > situation > above, you would end up with 12 marked with both READY and INTR and at some > point > you would have bd 13 marked as INTR only (as it would have been transmitted > but > not cleaned up.) This allows me to clean up bd 12 despite not actually > being > transmitted. This gets the driver back in sync with the FEC and things > continue > on normally... until it happens again. > > Of course, this results in the occasional dropped packet but I feel like > for now > (until Freescale figures out what's going on) this is better than nothing. > At > least the driver is able to recover somewhat from the situation. > > I'm not sure if the mainline driver could benefit from a strategy like this > or not, > especially since it manifested this problem differently (tx transmit > timeout). > Also, in my opinion this is an undesirable hack to make things work > acceptably. > There could also be some inherent problem with this strategy that I'm > unaware of, > since dealing with linux kernel ethernet drivers is not exactly my area of > expertise. > > Any thoughts or new insights? > > Thanks, > > Robert Daniels > > This email, and any document attached hereto, may contain > confidential and/or privileged information. If you are not the > intended recipient (or have received this email in error) please > notify the sender immediately and destroy this email. Any > unauthorized, direct or indirect, copying, disclosure, distribution > or other use of the material or parts thereof is strictly > forbidden. > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Tue, Apr 29, 2014 at 11:05:04AM +0200, Jaccon Bastiaansen wrote: > Hello all, > > I tried the FEC patches from Russel > (http://ftp.arm.linux.org.uk/cgit/linux-arm.git/log/?h=fec-testing), > but with the following test the FEC complety stops receiving frames: > > > Linux PC <-----> Gigabit Ethernet switch <------> SabreSD board > > Both ethernet links run at 1 Gigabit, full duplex. > > The SabreSD board runs the fec-testing kernel from Russel. > > On the SabreSD board we run "iperf -s -u" > > On the Linux PC we > - run the iperf client (command used: “while [ 1 ]; do iperf –c ‘IP > address of SabreSD’ -u -b 100m -t 300 -l 256;sleep 1;done”) > - ping the SabreSD board every second > > After a while (sometimes 10 seconds, sometimes a couple of minutes), > we see that the SabreSD board stops replying to the pings from the > Linux PC. Closer inspection shows that the FEC doesn't generate > receive frame interrupts anymore. The attached debugger screenshots > show the FEC registers when it is in this state. The RXF interrupt is > enabled and we know that the FEC is still receiving ethernet frames > (because the IEEE_R_FRAME_OK event counter is increasing), but no > receive frame interrupts are occurring. We only see MII interrupts > occurring. If the RXF interrupt is enabled, it means that we must have received less than NAPI_POLL_WEIGHT (64) frames from the ring during the previous NAPI poll - this is the only circumstance when we will re-enable interrupts. There are only two conditions where we stop receiving frames from the receive ring: 1. if we reach the NAPI poll weight number of frames received. 2. if we encounter a packet descriptor marked empty. It can't be (1) because we would have left the TXF/RXF interrupts disabled and waited for the next NAPI poll. So, it can only be (2). (2) implies that there is a ring descriptor which is marked as being owned by the FEC, which means that there are descriptors free. Whether it's the one which the FEC is expecting to be free or not is something that's impossible to tell (the FEC hardware doesn't tell us where it is in its ring, which is a big minus point against debugging.) It would be useful to see the state of the receive ring, as well as the rx_next index. That may be difficult to get though.
Hello all, 2014-05-02 13:41 GMT+02:00 Russell King - ARM Linux <linux@arm.linux.org.uk>: > On Tue, Apr 29, 2014 at 11:05:04AM +0200, Jaccon Bastiaansen wrote: >> Hello all, >> >> I tried the FEC patches from Russel >> (http://ftp.arm.linux.org.uk/cgit/linux-arm.git/log/?h=fec-testing), >> but with the following test the FEC complety stops receiving frames: >> >> >> Linux PC <-----> Gigabit Ethernet switch <------> SabreSD board >> >> Both ethernet links run at 1 Gigabit, full duplex. >> >> The SabreSD board runs the fec-testing kernel from Russel. >> >> On the SabreSD board we run "iperf -s -u" >> >> On the Linux PC we >> - run the iperf client (command used: “while [ 1 ]; do iperf –c ‘IP >> address of SabreSD’ -u -b 100m -t 300 -l 256;sleep 1;done”) >> - ping the SabreSD board every second >> >> After a while (sometimes 10 seconds, sometimes a couple of minutes), >> we see that the SabreSD board stops replying to the pings from the >> Linux PC. Closer inspection shows that the FEC doesn't generate >> receive frame interrupts anymore. The attached debugger screenshots >> show the FEC registers when it is in this state. The RXF interrupt is >> enabled and we know that the FEC is still receiving ethernet frames >> (because the IEEE_R_FRAME_OK event counter is increasing), but no >> receive frame interrupts are occurring. We only see MII interrupts >> occurring. > > If the RXF interrupt is enabled, it means that we must have received > less than NAPI_POLL_WEIGHT (64) frames from the ring during the previous > NAPI poll - this is the only circumstance when we will re-enable > interrupts. > > There are only two conditions where we stop receiving frames from the > receive ring: > 1. if we reach the NAPI poll weight number of frames received. > 2. if we encounter a packet descriptor marked empty. > > It can't be (1) because we would have left the TXF/RXF interrupts > disabled and waited for the next NAPI poll. So, it can only be (2). > > (2) implies that there is a ring descriptor which is marked as being > owned by the FEC, which means that there are descriptors free. Whether > it's the one which the FEC is expecting to be free or not is something > that's impossible to tell (the FEC hardware doesn't tell us where it > is in its ring, which is a big minus point against debugging.) > > It would be useful to see the state of the receive ring, as well as > the rx_next index. That may be difficult to get though. > I added some code to show the RX ring when the reception of frames has stopped. See the patch below. Reading the fec_rx_ring file gives the following output (I stripped the bit decoding here to reduce the number of lines): 0: c0a0d000 0x0800 (== rx_bd_base) 1: c0a0d020 0x0800 2: c0a0d040 0x0800 3: c0a0d060 0x0800 4: c0a0d080 0x0800 5: c0a0d0a0 0x0800 6: c0a0d0c0 0x0800 (== rx_next) 7: c0a0d0e0 0x0800 8: c0a0d100 0x0800 9: c0a0d120 0x0800 10: c0a0d140 0x0800 11: c0a0d160 0x0800 12: c0a0d180 0x0800 13: c0a0d1a0 0x0800 14: c0a0d1c0 0x0800 15: c0a0d1e0 0x0800 16: c0a0d200 0x0800 17: c0a0d220 0x0800 18: c0a0d240 0x0800 19: c0a0d260 0x0800 20: c0a0d280 0x0800 21: c0a0d2a0 0x0800 22: c0a0d2c0 0x0800 23: c0a0d2e0 0x0800 24: c0a0d300 0x0800 25: c0a0d320 0x0800 26: c0a0d340 0x0800 27: c0a0d360 0x0800 28: c0a0d380 0x0800 29: c0a0d3a0 0x0800 30: c0a0d3c0 0x0800 31: c0a0d3e0 0x0800 32: c0a0d400 0x0800 33: c0a0d420 0x0800 34: c0a0d440 0x0800 35: c0a0d460 0x0800 36: c0a0d480 0x0800 37: c0a0d4a0 0x0800 38: c0a0d4c0 0x0800 39: c0a0d4e0 0x0800 40: c0a0d500 0x0800 41: c0a0d520 0x0800 42: c0a0d540 0x0800 43: c0a0d560 0x0800 44: c0a0d580 0x0800 45: c0a0d5a0 0x0800 46: c0a0d5c0 0x0800 47: c0a0d5e0 0x0800 48: c0a0d600 0x0800 49: c0a0d620 0x0800 50: c0a0d640 0x0800 51: c0a0d660 0x0800 52: c0a0d680 0x0800 53: c0a0d6a0 0x0800 54: c0a0d6c0 0x0800 55: c0a0d6e0 0x0800 56: c0a0d700 0x0800 57: c0a0d720 0x0800 58: c0a0d740 0x0800 59: c0a0d760 0x0800 60: c0a0d780 0x0800 61: c0a0d7a0 0x0800 62: c0a0d7c0 0x0800 63: c0a0d7e0 0x0800 64: c0a0d800 0x0800 65: c0a0d820 0x0800 66: c0a0d840 0x0800 67: c0a0d860 0x0800 68: c0a0d880 0x0800 69: c0a0d8a0 0x0800 70: c0a0d8c0 0x0800 71: c0a0d8e0 0x0800 72: c0a0d900 0x0800 73: c0a0d920 0x0800 74: c0a0d940 0x0800 75: c0a0d960 0x0800 76: c0a0d980 0x0800 77: c0a0d9a0 0x0800 78: c0a0d9c0 0x0800 79: c0a0d9e0 0x0800 80: c0a0da00 0x0800 81: c0a0da20 0x0800 82: c0a0da40 0x0800 83: c0a0da60 0x0800 84: c0a0da80 0x0800 85: c0a0daa0 0x0800 86: c0a0dac0 0x0800 87: c0a0dae0 0x0800 88: c0a0db00 0x0800 89: c0a0db20 0x0800 90: c0a0db40 0x0800 91: c0a0db60 0x0800 92: c0a0db80 0x0800 93: c0a0dba0 0x0800 94: c0a0dbc0 0x0800 95: c0a0dbe0 0x0800 96: c0a0dc00 0x0800 97: c0a0dc20 0x0800 98: c0a0dc40 0x0800 99: c0a0dc60 0x0800 100: c0a0dc80 0x0800 101: c0a0dca0 0x0800 102: c0a0dcc0 0x0800 103: c0a0dce0 0x0800 104: c0a0dd00 0x0800 105: c0a0dd20 0x0800 106: c0a0dd40 0x0800 107: c0a0dd60 0x0800 108: c0a0dd80 0x0800 109: c0a0dda0 0x0800 110: c0a0ddc0 0x0800 111: c0a0dde0 0x0800 112: c0a0de00 0x0800 113: c0a0de20 0x0800 114: c0a0de40 0x0800 115: c0a0de60 0x0800 116: c0a0de80 0x0800 117: c0a0dea0 0x0800 118: c0a0dec0 0x0800 119: c0a0dee0 0x0800 120: c0a0df00 0x0800 121: c0a0df20 0x0800 122: c0a0df40 0x0800 123: c0a0df60 0x0800 124: c0a0df80 0x0800 125: c0a0dfa0 0x0800 126: c0a0dfc0 0x0800 127: c0a0dfe0 0x2800 This shows that the complete receive ring is filled with received frames, which matches with the value 0 in the RDAR register. The patched FEC driver also logs "FEC receive ring full" when the receive ring is full after the napi_complete() call and before the enabling of the RXF interrupt. The kernel logging shows this message a couple of times. So the RX ring has been filled completely between the fec_enet_rx() call and the napi_complete() call. In this testcase (100Megabit, 256 byte frames) it takes less than 3 milliseconds to completely fill the RX ring. So a preemption between the fec_enet_rx() call and the napi_complete() call or a long execution time of the fec_enet_tx() function (which is called between fec_enet_rx() and napi_complete() ) results in the enabling of the RXF interrupt when the RX ring is completely full. It seems that in that case the RXF interrupt will not be generated anymore. @@ -23,17 +23,20 @@ #include <linux/module.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/ptrace.h> +#include <linux/debugfs.h> #include <linux/errno.h> #include <linux/ioport.h> #include <linux/slab.h> #include <linux/interrupt.h> #include <linux/delay.h> #include <linux/netdevice.h> #include <linux/etherdevice.h> +#include <linux/sched.h> +#include <linux/seq_file.h> #include <linux/skbuff.h> #include <linux/spinlock.h> #include <linux/workqueue.h> #include <linux/bitops.h> #include <linux/io.h> @@ -1164,11 +1167,17 @@ static int fec_enet_rx_napi(struct napi_struct *napi, int budget) fec_enet_tx(ndev); if (pkts < budget) { napi_complete(napi); - writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK); + if (readl(fep->hwp + FEC_R_DES_ACTIVE) == 0) { + writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK); + printk(KERN_ERR "%llu FEC: receive ring full!!!\n", + sched_clock()); + } else { + writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK); + } } return pkts; } /* ------------------------------------------------------------------------- */ @@ -2404,10 +2413,79 @@ static void fec_reset_phy(struct platform_device *pdev) * by machine code. */ } #endif /* CONFIG_OF */ +#define BDRX_EMPTY (1 << 15) +#define BDRX_RO1 (1 << 14) +#define BDRX_WRAP (1 << 13) +#define BDRX_RO2 (1 << 12) +#define BDRX_LAST (1 << 11) +#define BDRX_UNDEF10 (1 << 10) +#define BDRX_UNDEF9 (1 << 9) +#define BDRX_MISS (1 << 8) +#define BDRX_BCAST (1 << 7) +#define BDRX_MCAST (1 << 6) +#define BDRX_LENERR (1 << 5) +#define BDRX_ALGNERR (1 << 4) +#define BDRX_UNDEF3 (1 << 3) +#define BDRX_CRCERR (1 << 2) +#define BDRX_OVERRUN (1 << 1) +#define BDRX_TRUNC (1 << 0) + + +static int fec_rx_ring_show(struct seq_file *s, void *p) +{ + struct fec_enet_private *fep = s->private; + int nbds; + union bufdesc_u *bd; + + bd = fep->rx_bd_base; + for (nbds = 0;nbds < fep->rx_ring_size;nbds++) { + seq_printf(s, "%d: %p 0x%.4x", nbds, bd, bd->ebd.desc.cbd_sc); + if (bd == fep->rx_bd_base) + seq_printf(s, " (== rx_bd_base)"); + if (nbds == fep->rx_next) + seq_printf(s, " (== rx_next)"); + seq_printf(s, "\n"); + seq_printf(s, + " [15]%s [14]%s [13]%s [12]%s [11]%s [10]%s [ 9]%s [ 8]%s" + " [ 7]%s [ 6]%s [ 5]%s [ 4]%s [ 3]%s [ 2]%s [ 1]%s [ 0]%s\n", + bd->ebd.desc.cbd_sc & BDRX_EMPTY ? "EMP" : "emp", + bd->ebd.desc.cbd_sc & BDRX_RO1 ? "RO1" : "ro1", + bd->ebd.desc.cbd_sc & BDRX_WRAP ? "WRP" : "wrp", + bd->ebd.desc.cbd_sc & BDRX_RO2 ? "RO2" : "ro2", + bd->ebd.desc.cbd_sc & BDRX_LAST ? "LST" : "lst", + bd->ebd.desc.cbd_sc & BDRX_UNDEF10 ? "UNA" : "una", + bd->ebd.desc.cbd_sc & BDRX_UNDEF9 ? "UN9" : "un9", + bd->ebd.desc.cbd_sc & BDRX_MISS ? "MIS" : "mis", + bd->ebd.desc.cbd_sc & BDRX_BCAST ? "BCS" : "bcs", + bd->ebd.desc.cbd_sc & BDRX_MCAST ? "MCS" : "mcs", + bd->ebd.desc.cbd_sc & BDRX_LENERR ? "LEN" : "len", + bd->ebd.desc.cbd_sc & BDRX_ALGNERR ? "ALN" : "aln", + bd->ebd.desc.cbd_sc & BDRX_UNDEF3 ? "UN3" : "un3", + bd->ebd.desc.cbd_sc & BDRX_CRCERR ? "CRC" : "crc", + bd->ebd.desc.cbd_sc & BDRX_OVERRUN ? "OVR" : "ovr", + bd->ebd.desc.cbd_sc & BDRX_TRUNC ? "TRC" : "trc"); + bd += 1; + } + return 0; +} + +static int fec_rx_ring_open(struct inode *inode, struct file *file) +{ + struct fec_enet_private *fep = inode->i_private; + return single_open(file, fec_rx_ring_show, fep); +} + +static const struct file_operations fec_rx_ring_fops = { + .open = fec_rx_ring_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + static int fec_probe(struct platform_device *pdev) { struct fec_enet_private *fep; struct fec_platform_data *pdata; @@ -2558,10 +2636,18 @@ fec_probe(struct platform_device *pdev) if (fep->flags & FEC_FLAG_BUFDESC_EX && fep->ptp_clock) netdev_info(ndev, "registered PHC device %d\n", fep->dev_id); INIT_WORK(&fep->tx_timeout_work, fec_enet_timeout_work); + + if (!debugfs_create_file("fec_rx_ring", + S_IFREG | S_IRUGO, + NULL, + fep, + &fec_rx_ring_fops)) + goto failed_register; + return 0; failed_register: fec_enet_mii_remove(fep); failed_mii_init: Regards, Jaccon
diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h index 3b8d6d19ff05..510580eeae4b 100644 --- a/drivers/net/ethernet/freescale/fec.h +++ b/drivers/net/ethernet/freescale/fec.h @@ -170,6 +170,11 @@ struct bufdesc_ex { unsigned short res0[4]; }; +union bufdesc_u { + struct bufdesc bd; + struct bufdesc_ex ebd; +}; + /* * The following definitions courtesy of commproc.h, which where * Copyright (c) 1997 Dan Malek (dmalek@jlc.net). @@ -240,14 +245,14 @@ struct bufdesc_ex { * the skbuffer directly. */ -#define FEC_ENET_RX_PAGES 8 +#define FEC_ENET_RX_PAGES 32 #define FEC_ENET_RX_FRSIZE 2048 #define FEC_ENET_RX_FRPPG (PAGE_SIZE / FEC_ENET_RX_FRSIZE) #define RX_RING_SIZE (FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES) #define FEC_ENET_TX_FRSIZE 2048 #define FEC_ENET_TX_FRPPG (PAGE_SIZE / FEC_ENET_TX_FRSIZE) -#define TX_RING_SIZE 16 /* Must be power of two */ -#define TX_RING_MOD_MASK 15 /* for this to work */ +#define TX_RING_SIZE 64 /* Must be power of two */ +#define TX_RING_MOD_MASK 63 /* for this to work */ #define BD_ENET_RX_INT 0x00800000 #define BD_ENET_RX_PTP ((ushort)0x0400) @@ -289,12 +294,12 @@ struct fec_enet_private { /* CPM dual port RAM relative addresses */ dma_addr_t bd_dma; /* Address of Rx and Tx buffers */ - struct bufdesc *rx_bd_base; - struct bufdesc *tx_bd_base; + union bufdesc_u *rx_bd_base; + union bufdesc_u *tx_bd_base; /* The next free ring entry */ - struct bufdesc *cur_rx, *cur_tx; - /* The ring entries to be free()ed */ - struct bufdesc *dirty_tx; + unsigned short tx_next; + unsigned short tx_dirty; + unsigned short rx_next; unsigned short tx_ring_size; unsigned short rx_ring_size; @@ -335,6 +340,9 @@ struct fec_enet_private { struct timer_list time_keep; struct fec_enet_delayed_work delay_work; struct regulator *reg_phy; + unsigned long quirks; + + }; void fec_ptp_init(struct platform_device *pdev); diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index 03a351300013..8105697d5a99 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -101,6 +101,8 @@ static void set_multicast_list(struct net_device *ndev); * ENET_TDAR[TDAR]. */ #define FEC_QUIRK_ERR006358 (1 << 7) +/* Controller has ability to offset rx packets */ +#define FEC_QUIRK_RX_SHIFT16 (1 << 8) static struct platform_device_id fec_devtype[] = { { @@ -120,7 +122,8 @@ static struct platform_device_id fec_devtype[] = { .name = "imx6q-fec", .driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT | FEC_QUIRK_HAS_BUFDESC_EX | FEC_QUIRK_HAS_CSUM | - FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358, + FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358 | + FEC_QUIRK_RX_SHIFT16, }, { .name = "mvf600-fec", .driver_data = FEC_QUIRK_ENET_MAC, @@ -200,6 +203,7 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address"); /* FEC receive acceleration */ #define FEC_RACC_IPDIS (1 << 1) #define FEC_RACC_PRODIS (1 << 2) +#define FEC_RACC_SHIFT16 BIT(7) #define FEC_RACC_OPTIONS (FEC_RACC_IPDIS | FEC_RACC_PRODIS) /* @@ -233,57 +237,54 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address"); static int mii_cnt; -static inline -struct bufdesc *fec_enet_get_nextdesc(struct bufdesc *bdp, struct fec_enet_private *fep) +static unsigned copybreak = 200; +module_param(copybreak, uint, 0644); +MODULE_PARM_DESC(copybreak, + "Maximum size of packet that is copied to a new buffer on receive"); + + + + + +static bool fec_enet_rx_zerocopy(struct fec_enet_private *fep, unsigned pktlen) { - struct bufdesc *new_bd = bdp + 1; - struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp + 1; - struct bufdesc_ex *ex_base; - struct bufdesc *base; - int ring_size; - - if (bdp >= fep->tx_bd_base) { - base = fep->tx_bd_base; - ring_size = fep->tx_ring_size; - ex_base = (struct bufdesc_ex *)fep->tx_bd_base; - } else { - base = fep->rx_bd_base; - ring_size = fep->rx_ring_size; - ex_base = (struct bufdesc_ex *)fep->rx_bd_base; - } +#ifndef CONFIG_M5272 + if (fep->quirks & FEC_QUIRK_RX_SHIFT16 && pktlen >= copybreak) + return true; +#endif + return false; +} + +static union bufdesc_u * +fec_enet_tx_get(unsigned index, struct fec_enet_private *fep) +{ + union bufdesc_u *base = fep->tx_bd_base; + union bufdesc_u *bdp; + + index &= fep->tx_ring_size - 1; if (fep->bufdesc_ex) - return (struct bufdesc *)((ex_new_bd >= (ex_base + ring_size)) ? - ex_base : ex_new_bd); + bdp = (union bufdesc_u *)(&base->ebd + index); else - return (new_bd >= (base + ring_size)) ? - base : new_bd; + bdp = (union bufdesc_u *)(&base->bd + index); + + return bdp; } -static inline -struct bufdesc *fec_enet_get_prevdesc(struct bufdesc *bdp, struct fec_enet_private *fep) +static union bufdesc_u * +fec_enet_rx_get(unsigned index, struct fec_enet_private *fep) { - struct bufdesc *new_bd = bdp - 1; - struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp - 1; - struct bufdesc_ex *ex_base; - struct bufdesc *base; - int ring_size; - - if (bdp >= fep->tx_bd_base) { - base = fep->tx_bd_base; - ring_size = fep->tx_ring_size; - ex_base = (struct bufdesc_ex *)fep->tx_bd_base; - } else { - base = fep->rx_bd_base; - ring_size = fep->rx_ring_size; - ex_base = (struct bufdesc_ex *)fep->rx_bd_base; - } + union bufdesc_u *base = fep->rx_bd_base; + union bufdesc_u *bdp; + + index &= fep->rx_ring_size - 1; if (fep->bufdesc_ex) - return (struct bufdesc *)((ex_new_bd < ex_base) ? - (ex_new_bd + ring_size) : ex_new_bd); + bdp = (union bufdesc_u *)(&base->ebd + index); else - return (new_bd < base) ? (new_bd + ring_size) : new_bd; + bdp = (union bufdesc_u *)(&base->bd + index); + + return bdp; } static void *swap_buffer(void *bufaddr, int len) @@ -297,6 +298,26 @@ static void *swap_buffer(void *bufaddr, int len) return bufaddr; } +static void fec_dump(struct net_device *ndev) +{ + struct fec_enet_private *fep = netdev_priv(ndev); + unsigned index = 0; + + netdev_info(ndev, "TX ring dump\n"); + pr_info("Nr SC addr len SKB\n"); + + for (index = 0; index < fep->tx_ring_size; index++) { + union bufdesc_u *bdp = fec_enet_tx_get(index, fep); + + pr_info("%2u %c%c 0x%04x 0x%08lx %4u %p\n", + index, + index == fep->tx_next ? 'S' : ' ', + index == fep->tx_dirty ? 'H' : ' ', + bdp->bd.cbd_sc, bdp->bd.cbd_bufaddr, bdp->bd.cbd_datlen, + fep->tx_skbuff[index]); + } +} + static int fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev) { @@ -312,21 +333,42 @@ fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev) return 0; } +static void +fec_enet_tx_unmap(struct bufdesc *bdp, struct fec_enet_private *fep) +{ + dma_addr_t addr = bdp->cbd_bufaddr; + unsigned length = bdp->cbd_datlen; + + bdp->cbd_bufaddr = 0; + + dma_unmap_single(&fep->pdev->dev, addr, length, DMA_TO_DEVICE); +} + static netdev_tx_t fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev) { struct fec_enet_private *fep = netdev_priv(ndev); - const struct platform_device_id *id_entry = - platform_get_device_id(fep->pdev); - struct bufdesc *bdp, *bdp_pre; + union bufdesc_u *bdp, *bdp_pre; void *bufaddr; unsigned short status; - unsigned int index; + unsigned index; + unsigned length; + dma_addr_t addr; + + + + + + + + + /* Fill in a Tx ring entry */ - bdp = fep->cur_tx; + index = fep->tx_next; - status = bdp->cbd_sc; + bdp = fec_enet_tx_get(index, fep); + status = bdp->bd.cbd_sc; if (status & BD_ENET_TX_READY) { /* Ooops. All transmit buffers are full. Bail out. @@ -347,21 +389,15 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev) /* Set buffer length and buffer pointer */ bufaddr = skb->data; - bdp->cbd_datlen = skb->len; + length = skb->len; /* * On some FEC implementations data must be aligned on * 4-byte boundaries. Use bounce buffers to copy data * and get it aligned. Ugh. */ - if (fep->bufdesc_ex) - index = (struct bufdesc_ex *)bdp - - (struct bufdesc_ex *)fep->tx_bd_base; - else - index = bdp - fep->tx_bd_base; - if (((unsigned long) bufaddr) & FEC_ALIGNMENT) { - memcpy(fep->tx_bounce[index], skb->data, skb->len); + memcpy(fep->tx_bounce[index], skb->data, length); bufaddr = fep->tx_bounce[index]; } @@ -370,70 +406,72 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev) * the system that it's running on. As the result, driver has to * swap every frame going to and coming from the controller. */ - if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME) - swap_buffer(bufaddr, skb->len); + if (fep->quirks & FEC_QUIRK_SWAP_FRAME) + swap_buffer(bufaddr, length); - /* Save skb pointer */ - fep->tx_skbuff[index] = skb; - - /* Push the data cache so the CPM does not get stale memory - * data. - */ - bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr, - skb->len, DMA_TO_DEVICE); - if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) { - bdp->cbd_bufaddr = 0; - fep->tx_skbuff[index] = NULL; + /* Push the data cache so the CPM does not get stale memory data. */ + addr = dma_map_single(&fep->pdev->dev, bufaddr, length, DMA_TO_DEVICE); + if (dma_mapping_error(&fep->pdev->dev, addr)) { dev_kfree_skb_any(skb); if (net_ratelimit()) netdev_err(ndev, "Tx DMA memory map failed\n"); return NETDEV_TX_OK; } - if (fep->bufdesc_ex) { + /* Save skb pointer */ + fep->tx_skbuff[index] = skb; - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; - ebdp->cbd_bdu = 0; + bdp->bd.cbd_datlen = length; + bdp->bd.cbd_bufaddr = addr; + + if (fep->bufdesc_ex) { + bdp->ebd.cbd_bdu = 0; if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP && fep->hwts_tx_en)) { - ebdp->cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT); + bdp->ebd.cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT); skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; } else { - ebdp->cbd_esc = BD_ENET_TX_INT; + bdp->ebd.cbd_esc = BD_ENET_TX_INT; /* Enable protocol checksum flags * We do not bother with the IP Checksum bits as they * are done by the kernel */ if (skb->ip_summed == CHECKSUM_PARTIAL) - ebdp->cbd_esc |= BD_ENET_TX_PINS; + bdp->ebd.cbd_esc |= BD_ENET_TX_PINS; } } + /* + * We need the preceding stores to the descriptor to complete + * before updating the status field, which hands it over to the + * hardware. The corresponding rmb() is "in the hardware". + */ + wmb(); + /* Send it on its way. Tell FEC it's ready, interrupt when done, * it's the last BD of the frame, and to put the CRC on the end. */ status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR | BD_ENET_TX_LAST | BD_ENET_TX_TC); - bdp->cbd_sc = status; + bdp->bd.cbd_sc = status; - bdp_pre = fec_enet_get_prevdesc(bdp, fep); - if ((id_entry->driver_data & FEC_QUIRK_ERR006358) && - !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) { + bdp_pre = fec_enet_tx_get(index - 1, fep); + if ((fep->quirks & FEC_QUIRK_ERR006358) && + !(bdp_pre->bd.cbd_sc & BD_ENET_TX_READY)) { fep->delay_work.trig_tx = true; schedule_delayed_work(&(fep->delay_work.delay_work), msecs_to_jiffies(1)); } - /* If this was the last BD in the ring, start at the beginning again. */ - bdp = fec_enet_get_nextdesc(bdp, fep); - skb_tx_timestamp(skb); - fep->cur_tx = bdp; + fep->tx_next = (index + 1) & (fep->tx_ring_size - 1); - if (fep->cur_tx == fep->dirty_tx) + if (fep->tx_next == fep->tx_dirty) { + netif_stop_queue(ndev); + } /* Trigger transmission start */ writel(0, fep->hwp + FEC_X_DES_ACTIVE); @@ -446,46 +484,43 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev) static void fec_enet_bd_init(struct net_device *dev) { struct fec_enet_private *fep = netdev_priv(dev); - struct bufdesc *bdp; unsigned int i; /* Initialize the receive buffer descriptors. */ - bdp = fep->rx_bd_base; for (i = 0; i < fep->rx_ring_size; i++) { + union bufdesc_u *bdp = fec_enet_rx_get(i, fep); /* Initialize the BD for every fragment in the page. */ - if (bdp->cbd_bufaddr) - bdp->cbd_sc = BD_ENET_RX_EMPTY; + if (bdp->bd.cbd_bufaddr) + bdp->bd.cbd_sc = BD_ENET_RX_EMPTY; else - bdp->cbd_sc = 0; - bdp = fec_enet_get_nextdesc(bdp, fep); + bdp->bd.cbd_sc = 0; + if (i == fep->rx_ring_size - 1) + bdp->bd.cbd_sc |= BD_SC_WRAP; } - /* Set the last buffer to wrap */ - bdp = fec_enet_get_prevdesc(bdp, fep); - bdp->cbd_sc |= BD_SC_WRAP; - - fep->cur_rx = fep->rx_bd_base; + fep->rx_next = 0; /* ...and the same for transmit */ - bdp = fep->tx_bd_base; - fep->cur_tx = bdp; for (i = 0; i < fep->tx_ring_size; i++) { + union bufdesc_u *bdp = fec_enet_tx_get(i, fep); /* Initialize the BD for every fragment in the page. */ - bdp->cbd_sc = 0; - if (bdp->cbd_bufaddr && fep->tx_skbuff[i]) { + /* Set the last buffer to wrap */ + if (i == fep->tx_ring_size - 1) + bdp->bd.cbd_sc = BD_SC_WRAP; + else + bdp->bd.cbd_sc = 0; + if (bdp->bd.cbd_bufaddr) + fec_enet_tx_unmap(&bdp->bd, fep); + if (fep->tx_skbuff[i]) { dev_kfree_skb_any(fep->tx_skbuff[i]); fep->tx_skbuff[i] = NULL; } - bdp->cbd_bufaddr = 0; - bdp = fec_enet_get_nextdesc(bdp, fep); } - /* Set the last buffer to wrap */ - bdp = fec_enet_get_prevdesc(bdp, fep); - bdp->cbd_sc |= BD_SC_WRAP; - fep->dirty_tx = bdp; + fep->tx_next = 0; + fep->tx_dirty = fep->tx_ring_size - 1; } /* This function is called to start or restart the FEC during a link @@ -496,8 +531,6 @@ static void fec_restart(struct net_device *ndev, int duplex) { struct fec_enet_private *fep = netdev_priv(ndev); - const struct platform_device_id *id_entry = - platform_get_device_id(fep->pdev); int i; u32 val; u32 temp_mac[2]; @@ -519,7 +552,7 @@ fec_restart(struct net_device *ndev, int duplex) * enet-mac reset will reset mac address registers too, * so need to reconfigure it. */ - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) { + if (fep->quirks & FEC_QUIRK_ENET_MAC) { memcpy(&temp_mac, ndev->dev_addr, ETH_ALEN); writel(cpu_to_be32(temp_mac[0]), fep->hwp + FEC_ADDR_LOW); writel(cpu_to_be32(temp_mac[1]), fep->hwp + FEC_ADDR_HIGH); @@ -568,6 +601,8 @@ fec_restart(struct net_device *ndev, int duplex) #if !defined(CONFIG_M5272) /* set RX checksum */ val = readl(fep->hwp + FEC_RACC); + if (fep->quirks & FEC_QUIRK_RX_SHIFT16) + val |= FEC_RACC_SHIFT16; if (fep->csum_flags & FLAG_RX_CSUM_ENABLED) val |= FEC_RACC_OPTIONS; else @@ -579,7 +614,7 @@ fec_restart(struct net_device *ndev, int duplex) * The phy interface and speed need to get configured * differently on enet-mac. */ - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) { + if (fep->quirks & FEC_QUIRK_ENET_MAC) { /* Enable flow control and length check */ rcntl |= 0x40000000 | 0x00000020; @@ -602,7 +637,7 @@ fec_restart(struct net_device *ndev, int duplex) } } else { #ifdef FEC_MIIGSK_ENR - if (id_entry->driver_data & FEC_QUIRK_USE_GASKET) { + if (fep->quirks & FEC_QUIRK_USE_GASKET) { u32 cfgr; /* disable the gasket and wait */ writel(0, fep->hwp + FEC_MIIGSK_ENR); @@ -655,7 +690,7 @@ fec_restart(struct net_device *ndev, int duplex) writel(0, fep->hwp + FEC_HASH_TABLE_LOW); #endif - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) { + if (fep->quirks & FEC_QUIRK_ENET_MAC) { /* enable ENET endian swap */ ecntl |= (1 << 8); /* enable ENET store and forward mode */ @@ -692,8 +727,6 @@ static void fec_stop(struct net_device *ndev) { struct fec_enet_private *fep = netdev_priv(ndev); - const struct platform_device_id *id_entry = - platform_get_device_id(fep->pdev); u32 rmii_mode = readl(fep->hwp + FEC_R_CNTRL) & (1 << 8); /* We cannot expect a graceful transmit stop without link !!! */ @@ -711,7 +744,7 @@ fec_stop(struct net_device *ndev) writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK); /* We have to keep ENET enabled to have MII interrupt stay working */ - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) { + if (fep->quirks & FEC_QUIRK_ENET_MAC) { writel(2, fep->hwp + FEC_ECNTRL); writel(rmii_mode, fep->hwp + FEC_R_CNTRL); } @@ -723,6 +756,8 @@ fec_timeout(struct net_device *ndev) { struct fec_enet_private *fep = netdev_priv(ndev); + fec_dump(ndev); + ndev->stats.tx_errors++; fep->delay_work.timeout = true; @@ -751,34 +786,28 @@ static void fec_enet_work(struct work_struct *work) static void fec_enet_tx(struct net_device *ndev) { - struct fec_enet_private *fep; - struct bufdesc *bdp; + struct fec_enet_private *fep = netdev_priv(ndev); + union bufdesc_u *bdp; unsigned short status; struct sk_buff *skb; - int index = 0; - - fep = netdev_priv(ndev); - bdp = fep->dirty_tx; + unsigned index = fep->tx_dirty; - /* get next bdp of dirty_tx */ - bdp = fec_enet_get_nextdesc(bdp, fep); + do { + index = (index + 1) & (fep->tx_ring_size - 1); + bdp = fec_enet_tx_get(index, fep); - while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) { + status = bdp->bd.cbd_sc; + if (status & BD_ENET_TX_READY) + break; /* current queue is empty */ - if (bdp == fep->cur_tx) + if (index == fep->tx_next) break; - if (fep->bufdesc_ex) - index = (struct bufdesc_ex *)bdp - - (struct bufdesc_ex *)fep->tx_bd_base; - else - index = bdp - fep->tx_bd_base; + fec_enet_tx_unmap(&bdp->bd, fep); skb = fep->tx_skbuff[index]; - dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len, - DMA_TO_DEVICE); - bdp->cbd_bufaddr = 0; + fep->tx_skbuff[index] = NULL; /* Check for errors. */ if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC | @@ -797,19 +826,18 @@ fec_enet_tx(struct net_device *ndev) ndev->stats.tx_carrier_errors++; } else { ndev->stats.tx_packets++; - ndev->stats.tx_bytes += bdp->cbd_datlen; + ndev->stats.tx_bytes += bdp->bd.cbd_datlen; } if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) && fep->bufdesc_ex) { struct skb_shared_hwtstamps shhwtstamps; unsigned long flags; - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; memset(&shhwtstamps, 0, sizeof(shhwtstamps)); spin_lock_irqsave(&fep->tmreg_lock, flags); shhwtstamps.hwtstamp = ns_to_ktime( - timecounter_cyc2time(&fep->tc, ebdp->ts)); + timecounter_cyc2time(&fep->tc, bdp->ebd.ts)); spin_unlock_irqrestore(&fep->tmreg_lock, flags); skb_tstamp_tx(skb, &shhwtstamps); } @@ -825,45 +853,252 @@ fec_enet_tx(struct net_device *ndev) /* Free the sk buffer associated with this last transmit */ dev_kfree_skb_any(skb); - fep->tx_skbuff[index] = NULL; - - fep->dirty_tx = bdp; - - /* Update pointer to next buffer descriptor to be transmitted */ - bdp = fec_enet_get_nextdesc(bdp, fep); /* Since we have freed up a buffer, the ring is no longer full */ - if (fep->dirty_tx != fep->cur_tx) { - if (netif_queue_stopped(ndev)) - netif_wake_queue(ndev); + if (netif_queue_stopped(ndev)) { + + + + + + netif_wake_queue(ndev); + } - } + + fep->tx_dirty = index; + } while (1); return; } -/* During a receive, the cur_rx points to the current incoming buffer. +static void +fec_enet_receive(struct sk_buff *skb, union bufdesc_u *bdp, struct net_device *ndev) +{ + struct fec_enet_private *fep = netdev_priv(ndev); + + skb->protocol = eth_type_trans(skb, ndev); + + /* Get receive timestamp from the skb */ + if (fep->hwts_rx_en && fep->bufdesc_ex) { + struct skb_shared_hwtstamps *shhwtstamps = + skb_hwtstamps(skb); + unsigned long flags; + + memset(shhwtstamps, 0, sizeof(*shhwtstamps)); + + spin_lock_irqsave(&fep->tmreg_lock, flags); + shhwtstamps->hwtstamp = ns_to_ktime( + timecounter_cyc2time(&fep->tc, bdp->ebd.ts)); + spin_unlock_irqrestore(&fep->tmreg_lock, flags); + } + + if (fep->csum_flags & FLAG_RX_CSUM_ENABLED) { + if (!(bdp->ebd.cbd_esc & FLAG_RX_CSUM_ERROR)) { + /* don't check it */ + skb->ip_summed = CHECKSUM_UNNECESSARY; + } else { + skb_checksum_none_assert(skb); + } + } + + napi_gro_receive(&fep->napi, skb); +} + +static void noinline +fec_enet_receive_copy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev) +{ + struct fec_enet_private *fep = netdev_priv(ndev); + struct sk_buff *skb; + unsigned char *data; + bool vlan_packet_rcvd = false; + + /* + * Detect the presence of the VLAN tag, and adjust + * the packet length appropriately. + */ + if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX && + bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) { + pkt_len -= VLAN_HLEN; + vlan_packet_rcvd = true; + } + + /* This does 16 byte alignment, exactly what we need. */ + skb = netdev_alloc_skb(ndev, pkt_len + NET_IP_ALIGN); + if (unlikely(!skb)) { + ndev->stats.rx_dropped++; + return; + } + + dma_sync_single_for_cpu(&fep->pdev->dev, bdp->bd.cbd_bufaddr, + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); + + data = fep->rx_skbuff[index]->data; + +#ifndef CONFIG_M5272 + /* + * If we have enabled this feature, we need to discard + * the two bytes at the beginning of the packet before + * copying it. + */ + if (fep->quirks & FEC_QUIRK_RX_SHIFT16) { + pkt_len -= 2; + data += 2; + } +#endif + + if (fep->quirks & FEC_QUIRK_SWAP_FRAME) + swap_buffer(data, pkt_len); + + skb_reserve(skb, NET_IP_ALIGN); + skb_put(skb, pkt_len); /* Make room */ + + /* If this is a VLAN packet remove the VLAN Tag */ + if (vlan_packet_rcvd) { + struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN); + + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), + ntohs(vlan->h_vlan_TCI)); + + /* Extract the frame data without the VLAN header. */ + skb_copy_to_linear_data(skb, data, 2 * ETH_ALEN); + skb_copy_to_linear_data_offset(skb, 2 * ETH_ALEN, + data + 2 * ETH_ALEN + VLAN_HLEN, + pkt_len - 2 * ETH_ALEN); + } else { + skb_copy_to_linear_data(skb, data, pkt_len); + } + + dma_sync_single_for_device(&fep->pdev->dev, bdp->bd.cbd_bufaddr, + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); + + fec_enet_receive(skb, bdp, ndev); +} + +static void noinline +fec_enet_receive_nocopy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev) +{ + struct fec_enet_private *fep = netdev_priv(ndev); + struct sk_buff *skb, *skb_new; + unsigned char *data; + dma_addr_t addr; + +#if 0 + skb_new = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE); + if (!skb_new) { + ndev->stats.rx_dropped++; + return; + } + + addr = dma_map_single(&fep->pdev->dev, skb_new->data, + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); + if (dma_mapping_error(&fep->pdev->dev, addr)) { + dev_kfree_skb(skb_new); + ndev->stats.rx_dropped++; + return; + } +#else + skb_new = NULL; + addr = 0; +#endif + + /* + * We have the new skb, so proceed to deal with the + * received data. + */ + dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr, + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); + + skb = fep->rx_skbuff[index]; + + /* Now subsitute in the new skb */ + fep->rx_skbuff[index] = skb_new; + bdp->bd.cbd_bufaddr = addr; + + /* + * Update the skb length according to the raw packet + * length. Then remove the two bytes of additional + * padding. + */ + skb_put(skb, pkt_len); + data = skb_pull_inline(skb, 2); + + if (fep->quirks & FEC_QUIRK_SWAP_FRAME) + swap_buffer(data, skb->len); + + /* + * Now juggle things for the VLAN tag - if the hardware + * flags this as present, we need to read the tag, and + * then shuffle the ethernet addresses up. + */ + if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX && + bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) { + struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN); + + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), + ntohs(vlan->h_vlan_TCI)); + + memmove(data + VLAN_HLEN, data, 2 * ETH_ALEN); + skb_pull_inline(skb, VLAN_HLEN); + } + + fec_enet_receive(skb, bdp, ndev); +} + +static int +fec_enet_refill_ring(unsigned first, unsigned last, struct net_device *ndev) +{ + struct fec_enet_private *fep = netdev_priv(ndev); + unsigned i = first; + + do { + union bufdesc_u *bdp = fec_enet_rx_get(i, fep); + struct sk_buff *skb; + dma_addr_t addr; + + if (!fep->rx_skbuff[i]) { + skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE); + if (!skb) + return -ENOMEM; + + addr = dma_map_single(&fep->pdev->dev, skb->data, + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); + if (dma_mapping_error(&fep->pdev->dev, addr)) { + dev_kfree_skb(skb); + return -ENOMEM; + } + + fep->rx_skbuff[i] = skb; + bdp->bd.cbd_bufaddr = addr; + } + + bdp->bd.cbd_sc = (bdp->bd.cbd_sc & BD_SC_WRAP) | + BD_ENET_RX_EMPTY; + + if (fep->bufdesc_ex) { + bdp->ebd.cbd_esc = BD_ENET_RX_INT; + bdp->ebd.cbd_prot = 0; + bdp->ebd.cbd_bdu = 0; + } + i = (i + 1) & (fep->rx_ring_size - 1); + } while (i != last); + + return 0; +} + +/* During a receive, the rx_next points to the current incoming buffer. * When we update through the ring, if the next incoming buffer has * not been given to the system, we just set the empty indicator, * effectively tossing the packet. */ -static int +static int noinline fec_enet_rx(struct net_device *ndev, int budget) { struct fec_enet_private *fep = netdev_priv(ndev); - const struct platform_device_id *id_entry = - platform_get_device_id(fep->pdev); - struct bufdesc *bdp; unsigned short status; - struct sk_buff *skb; - ushort pkt_len; - __u8 *data; + unsigned pkt_len; int pkt_received = 0; - struct bufdesc_ex *ebdp = NULL; - bool vlan_packet_rcvd = false; - u16 vlan_tag; - int index = 0; + unsigned index = fep->rx_next; #ifdef CONFIG_M532x flush_cache_all(); @@ -872,12 +1107,16 @@ fec_enet_rx(struct net_device *ndev, int budget) /* First, grab all of the stats for the incoming packet. * These get messed up if we get called due to a busy condition. */ - bdp = fep->cur_rx; + do { + union bufdesc_u *bdp = fec_enet_rx_get(index, fep); - while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) { + status = bdp->bd.cbd_sc; + if (status & BD_ENET_RX_EMPTY) + break; if (pkt_received >= budget) break; + pkt_received++; /* Since we have allocated space to hold a complete frame, @@ -917,124 +1156,33 @@ fec_enet_rx(struct net_device *ndev, int budget) /* Process the incoming frame. */ ndev->stats.rx_packets++; - pkt_len = bdp->cbd_datlen; - ndev->stats.rx_bytes += pkt_len; - - if (fep->bufdesc_ex) - index = (struct bufdesc_ex *)bdp - - (struct bufdesc_ex *)fep->rx_bd_base; - else - index = bdp - fep->rx_bd_base; - data = fep->rx_skbuff[index]->data; - dma_sync_single_for_cpu(&fep->pdev->dev, bdp->cbd_bufaddr, - FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); - - if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME) - swap_buffer(data, pkt_len); - - /* Extract the enhanced buffer descriptor */ - ebdp = NULL; - if (fep->bufdesc_ex) - ebdp = (struct bufdesc_ex *)bdp; - - /* If this is a VLAN packet remove the VLAN Tag */ - vlan_packet_rcvd = false; - if ((ndev->features & NETIF_F_HW_VLAN_CTAG_RX) && - fep->bufdesc_ex && (ebdp->cbd_esc & BD_ENET_RX_VLAN)) { - /* Push and remove the vlan tag */ - struct vlan_hdr *vlan_header = - (struct vlan_hdr *) (data + ETH_HLEN); - vlan_tag = ntohs(vlan_header->h_vlan_TCI); - pkt_len -= VLAN_HLEN; - - vlan_packet_rcvd = true; - } - /* This does 16 byte alignment, exactly what we need. - * The packet length includes FCS, but we don't want to - * include that when passing upstream as it messes up - * bridging applications. + /* + * The packet length includes FCS, but we don't want + * to include that when passing upstream as it messes + * up bridging applications. */ - skb = netdev_alloc_skb(ndev, pkt_len - 4 + NET_IP_ALIGN); + pkt_len = bdp->bd.cbd_datlen - 4; + ndev->stats.rx_bytes += pkt_len; - if (unlikely(!skb)) { - ndev->stats.rx_dropped++; + if (fec_enet_rx_zerocopy(fep, pkt_len)) { + fec_enet_receive_nocopy(pkt_len, index, bdp, ndev); } else { - int payload_offset = (2 * ETH_ALEN); - skb_reserve(skb, NET_IP_ALIGN); - skb_put(skb, pkt_len - 4); /* Make room */ - - /* Extract the frame data without the VLAN header. */ - skb_copy_to_linear_data(skb, data, (2 * ETH_ALEN)); - if (vlan_packet_rcvd) - payload_offset = (2 * ETH_ALEN) + VLAN_HLEN; - skb_copy_to_linear_data_offset(skb, (2 * ETH_ALEN), - data + payload_offset, - pkt_len - 4 - (2 * ETH_ALEN)); - - skb->protocol = eth_type_trans(skb, ndev); - - /* Get receive timestamp from the skb */ - if (fep->hwts_rx_en && fep->bufdesc_ex) { - struct skb_shared_hwtstamps *shhwtstamps = - skb_hwtstamps(skb); - unsigned long flags; - - memset(shhwtstamps, 0, sizeof(*shhwtstamps)); - - spin_lock_irqsave(&fep->tmreg_lock, flags); - shhwtstamps->hwtstamp = ns_to_ktime( - timecounter_cyc2time(&fep->tc, ebdp->ts)); - spin_unlock_irqrestore(&fep->tmreg_lock, flags); - } - - if (fep->bufdesc_ex && - (fep->csum_flags & FLAG_RX_CSUM_ENABLED)) { - if (!(ebdp->cbd_esc & FLAG_RX_CSUM_ERROR)) { - /* don't check it */ - skb->ip_summed = CHECKSUM_UNNECESSARY; - } else { - skb_checksum_none_assert(skb); - } - } - - /* Handle received VLAN packets */ - if (vlan_packet_rcvd) - __vlan_hwaccel_put_tag(skb, - htons(ETH_P_8021Q), - vlan_tag); - - napi_gro_receive(&fep->napi, skb); + fec_enet_receive_copy(pkt_len, index, bdp, ndev); } - dma_sync_single_for_device(&fep->pdev->dev, bdp->cbd_bufaddr, - FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); rx_processing_done: - /* Clear the status flags for this buffer */ - status &= ~BD_ENET_RX_STATS; - - /* Mark the buffer empty */ - status |= BD_ENET_RX_EMPTY; - bdp->cbd_sc = status; - - if (fep->bufdesc_ex) { - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; - - ebdp->cbd_esc = BD_ENET_RX_INT; - ebdp->cbd_prot = 0; - ebdp->cbd_bdu = 0; - } - - /* Update BD pointer to next entry */ - bdp = fec_enet_get_nextdesc(bdp, fep); + index = (index + 1) & (fep->rx_ring_size - 1); + if (index == fep->rx_next) + break; + } while (1); - /* Doing this here will keep the FEC running while we process - * incoming frames. On a heavily loaded network, we should be - * able to keep up at the expense of system resources. - */ + if (pkt_received) { + fec_enet_refill_ring(fep->rx_next, index, ndev); writel(0, fep->hwp + FEC_R_DES_ACTIVE); } - fep->cur_rx = bdp; + + fep->rx_next = index; return pkt_received; } @@ -1044,29 +1192,25 @@ fec_enet_interrupt(int irq, void *dev_id) { struct net_device *ndev = dev_id; struct fec_enet_private *fep = netdev_priv(ndev); + const unsigned napi_mask = FEC_ENET_RXF | FEC_ENET_TXF; uint int_events; irqreturn_t ret = IRQ_NONE; - do { - int_events = readl(fep->hwp + FEC_IEVENT); - writel(int_events, fep->hwp + FEC_IEVENT); + int_events = readl(fep->hwp + FEC_IEVENT); + writel(int_events & ~napi_mask, fep->hwp + FEC_IEVENT); - if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) { - ret = IRQ_HANDLED; + if (int_events & napi_mask) { + ret = IRQ_HANDLED; - /* Disable the RX interrupt */ - if (napi_schedule_prep(&fep->napi)) { - writel(FEC_RX_DISABLED_IMASK, - fep->hwp + FEC_IMASK); - __napi_schedule(&fep->napi); - } - } + /* Disable the NAPI interrupts */ + writel(FEC_ENET_MII, fep->hwp + FEC_IMASK); + napi_schedule(&fep->napi); + } - if (int_events & FEC_ENET_MII) { - ret = IRQ_HANDLED; - complete(&fep->mdio_done); - } - } while (int_events); + if (int_events & FEC_ENET_MII) { + ret = IRQ_HANDLED; + complete(&fep->mdio_done); + } return ret; } @@ -1074,10 +1218,24 @@ fec_enet_interrupt(int irq, void *dev_id) static int fec_enet_rx_napi(struct napi_struct *napi, int budget) { struct net_device *ndev = napi->dev; - int pkts = fec_enet_rx(ndev, budget); struct fec_enet_private *fep = netdev_priv(ndev); + unsigned status; + int pkts = 0; + + status = readl(fep->hwp + FEC_IEVENT) & (FEC_ENET_RXF | FEC_ENET_TXF); + if (status) { + /* + * Clear any pending transmit or receive interrupts before + * processing the rings to avoid racing with the hardware. + */ + writel(status, fep->hwp + FEC_IEVENT); - fec_enet_tx(ndev); + if (status & FEC_ENET_RXF) + pkts = fec_enet_rx(ndev, budget); + + if (status & FEC_ENET_TXF) + fec_enet_tx(ndev); + } if (pkts < budget) { napi_complete(napi); @@ -1263,8 +1421,6 @@ static int fec_enet_mdio_reset(struct mii_bus *bus) static int fec_enet_mii_probe(struct net_device *ndev) { struct fec_enet_private *fep = netdev_priv(ndev); - const struct platform_device_id *id_entry = - platform_get_device_id(fep->pdev); struct phy_device *phy_dev = NULL; char mdio_bus_id[MII_BUS_ID_SIZE]; char phy_name[MII_BUS_ID_SIZE + 3]; @@ -1302,7 +1458,7 @@ static int fec_enet_mii_probe(struct net_device *ndev) } /* mask with MAC supported features */ - if (id_entry->driver_data & FEC_QUIRK_HAS_GBIT) { + if (fep->quirks & FEC_QUIRK_HAS_GBIT) { phy_dev->supported &= PHY_GBIT_FEATURES; #if !defined(CONFIG_M5272) phy_dev->supported |= SUPPORTED_Pause; @@ -1329,8 +1485,6 @@ static int fec_enet_mii_init(struct platform_device *pdev) static struct mii_bus *fec0_mii_bus; struct net_device *ndev = platform_get_drvdata(pdev); struct fec_enet_private *fep = netdev_priv(ndev); - const struct platform_device_id *id_entry = - platform_get_device_id(fep->pdev); int err = -ENXIO, i; /* @@ -1349,7 +1503,7 @@ static int fec_enet_mii_init(struct platform_device *pdev) * mdio interface in board design, and need to be configured by * fec0 mii_bus. */ - if ((id_entry->driver_data & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) { + if ((fep->quirks & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) { /* fec1 uses fec0 mii_bus */ if (mii_cnt && fec0_mii_bus) { fep->mii_bus = fec0_mii_bus; @@ -1370,7 +1524,7 @@ static int fec_enet_mii_init(struct platform_device *pdev) * document. */ fep->phy_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ahb), 5000000); - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) + if (fep->quirks & FEC_QUIRK_ENET_MAC) fep->phy_speed--; fep->phy_speed <<= 1; writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED); @@ -1405,7 +1559,7 @@ static int fec_enet_mii_init(struct platform_device *pdev) mii_cnt++; /* save fec0 mii_bus */ - if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) + if (fep->quirks & FEC_QUIRK_ENET_MAC) fec0_mii_bus = fep->mii_bus; return 0; @@ -1694,23 +1848,24 @@ static void fec_enet_free_buffers(struct net_device *ndev) struct fec_enet_private *fep = netdev_priv(ndev); unsigned int i; struct sk_buff *skb; - struct bufdesc *bdp; - bdp = fep->rx_bd_base; for (i = 0; i < fep->rx_ring_size; i++) { - skb = fep->rx_skbuff[i]; + union bufdesc_u *bdp = fec_enet_rx_get(i, fep); - if (bdp->cbd_bufaddr) - dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, + skb = fep->rx_skbuff[i]; + if (skb) { + dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr, FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); - if (skb) dev_kfree_skb(skb); - bdp = fec_enet_get_nextdesc(bdp, fep); + } } - bdp = fep->tx_bd_base; - for (i = 0; i < fep->tx_ring_size; i++) + for (i = 0; i < fep->tx_ring_size; i++) { + union bufdesc_u *bdp = fec_enet_tx_get(i, fep); + if (bdp->bd.cbd_bufaddr) + fec_enet_tx_unmap(&bdp->bd, fep); kfree(fep->tx_bounce[i]); + } } static int fec_enet_alloc_buffers(struct net_device *ndev) @@ -1718,58 +1873,54 @@ static int fec_enet_alloc_buffers(struct net_device *ndev) struct fec_enet_private *fep = netdev_priv(ndev); unsigned int i; struct sk_buff *skb; - struct bufdesc *bdp; - bdp = fep->rx_bd_base; for (i = 0; i < fep->rx_ring_size; i++) { + union bufdesc_u *bdp = fec_enet_rx_get(i, fep); + dma_addr_t addr; + skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE); if (!skb) { fec_enet_free_buffers(ndev); return -ENOMEM; } - fep->rx_skbuff[i] = skb; - bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, skb->data, - FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); - if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) { + addr = dma_map_single(&fep->pdev->dev, skb->data, + FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE); + if (dma_mapping_error(&fep->pdev->dev, addr)) { + dev_kfree_skb(skb); fec_enet_free_buffers(ndev); if (net_ratelimit()) netdev_err(ndev, "Rx DMA memory map failed\n"); return -ENOMEM; } - bdp->cbd_sc = BD_ENET_RX_EMPTY; - if (fep->bufdesc_ex) { - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; - ebdp->cbd_esc = BD_ENET_RX_INT; - } + fep->rx_skbuff[i] = skb; - bdp = fec_enet_get_nextdesc(bdp, fep); - } + bdp->bd.cbd_bufaddr = addr; + bdp->bd.cbd_sc = BD_ENET_RX_EMPTY; + /* Set the last buffer to wrap. */ + if (i == fep->rx_ring_size - 1) + bdp->bd.cbd_sc |= BD_SC_WRAP; - /* Set the last buffer to wrap. */ - bdp = fec_enet_get_prevdesc(bdp, fep); - bdp->cbd_sc |= BD_SC_WRAP; + if (fep->bufdesc_ex) + bdp->ebd.cbd_esc = BD_ENET_RX_INT; + } - bdp = fep->tx_bd_base; for (i = 0; i < fep->tx_ring_size; i++) { + union bufdesc_u *bdp = fec_enet_tx_get(i, fep); fep->tx_bounce[i] = kmalloc(FEC_ENET_TX_FRSIZE, GFP_KERNEL); - bdp->cbd_sc = 0; - bdp->cbd_bufaddr = 0; - - if (fep->bufdesc_ex) { - struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp; - ebdp->cbd_esc = BD_ENET_TX_INT; - } + /* Set the last buffer to wrap. */ + if (i == fep->tx_ring_size - 1) + bdp->bd.cbd_sc = BD_SC_WRAP; + else + bdp->bd.cbd_sc = 0; + bdp->bd.cbd_bufaddr = 0; - bdp = fec_enet_get_nextdesc(bdp, fep); + if (fep->bufdesc_ex) + bdp->ebd.cbd_esc = BD_ENET_TX_INT; } - /* Set the last buffer to wrap. */ - bdp = fec_enet_get_prevdesc(bdp, fep); - bdp->cbd_sc |= BD_SC_WRAP; - return 0; } @@ -1990,9 +2141,7 @@ static const struct net_device_ops fec_netdev_ops = { static int fec_enet_init(struct net_device *ndev) { struct fec_enet_private *fep = netdev_priv(ndev); - const struct platform_device_id *id_entry = - platform_get_device_id(fep->pdev); - struct bufdesc *cbd_base; + union bufdesc_u *cbd_base; /* Allocate memory for buffer descriptors. */ cbd_base = dma_alloc_coherent(NULL, PAGE_SIZE, &fep->bd_dma, @@ -2014,10 +2163,11 @@ static int fec_enet_init(struct net_device *ndev) /* Set receive and transmit descriptor base. */ fep->rx_bd_base = cbd_base; if (fep->bufdesc_ex) - fep->tx_bd_base = (struct bufdesc *) - (((struct bufdesc_ex *)cbd_base) + fep->rx_ring_size); + fep->tx_bd_base = (union bufdesc_u *) + (&cbd_base->ebd + fep->rx_ring_size); else - fep->tx_bd_base = cbd_base + fep->rx_ring_size; + fep->tx_bd_base = (union bufdesc_u *) + (&cbd_base->bd + fep->rx_ring_size); /* The FEC Ethernet specific entries in the device structure */ ndev->watchdog_timeo = TX_TIMEOUT; @@ -2027,19 +2177,24 @@ static int fec_enet_init(struct net_device *ndev) writel(FEC_RX_DISABLED_IMASK, fep->hwp + FEC_IMASK); netif_napi_add(ndev, &fep->napi, fec_enet_rx_napi, NAPI_POLL_WEIGHT); - if (id_entry->driver_data & FEC_QUIRK_HAS_VLAN) { - /* enable hw VLAN support */ - ndev->features |= NETIF_F_HW_VLAN_CTAG_RX; - ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX; - } + if (fep->bufdesc_ex) { + /* Features which require the enhanced buffer descriptors */ + netdev_features_t features = 0; - if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) { - /* enable hw accelerator */ - ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM - | NETIF_F_RXCSUM); - ndev->hw_features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM - | NETIF_F_RXCSUM); - fep->csum_flags |= FLAG_RX_CSUM_ENABLED; + if (fep->quirks & FEC_QUIRK_HAS_VLAN) { + /* enable hw VLAN support */ + features |= NETIF_F_HW_VLAN_CTAG_RX; + } + + if (fep->quirks & FEC_QUIRK_HAS_CSUM) { + /* enable hw accelerator */ + features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | + NETIF_F_RXCSUM; + fep->csum_flags |= FLAG_RX_CSUM_ENABLED; + } + + ndev->hw_features |= features; + ndev->features |= features; } fec_restart(ndev, 0); @@ -2110,13 +2265,6 @@ fec_probe(struct platform_device *pdev) /* setup board info structure */ fep = netdev_priv(ndev); -#if !defined(CONFIG_M5272) - /* default enable pause frame auto negotiation */ - if (pdev->id_entry && - (pdev->id_entry->driver_data & FEC_QUIRK_HAS_GBIT)) - fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG; -#endif - r = platform_get_resource(pdev, IORESOURCE_MEM, 0); fep->hwp = devm_ioremap_resource(&pdev->dev, r); if (IS_ERR(fep->hwp)) { @@ -2126,6 +2274,14 @@ fec_probe(struct platform_device *pdev) fep->pdev = pdev; fep->dev_id = dev_id++; + if (pdev->id_entry) + fep->quirks = pdev->id_entry->driver_data; + +#if !defined(CONFIG_M5272) + /* default enable pause frame auto negotiation */ + if (fep->quirks & FEC_QUIRK_HAS_GBIT) + fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG; +#endif fep->bufdesc_ex = 0; @@ -2160,8 +2316,7 @@ fec_probe(struct platform_device *pdev) fep->clk_enet_out = NULL; fep->clk_ptp = devm_clk_get(&pdev->dev, "ptp"); - fep->bufdesc_ex = - pdev->id_entry->driver_data & FEC_QUIRK_HAS_BUFDESC_EX; + fep->bufdesc_ex = fep->quirks & FEC_QUIRK_HAS_BUFDESC_EX; if (IS_ERR(fep->clk_ptp)) { fep->clk_ptp = NULL; fep->bufdesc_ex = 0;