diff mbox

PL310 errata workarounds

Message ID 20140319225137.GM7528@n2100.arm.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Russell King - ARM Linux March 19, 2014, 10:51 p.m. UTC
On Wed, Mar 19, 2014 at 10:52:32PM +0100, Marek Vasut wrote:
> On Tuesday, March 18, 2014 at 06:26:15 PM, Russell King - ARM Linux wrote:
> > On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
> > > Setting prefetch enables and early BRESP could all be done
> > > unconditionally in the core code.
> > 
> > I think we can do a few things here, if we know that the CPUs we're
> > connected to are all Cortex-A9:
> > 
> > 1. Enable BRESP.
> > 
> > 2. Enable I+D prefetching - but we really need to tune the prefetch offset
> >    for this to be worthwhile.  The value depends on the L3 memory system
> >    latency, so isn't something that should be specified at the SoC level.
> >    It may also change with different operating points.
> > 
> > 3. Full line of zeros - I think this is a difficult one to achieve
> > properly. The required sequence:
> > 
> >    - enable FLZ in L2 cache
> >    - enable L2 cache
> >    - enable FLZ in Cortex A9
> > 
> >    I'd also assume that when we turn the L2 cache off, we need the reverse
> >    sequence too.  So this sequence can't be done entirely by the boot
> > loader.
> > 
> > With (1) enabled and (2) properly tuned, I see a performance increase of
> > around 60Mbps on transmission, bringing the Cubox-i4 up from 250Mbps to
> > 315Mbps transmit on its gigabit interface with cpufreq ondemand enabled.
> > With "performance", this goes up to [323, 323, 321, 325, 322]Mbps.  On
> > receive [446, 603, 605, 605, 601]Mbps, which hasn't really changed
> > very much (and still impressively exceeds the Freescale stated maximum
> > total bandwidth of the gigabit interface.)
> 
> Speaking of FEC and slightly off-topic, have you ever seen this on your box 
> [1]/[2]/[3] ? I wonder if this might be cache-related as well, since I saw 
> similar issue on MX6 with PCIe-connected ethernet. I cannot put a finger on
> this though.

I think I've seen something similar once or twice.

Let me just pull up the transmit function.  This is the code which
writes to the descriptors (which frankly is pretty horrid):

        bdp->cbd_datlen = skb->len;
        bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
                        skb->len, DMA_TO_DEVICE); <-- barrier inside
                ebdp->cbd_bdu = 0;
                        ebdp->cbd_esc = BD_ENET_TX_INT;
                                ebdp->cbd_esc |= BD_ENET_TX_PINS;
        bdp->cbd_sc = status;
        writel(0, fep->hwp + FEC_X_DES_ACTIVE); <-- barrier before write

A couple of points here:

1. The hardware operates in a ring - it can read the next ring if it
   sees the BD_ENET_TX_READY bit set if has just finished processing
   the previous descriptor - this can happen before the write to
   FEC_X_DES_ACTIVE.

2. The ARM can re-order writes.  The writes it can re-order are:

	bdp->cbd_bufaddr
	ebdp->cbd_bdu
	ebdp->cbd_esc
	ebdp->cbd_sc

Hence, it's entirely possible for the FEC to see the updated descriptor
status before the rest of the descriptor has been written.  What's missing
is a barrier between the descriptor writes, and the final write to
bdp->cbd_sc.

Had I not got distracted by the L2 issues, I'd have posted my FEC patches
by now... in any case, my current xmit function looks a little different -
I've organised it such that:

1. We don't modify anything until we're past the point where things can
   error out.

2. Writes to the descriptor are localised in one area.

3. There's a wmb() barrier between cbd_sc and the previous writes - I
   discussed this with Will Deacon, which resulted in this documentation
   for the barrier:

        /*
         * We need the preceding stores to the descriptor to complete
         * before updating the status field, which hands it over to the
         * hardware.  The corresponding rmb() is "in the hardware".
         */

The second thing that causes the transmit timeouts is the horrid way
NAPI has been added to the driver - it's racy.  NAPI itself isn't the
problem, it's this (compressed a bit to show only the relevant bits):

        do {
                int_events = readl(fep->hwp + FEC_IEVENT);
                writel(int_events, fep->hwp + FEC_IEVENT);
                if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
                        if (napi_schedule_prep(&fep->napi)) {
                                writel(FEC_RX_DISABLED_IMASK,
                                        fep->hwp + FEC_IMASK);
                                __napi_schedule(&fep->napi);
                        }
                }
                if (int_events & FEC_ENET_MII) {
                        complete(&fep->mdio_done);
                }
        } while (int_events);

Consider what happens here if:
- we talk in the MII bus and receive a MII interrupt
- we're just finishing NAPI processing but haven't quite got around to
  calling napi_complete()
- the ethernet has sent all packets, and has also raised a transmit
  interrupt

The result is the handler is entered, FEC_IEVENT contains TXF and MII
events.  Both these events are cleared down, (and thus no longer exist
as interrupt-causing events.)  napi_schedule_prep() returns false as
the NAPI rx function is still running, and doesn't mark it for a re-run.
We then do the MII interrupt.  Loop again, and int_events is zero,
we exit.

Meanwhile, the NAPI rx function calls napi_complete() and re-enables
the receive interrupt.  If you're unlucky enough that the RX ring is
also full... no RXF interrupt.  So no further interrupts except maybe
MII interrupts.

NAPI never gets scheduled.  RX ring never gets emptied.  TX ring never
gets reaped.  The result is a timeout with a completely full TX ring.

I think I've seen both cases: I've seen the case where the TX ring is
completely empty, but it hasn't been reaped.   I've also seen the case
where the TX ring contains packets to be transmitted but the hardware
isn't sending them.

That all said - with the patch below I haven't seen problems since.
(which is the quickest way I can think of to get you a copy of what
I'm presently running - I've killed a number of debug bits denoted by
all the blank lines - and this is against -rc7.)  You may notice that
I added some TX ring dumping code to the driver - always useful in
these situations. ;-)

This patch is of course the consolidated version: individually, this
would be at least 19 patches with nice commit messages describing
each change...

As far as the 600Mbps receive - you need the right conditions for that.
I select the performance cpufreq governor after boot, and let the boot
quiesce.  It doesn't take much for it to drop back to 460Mbps - another
running process other than iperf -s is sufficient to do that.

Let me know how you get on with this.

Comments

Marek Vasut March 19, 2014, 11:05 p.m. UTC | #1
On Wednesday, March 19, 2014 at 11:51:38 PM, Russell King - ARM Linux wrote:
> On Wed, Mar 19, 2014 at 10:52:32PM +0100, Marek Vasut wrote:

[...]

> > Speaking of FEC and slightly off-topic, have you ever seen this on your
> > box [1]/[2]/[3] ? I wonder if this might be cache-related as well, since
> > I saw similar issue on MX6 with PCIe-connected ethernet. I cannot put a
> > finger on this though.
> 
> I think I've seen something similar once or twice.

Nice, thanks for confirming this.

> Let me just pull up the transmit function.  This is the code which
> writes to the descriptors (which frankly is pretty horrid):

Yes.

>         bdp->cbd_datlen = skb->len;
>         bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
>                         skb->len, DMA_TO_DEVICE); <-- barrier inside
>                 ebdp->cbd_bdu = 0;
>                         ebdp->cbd_esc = BD_ENET_TX_INT;
>                                 ebdp->cbd_esc |= BD_ENET_TX_PINS;
>         bdp->cbd_sc = status;
>         writel(0, fep->hwp + FEC_X_DES_ACTIVE); <-- barrier before write
> 
> A couple of points here:
> 
> 1. The hardware operates in a ring - it can read the next ring if it
>    sees the BD_ENET_TX_READY bit set if has just finished processing
>    the previous descriptor - this can happen before the write to
>    FEC_X_DES_ACTIVE.
> 
> 2. The ARM can re-order writes.  The writes it can re-order are:
> 
> 	bdp->cbd_bufaddr
> 	ebdp->cbd_bdu
> 	ebdp->cbd_esc
> 	ebdp->cbd_sc
> 
> Hence, it's entirely possible for the FEC to see the updated descriptor
> status before the rest of the descriptor has been written.  What's missing
> is a barrier between the descriptor writes, and the final write to
> bdp->cbd_sc.

I think there might even be another thing at play here, but please correct me if 
I am wrong. If you look at fb8ef788680d48523321e5f150b23700a1caf980 , this patch 
actually has no functional impact on the driver , right ?

It does no functional change at all, but it changed something for Fugang. This 
makes me also believe there might clearly be something afoot with the DMA 
descriptors. Reading Documentation/DMA-API.txt , I noticed the ring with the 
descriptors is allocated with dma_alloc_coherent() . Yet, DMA-API.txt , section 
I, states:

"
 29 Consistent memory is memory for which a write by either the device or
 30 the processor can immediately be read by the processor or device
 31 without having to worry about caching effects.  (You may however need
 32 to make sure to flush the processor's write buffers before telling
 33 devices to read that memory.)
"

So the writes into the descriptor from the CPU might even be stuck in the CPU's 
write buffer when this code below from fec_main.c is called:

 438         /* Trigger transmission start */
 439         writel(0, fep->hwp + FEC_X_DES_ACTIVE);

Does this also make sense or is my idea completely flawed please ?

> Had I not got distracted by the L2 issues, I'd have posted my FEC patches
> by now... in any case, my current xmit function looks a little different -
> I've organised it such that:
> 
> 1. We don't modify anything until we're past the point where things can
>    error out.
> 
> 2. Writes to the descriptor are localised in one area.
> 
> 3. There's a wmb() barrier between cbd_sc and the previous writes - I
>    discussed this with Will Deacon, which resulted in this documentation
>    for the barrier:
> 
>         /*
>          * We need the preceding stores to the descriptor to complete
>          * before updating the status field, which hands it over to the
>          * hardware.  The corresponding rmb() is "in the hardware".
>          */
> 
> The second thing that causes the transmit timeouts is the horrid way
> NAPI has been added to the driver - it's racy.  NAPI itself isn't the
> problem, it's this (compressed a bit to show only the relevant bits):

I fully agree there are multiple race conditions in the driver as it is now.

[...]

> That all said - with the patch below I haven't seen problems since.
> (which is the quickest way I can think of to get you a copy of what
> I'm presently running - I've killed a number of debug bits denoted by
> all the blank lines - and this is against -rc7.)  You may notice that
> I added some TX ring dumping code to the driver - always useful in
> these situations. ;-)

I will test this ASAP on multiple MX6es and let you know of the result.

> This patch is of course the consolidated version: individually, this
> would be at least 19 patches with nice commit messages describing
> each change...

Please keep me on CC of those, I'd be very interested to test them.

> As far as the 600Mbps receive - you need the right conditions for that.
> I select the performance cpufreq governor after boot, and let the boot
> quiesce.  It doesn't take much for it to drop back to 460Mbps - another
> running process other than iperf -s is sufficient to do that.
> 
> Let me know how you get on with this.

Roger that, will do, thank you very much !
Marek Vasut March 20, 2014, 4:01 a.m. UTC | #2
On Thursday, March 20, 2014 at 12:05:56 AM, Marek Vasut wrote:
> On Wednesday, March 19, 2014 at 11:51:38 PM, Russell King - ARM Linux wrote:

[...]

> > As far as the 600Mbps receive - you need the right conditions for that.
> > I select the performance cpufreq governor after boot, and let the boot
> > quiesce.  It doesn't take much for it to drop back to 460Mbps - another
> > running process other than iperf -s is sufficient to do that.
> > 
> > Let me know how you get on with this.
> 
> Roger that, will do, thank you very much !

I did some test and I observe no dips in the speed, nice work. Thanks !

I will try to get feedback from more boards and get back to you with the 
results.

Best regards,
Marek Vasut
Fabio Estevam March 20, 2014, 10:27 p.m. UTC | #3
Hi Russell,

On Thu, Mar 20, 2014 at 1:01 AM, Marek Vasut <marex@denx.de> wrote:
> On Thursday, March 20, 2014 at 12:05:56 AM, Marek Vasut wrote:
>> On Wednesday, March 19, 2014 at 11:51:38 PM, Russell King - ARM Linux wrote:
>
> [...]
>
>> > As far as the 600Mbps receive - you need the right conditions for that.
>> > I select the performance cpufreq governor after boot, and let the boot
>> > quiesce.  It doesn't take much for it to drop back to 460Mbps - another
>> > running process other than iperf -s is sufficient to do that.
>> >
>> > Let me know how you get on with this.
>>
>> Roger that, will do, thank you very much !
>
> I did some test and I observe no dips in the speed, nice work. Thanks !
>
> I will try to get feedback from more boards and get back to you with the
> results.

Thanks for the patch.

Robert Daniels on Cc tried it on mx53 and sent the feedback below.

Thanks

"I tried the patch and it seems a little better but still has problems.
When I first started the test I saw dropped packets.  After about 2 minutes
I got the pause again with the accompanying kernel backtrace.  It looks
like I got some more debug information from the backtrace however (I think
that was part of the patch.)

This is what it looks like now:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0
at /home/robertd/Development/IC/
Dev/BoardSupport/ic-ii/linux-mainline/net/sched/sch_generic.c:264
 dev_watchdo
g+0x288/0x2ac()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc6+ #5
Backtrace:
[<800121bc>] (dump_backtrace) from [<800124a0>] (show_stack+0x18/0x1c)
 r6:8051f29c r5:00000000 r4:808edb9c r3:00000000
[<80012488>] (show_stack) from [<80652098>] (dump_stack+0x84/0x9c)
[<80652014>] (dump_stack) from [<80027d1c>] (warn_slowpath_common
+0x70/0x94)
 r5:00000009 r4:808c9d60
[<80027cac>] (warn_slowpath_common) from [<80027d78>] (warn_slowpath_fmt
+0x38/0x40)
 r8:ded37b40 r7:808c8000 r6:ded37b00 r5:dec34000 r4:00000000
[<80027d44>] (warn_slowpath_fmt) from [<8051f29c>] (dev_watchdog
+0x288/0x2ac)
 r3:dec34000 r2:808335f0
[<8051f014>] (dev_watchdog) from [<80031e68>] (call_timer_fn+0x74/0xf4)
 r10:80031df4 r9:dec34000 r8:8051f014 r7:808c8000 r6:00000100 r5:808c8000
 r4:808c9dd0
[<80031df4>] (call_timer_fn) from [<80032598>] (run_timer_softirq
+0x19c/0x234)
 r10:8051f014 r9:dec34000 r8:00200200 r7:00000000 r6:808c9e20 r5:80929fc0
 r4:dec34284
[<800323fc>] (run_timer_softirq) from [<8002c2f4>] (__do_softirq
+0x110/0x2b4)
 r10:00000100 r9:00000001 r8:40000001 r7:808c8000 r6:808ca080 r5:808ca084
 r4:00000000
[<8002c1e4>] (__do_softirq) from [<8002c7ac>] (irq_exit+0xb8/0x10c)
 r10:8065c4cc r9:00000001 r8:00000000 r7:00000037 r6:808c8000 r5:808c5fe8
 r4:808c8028
[<8002c6f4>] (irq_exit) from [<8000f2a0>] (handle_IRQ+0x5c/0xbc)
 r5:808c5fe8 r4:808d0d24
[<8000f244>] (handle_IRQ) from [<80008590>] (tzic_handle_irq+0x78/0xa8)
 r8:808c9f10 r7:00000001 r6:00000020 r5:80928fd8 r4:00000000 r3:00000080
[<80008518>] (tzic_handle_irq) from [<800130a4>] (__irq_svc+0x44/0x5c)
Exception stack(0x808c9f10 to 0x808c9f58)
9f00:                                     00000001 00000001 00000000
808d3e70
9f20: 808c8000 808d099c 808d0938 8092837d 00000000 808c8000 8065c4cc
808c9f64
9f40: 808c9f28 808c9f58 800638e0 8000f674 20000013 ffffffff
 r9:808c8000 r8:00000000 r7:808c9f44 r6:ffffffff r5:20000013 r4:8000f674
[<8000f64c>] (arch_cpu_idle) from [<8006e874>] (cpu_startup_entry
+0x108/0x160)
[<8006e76c>] (cpu_startup_entry) from [<8064ce8c>] (rest_init+0xb4/0xdc)
 r7:808b8360
[<8064cdd8>] (rest_init) from [<80879b58>] (start_kernel+0x328/0x38c)
 r6:ffffffff r5:808d0880 r4:808d0a30
[<80879830>] (start_kernel) from [<70008074>] (0x70008074)
---[ end trace 18ea3643c6be04df ]---
fec 63fec000.ethernet eth0: TX ring dump
Nr    SC     addr       len  SKB
 0    0x1c00 0xce485000  106 de5db480
 1    0x1c00 0xce485800  106 de5b7540
 2    0x1c00 0xce486000  106 de5b7300
 3    0x1c00 0xce486800  106 de5b73c0
 4    0x1c00 0xce487000  106 de5b7e40
 5    0x1c00 0xce487800  106 de5b7840
 6    0x1c00 0xce500000  106 de5b7b40
 7    0x1c00 0xce500800  106 de5b7600
 8    0x1c00 0xce501000  106 de5b7d80
 9    0x1c00 0xce501800  106 de5b7180
10    0x1c00 0xce502000  106 de5b7a80
11    0x1c00 0xce502800  106 de5b7240
12    0x1c00 0xce503000  106 de5b7000
13 SH 0x1c00 0x00000000   66   (null)
14    0x9c00 0xce504000   66 de5b90c0
15    0x1c00 0xce504800   66 de5b9a80
16    0x1c00 0xce505000   66 de5b9840
17    0x1c00 0xce505800   66 de5b9000
18    0x1c00 0xce506000   66 de5b9480
19    0x1c00 0xce506800   66 de5b9b40
20    0x1c00 0xce507000   66 de5b9600
21    0x1c00 0xce507800   66 de5b9900
22    0x1c00 0xce508000   66 de5b9180
23    0x1c00 0xce508800   66 de5b9240
24    0x1c00 0xce509000   66 de5b93c0
25    0x1c00 0xce509800   66 de5b9540
26    0x1c00 0xce50a000  106 de5b9780
27    0x1c00 0xce50a800   66 de5b9d80
28    0x1c00 0xce50b000  106 de5b9cc0
29    0x1c00 0xce50b800   66 de5b99c0
30    0x1c00 0xce50c000  106 de5b9300
31    0x1c00 0xce50c800  106 de7e9900
32    0x1c00 0xce50d000   66 de7e9e40
33    0x1c00 0xce50d800  106 de7e96c0
34    0x1c00 0xce50e000   66 de7e9c00
35    0x1c00 0xce50e800  106 de7e9d80
36    0x1c00 0xce50f000  106 de7e9cc0
37    0x1c00 0xce50f800   66 de7e9f00
38    0x1c00 0xce510000  106 de7e9a80
39    0x1c00 0xce510800  106 de7e90c0
40    0x1c00 0xce511000   66 de7e9240
41    0x1c00 0xce511800  106 de7e9000
42    0x1c00 0xce512000  106 de7e9780
43    0x1c00 0xce512800   66 de7e9300
44    0x1c00 0xce513000  106 de7e9840
45    0x1c00 0xce513800  106 de5dbd80
46    0x1c00 0xce514000   66 de5dbb40
47    0x1c00 0xce514800  106 de5db780
48    0x1c00 0xce515000  106 de5db9c0
49    0x1c00 0xce515800   66 de5db600
50    0x1c00 0xce516000  106 de5db540
51    0x1c00 0xce516800  106 de5db000
52    0x1c00 0xce517000   66 de5dbc00
53    0x1c00 0xce517800  106 de5db6c0
54    0x1c00 0xce518000   66 de5db3c0
55    0x1c00 0xce518800  106 de5db300
56    0x1c00 0xce519000  106 de5dbf00
57    0x1c00 0xce519800   66 de5db0c0
58    0x1c00 0xce51a000  106 de5db900
59    0x1c00 0xce51a800  106 de5db240
60    0x1c00 0xce51b000   66 de5dbe40
61    0x1c00 0xce51b800  106 de5db840
62    0x1c00 0xce51c000  106 de5dbcc0
63    0x3c00 0xce51c800  106 de5db180
"
Russell King - ARM Linux March 20, 2014, 11:06 p.m. UTC | #4
On Thu, Mar 20, 2014 at 07:27:43PM -0300, Fabio Estevam wrote:
> "I tried the patch and it seems a little better but still has problems.
> When I first started the test I saw dropped packets.  After about 2 minutes
> I got the pause again with the accompanying kernel backtrace.  It looks
> like I got some more debug information from the backtrace however (I think
> that was part of the patch.)

Thanks.  Now that we can see the ring, we can see what's going on.
Every entry with a non-NULL skb is a packet which has been queued up
for transmission, but which hasn't been reaped.  Any entry with a "SC"
value of 0x1c00 has been transmitted.

S marks the ring entry where the next packet to be transmitted would
be placed into the ring.  H points at the previous entry which was
reaped - in other words, the next packet which would be reaped if it
were transmitted would be ring entry 14.

However, this hasn't happened, because that packet is still marked for
the FEC to transmit it.  The really odd thing is, it has transmitted
the subsequent packets from 15 to 63, and 0 to 12.

The obvious question is: was the packet at 14 actually transmitted,
and has the update to the descriptor been lost somehow - that would
require a log of the last 64-ish transmitted packets, and knowledge
of what those packets should contain to really confirm - just wondering
whether we could transmit an additional byte containing the ring index
on these small packets which we could then capture.  We would have to
be certain that the capture itself hasn't dropped any packets.

I'll look into some further patches which would allow that to be
confirmed - I can't do that immediately as I've rather a lot to deal
with at the moment.

> fec 63fec000.ethernet eth0: TX ring dump
> Nr    SC     addr       len  SKB
>  0    0x1c00 0xce485000  106 de5db480
>  1    0x1c00 0xce485800  106 de5b7540
>  2    0x1c00 0xce486000  106 de5b7300
>  3    0x1c00 0xce486800  106 de5b73c0
>  4    0x1c00 0xce487000  106 de5b7e40
>  5    0x1c00 0xce487800  106 de5b7840
>  6    0x1c00 0xce500000  106 de5b7b40
>  7    0x1c00 0xce500800  106 de5b7600
>  8    0x1c00 0xce501000  106 de5b7d80
>  9    0x1c00 0xce501800  106 de5b7180
> 10    0x1c00 0xce502000  106 de5b7a80
> 11    0x1c00 0xce502800  106 de5b7240
> 12    0x1c00 0xce503000  106 de5b7000
> 13 SH 0x1c00 0x00000000   66   (null)
> 14    0x9c00 0xce504000   66 de5b90c0
> 15    0x1c00 0xce504800   66 de5b9a80
> 16    0x1c00 0xce505000   66 de5b9840
> 17    0x1c00 0xce505800   66 de5b9000
> 18    0x1c00 0xce506000   66 de5b9480
> 19    0x1c00 0xce506800   66 de5b9b40
> 20    0x1c00 0xce507000   66 de5b9600
> 21    0x1c00 0xce507800   66 de5b9900
> 22    0x1c00 0xce508000   66 de5b9180
> 23    0x1c00 0xce508800   66 de5b9240
> 24    0x1c00 0xce509000   66 de5b93c0
> 25    0x1c00 0xce509800   66 de5b9540
> 26    0x1c00 0xce50a000  106 de5b9780
> 27    0x1c00 0xce50a800   66 de5b9d80
> 28    0x1c00 0xce50b000  106 de5b9cc0
> 29    0x1c00 0xce50b800   66 de5b99c0
> 30    0x1c00 0xce50c000  106 de5b9300
> 31    0x1c00 0xce50c800  106 de7e9900
> 32    0x1c00 0xce50d000   66 de7e9e40
> 33    0x1c00 0xce50d800  106 de7e96c0
> 34    0x1c00 0xce50e000   66 de7e9c00
> 35    0x1c00 0xce50e800  106 de7e9d80
> 36    0x1c00 0xce50f000  106 de7e9cc0
> 37    0x1c00 0xce50f800   66 de7e9f00
> 38    0x1c00 0xce510000  106 de7e9a80
> 39    0x1c00 0xce510800  106 de7e90c0
> 40    0x1c00 0xce511000   66 de7e9240
> 41    0x1c00 0xce511800  106 de7e9000
> 42    0x1c00 0xce512000  106 de7e9780
> 43    0x1c00 0xce512800   66 de7e9300
> 44    0x1c00 0xce513000  106 de7e9840
> 45    0x1c00 0xce513800  106 de5dbd80
> 46    0x1c00 0xce514000   66 de5dbb40
> 47    0x1c00 0xce514800  106 de5db780
> 48    0x1c00 0xce515000  106 de5db9c0
> 49    0x1c00 0xce515800   66 de5db600
> 50    0x1c00 0xce516000  106 de5db540
> 51    0x1c00 0xce516800  106 de5db000
> 52    0x1c00 0xce517000   66 de5dbc00
> 53    0x1c00 0xce517800  106 de5db6c0
> 54    0x1c00 0xce518000   66 de5db3c0
> 55    0x1c00 0xce518800  106 de5db300
> 56    0x1c00 0xce519000  106 de5dbf00
> 57    0x1c00 0xce519800   66 de5db0c0
> 58    0x1c00 0xce51a000  106 de5db900
> 59    0x1c00 0xce51a800  106 de5db240
> 60    0x1c00 0xce51b000   66 de5dbe40
> 61    0x1c00 0xce51b800  106 de5db840
> 62    0x1c00 0xce51c000  106 de5dbcc0
> 63    0x3c00 0xce51c800  106 de5db180
> "
Troy Kisky March 21, 2014, 12:24 a.m. UTC | #5
On 3/20/2014 4:06 PM, Russell King - ARM Linux wrote:
> On Thu, Mar 20, 2014 at 07:27:43PM -0300, Fabio Estevam wrote:
>> "I tried the patch and it seems a little better but still has problems.
>> When I first started the test I saw dropped packets.  After about 2 minutes
>> I got the pause again with the accompanying kernel backtrace.  It looks
>> like I got some more debug information from the backtrace however (I think
>> that was part of the patch.)
> 
> Thanks.  Now that we can see the ring, we can see what's going on.
> Every entry with a non-NULL skb is a packet which has been queued up
> for transmission, but which hasn't been reaped.  Any entry with a "SC"
> value of 0x1c00 has been transmitted.
> 
> S marks the ring entry where the next packet to be transmitted would
> be placed into the ring.  H points at the previous entry which was
> reaped - in other words, the next packet which would be reaped if it
> were transmitted would be ring entry 14.
> 
> However, this hasn't happened, because that packet is still marked for
> the FEC to transmit it.  The really odd thing is, it has transmitted
> the subsequent packets from 15 to 63, and 0 to 12.
> 
> The obvious question is: was the packet at 14 actually transmitted,
> and has the update to the descriptor been lost somehow - that would
> require a log of the last 64-ish transmitted packets, and knowledge
> of what those packets should contain to really confirm - just wondering
> whether we could transmit an additional byte containing the ring index
> on these small packets which we could then capture.  We would have to
> be certain that the capture itself hasn't dropped any packets.

My question is, on a non-cacheable, bufferable write, can an entire 64
byte line be written ?  Since our descriptors are only 32 bytes is there
contention between the cpu queuing  the next packet and the ENET completing
the one before ?

13 SH 0x1c00 0x00000000   66   (null)
14    0x9c00 0xce504000   66 de5b90c0

Note that the problematic descriptor(14) is even. That would fit with
the above. In your testing, is it always even?


Troy
Russell King - ARM Linux March 21, 2014, 1:18 a.m. UTC | #6
On Thu, Mar 20, 2014 at 05:24:07PM -0700, Troy Kisky wrote:
> My question is, on a non-cacheable, bufferable write, can an entire 64
> byte line be written ?  Since our descriptors are only 32 bytes is there
> contention between the cpu queuing  the next packet and the ENET completing
> the one before ?

Well, 64-byte lines would be covering descriptors 0,1 2,3 4,5 ... 12,13
14,15 etc.

However, it wouldn't really make sense because the store buffers in the
L310 are 32-bytes in size, and there's three of them.  Given that with
my patch the descriptor updates go through two draining of that buffer
per update - once after the address, length and other parameters are
written, then again after the status bits have been written, I think
it's unlikely that what you're suggesting could be a possibility,
unless there's some bug we don't know about.

Remember, DMA memory is "normal, non-cacheable" so there shouldn't be
any cache effects going on here - only store buffer effects which may
delay, merge and/or reorder writes, and that's why we have barriers.

> 13 SH 0x1c00 0x00000000   66   (null)
> 14    0x9c00 0xce504000   66 de5b90c0
> 
> Note that the problematic descriptor(14) is even. That would fit with
> the above. In your testing, is it always even?

Since running with my patch on aan iMX6Q, I haven't seen any timeouts,
and I've run many iperf sessions in both directions (even multiple
machines hammering the poor iMX6Q, running iperf and/or flood pings
with various packet sizes.)

What may be useful to know is which iMX device these are still happening
on, with my patch applied.  I have both iMX6Q and iMX6S, the 6Q has been
the focus of my hammering so far...
Fabio Estevam March 21, 2014, 1:36 a.m. UTC | #7
On Thu, Mar 20, 2014 at 10:18 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:

> Well, 64-byte lines would be covering descriptors 0,1 2,3 4,5 ... 12,13
> 14,15 etc.
>
> However, it wouldn't really make sense because the store buffers in the
> L310 are 32-bytes in size, and there's three of them.  Given that with
> my patch the descriptor updates go through two draining of that buffer
> per update - once after the address, length and other parameters are
> written, then again after the status bits have been written, I think
> it's unlikely that what you're suggesting could be a possibility,
> unless there's some bug we don't know about.
>
> Remember, DMA memory is "normal, non-cacheable" so there shouldn't be
> any cache effects going on here - only store buffer effects which may
> delay, merge and/or reorder writes, and that's why we have barriers.
>
>> 13 SH 0x1c00 0x00000000   66   (null)
>> 14    0x9c00 0xce504000   66 de5b90c0
>>
>> Note that the problematic descriptor(14) is even. That would fit with
>> the above. In your testing, is it always even?
>
> Since running with my patch on aan iMX6Q, I haven't seen any timeouts,
> and I've run many iperf sessions in both directions (even multiple
> machines hammering the poor iMX6Q, running iperf and/or flood pings
> with various packet sizes.)
>
> What may be useful to know is which iMX device these are still happening
> on, with my patch applied.  I have both iMX6Q and iMX6S, the 6Q has been
> the focus of my hammering so far...

Robert's tests were made on a mx53 (single CortexA9), and its cache
controller is not the L310.

Regards,

Fabio Estevam
Fabio Estevam March 21, 2014, 1:43 a.m. UTC | #8
On Thu, Mar 20, 2014 at 10:36 PM, Fabio Estevam <festevam@gmail.com> wrote:

> Robert's tests were made on a mx53 (single CortexA9), and its cache
> controller is not the L310.

Ops, I meant CortexA8.
Zhi Li March 21, 2014, 3:50 p.m. UTC | #9
On Wed, Mar 19, 2014 at 5:51 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Mar 19, 2014 at 10:52:32PM +0100, Marek Vasut wrote:
>> On Tuesday, March 18, 2014 at 06:26:15 PM, Russell King - ARM Linux wrote:
>> > On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
>> > > Setting prefetch enables and early BRESP could all be done
>> > > unconditionally in the core code.
>> >
>> > I think we can do a few things here, if we know that the CPUs we're
>> > connected to are all Cortex-A9:
>> >
>> > 1. Enable BRESP.
>> >
>> > 2. Enable I+D prefetching - but we really need to tune the prefetch offset
>> >    for this to be worthwhile.  The value depends on the L3 memory system
>> >    latency, so isn't something that should be specified at the SoC level.
>> >    It may also change with different operating points.
>> >
>> > 3. Full line of zeros - I think this is a difficult one to achieve
>> > properly. The required sequence:
>> >
>> >    - enable FLZ in L2 cache
>> >    - enable L2 cache
>> >    - enable FLZ in Cortex A9
>> >
>> >    I'd also assume that when we turn the L2 cache off, we need the reverse
>> >    sequence too.  So this sequence can't be done entirely by the boot
>> > loader.
>> >
>> > With (1) enabled and (2) properly tuned, I see a performance increase of
>> > around 60Mbps on transmission, bringing the Cubox-i4 up from 250Mbps to
>> > 315Mbps transmit on its gigabit interface with cpufreq ondemand enabled.
>> > With "performance", this goes up to [323, 323, 321, 325, 322]Mbps.  On
>> > receive [446, 603, 605, 605, 601]Mbps, which hasn't really changed
>> > very much (and still impressively exceeds the Freescale stated maximum
>> > total bandwidth of the gigabit interface.)
>>
>> Speaking of FEC and slightly off-topic, have you ever seen this on your box
>> [1]/[2]/[3] ? I wonder if this might be cache-related as well, since I saw
>> similar issue on MX6 with PCIe-connected ethernet. I cannot put a finger on
>> this though.
>
> I think I've seen something similar once or twice.
>
> Let me just pull up the transmit function.  This is the code which
> writes to the descriptors (which frankly is pretty horrid):
>
>         bdp->cbd_datlen = skb->len;
>         bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
>                         skb->len, DMA_TO_DEVICE); <-- barrier inside
>                 ebdp->cbd_bdu = 0;
>                         ebdp->cbd_esc = BD_ENET_TX_INT;
>                                 ebdp->cbd_esc |= BD_ENET_TX_PINS;
>         bdp->cbd_sc = status;
>         writel(0, fep->hwp + FEC_X_DES_ACTIVE); <-- barrier before write
>
> A couple of points here:
>
> 1. The hardware operates in a ring - it can read the next ring if it
>    sees the BD_ENET_TX_READY bit set if has just finished processing
>    the previous descriptor - this can happen before the write to
>    FEC_X_DES_ACTIVE.

Correct. uDMA will start transfer when D_ENET_TX_READY bit.
I think other network MAC work similar with this one.

>
> 2. The ARM can re-order writes.  The writes it can re-order are:
>
>         bdp->cbd_bufaddr
>         ebdp->cbd_bdu
>         ebdp->cbd_esc
>         ebdp->cbd_sc
>
> Hence, it's entirely possible for the FEC to see the updated descriptor
> status before the rest of the descriptor has been written.  What's missing
> is a barrier between the descriptor writes, and the final write to
> bdp->cbd_sc.

Possible. I also consider this when debug the other issues.
Just bdp->bufaddr and bdp->buflen is important.
bdu, hardware don't care it.
esc is the same for each BD.  When software go though BD ring once,
esc will be not changed again, even though we write.

but if sc write before buffaddr and bullen, there will be issue.
Add memory barre here is better before  ebdp->cbd_sc.

I also want to add it before, But I have not found a test case to
verify it is necessary.

>
> Had I not got distracted by the L2 issues, I'd have posted my FEC patches
> by now... in any case, my current xmit function looks a little different -
> I've organised it such that:
>
> 1. We don't modify anything until we're past the point where things can
>    error out.
>
> 2. Writes to the descriptor are localised in one area.
>
> 3. There's a wmb() barrier between cbd_sc and the previous writes - I
>    discussed this with Will Deacon, which resulted in this documentation
>    for the barrier:
>
>         /*
>          * We need the preceding stores to the descriptor to complete
>          * before updating the status field, which hands it over to the
>          * hardware.  The corresponding rmb() is "in the hardware".
>          */

It is better.
>
> The second thing that causes the transmit timeouts is the horrid way
> NAPI has been added to the driver - it's racy.  NAPI itself isn't the
> problem, it's this (compressed a bit to show only the relevant bits):
>
>         do {
>                 int_events = readl(fep->hwp + FEC_IEVENT);
>                 writel(int_events, fep->hwp + FEC_IEVENT);
>                 if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
>                         if (napi_schedule_prep(&fep->napi)) {
>                                 writel(FEC_RX_DISABLED_IMASK,
>                                         fep->hwp + FEC_IMASK);
>                                 __napi_schedule(&fep->napi);
>                         }
>                 }
>                 if (int_events & FEC_ENET_MII) {
>                         complete(&fep->mdio_done);
>                 }
>         } while (int_events);
>
> Consider what happens here if:
> - we talk in the MII bus and receive a MII interrupt
> - we're just finishing NAPI processing but haven't quite got around to
>   calling napi_complete()
> - the ethernet has sent all packets, and has also raised a transmit
>   interrupt
>
> The result is the handler is entered, FEC_IEVENT contains TXF and MII
> events.  Both these events are cleared down, (and thus no longer exist
> as interrupt-causing events.)  napi_schedule_prep() returns false as
> the NAPI rx function is still running, and doesn't mark it for a re-run.
> We then do the MII interrupt.  Loop again, and int_events is zero,
> we exit.
>
> Meanwhile, the NAPI rx function calls napi_complete() and re-enables
> the receive interrupt.  If you're unlucky enough that the RX ring is
> also full... no RXF interrupt.  So no further interrupts except maybe
> MII interrupts.
>
> NAPI never gets scheduled.  RX ring never gets emptied.  TX ring never
> gets reaped.  The result is a timeout with a completely full TX ring.

Do you see RX ring full?

>
> I think I've seen both cases: I've seen the case where the TX ring is
> completely empty, but it hasn't been reaped.   I've also seen the case
> where the TX ring contains packets to be transmitted but the hardware
> isn't sending them.
>
> That all said - with the patch below I haven't seen problems since.
> (which is the quickest way I can think of to get you a copy of what
> I'm presently running - I've killed a number of debug bits denoted by
> all the blank lines - and this is against -rc7.)  You may notice that
> I added some TX ring dumping code to the driver - always useful in
> these situations. ;-)
>
> This patch is of course the consolidated version: individually, this
> would be at least 19 patches with nice commit messages describing
> each change...
>
> As far as the 600Mbps receive - you need the right conditions for that.
> I select the performance cpufreq governor after boot, and let the boot
> quiesce.  It doesn't take much for it to drop back to 460Mbps - another
> running process other than iperf -s is sufficient to do that.
>
> Let me know how you get on with this.
>
> diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
> index 3b8d6d19ff05..510580eeae4b 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -170,6 +170,11 @@ struct bufdesc_ex {
>         unsigned short res0[4];
>  };
>
> +union bufdesc_u {
> +       struct bufdesc bd;
> +       struct bufdesc_ex ebd;
> +};
> +
>  /*
>   *     The following definitions courtesy of commproc.h, which where
>   *     Copyright (c) 1997 Dan Malek (dmalek@jlc.net).
> @@ -240,14 +245,14 @@ struct bufdesc_ex {
>   * the skbuffer directly.
>   */
>
> -#define FEC_ENET_RX_PAGES      8
> +#define FEC_ENET_RX_PAGES      32
>  #define FEC_ENET_RX_FRSIZE     2048
>  #define FEC_ENET_RX_FRPPG      (PAGE_SIZE / FEC_ENET_RX_FRSIZE)
>  #define RX_RING_SIZE           (FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES)
>  #define FEC_ENET_TX_FRSIZE     2048
>  #define FEC_ENET_TX_FRPPG      (PAGE_SIZE / FEC_ENET_TX_FRSIZE)
> -#define TX_RING_SIZE           16      /* Must be power of two */
> -#define TX_RING_MOD_MASK       15      /*   for this to work */
> +#define TX_RING_SIZE           64      /* Must be power of two */
> +#define TX_RING_MOD_MASK       63      /*   for this to work */
>
>  #define BD_ENET_RX_INT          0x00800000
>  #define BD_ENET_RX_PTP          ((ushort)0x0400)
> @@ -289,12 +294,12 @@ struct fec_enet_private {
>         /* CPM dual port RAM relative addresses */
>         dma_addr_t      bd_dma;
>         /* Address of Rx and Tx buffers */
> -       struct bufdesc  *rx_bd_base;
> -       struct bufdesc  *tx_bd_base;
> +       union bufdesc_u *rx_bd_base;
> +       union bufdesc_u *tx_bd_base;
>         /* The next free ring entry */
> -       struct bufdesc  *cur_rx, *cur_tx;
> -       /* The ring entries to be free()ed */
> -       struct bufdesc  *dirty_tx;
> +       unsigned short tx_next;
> +       unsigned short tx_dirty;
> +       unsigned short rx_next;
>
>         unsigned short tx_ring_size;
>         unsigned short rx_ring_size;
> @@ -335,6 +340,9 @@ struct fec_enet_private {
>         struct timer_list time_keep;
>         struct fec_enet_delayed_work delay_work;
>         struct regulator *reg_phy;
> +       unsigned long quirks;
> +
> +
>  };
>
>  void fec_ptp_init(struct platform_device *pdev);
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 03a351300013..8105697d5a99 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -101,6 +101,8 @@ static void set_multicast_list(struct net_device *ndev);
>   * ENET_TDAR[TDAR].
>   */
>  #define FEC_QUIRK_ERR006358            (1 << 7)
> +/* Controller has ability to offset rx packets */
> +#define FEC_QUIRK_RX_SHIFT16           (1 << 8)
>
>  static struct platform_device_id fec_devtype[] = {
>         {
> @@ -120,7 +122,8 @@ static struct platform_device_id fec_devtype[] = {
>                 .name = "imx6q-fec",
>                 .driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT |
>                                 FEC_QUIRK_HAS_BUFDESC_EX | FEC_QUIRK_HAS_CSUM |
> -                               FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358,
> +                               FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358 |
> +                               FEC_QUIRK_RX_SHIFT16,
>         }, {
>                 .name = "mvf600-fec",
>                 .driver_data = FEC_QUIRK_ENET_MAC,
> @@ -200,6 +203,7 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
>  /* FEC receive acceleration */
>  #define FEC_RACC_IPDIS         (1 << 1)
>  #define FEC_RACC_PRODIS                (1 << 2)
> +#define FEC_RACC_SHIFT16       BIT(7)
>  #define FEC_RACC_OPTIONS       (FEC_RACC_IPDIS | FEC_RACC_PRODIS)
>
>  /*
> @@ -233,57 +237,54 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
>
>  static int mii_cnt;
>
> -static inline
> -struct bufdesc *fec_enet_get_nextdesc(struct bufdesc *bdp, struct fec_enet_private *fep)
> +static unsigned copybreak = 200;
> +module_param(copybreak, uint, 0644);
> +MODULE_PARM_DESC(copybreak,
> +                "Maximum size of packet that is copied to a new buffer on receive");
> +
> +
> +
> +
> +
> +static bool fec_enet_rx_zerocopy(struct fec_enet_private *fep, unsigned pktlen)
>  {
> -       struct bufdesc *new_bd = bdp + 1;
> -       struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp + 1;
> -       struct bufdesc_ex *ex_base;
> -       struct bufdesc *base;
> -       int ring_size;
> -
> -       if (bdp >= fep->tx_bd_base) {
> -               base = fep->tx_bd_base;
> -               ring_size = fep->tx_ring_size;
> -               ex_base = (struct bufdesc_ex *)fep->tx_bd_base;
> -       } else {
> -               base = fep->rx_bd_base;
> -               ring_size = fep->rx_ring_size;
> -               ex_base = (struct bufdesc_ex *)fep->rx_bd_base;
> -       }
> +#ifndef CONFIG_M5272
> +       if (fep->quirks & FEC_QUIRK_RX_SHIFT16 && pktlen >= copybreak)
> +               return true;
> +#endif
> +       return false;
> +}
> +
> +static union bufdesc_u *
> +fec_enet_tx_get(unsigned index, struct fec_enet_private *fep)
> +{
> +       union bufdesc_u *base = fep->tx_bd_base;
> +       union bufdesc_u *bdp;
> +
> +       index &= fep->tx_ring_size - 1;
>
>         if (fep->bufdesc_ex)
> -               return (struct bufdesc *)((ex_new_bd >= (ex_base + ring_size)) ?
> -                       ex_base : ex_new_bd);
> +               bdp = (union bufdesc_u *)(&base->ebd + index);
>         else
> -               return (new_bd >= (base + ring_size)) ?
> -                       base : new_bd;
> +               bdp = (union bufdesc_u *)(&base->bd + index);
> +
> +       return bdp;
>  }
>
> -static inline
> -struct bufdesc *fec_enet_get_prevdesc(struct bufdesc *bdp, struct fec_enet_private *fep)
> +static union bufdesc_u *
> +fec_enet_rx_get(unsigned index, struct fec_enet_private *fep)
>  {
> -       struct bufdesc *new_bd = bdp - 1;
> -       struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp - 1;
> -       struct bufdesc_ex *ex_base;
> -       struct bufdesc *base;
> -       int ring_size;
> -
> -       if (bdp >= fep->tx_bd_base) {
> -               base = fep->tx_bd_base;
> -               ring_size = fep->tx_ring_size;
> -               ex_base = (struct bufdesc_ex *)fep->tx_bd_base;
> -       } else {
> -               base = fep->rx_bd_base;
> -               ring_size = fep->rx_ring_size;
> -               ex_base = (struct bufdesc_ex *)fep->rx_bd_base;
> -       }
> +       union bufdesc_u *base = fep->rx_bd_base;
> +       union bufdesc_u *bdp;
> +
> +       index &= fep->rx_ring_size - 1;
>
>         if (fep->bufdesc_ex)
> -               return (struct bufdesc *)((ex_new_bd < ex_base) ?
> -                       (ex_new_bd + ring_size) : ex_new_bd);
> +               bdp = (union bufdesc_u *)(&base->ebd + index);
>         else
> -               return (new_bd < base) ? (new_bd + ring_size) : new_bd;
> +               bdp = (union bufdesc_u *)(&base->bd + index);
> +
> +       return bdp;
>  }
>
>  static void *swap_buffer(void *bufaddr, int len)
> @@ -297,6 +298,26 @@ static void *swap_buffer(void *bufaddr, int len)
>         return bufaddr;
>  }
>
> +static void fec_dump(struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       unsigned index = 0;
> +
> +       netdev_info(ndev, "TX ring dump\n");
> +       pr_info("Nr    SC     addr       len  SKB\n");
> +
> +       for (index = 0; index < fep->tx_ring_size; index++) {
> +               union bufdesc_u *bdp = fec_enet_tx_get(index, fep);
> +
> +               pr_info("%2u %c%c 0x%04x 0x%08lx %4u %p\n",
> +                       index,
> +                       index == fep->tx_next ? 'S' : ' ',
> +                       index == fep->tx_dirty ? 'H' : ' ',
> +                       bdp->bd.cbd_sc, bdp->bd.cbd_bufaddr, bdp->bd.cbd_datlen,
> +                       fep->tx_skbuff[index]);
> +       }
> +}
> +
>  static int
>  fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
>  {
> @@ -312,21 +333,42 @@ fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
>         return 0;
>  }
>
> +static void
> +fec_enet_tx_unmap(struct bufdesc *bdp, struct fec_enet_private *fep)
> +{
> +       dma_addr_t addr = bdp->cbd_bufaddr;
> +       unsigned length = bdp->cbd_datlen;
> +
> +       bdp->cbd_bufaddr = 0;
> +
> +       dma_unmap_single(&fep->pdev->dev, addr, length, DMA_TO_DEVICE);
> +}
> +
>  static netdev_tx_t
>  fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
> -       struct bufdesc *bdp, *bdp_pre;
> +       union bufdesc_u *bdp, *bdp_pre;
>         void *bufaddr;
>         unsigned short  status;
> -       unsigned int index;
> +       unsigned index;
> +       unsigned length;
> +       dma_addr_t addr;
> +
> +
> +
> +
> +
> +
> +
> +
> +
>
>         /* Fill in a Tx ring entry */
> -       bdp = fep->cur_tx;
> +       index = fep->tx_next;
>
> -       status = bdp->cbd_sc;
> +       bdp = fec_enet_tx_get(index, fep);
> +       status = bdp->bd.cbd_sc;
>
>         if (status & BD_ENET_TX_READY) {
>                 /* Ooops.  All transmit buffers are full.  Bail out.
> @@ -347,21 +389,15 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>
>         /* Set buffer length and buffer pointer */
>         bufaddr = skb->data;
> -       bdp->cbd_datlen = skb->len;
> +       length = skb->len;
>
>         /*
>          * On some FEC implementations data must be aligned on
>          * 4-byte boundaries. Use bounce buffers to copy data
>          * and get it aligned. Ugh.
>          */
> -       if (fep->bufdesc_ex)
> -               index = (struct bufdesc_ex *)bdp -
> -                       (struct bufdesc_ex *)fep->tx_bd_base;
> -       else
> -               index = bdp - fep->tx_bd_base;
> -
>         if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
> -               memcpy(fep->tx_bounce[index], skb->data, skb->len);
> +               memcpy(fep->tx_bounce[index], skb->data, length);
>                 bufaddr = fep->tx_bounce[index];
>         }
>
> @@ -370,70 +406,72 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>          * the system that it's running on. As the result, driver has to
>          * swap every frame going to and coming from the controller.
>          */
> -       if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
> -               swap_buffer(bufaddr, skb->len);
> +       if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> +               swap_buffer(bufaddr, length);
>
> -       /* Save skb pointer */
> -       fep->tx_skbuff[index] = skb;
> -
> -       /* Push the data cache so the CPM does not get stale memory
> -        * data.
> -        */
> -       bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
> -                       skb->len, DMA_TO_DEVICE);
> -       if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
> -               bdp->cbd_bufaddr = 0;
> -               fep->tx_skbuff[index] = NULL;
> +       /* Push the data cache so the CPM does not get stale memory data. */
> +       addr = dma_map_single(&fep->pdev->dev, bufaddr, length, DMA_TO_DEVICE);
> +       if (dma_mapping_error(&fep->pdev->dev, addr)) {
>                 dev_kfree_skb_any(skb);
>                 if (net_ratelimit())
>                         netdev_err(ndev, "Tx DMA memory map failed\n");
>                 return NETDEV_TX_OK;
>         }
>
> -       if (fep->bufdesc_ex) {
> +       /* Save skb pointer */
> +       fep->tx_skbuff[index] = skb;
>
> -               struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> -               ebdp->cbd_bdu = 0;
> +       bdp->bd.cbd_datlen = length;
> +       bdp->bd.cbd_bufaddr = addr;
> +
> +       if (fep->bufdesc_ex) {
> +               bdp->ebd.cbd_bdu = 0;
>                 if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
>                         fep->hwts_tx_en)) {
> -                       ebdp->cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
> +                       bdp->ebd.cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
>                         skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
>                 } else {
> -                       ebdp->cbd_esc = BD_ENET_TX_INT;
> +                       bdp->ebd.cbd_esc = BD_ENET_TX_INT;
>
>                         /* Enable protocol checksum flags
>                          * We do not bother with the IP Checksum bits as they
>                          * are done by the kernel
>                          */
>                         if (skb->ip_summed == CHECKSUM_PARTIAL)
> -                               ebdp->cbd_esc |= BD_ENET_TX_PINS;
> +                               bdp->ebd.cbd_esc |= BD_ENET_TX_PINS;
>                 }
>         }
>
> +       /*
> +        * We need the preceding stores to the descriptor to complete
> +        * before updating the status field, which hands it over to the
> +        * hardware.  The corresponding rmb() is "in the hardware".
> +        */
> +       wmb();
> +
>         /* Send it on its way.  Tell FEC it's ready, interrupt when done,
>          * it's the last BD of the frame, and to put the CRC on the end.
>          */
>         status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
>                         | BD_ENET_TX_LAST | BD_ENET_TX_TC);
> -       bdp->cbd_sc = status;
> +       bdp->bd.cbd_sc = status;
>
> -       bdp_pre = fec_enet_get_prevdesc(bdp, fep);
> -       if ((id_entry->driver_data & FEC_QUIRK_ERR006358) &&
> -           !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) {
> +       bdp_pre = fec_enet_tx_get(index - 1, fep);
> +       if ((fep->quirks & FEC_QUIRK_ERR006358) &&
> +           !(bdp_pre->bd.cbd_sc & BD_ENET_TX_READY)) {
>                 fep->delay_work.trig_tx = true;
>                 schedule_delayed_work(&(fep->delay_work.delay_work),
>                                         msecs_to_jiffies(1));
>         }
>
> -       /* If this was the last BD in the ring, start at the beginning again. */
> -       bdp = fec_enet_get_nextdesc(bdp, fep);
> -
>         skb_tx_timestamp(skb);
>
> -       fep->cur_tx = bdp;
> +       fep->tx_next = (index + 1) & (fep->tx_ring_size - 1);
>
> -       if (fep->cur_tx == fep->dirty_tx)
> +       if (fep->tx_next == fep->tx_dirty) {
> +
>                 netif_stop_queue(ndev);
> +       }
>
>         /* Trigger transmission start */
>         writel(0, fep->hwp + FEC_X_DES_ACTIVE);
> @@ -446,46 +484,43 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  static void fec_enet_bd_init(struct net_device *dev)
>  {
>         struct fec_enet_private *fep = netdev_priv(dev);
> -       struct bufdesc *bdp;
>         unsigned int i;
>
>         /* Initialize the receive buffer descriptors. */
> -       bdp = fep->rx_bd_base;
>         for (i = 0; i < fep->rx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
>
>                 /* Initialize the BD for every fragment in the page. */
> -               if (bdp->cbd_bufaddr)
> -                       bdp->cbd_sc = BD_ENET_RX_EMPTY;
> +               if (bdp->bd.cbd_bufaddr)
> +                       bdp->bd.cbd_sc = BD_ENET_RX_EMPTY;
>                 else
> -                       bdp->cbd_sc = 0;
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> +                       bdp->bd.cbd_sc = 0;
> +               if (i == fep->rx_ring_size - 1)
> +                       bdp->bd.cbd_sc |= BD_SC_WRAP;
>         }
>
> -       /* Set the last buffer to wrap */
> -       bdp = fec_enet_get_prevdesc(bdp, fep);
> -       bdp->cbd_sc |= BD_SC_WRAP;
> -
> -       fep->cur_rx = fep->rx_bd_base;
> +       fep->rx_next = 0;
>
>         /* ...and the same for transmit */
> -       bdp = fep->tx_bd_base;
> -       fep->cur_tx = bdp;
>         for (i = 0; i < fep->tx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
>
>                 /* Initialize the BD for every fragment in the page. */
> -               bdp->cbd_sc = 0;
> -               if (bdp->cbd_bufaddr && fep->tx_skbuff[i]) {
> +               /* Set the last buffer to wrap */
> +               if (i == fep->tx_ring_size - 1)
> +                       bdp->bd.cbd_sc = BD_SC_WRAP;
> +               else
> +                       bdp->bd.cbd_sc = 0;
> +               if (bdp->bd.cbd_bufaddr)
> +                       fec_enet_tx_unmap(&bdp->bd, fep);
> +               if (fep->tx_skbuff[i]) {
>                         dev_kfree_skb_any(fep->tx_skbuff[i]);
>                         fep->tx_skbuff[i] = NULL;
>                 }
> -               bdp->cbd_bufaddr = 0;
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
>         }
>
> -       /* Set the last buffer to wrap */
> -       bdp = fec_enet_get_prevdesc(bdp, fep);
> -       bdp->cbd_sc |= BD_SC_WRAP;
> -       fep->dirty_tx = bdp;
> +       fep->tx_next = 0;
> +       fep->tx_dirty = fep->tx_ring_size - 1;
>  }
>
>  /* This function is called to start or restart the FEC during a link
> @@ -496,8 +531,6 @@ static void
>  fec_restart(struct net_device *ndev, int duplex)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
>         int i;
>         u32 val;
>         u32 temp_mac[2];
> @@ -519,7 +552,7 @@ fec_restart(struct net_device *ndev, int duplex)
>          * enet-mac reset will reset mac address registers too,
>          * so need to reconfigure it.
>          */
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC) {
>                 memcpy(&temp_mac, ndev->dev_addr, ETH_ALEN);
>                 writel(cpu_to_be32(temp_mac[0]), fep->hwp + FEC_ADDR_LOW);
>                 writel(cpu_to_be32(temp_mac[1]), fep->hwp + FEC_ADDR_HIGH);
> @@ -568,6 +601,8 @@ fec_restart(struct net_device *ndev, int duplex)
>  #if !defined(CONFIG_M5272)
>         /* set RX checksum */
>         val = readl(fep->hwp + FEC_RACC);
> +       if (fep->quirks & FEC_QUIRK_RX_SHIFT16)
> +               val |= FEC_RACC_SHIFT16;
>         if (fep->csum_flags & FLAG_RX_CSUM_ENABLED)
>                 val |= FEC_RACC_OPTIONS;
>         else
> @@ -579,7 +614,7 @@ fec_restart(struct net_device *ndev, int duplex)
>          * The phy interface and speed need to get configured
>          * differently on enet-mac.
>          */
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC) {
>                 /* Enable flow control and length check */
>                 rcntl |= 0x40000000 | 0x00000020;
>
> @@ -602,7 +637,7 @@ fec_restart(struct net_device *ndev, int duplex)
>                 }
>         } else {
>  #ifdef FEC_MIIGSK_ENR
> -               if (id_entry->driver_data & FEC_QUIRK_USE_GASKET) {
> +               if (fep->quirks & FEC_QUIRK_USE_GASKET) {
>                         u32 cfgr;
>                         /* disable the gasket and wait */
>                         writel(0, fep->hwp + FEC_MIIGSK_ENR);
> @@ -655,7 +690,7 @@ fec_restart(struct net_device *ndev, int duplex)
>         writel(0, fep->hwp + FEC_HASH_TABLE_LOW);
>  #endif
>
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC) {
>                 /* enable ENET endian swap */
>                 ecntl |= (1 << 8);
>                 /* enable ENET store and forward mode */
> @@ -692,8 +727,6 @@ static void
>  fec_stop(struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
>         u32 rmii_mode = readl(fep->hwp + FEC_R_CNTRL) & (1 << 8);
>
>         /* We cannot expect a graceful transmit stop without link !!! */
> @@ -711,7 +744,7 @@ fec_stop(struct net_device *ndev)
>         writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
>
>         /* We have to keep ENET enabled to have MII interrupt stay working */
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC) {
>                 writel(2, fep->hwp + FEC_ECNTRL);
>                 writel(rmii_mode, fep->hwp + FEC_R_CNTRL);
>         }
> @@ -723,6 +756,8 @@ fec_timeout(struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
>
> +       fec_dump(ndev);
> +
>         ndev->stats.tx_errors++;
>
>         fep->delay_work.timeout = true;
> @@ -751,34 +786,28 @@ static void fec_enet_work(struct work_struct *work)
>  static void
>  fec_enet_tx(struct net_device *ndev)
>  {
> -       struct  fec_enet_private *fep;
> -       struct bufdesc *bdp;
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       union bufdesc_u *bdp;
>         unsigned short status;
>         struct  sk_buff *skb;
> -       int     index = 0;
> -
> -       fep = netdev_priv(ndev);
> -       bdp = fep->dirty_tx;
> +       unsigned index = fep->tx_dirty;
>
> -       /* get next bdp of dirty_tx */
> -       bdp = fec_enet_get_nextdesc(bdp, fep);
> +       do {
> +               index = (index + 1) & (fep->tx_ring_size - 1);
> +               bdp = fec_enet_tx_get(index, fep);
>
> -       while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
> +               status = bdp->bd.cbd_sc;
> +               if (status & BD_ENET_TX_READY)
> +                       break;
>
>                 /* current queue is empty */
> -               if (bdp == fep->cur_tx)
> +               if (index == fep->tx_next)
>                         break;
>
> -               if (fep->bufdesc_ex)
> -                       index = (struct bufdesc_ex *)bdp -
> -                               (struct bufdesc_ex *)fep->tx_bd_base;
> -               else
> -                       index = bdp - fep->tx_bd_base;
> +               fec_enet_tx_unmap(&bdp->bd, fep);
>
>                 skb = fep->tx_skbuff[index];
> -               dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len,
> -                               DMA_TO_DEVICE);
> -               bdp->cbd_bufaddr = 0;
> +               fep->tx_skbuff[index] = NULL;
>
>                 /* Check for errors. */
>                 if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC |
> @@ -797,19 +826,18 @@ fec_enet_tx(struct net_device *ndev)
>                                 ndev->stats.tx_carrier_errors++;
>                 } else {
>                         ndev->stats.tx_packets++;
> -                       ndev->stats.tx_bytes += bdp->cbd_datlen;
> +                       ndev->stats.tx_bytes += bdp->bd.cbd_datlen;
>                 }
>
>                 if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) &&
>                         fep->bufdesc_ex) {
>                         struct skb_shared_hwtstamps shhwtstamps;
>                         unsigned long flags;
> -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
>
>                         memset(&shhwtstamps, 0, sizeof(shhwtstamps));
>                         spin_lock_irqsave(&fep->tmreg_lock, flags);
>                         shhwtstamps.hwtstamp = ns_to_ktime(
> -                               timecounter_cyc2time(&fep->tc, ebdp->ts));
> +                               timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
>                         spin_unlock_irqrestore(&fep->tmreg_lock, flags);
>                         skb_tstamp_tx(skb, &shhwtstamps);
>                 }
> @@ -825,45 +853,252 @@ fec_enet_tx(struct net_device *ndev)
>
>                 /* Free the sk buffer associated with this last transmit */
>                 dev_kfree_skb_any(skb);
> -               fep->tx_skbuff[index] = NULL;
> -
> -               fep->dirty_tx = bdp;
> -
> -               /* Update pointer to next buffer descriptor to be transmitted */
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
>
>                 /* Since we have freed up a buffer, the ring is no longer full
>                  */
> -               if (fep->dirty_tx != fep->cur_tx) {
> -                       if (netif_queue_stopped(ndev))
> -                               netif_wake_queue(ndev);
> +               if (netif_queue_stopped(ndev)) {
> +
> +
> +
> +
> +
> +                       netif_wake_queue(ndev);
> +
>                 }
> -       }
> +
> +               fep->tx_dirty = index;
> +       } while (1);

where break this while(1);

>         return;
>  }
>
>
> -/* During a receive, the cur_rx points to the current incoming buffer.
> +static void
> +fec_enet_receive(struct sk_buff *skb, union bufdesc_u *bdp, struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +
> +       skb->protocol = eth_type_trans(skb, ndev);
> +
> +       /* Get receive timestamp from the skb */
> +       if (fep->hwts_rx_en && fep->bufdesc_ex) {
> +               struct skb_shared_hwtstamps *shhwtstamps =
> +                                                   skb_hwtstamps(skb);
> +               unsigned long flags;
> +
> +               memset(shhwtstamps, 0, sizeof(*shhwtstamps));
> +
> +               spin_lock_irqsave(&fep->tmreg_lock, flags);
> +               shhwtstamps->hwtstamp = ns_to_ktime(
> +                   timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
> +               spin_unlock_irqrestore(&fep->tmreg_lock, flags);
> +       }
> +
> +       if (fep->csum_flags & FLAG_RX_CSUM_ENABLED) {
> +               if (!(bdp->ebd.cbd_esc & FLAG_RX_CSUM_ERROR)) {
> +                       /* don't check it */
> +                       skb->ip_summed = CHECKSUM_UNNECESSARY;
> +               } else {
> +                       skb_checksum_none_assert(skb);
> +               }
> +       }
> +
> +       napi_gro_receive(&fep->napi, skb);
> +}
> +
> +static void noinline
> +fec_enet_receive_copy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       struct sk_buff *skb;
> +       unsigned char *data;
> +       bool vlan_packet_rcvd = false;
> +
> +       /*
> +        * Detect the presence of the VLAN tag, and adjust
> +        * the packet length appropriately.
> +        */
> +       if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX &&
> +           bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) {
> +               pkt_len -= VLAN_HLEN;
> +               vlan_packet_rcvd = true;
> +       }
> +
> +       /* This does 16 byte alignment, exactly what we need. */
> +       skb = netdev_alloc_skb(ndev, pkt_len + NET_IP_ALIGN);
> +       if (unlikely(!skb)) {
> +               ndev->stats.rx_dropped++;
> +               return;
> +       }
> +
> +       dma_sync_single_for_cpu(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
> +                               FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +
> +       data = fep->rx_skbuff[index]->data;
> +
> +#ifndef CONFIG_M5272
> +       /*
> +        * If we have enabled this feature, we need to discard
> +        * the two bytes at the beginning of the packet before
> +        * copying it.
> +        */
> +       if (fep->quirks & FEC_QUIRK_RX_SHIFT16) {
> +               pkt_len -= 2;
> +               data += 2;
> +       }
> +#endif
> +
> +       if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> +               swap_buffer(data, pkt_len);
> +
> +       skb_reserve(skb, NET_IP_ALIGN);
> +       skb_put(skb, pkt_len);  /* Make room */
> +
> +       /* If this is a VLAN packet remove the VLAN Tag */
> +       if (vlan_packet_rcvd) {
> +               struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN);
> +
> +               __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
> +                                      ntohs(vlan->h_vlan_TCI));
> +
> +               /* Extract the frame data without the VLAN header. */
> +               skb_copy_to_linear_data(skb, data, 2 * ETH_ALEN);
> +               skb_copy_to_linear_data_offset(skb, 2 * ETH_ALEN,
> +                                              data + 2 * ETH_ALEN + VLAN_HLEN,
> +                                              pkt_len - 2 * ETH_ALEN);
> +       } else {
> +               skb_copy_to_linear_data(skb, data, pkt_len);
> +       }
> +
> +       dma_sync_single_for_device(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
> +                                  FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +
> +       fec_enet_receive(skb, bdp, ndev);
> +}
> +
> +static void noinline
> +fec_enet_receive_nocopy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       struct sk_buff *skb, *skb_new;
> +       unsigned char *data;
> +       dma_addr_t addr;
> +
> +#if 0
> +       skb_new = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
> +       if (!skb_new) {
> +               ndev->stats.rx_dropped++;
> +               return;
> +       }
> +
> +       addr = dma_map_single(&fep->pdev->dev, skb_new->data,
> +                             FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +       if (dma_mapping_error(&fep->pdev->dev, addr)) {
> +               dev_kfree_skb(skb_new);
> +               ndev->stats.rx_dropped++;
> +               return;
> +       }
> +#else
> +       skb_new = NULL;
> +       addr = 0;
> +#endif
> +
> +       /*
> +        * We have the new skb, so proceed to deal with the
> +        * received data.
> +        */
> +       dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
> +                        FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +
> +       skb = fep->rx_skbuff[index];
> +
> +       /* Now subsitute in the new skb */
> +       fep->rx_skbuff[index] = skb_new;
> +       bdp->bd.cbd_bufaddr = addr;
> +
> +       /*
> +        * Update the skb length according to the raw packet
> +        * length.  Then remove the two bytes of additional
> +        * padding.
> +        */
> +       skb_put(skb, pkt_len);
> +       data = skb_pull_inline(skb, 2);
> +
> +       if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> +               swap_buffer(data, skb->len);
> +
> +       /*
> +        * Now juggle things for the VLAN tag - if the hardware
> +        * flags this as present, we need to read the tag, and
> +        * then shuffle the ethernet addresses up.
> +        */
> +       if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX &&
> +           bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) {
> +               struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN);
> +
> +               __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
> +                                      ntohs(vlan->h_vlan_TCI));
> +
> +               memmove(data + VLAN_HLEN, data, 2 * ETH_ALEN);
> +               skb_pull_inline(skb, VLAN_HLEN);
> +       }
> +
> +       fec_enet_receive(skb, bdp, ndev);
> +}
> +
> +static int
> +fec_enet_refill_ring(unsigned first, unsigned last, struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       unsigned i = first;
> +
> +       do {
> +               union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
> +               struct sk_buff *skb;
> +               dma_addr_t addr;
> +
> +               if (!fep->rx_skbuff[i]) {
> +                       skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
> +                       if (!skb)
> +                               return -ENOMEM;
> +
> +                       addr = dma_map_single(&fep->pdev->dev, skb->data,
> +                                             FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +                       if (dma_mapping_error(&fep->pdev->dev, addr)) {
> +                               dev_kfree_skb(skb);
> +                               return -ENOMEM;
> +                       }
> +
> +                       fep->rx_skbuff[i] = skb;
> +                       bdp->bd.cbd_bufaddr = addr;
> +               }
> +
> +               bdp->bd.cbd_sc = (bdp->bd.cbd_sc & BD_SC_WRAP) |
> +                                BD_ENET_RX_EMPTY;
> +
> +               if (fep->bufdesc_ex) {
> +                       bdp->ebd.cbd_esc = BD_ENET_RX_INT;
> +                       bdp->ebd.cbd_prot = 0;
> +                       bdp->ebd.cbd_bdu = 0;
> +               }
> +               i = (i + 1) & (fep->rx_ring_size - 1);
> +       } while (i != last);
> +
> +       return 0;
> +}
> +
> +/* During a receive, the rx_next points to the current incoming buffer.
>   * When we update through the ring, if the next incoming buffer has
>   * not been given to the system, we just set the empty indicator,
>   * effectively tossing the packet.
>   */
> -static int
> +static int noinline
>  fec_enet_rx(struct net_device *ndev, int budget)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
> -       struct bufdesc *bdp;
>         unsigned short status;
> -       struct  sk_buff *skb;
> -       ushort  pkt_len;
> -       __u8 *data;
> +       unsigned pkt_len;
>         int     pkt_received = 0;
> -       struct  bufdesc_ex *ebdp = NULL;
> -       bool    vlan_packet_rcvd = false;
> -       u16     vlan_tag;
> -       int     index = 0;
> +       unsigned index = fep->rx_next;
>
>  #ifdef CONFIG_M532x
>         flush_cache_all();
> @@ -872,12 +1107,16 @@ fec_enet_rx(struct net_device *ndev, int budget)
>         /* First, grab all of the stats for the incoming packet.
>          * These get messed up if we get called due to a busy condition.
>          */
> -       bdp = fep->cur_rx;
> +       do {
> +               union bufdesc_u *bdp = fec_enet_rx_get(index, fep);
>
> -       while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
> +               status = bdp->bd.cbd_sc;
> +               if (status & BD_ENET_RX_EMPTY)
> +                       break;
>
>                 if (pkt_received >= budget)
>                         break;
> +
>                 pkt_received++;
>
>                 /* Since we have allocated space to hold a complete frame,
> @@ -917,124 +1156,33 @@ fec_enet_rx(struct net_device *ndev, int budget)
>
>                 /* Process the incoming frame. */
>                 ndev->stats.rx_packets++;
> -               pkt_len = bdp->cbd_datlen;
> -               ndev->stats.rx_bytes += pkt_len;
> -
> -               if (fep->bufdesc_ex)
> -                       index = (struct bufdesc_ex *)bdp -
> -                               (struct bufdesc_ex *)fep->rx_bd_base;
> -               else
> -                       index = bdp - fep->rx_bd_base;
> -               data = fep->rx_skbuff[index]->data;
> -               dma_sync_single_for_cpu(&fep->pdev->dev, bdp->cbd_bufaddr,
> -                                       FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> -
> -               if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
> -                       swap_buffer(data, pkt_len);
> -
> -               /* Extract the enhanced buffer descriptor */
> -               ebdp = NULL;
> -               if (fep->bufdesc_ex)
> -                       ebdp = (struct bufdesc_ex *)bdp;
> -
> -               /* If this is a VLAN packet remove the VLAN Tag */
> -               vlan_packet_rcvd = false;
> -               if ((ndev->features & NETIF_F_HW_VLAN_CTAG_RX) &&
> -                   fep->bufdesc_ex && (ebdp->cbd_esc & BD_ENET_RX_VLAN)) {
> -                       /* Push and remove the vlan tag */
> -                       struct vlan_hdr *vlan_header =
> -                                       (struct vlan_hdr *) (data + ETH_HLEN);
> -                       vlan_tag = ntohs(vlan_header->h_vlan_TCI);
> -                       pkt_len -= VLAN_HLEN;
> -
> -                       vlan_packet_rcvd = true;
> -               }
>
> -               /* This does 16 byte alignment, exactly what we need.
> -                * The packet length includes FCS, but we don't want to
> -                * include that when passing upstream as it messes up
> -                * bridging applications.
> +               /*
> +                * The packet length includes FCS, but we don't want
> +                * to include that when passing upstream as it messes
> +                * up bridging applications.
>                  */
> -               skb = netdev_alloc_skb(ndev, pkt_len - 4 + NET_IP_ALIGN);
> +               pkt_len = bdp->bd.cbd_datlen - 4;
> +               ndev->stats.rx_bytes += pkt_len;
>
> -               if (unlikely(!skb)) {
> -                       ndev->stats.rx_dropped++;
> +               if (fec_enet_rx_zerocopy(fep, pkt_len)) {
> +                       fec_enet_receive_nocopy(pkt_len, index, bdp, ndev);
>                 } else {
> -                       int payload_offset = (2 * ETH_ALEN);
> -                       skb_reserve(skb, NET_IP_ALIGN);
> -                       skb_put(skb, pkt_len - 4);      /* Make room */
> -
> -                       /* Extract the frame data without the VLAN header. */
> -                       skb_copy_to_linear_data(skb, data, (2 * ETH_ALEN));
> -                       if (vlan_packet_rcvd)
> -                               payload_offset = (2 * ETH_ALEN) + VLAN_HLEN;
> -                       skb_copy_to_linear_data_offset(skb, (2 * ETH_ALEN),
> -                                                      data + payload_offset,
> -                                                      pkt_len - 4 - (2 * ETH_ALEN));
> -
> -                       skb->protocol = eth_type_trans(skb, ndev);
> -
> -                       /* Get receive timestamp from the skb */
> -                       if (fep->hwts_rx_en && fep->bufdesc_ex) {
> -                               struct skb_shared_hwtstamps *shhwtstamps =
> -                                                           skb_hwtstamps(skb);
> -                               unsigned long flags;
> -
> -                               memset(shhwtstamps, 0, sizeof(*shhwtstamps));
> -
> -                               spin_lock_irqsave(&fep->tmreg_lock, flags);
> -                               shhwtstamps->hwtstamp = ns_to_ktime(
> -                                   timecounter_cyc2time(&fep->tc, ebdp->ts));
> -                               spin_unlock_irqrestore(&fep->tmreg_lock, flags);
> -                       }
> -
> -                       if (fep->bufdesc_ex &&
> -                           (fep->csum_flags & FLAG_RX_CSUM_ENABLED)) {
> -                               if (!(ebdp->cbd_esc & FLAG_RX_CSUM_ERROR)) {
> -                                       /* don't check it */
> -                                       skb->ip_summed = CHECKSUM_UNNECESSARY;
> -                               } else {
> -                                       skb_checksum_none_assert(skb);
> -                               }
> -                       }
> -
> -                       /* Handle received VLAN packets */
> -                       if (vlan_packet_rcvd)
> -                               __vlan_hwaccel_put_tag(skb,
> -                                                      htons(ETH_P_8021Q),
> -                                                      vlan_tag);
> -
> -                       napi_gro_receive(&fep->napi, skb);
> +                       fec_enet_receive_copy(pkt_len, index, bdp, ndev);
>                 }
>
> -               dma_sync_single_for_device(&fep->pdev->dev, bdp->cbd_bufaddr,
> -                                       FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
>  rx_processing_done:
> -               /* Clear the status flags for this buffer */
> -               status &= ~BD_ENET_RX_STATS;
> -
> -               /* Mark the buffer empty */
> -               status |= BD_ENET_RX_EMPTY;
> -               bdp->cbd_sc = status;
> -
> -               if (fep->bufdesc_ex) {
> -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> -
> -                       ebdp->cbd_esc = BD_ENET_RX_INT;
> -                       ebdp->cbd_prot = 0;
> -                       ebdp->cbd_bdu = 0;
> -               }
> -
> -               /* Update BD pointer to next entry */
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> +               index = (index + 1) & (fep->rx_ring_size - 1);
> +               if (index == fep->rx_next)
> +                       break;
> +       } while (1);
>
> -               /* Doing this here will keep the FEC running while we process
> -                * incoming frames.  On a heavily loaded network, we should be
> -                * able to keep up at the expense of system resources.
> -                */
> +       if (pkt_received) {
> +               fec_enet_refill_ring(fep->rx_next, index, ndev);
>                 writel(0, fep->hwp + FEC_R_DES_ACTIVE);
>         }
> -       fep->cur_rx = bdp;
> +
> +       fep->rx_next = index;
>
>         return pkt_received;
>  }
> @@ -1044,29 +1192,25 @@ fec_enet_interrupt(int irq, void *dev_id)
>  {
>         struct net_device *ndev = dev_id;
>         struct fec_enet_private *fep = netdev_priv(ndev);
> +       const unsigned napi_mask = FEC_ENET_RXF | FEC_ENET_TXF;
>         uint int_events;
>         irqreturn_t ret = IRQ_NONE;
>
> -       do {
> -               int_events = readl(fep->hwp + FEC_IEVENT);
> -               writel(int_events, fep->hwp + FEC_IEVENT);
> +       int_events = readl(fep->hwp + FEC_IEVENT);
> +       writel(int_events & ~napi_mask, fep->hwp + FEC_IEVENT);
>
> -               if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
> -                       ret = IRQ_HANDLED;
> +       if (int_events & napi_mask) {
> +               ret = IRQ_HANDLED;
>
> -                       /* Disable the RX interrupt */
> -                       if (napi_schedule_prep(&fep->napi)) {
> -                               writel(FEC_RX_DISABLED_IMASK,
> -                                       fep->hwp + FEC_IMASK);
> -                               __napi_schedule(&fep->napi);
> -                       }
> -               }
> +               /* Disable the NAPI interrupts */
> +               writel(FEC_ENET_MII, fep->hwp + FEC_IMASK);
> +               napi_schedule(&fep->napi);
> +       }
>
> -               if (int_events & FEC_ENET_MII) {
> -                       ret = IRQ_HANDLED;
> -                       complete(&fep->mdio_done);
> -               }
> -       } while (int_events);
> +       if (int_events & FEC_ENET_MII) {
> +               ret = IRQ_HANDLED;
> +               complete(&fep->mdio_done);
> +       }
>
>         return ret;
>  }
> @@ -1074,10 +1218,24 @@ fec_enet_interrupt(int irq, void *dev_id)
>  static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
>  {
>         struct net_device *ndev = napi->dev;
> -       int pkts = fec_enet_rx(ndev, budget);
>         struct fec_enet_private *fep = netdev_priv(ndev);
> +       unsigned status;
> +       int pkts = 0;
> +
> +       status = readl(fep->hwp + FEC_IEVENT) & (FEC_ENET_RXF | FEC_ENET_TXF);
> +       if (status) {
> +               /*
> +                * Clear any pending transmit or receive interrupts before
> +                * processing the rings to avoid racing with the hardware.
> +                */
> +               writel(status, fep->hwp + FEC_IEVENT);
>
> -       fec_enet_tx(ndev);
> +               if (status & FEC_ENET_RXF)
> +                       pkts = fec_enet_rx(ndev, budget);
> +
> +               if (status & FEC_ENET_TXF)
> +                       fec_enet_tx(ndev);
> +       }
>
>         if (pkts < budget) {
>                 napi_complete(napi);
> @@ -1263,8 +1421,6 @@ static int fec_enet_mdio_reset(struct mii_bus *bus)
>  static int fec_enet_mii_probe(struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
>         struct phy_device *phy_dev = NULL;
>         char mdio_bus_id[MII_BUS_ID_SIZE];
>         char phy_name[MII_BUS_ID_SIZE + 3];
> @@ -1302,7 +1458,7 @@ static int fec_enet_mii_probe(struct net_device *ndev)
>         }
>
>         /* mask with MAC supported features */
> -       if (id_entry->driver_data & FEC_QUIRK_HAS_GBIT) {
> +       if (fep->quirks & FEC_QUIRK_HAS_GBIT) {
>                 phy_dev->supported &= PHY_GBIT_FEATURES;
>  #if !defined(CONFIG_M5272)
>                 phy_dev->supported |= SUPPORTED_Pause;
> @@ -1329,8 +1485,6 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>         static struct mii_bus *fec0_mii_bus;
>         struct net_device *ndev = platform_get_drvdata(pdev);
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
>         int err = -ENXIO, i;
>
>         /*
> @@ -1349,7 +1503,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>          * mdio interface in board design, and need to be configured by
>          * fec0 mii_bus.
>          */
> -       if ((id_entry->driver_data & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) {
> +       if ((fep->quirks & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) {
>                 /* fec1 uses fec0 mii_bus */
>                 if (mii_cnt && fec0_mii_bus) {
>                         fep->mii_bus = fec0_mii_bus;
> @@ -1370,7 +1524,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>          * document.
>          */
>         fep->phy_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ahb), 5000000);
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC)
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC)
>                 fep->phy_speed--;
>         fep->phy_speed <<= 1;
>         writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED);
> @@ -1405,7 +1559,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>         mii_cnt++;
>
>         /* save fec0 mii_bus */
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC)
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC)
>                 fec0_mii_bus = fep->mii_bus;
>
>         return 0;
> @@ -1694,23 +1848,24 @@ static void fec_enet_free_buffers(struct net_device *ndev)
>         struct fec_enet_private *fep = netdev_priv(ndev);
>         unsigned int i;
>         struct sk_buff *skb;
> -       struct bufdesc  *bdp;
>
> -       bdp = fep->rx_bd_base;
>         for (i = 0; i < fep->rx_ring_size; i++) {
> -               skb = fep->rx_skbuff[i];
> +               union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
>
> -               if (bdp->cbd_bufaddr)
> -                       dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
> +               skb = fep->rx_skbuff[i];
> +               if (skb) {
> +                       dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
>                                         FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> -               if (skb)
>                         dev_kfree_skb(skb);
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> +               }
>         }
>
> -       bdp = fep->tx_bd_base;
> -       for (i = 0; i < fep->tx_ring_size; i++)
> +       for (i = 0; i < fep->tx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
> +               if (bdp->bd.cbd_bufaddr)
> +                       fec_enet_tx_unmap(&bdp->bd, fep);
>                 kfree(fep->tx_bounce[i]);
> +       }
>  }
>
>  static int fec_enet_alloc_buffers(struct net_device *ndev)
> @@ -1718,58 +1873,54 @@ static int fec_enet_alloc_buffers(struct net_device *ndev)
>         struct fec_enet_private *fep = netdev_priv(ndev);
>         unsigned int i;
>         struct sk_buff *skb;
> -       struct bufdesc  *bdp;
>
> -       bdp = fep->rx_bd_base;
>         for (i = 0; i < fep->rx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
> +               dma_addr_t addr;
> +
>                 skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
>                 if (!skb) {
>                         fec_enet_free_buffers(ndev);
>                         return -ENOMEM;
>                 }
> -               fep->rx_skbuff[i] = skb;
>
> -               bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, skb->data,
> -                               FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> -               if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
> +               addr = dma_map_single(&fep->pdev->dev, skb->data,
> +                                     FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +               if (dma_mapping_error(&fep->pdev->dev, addr)) {
> +                       dev_kfree_skb(skb);
>                         fec_enet_free_buffers(ndev);
>                         if (net_ratelimit())
>                                 netdev_err(ndev, "Rx DMA memory map failed\n");
>                         return -ENOMEM;
>                 }
> -               bdp->cbd_sc = BD_ENET_RX_EMPTY;
>
> -               if (fep->bufdesc_ex) {
> -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> -                       ebdp->cbd_esc = BD_ENET_RX_INT;
> -               }
> +               fep->rx_skbuff[i] = skb;
>
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> -       }
> +               bdp->bd.cbd_bufaddr = addr;
> +               bdp->bd.cbd_sc = BD_ENET_RX_EMPTY;
> +               /* Set the last buffer to wrap. */
> +               if (i == fep->rx_ring_size - 1)
> +                       bdp->bd.cbd_sc |= BD_SC_WRAP;
>
> -       /* Set the last buffer to wrap. */
> -       bdp = fec_enet_get_prevdesc(bdp, fep);
> -       bdp->cbd_sc |= BD_SC_WRAP;
> +               if (fep->bufdesc_ex)
> +                       bdp->ebd.cbd_esc = BD_ENET_RX_INT;
> +       }
>
> -       bdp = fep->tx_bd_base;
>         for (i = 0; i < fep->tx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
>                 fep->tx_bounce[i] = kmalloc(FEC_ENET_TX_FRSIZE, GFP_KERNEL);
>
> -               bdp->cbd_sc = 0;
> -               bdp->cbd_bufaddr = 0;
> -
> -               if (fep->bufdesc_ex) {
> -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> -                       ebdp->cbd_esc = BD_ENET_TX_INT;
> -               }
> +               /* Set the last buffer to wrap. */
> +               if (i == fep->tx_ring_size - 1)
> +                       bdp->bd.cbd_sc = BD_SC_WRAP;
> +               else
> +                       bdp->bd.cbd_sc = 0;
> +               bdp->bd.cbd_bufaddr = 0;
>
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> +               if (fep->bufdesc_ex)
> +                       bdp->ebd.cbd_esc = BD_ENET_TX_INT;
>         }
>
> -       /* Set the last buffer to wrap. */
> -       bdp = fec_enet_get_prevdesc(bdp, fep);
> -       bdp->cbd_sc |= BD_SC_WRAP;
> -
>         return 0;
>  }
>
> @@ -1990,9 +2141,7 @@ static const struct net_device_ops fec_netdev_ops = {
>  static int fec_enet_init(struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
> -       struct bufdesc *cbd_base;
> +       union bufdesc_u *cbd_base;
>
>         /* Allocate memory for buffer descriptors. */
>         cbd_base = dma_alloc_coherent(NULL, PAGE_SIZE, &fep->bd_dma,
> @@ -2014,10 +2163,11 @@ static int fec_enet_init(struct net_device *ndev)
>         /* Set receive and transmit descriptor base. */
>         fep->rx_bd_base = cbd_base;
>         if (fep->bufdesc_ex)
> -               fep->tx_bd_base = (struct bufdesc *)
> -                       (((struct bufdesc_ex *)cbd_base) + fep->rx_ring_size);
> +               fep->tx_bd_base = (union bufdesc_u *)
> +                       (&cbd_base->ebd + fep->rx_ring_size);
>         else
> -               fep->tx_bd_base = cbd_base + fep->rx_ring_size;
> +               fep->tx_bd_base = (union bufdesc_u *)
> +                       (&cbd_base->bd + fep->rx_ring_size);
>
>         /* The FEC Ethernet specific entries in the device structure */
>         ndev->watchdog_timeo = TX_TIMEOUT;
> @@ -2027,19 +2177,24 @@ static int fec_enet_init(struct net_device *ndev)
>         writel(FEC_RX_DISABLED_IMASK, fep->hwp + FEC_IMASK);
>         netif_napi_add(ndev, &fep->napi, fec_enet_rx_napi, NAPI_POLL_WEIGHT);
>
> -       if (id_entry->driver_data & FEC_QUIRK_HAS_VLAN) {
> -               /* enable hw VLAN support */
> -               ndev->features |= NETIF_F_HW_VLAN_CTAG_RX;
> -               ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX;
> -       }
> +       if (fep->bufdesc_ex) {
> +               /* Features which require the enhanced buffer descriptors */
> +               netdev_features_t features = 0;
>
> -       if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) {
> -               /* enable hw accelerator */
> -               ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
> -                               | NETIF_F_RXCSUM);
> -               ndev->hw_features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
> -                               | NETIF_F_RXCSUM);
> -               fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
> +               if (fep->quirks & FEC_QUIRK_HAS_VLAN) {
> +                       /* enable hw VLAN support */
> +                       features |= NETIF_F_HW_VLAN_CTAG_RX;
> +               }
> +
> +               if (fep->quirks & FEC_QUIRK_HAS_CSUM) {
> +                       /* enable hw accelerator */
> +                       features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
> +                                   NETIF_F_RXCSUM;
> +                       fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
> +               }
> +
> +               ndev->hw_features |= features;
> +               ndev->features |= features;
>         }
>
>         fec_restart(ndev, 0);
> @@ -2110,13 +2265,6 @@ fec_probe(struct platform_device *pdev)
>         /* setup board info structure */
>         fep = netdev_priv(ndev);
>
> -#if !defined(CONFIG_M5272)
> -       /* default enable pause frame auto negotiation */
> -       if (pdev->id_entry &&
> -           (pdev->id_entry->driver_data & FEC_QUIRK_HAS_GBIT))
> -               fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG;
> -#endif
> -
>         r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
>         fep->hwp = devm_ioremap_resource(&pdev->dev, r);
>         if (IS_ERR(fep->hwp)) {
> @@ -2126,6 +2274,14 @@ fec_probe(struct platform_device *pdev)
>
>         fep->pdev = pdev;
>         fep->dev_id = dev_id++;
> +       if (pdev->id_entry)
> +               fep->quirks = pdev->id_entry->driver_data;
> +
> +#if !defined(CONFIG_M5272)
> +       /* default enable pause frame auto negotiation */
> +       if (fep->quirks & FEC_QUIRK_HAS_GBIT)
> +               fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG;
> +#endif
>
>         fep->bufdesc_ex = 0;
>
> @@ -2160,8 +2316,7 @@ fec_probe(struct platform_device *pdev)
>                 fep->clk_enet_out = NULL;
>
>         fep->clk_ptp = devm_clk_get(&pdev->dev, "ptp");
> -       fep->bufdesc_ex =
> -               pdev->id_entry->driver_data & FEC_QUIRK_HAS_BUFDESC_EX;
> +       fep->bufdesc_ex = fep->quirks & FEC_QUIRK_HAS_BUFDESC_EX;
>         if (IS_ERR(fep->clk_ptp)) {
>                 fep->clk_ptp = NULL;
>                 fep->bufdesc_ex = 0;
>
>
> --
> FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
> improving, and getting towards what was expected from it.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Bernd Faust March 21, 2014, 4:37 p.m. UTC | #10
I also tried the patch, but unfortunately it doesn't work for me either.
I get the same error as Robert Daniels. My target is a i.MX6 (Cortex
A9, single core) running Linux Kernel version 3.10.9-rt5+

Test done:
  Start iperf server on target
    $ iperf -s -u &

   Let iperf client run for 10 minutes on the host
    $ iperf -c <ip_target> -u -b 100m -t 600 -l 256

  Also start a ping from host to target, to easily detect when the error occurs
    $ ping <ip_target>


Without the patch applied I see the following behavior:
After some time the ping from host to target stopped.
As soon as iperf was ready I started a ping from target to host.
By doing that the network came back to life: the ping from host to
target resumed.

With the patch applied the following happens:
After some amount of time the ping stops, but I've seen large ping
times (up to 9000ms) before it does that.
After iperf is done, the connection was dead. (just like without the patch)
The difference now is that the ping from host to target started
reporting "Destination Host Unreachable" after iperf has finished
and the ping from target to host didn't bring the connection back to
life: it reported 100% packet loss.

dmesg output on target shows the following:

------------[ cut here ]------------
WARNING: at /home/bfa/linux/net/sched/sch_generic.c:255
dev_watchdog+0x380/0x390()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
Modules linked in:
CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G           O 3.10.9-rt5+ #3
Backtrace:
[<80012cf0>] (dump_backtrace+0x0/0x118) from [<800130f0>] (show_stack+0x20/0x24)
 r6:000000ff r5:8046caf8 r4:9f8d1d80 r3:00000000
[<800130d0>] (show_stack+0x0/0x24) from [<8054c9f0>] (dump_stack+0x24/0x28)
[<8054c9cc>] (dump_stack+0x0/0x28) from [<80020d84>]
(warn_slowpath_common+0x64/0x78)
[<80020d20>] (warn_slowpath_common+0x0/0x78) from [<80020e54>]
(warn_slowpath_fmt+0x40/0x48)
 r8:8076ea10 r7:8072c2c0 r6:00000000 r5:9f837214 r4:9f837000
r3:00000009
[<80020e14>] (warn_slowpath_fmt+0x0/0x48) from [<8046caf8>]
(dev_watchdog+0x380/0x390)
 r3:9f837000 r2:8069ab98
[<8046c778>] (dev_watchdog+0x0/0x390) from [<800326c8>]
(call_timer_fn+0x54/0x1ac)
[<80032674>] (call_timer_fn+0x0/0x1ac) from [<80032a58>]
(run_timer_softirq+0x238/0x2ec)
[<80032820>] (run_timer_softirq+0x0/0x2ec) from [<8002a180>]
(handle_softirq+0x84/0x1e8)
[<8002a0fc>] (handle_softirq+0x0/0x1e8) from [<8002a344>]
(do_single_softirq+0x60/0x90)
[<8002a2e4>] (do_single_softirq+0x0/0x90) from [<8002a4e4>]
(do_current_softirqs+0x170/0x1a8)
 r7:80735620 r6:807b2440 r5:9f8d0000 r4:00000001
[<8002a374>] (do_current_softirqs+0x0/0x1a8) from [<8002a55c>]
(run_ksoftirqd+0x40/0x60)
[<8002a51c>] (run_ksoftirqd+0x0/0x60) from [<800516a8>]
(smpboot_thread_fn+0x23c/0x2b0)
 r5:9f8d0000 r4:9f827540
[<8005146c>] (smpboot_thread_fn+0x0/0x2b0) from [<80047684>] (kthread+0xb4/0xb8)
[<800475d0>] (kthread+0x0/0xb8) from [<8000edd8>] (ret_from_fork+0x14/0x20)
 r7:00000000 r6:00000000 r5:800475d0 r4:9f8c3e5c
---[ end trace 0000000000000002 ]---
fec 2188000.ethernet eth0: TX ring dump
Nr    SC     addr       len  SKB
 0    0x1c00 0x2e853000   98 9e8b6000
 1    0x1c00 0x2e853800   98 9e9bb6c0
 2    0x1c00 0x2e9c4000   98 9e841a80
 3    0x1c00 0x2e9c4800   98 9e841000
 4    0x1c00 0x2e9c5000   98 9e8c3480
 5    0x1c00 0x2e9c5800   98 9e9bb540
 6    0x1c00 0x2e9c6000   98 9e8b4f00
 7    0x1c00 0x2e9c6800   98 9e941e40
 8    0x1c00 0x2e9c7000   98 9e941300
 9    0x1c00 0x2e9c7800   98 9e941780
10    0x1c00 0x2e9cc000   98 9e8b4d80
11    0x1c00 0x2e9cc800   98 9e8b4b40
12    0x1c00 0x2e9cd000   98 9e8b4a80
13    0x1c00 0x2e9cd800   98 9e8b4600
14    0x1c00 0x2e9ce000   98 9e941600
15    0x1c00 0x2e9ce800   98 9e8b4540
16    0x1c00 0x2e9cf000   98 9e8b4180
17    0x1c00 0x2e9cf800   98 9e9419c0
18    0x1c00 0x2e9f8000   98 9e8b4300
19    0x1c00 0x2e9f8800   98 9e941840
20    0x1c00 0x2e9f9000   98 9e8b4cc0
21    0x1c00 0x2e9f9800   98 9e8b49c0
22    0x1c00 0x2e9fa000   98 9e8416c0
23    0x1c00 0x2e9fa800   98 9e8c1b40
24    0x1c00 0x2e9fb000   98 9e9ad900
25    0x1c00 0x2e9fb800   98 9e9ad300
26    0x1c00 0x2e9fc000   98 9e9ad840
27    0x1c00 0x2e9fc800   98 9e9ad000
28    0x1c00 0x2e9fd000   98 9e941b40
29    0x1c00 0x2e9fd800   98 9e8b6240
30    0x1c00 0x2e9fe000   98 9e941f00
31    0x1c00 0x2e9fe800   98 9e941000
32    0x1c00 0x2e9ff000   98 9e941a80
33    0x1c00 0x2e9ff800   98 9e8b4e40
34    0x1c00 0x2ea00000   98 9e9ad600
35    0x1c00 0x2ea00800   98 9e8413c0
36    0x1c00 0x2ea01000   98 9e841c00
37    0x1c00 0x2ea01800   98 9e8b6300
38    0x1c00 0x2ea02000   98 9e8b6c00
39    0x1c00 0x2ea02800   98 9e8b40c0
40    0x1c00 0x2ea03000   98 9e841e40
41    0x1c00 0x2ea03800   98 9e8c1600
42    0x1c00 0x2ea04000   98 9e8c1c00
43    0x1c00 0x2ea04800   98 9e9ad6c0
44    0x1c00 0x2ea05000   98 9e9bb0c0
45    0x1c00 0x2ea05800   98 9e9ad480
46    0x1c00 0x2ea06000   98 9e841900
47    0x1c00 0x2ea06800   98 9e8b4240
48    0x1c00 0x2ea07000   98 9e8b4000
49    0x1c00 0x2ea07800   98 9e8b4840
50    0x1c00 0x2ea08000   98 9e9ad240
51    0x1c00 0x2ea08800   98 9e9b19c0
52    0x1c00 0x2ea09000   98 9e941240
53    0x1c00 0x2ea09800   98 9e841600
54    0x1c00 0x2ea0a000   98 9e9bb780
55    0x1c00 0x2ea0a800   98 9e9b1480
56    0x1c00 0x2ea0b000   98 9e9bb600
57    0x1c00 0x2ea0b800   98 9e9b13c0
58    0x1c00 0x2ea0c000   98 9e9ada80
59    0x1c00 0x2ea0c800   98 9e9ad0c0
60 SH 0x1c00 0x00000000   98   (null)
61    0x9c00 0x2ea0d800   98 9e841b40
62    0x1c00 0x2ea0e000   98 9e9bb240
63    0x3c00 0x2ea0e800   98 9e8c3f00


both situations are very reproducible.
I've you have suggestions on what I can do or test, please let me know.

regards,
Bernd Faust

On 21 March 2014 02:43, Fabio Estevam <festevam@gmail.com> wrote:
> On Thu, Mar 20, 2014 at 10:36 PM, Fabio Estevam <festevam@gmail.com> wrote:
>
>> Robert's tests were made on a mx53 (single CortexA9), and its cache
>> controller is not the L310.
>
> Ops, I meant CortexA8.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Russell King - ARM Linux March 21, 2014, 5:12 p.m. UTC | #11
On Fri, Mar 21, 2014 at 10:50:41AM -0500, Frank Li wrote:
> On Wed, Mar 19, 2014 at 5:51 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > 2. The ARM can re-order writes.  The writes it can re-order are:
> >
> >         bdp->cbd_bufaddr
> >         ebdp->cbd_bdu
> >         ebdp->cbd_esc
> >         ebdp->cbd_sc
> >
> > Hence, it's entirely possible for the FEC to see the updated descriptor
> > status before the rest of the descriptor has been written.  What's missing
> > is a barrier between the descriptor writes, and the final write to
> > bdp->cbd_sc.
> 
> Possible. I also consider this when debug the other issues.
> Just bdp->bufaddr and bdp->buflen is important.
> bdu, hardware don't care it.
> esc is the same for each BD.  When software go though BD ring once,
> esc will be not changed again, even though we write.

If you assume that all you're sending is traffic which needs the headers
checksummed (in other words, only IP traffic and nothing else) then you're
correct.  Reality is that other traffic gets included, such as ARPs which
have skb->ip_summed = CHECKSUM_NONE.

In any case, you are wrong about cbd_esc.  For this code:

			ebdp->cbd_esc = BD_ENET_TX_INT;
 
			/* Enable protocol checksum flags
			 * We do not bother with the IP Checksum bits as they
			 * are done by the kernel
			 */
			if (skb->ip_summed == CHECKSUM_PARTIAL)
				ebdp->cbd_esc |= BD_ENET_TX_PINS;

this is the assembly which the compiler will generate:

     820:       e3a03101        mov     r3, #1073741824 ; 0x40000000
     824:       e5863008        str     r3, [r6, #8]
     828:       e5d53064        ldrb    r3, [r5, #100]  ; 0x64
     82c:       e203300c        and     r3, r3, #12
     830:       e353000c        cmp     r3, #12
     834:       03a03205        moveq   r3, #1342177280 ; 0x50000000
     838:       05863008        streq   r3, [r6, #8]

That's a write to cbd_esc with BD_ENET_TX_INT set and BD_ENET_TX_PINS
clear, followed by another write with both set if the packet has been
partially checksummed.

Depending on the CPU and timing, that could be visible to the device.
Thankfully, fb8ef788680 fixed it for CPUs which are strongly ordered.
However, with a weakly ordered CPU, all it would take is an interrupt
between those two writes to make them visible as separate writes, and
without the following barrier, the order of cbd_esc and cbd_sc
becoming visible is uncertain.

> but if sc write before buffaddr and bullen, there will be issue.
> Add memory barre here is better before  ebdp->cbd_sc.
> 
> I also want to add it before, But I have not found a test case to
> verify it is necessary.

In these situations, knowledge and theory is better than test-and-verify,
especially when we're talking about trying to hit a small window with
timings.

We know ARMv6 and ARMv7 are weakly ordered.  We know we see problems
with transmit on these devices.  We know that the specification calls
for the TX ready bit to be set last.  We know on these CPUs that
writes can be reordered.  Therefore... a barrier is definitely
required.

> > The result is the handler is entered, FEC_IEVENT contains TXF and MII
> > events.  Both these events are cleared down, (and thus no longer exist
> > as interrupt-causing events.)  napi_schedule_prep() returns false as
> > the NAPI rx function is still running, and doesn't mark it for a re-run.
> > We then do the MII interrupt.  Loop again, and int_events is zero,
> > we exit.
> >
> > Meanwhile, the NAPI rx function calls napi_complete() and re-enables
> > the receive interrupt.  If you're unlucky enough that the RX ring is
> > also full... no RXF interrupt.  So no further interrupts except maybe
> > MII interrupts.
> >
> > NAPI never gets scheduled.  RX ring never gets emptied.  TX ring never
> > gets reaped.  The result is a timeout with a completely full TX ring.
> 
> Do you see RX ring full?

I haven't added any debug for the RX ring, so I don't know for certain.
What I do know is that I've had the situation where the TX ring is
completely full of packets which have been sent and the TX ring hasn't
been reaped, which leads to the timeout, and that is down to the NAPI
functions not being scheduled.

When that happens, the three sets of flood pings thumping away at the
iMX6Q don't get any response until the timeout occurs.

In that scenario, if a packet were to be received (and there's many
packets hitting the interface) then it _would_ trigger reaping of the
TX ring.

> > +       do {
> > +               index = (index + 1) & (fep->tx_ring_size - 1);
> > +               bdp = fec_enet_tx_get(index, fep);
> >
> > -       while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
> > +               status = bdp->bd.cbd_sc;
> > +               if (status & BD_ENET_TX_READY)
> > +                       break;

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In answer to your question below, here.

> >
> >                 /* current queue is empty */
> > -               if (bdp == fep->cur_tx)
> > +               if (index == fep->tx_next)
> >                         break;

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and here.

> >
> > -               if (fep->bufdesc_ex)
> > -                       index = (struct bufdesc_ex *)bdp -
> > -                               (struct bufdesc_ex *)fep->tx_bd_base;
> > -               else
> > -                       index = bdp - fep->tx_bd_base;
> > +               fec_enet_tx_unmap(&bdp->bd, fep);
> >
> >                 skb = fep->tx_skbuff[index];
> > -               dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len,
> > -                               DMA_TO_DEVICE);
> > -               bdp->cbd_bufaddr = 0;
> > +               fep->tx_skbuff[index] = NULL;
> >
> >                 /* Check for errors. */
> >                 if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC |
> > @@ -797,19 +826,18 @@ fec_enet_tx(struct net_device *ndev)
> >                                 ndev->stats.tx_carrier_errors++;
> >                 } else {
> >                         ndev->stats.tx_packets++;
> > -                       ndev->stats.tx_bytes += bdp->cbd_datlen;
> > +                       ndev->stats.tx_bytes += bdp->bd.cbd_datlen;
> >                 }
> >
> >                 if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) &&
> >                         fep->bufdesc_ex) {
> >                         struct skb_shared_hwtstamps shhwtstamps;
> >                         unsigned long flags;
> > -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> >
> >                         memset(&shhwtstamps, 0, sizeof(shhwtstamps));
> >                         spin_lock_irqsave(&fep->tmreg_lock, flags);
> >                         shhwtstamps.hwtstamp = ns_to_ktime(
> > -                               timecounter_cyc2time(&fep->tc, ebdp->ts));
> > +                               timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
> >                         spin_unlock_irqrestore(&fep->tmreg_lock, flags);
> >                         skb_tstamp_tx(skb, &shhwtstamps);
> >                 }
> > @@ -825,45 +853,252 @@ fec_enet_tx(struct net_device *ndev)
> >
> >                 /* Free the sk buffer associated with this last transmit */
> >                 dev_kfree_skb_any(skb);
> > -               fep->tx_skbuff[index] = NULL;
> > -
> > -               fep->dirty_tx = bdp;
> > -
> > -               /* Update pointer to next buffer descriptor to be transmitted */
> > -               bdp = fec_enet_get_nextdesc(bdp, fep);
> >
> >                 /* Since we have freed up a buffer, the ring is no longer full
> >                  */
> > -               if (fep->dirty_tx != fep->cur_tx) {
> > -                       if (netif_queue_stopped(ndev))
> > -                               netif_wake_queue(ndev);
> > +               if (netif_queue_stopped(ndev)) {
> > +
> > +
> > +
> > +
> > +
> > +                       netif_wake_queue(ndev);
> > +
> >                 }
> > -       }
> > +
> > +               fep->tx_dirty = index;
> > +       } while (1);
> 
> where break this while(1);
Russell King - ARM Linux March 21, 2014, 5:32 p.m. UTC | #12
On Fri, Mar 21, 2014 at 05:33:02PM +0100, Bernd Faust wrote:
> I also tried the patch, but unfortunately it doesn't work for me either.
> I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9,
> single core) running Linux Kernel version 3.10.9-rt5+

Hmm, rt kernels.  Does this happen without the rt patches applied?

> Test done:
>   Start iperf server on target
>     $ iperf -s -u &
> 
>    Let iperf client run for 10 minutes on the host
>     $ iperf -c <ip_target> -u -b 100m -t 600 -l 256
> 
>   Also start a ping from host to target, to easily detect when the error
> occurs
>     $ ping <ip_target>

Right, I've just done this here, the results aren't that good:

On the iMX6S:
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  160 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.181 port 5001 connected with 192.168.1.11 port 45248
[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  3]  0.0-600.2 sec   109 MBytes  1.52 Mbits/sec  13.726 ms 29469735/29914906 (99%)

On my x86 laptop:

--- 192.168.1.181 ping statistics ---
617 packets transmitted, 616 received, 0% packet loss, time 616014ms
rtt min/avg/max/mdev = 0.167/0.294/8.934/0.512 ms

------------------------------------------------------------
Client connecting to 192.168.1.181, UDP port 5001
Sending 256 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.11 port 45248 connected with 192.168.1.181 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-600.0 sec  7.13 GBytes   102 Mbits/sec
[  3] Sent 29914976 datagrams
[  3] Server Report:
[  3]  0.0-600.2 sec   109 MBytes  1.52 Mbits/sec  13.726 ms 29469735/29914906 (99%)

The loss is disappointing and needs further investigation, but as you
can see, none of the symptoms you have reported are present - no lockup
and no 9000ms ping times (the longest was 9ms).

This is a plain 3.14-rc7 kernel configured for preempt, without the
-rt patches.
Bernd Faust March 23, 2014, 11:38 a.m. UTC | #13
On 21 March 2014 18:32, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> On Fri, Mar 21, 2014 at 05:33:02PM +0100, Bernd Faust wrote:
>> I also tried the patch, but unfortunately it doesn't work for me either.
>> I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9,
>> single core) running Linux Kernel version 3.10.9-rt5+
>
> Hmm, rt kernels.  Does this happen without the rt patches applied?
>

On a kernel without RT patches the problem does not occur, so in my
case the problem seems to be introduced/triggered by the RT patches.
Russell King - ARM Linux March 23, 2014, 1:44 p.m. UTC | #14
On Sun, Mar 23, 2014 at 12:38:31PM +0100, Bernd Faust wrote:
> On 21 March 2014 18:32, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> > On Fri, Mar 21, 2014 at 05:33:02PM +0100, Bernd Faust wrote:
> >> I also tried the patch, but unfortunately it doesn't work for me either.
> >> I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9,
> >> single core) running Linux Kernel version 3.10.9-rt5+
> >
> > Hmm, rt kernels.  Does this happen without the rt patches applied?
> >
> 
> On a kernel without RT patches the problem does not occur, so in my
> case the problem seems to be introduced/triggered by the RT patches.

Yes, this suggests it's some kind of race introduced by the RT patches.
I've never looked at the RT patches, so I can't help there.

What I can say is that there are a number of other problems with this
driver which are all races of one form or another, but they should not
affect normal operation.

For example:
- the FEC_QUIRK_ERR006358 workaround (for iMX6 only) can read the previous
  descriptor before the write to new descriptor's TX status bit has become
  visible to the hardware.
- FEC_QUIRK_ERR006358 workaround not quite having the desired effect
  (a scheduled delayed work can't be re-scheduled until after it has
  already fired.)
- the number of places that call fec_stop() without first stopping the
  transmit queue or dealing with the above work-around before
  reconfiguring the hardware and re-initialising the rings.
- the rx path having a similar problem to the above (writing descriptor
  entries after the descriptor has been handed over to uDMA.)

I haven't addressed any of these yet, but they're on my list.  I'm pretty
sure there's other problems in there too...
robert.daniels@vantagecontrols.com March 24, 2014, 5:57 p.m. UTC | #15
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 03/21/2014
11:32:53 AM:

> Hmm, rt kernels.  Does this happen without the rt patches applied?

I am not using the rt patches and I still see the problem.  I'm using
3.14.0-rc6+ with
fec patches on the i.MX53 Quick Start Board with the latest U-Boot.

When I run my test I will immediately see packet loss with the eventual tx
timeout issue.

These are the steps I use to reproduce the problem:

Setup: Desktop (Ubuntu 12.04: webfs, iperf3, /srv/ftp/test.bmp ~27 MB)
       i.MX53 Quick Start Board (linux 3.14.0-rc6+ with fec patches:
iperf3, wget)

Test:

  Desktop:
    iperf3 -s -V

  i.MX53 QSB:
    ssh 1> iperf3 -c 192.168.1.101 -u -l 64 -b 55M -V -t 1000
    ssh 2> cd /tmp; while true; do date; wget
http://192.168.1.101:8000/test.bmp; rm -fv /tmp/test.bmp; done

Now for the verbose explanation, I installed webfs on my Ubuntu desktop
machine and put
a ~27 MB bmp file in the web_root directory for the i.MX53 QSB to access.
I built iperf3
and ran the server on my Ubuntu desktop machine in verbose mode.

On the i.MX53 QSB I connect with ssh and run iperf3 with the above options.
I then open
another ssh session and run the wget command from above which continuously
transfers the
test.bmp from my desktop.  As soon as I start the wget I start seeing
packet loss
from iperf3.  After a while I will also see the tx timeout.

The following is what I see from iperf3 on the desktop:

-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Time: Mon, 24 Mar 2014 16:42:13 GMT
Accepted connection from 192.168.1.169, port 48609
      Cookie: ic-ii-0.4872.965116.71cd138c70e6e36e
[  5] local 192.168.1.101 port 5201 connected to 192.168.1.169 port 56375
Starting Test: protocol: UDP, 1 streams, 64 byte blocks, omitting 0
seconds, 1000 second test
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total
Datagrams
[  5]   0.00-1.00   sec   879 KBytes  7.20 Mbits/sec  0.087 ms  0/14064
(0%)
[  5]   1.00-2.00   sec   910 KBytes  7.46 Mbits/sec  0.088 ms  0/14568
(0%)
[  5]   2.00-3.00   sec   910 KBytes  7.46 Mbits/sec  0.086 ms  0/14562
(0%)
[  5]   3.00-4.00   sec   891 KBytes  7.30 Mbits/sec  0.086 ms  0/14259
(0%)
[  5]   4.00-5.00   sec   891 KBytes  7.30 Mbits/sec  0.084 ms  0/14250
(0%)
[  5]   5.00-6.00   sec   913 KBytes  7.48 Mbits/sec  0.084 ms  0/14612
(0%)
[  5]   6.00-7.00   sec   913 KBytes  7.48 Mbits/sec  0.085 ms  0/14615
(0%)
[  5]   7.00-8.00   sec   806 KBytes  6.60 Mbits/sec  0.246 ms  0/12893
(0%)
[  5]   8.00-9.00   sec   327 KBytes  2.68 Mbits/sec  2.748 ms  1/5230
(0.019%)   <----- Start of wget transfers
[  5]   9.00-10.00  sec   294 KBytes  2.41 Mbits/sec  0.191 ms  13/4715
(0.28%)
[  5]  10.00-11.00  sec   301 KBytes  2.46 Mbits/sec  0.184 ms  11/4820
(0.23%)
[  5]  11.00-12.00  sec   351 KBytes  2.88 Mbits/sec  0.373 ms  0/5616 (0%)
[  5]  12.00-13.00  sec   289 KBytes  2.36 Mbits/sec  0.086 ms  4/4622
(0.087%)
[  5]  13.00-14.00  sec   303 KBytes  2.48 Mbits/sec  0.084 ms  18/4859
(0.37%)
[  5]  14.00-15.00  sec   369 KBytes  3.03 Mbits/sec  0.094 ms  0/5911 (0%)
[  5]  15.00-16.00  sec   283 KBytes  2.32 Mbits/sec  0.193 ms  6/4537
(0.13%)
[  5]  16.00-17.00  sec   277 KBytes  2.27 Mbits/sec  0.146 ms  12/4441
(0.27%)
[  5]  17.00-18.00  sec   366 KBytes  3.00 Mbits/sec  0.054 ms  0/5850 (0%)
[  5]  18.00-19.00  sec   287 KBytes  2.35 Mbits/sec  0.085 ms  5/4604
(0.11%)
[  5]  19.00-20.00  sec   276 KBytes  2.26 Mbits/sec  0.081 ms  10/4429
(0.23%)
[  5]  20.00-21.00  sec   362 KBytes  2.97 Mbits/sec  0.417 ms  1/5798
(0.017%)
[  5]  21.00-22.00  sec   301 KBytes  2.47 Mbits/sec  0.089 ms  6/4826
(0.12%)
[  5]  22.00-23.00  sec   278 KBytes  2.28 Mbits/sec  0.228 ms  15/4471
(0.34%)
[  5]  23.00-24.00  sec   379 KBytes  3.10 Mbits/sec  0.130 ms  7/6068
(0.12%)
[  5]  24.00-25.00  sec   318 KBytes  2.60 Mbits/sec  0.083 ms  2/5087
(0.039%)
[  5]  25.00-26.00  sec   279 KBytes  2.29 Mbits/sec  0.301 ms  11/4476
(0.25%)
[  5]  26.00-27.00  sec   350 KBytes  2.86 Mbits/sec  0.059 ms  11/5605
(0.2%)
[  5]  27.00-28.00  sec   324 KBytes  2.65 Mbits/sec  0.597 ms  1/5179
(0.019%)
[  5]  28.00-29.00  sec   284 KBytes  2.32 Mbits/sec  0.088 ms  13/4549
(0.29%)
[  5]  29.00-30.00  sec   339 KBytes  2.77 Mbits/sec  0.442 ms  12/5431
(0.22%)
[  5]  30.00-31.00  sec   317 KBytes  2.60 Mbits/sec  0.122 ms  1/5074
(0.02%)
[  5]  31.00-32.00  sec   278 KBytes  2.28 Mbits/sec  0.099 ms  10/4458
(0.22%)
[  5]  32.00-33.00  sec   328 KBytes  2.68 Mbits/sec  0.043 ms  12/5252
(0.23%)
[  5]  33.00-34.00  sec   316 KBytes  2.59 Mbits/sec  0.445 ms  1/5060
(0.02%)
[  5]  34.00-35.00  sec   277 KBytes  2.27 Mbits/sec  1.432 ms  13/4445
(0.29%)
[  5]  35.00-36.00  sec   345 KBytes  2.83 Mbits/sec  0.089 ms  12/5535
(0.22%)
[  5]  36.00-37.00  sec   360 KBytes  2.95 Mbits/sec  0.143 ms  0/5768 (0%)
[  5]  37.00-38.00  sec   292 KBytes  2.39 Mbits/sec  0.460 ms  3/4671
(0.064%)
[  5]  38.00-39.00  sec   290 KBytes  2.37 Mbits/sec  0.095 ms  14/4649
(0.3%)
[  5]  39.00-40.00  sec   353 KBytes  2.89 Mbits/sec  0.318 ms  1/5643
(0.018%)
[  5]  40.00-41.00  sec   297 KBytes  2.44 Mbits/sec  0.707 ms  3/4761
(0.063%)
[  5]  41.00-42.00  sec   270 KBytes  2.21 Mbits/sec  0.173 ms  18/4335
(0.42%)
[  5]  42.00-43.00  sec   359 KBytes  2.94 Mbits/sec  0.452 ms  1/5750
(0.017%)
[  5]  43.00-44.00  sec   294 KBytes  2.41 Mbits/sec  0.112 ms  4/4703
(0.085%)
[  5]  44.00-45.00  sec   272 KBytes  2.23 Mbits/sec  0.134 ms  18/4371
(0.41%)
[  5]  45.00-46.00  sec   385 KBytes  3.16 Mbits/sec  0.090 ms  3/6170
(0.049%)
[  5]  46.00-47.00  sec   309 KBytes  2.53 Mbits/sec  0.735 ms  2/4949
(0.04%)
[  5]  47.00-48.00  sec   288 KBytes  2.36 Mbits/sec  0.206 ms  13/4622
(0.28%)
[  5]  48.00-49.00  sec   356 KBytes  2.92 Mbits/sec  0.107 ms  4/5707
(0.07%)
[  5]  49.00-50.00  sec   292 KBytes  2.39 Mbits/sec  1.680 ms  4/4670
(0.086%)
[  5]  50.00-51.00  sec   262 KBytes  2.15 Mbits/sec  0.646 ms  12/4202
(0.29%)
[  5]  51.00-52.00  sec   349 KBytes  2.86 Mbits/sec  0.151 ms  11/5597
(0.2%)
[  5]  52.00-53.00  sec   313 KBytes  2.56 Mbits/sec  0.076 ms  3/5011
(0.06%)
[  5]  53.00-54.00  sec   278 KBytes  2.27 Mbits/sec  0.436 ms  16/4456
(0.36%)
[  5]  54.00-55.00  sec   361 KBytes  2.95 Mbits/sec  0.086 ms  3/5772
(0.052%)
[  5]  55.00-56.00  sec   312 KBytes  2.56 Mbits/sec  0.422 ms  1/4992
(0.02%)
[  5]  56.00-57.00  sec   292 KBytes  2.39 Mbits/sec  0.134 ms  14/4683
(0.3%)
[  5]  57.00-58.00  sec   344 KBytes  2.82 Mbits/sec  0.090 ms  6/5513
(0.11%)
[  5]  58.00-59.00  sec   200 KBytes  1.64 Mbits/sec  0.811 ms  1/3199
(0.031%)
[  5]  59.00-60.00  sec  0.00 Bytes  0.00 bits/sec  0.811 ms  0/0 (-nan%)
<----- Coincides with the tx timeout
[  5]  60.00-61.00  sec  0.00 Bytes  0.00 bits/sec  0.811 ms  0/0 (-nan%)
[  5]  61.00-62.00  sec  84.2 KBytes   690 Kbits/sec  0.032 ms  0/1347 (0%)
[  5]  62.00-63.00  sec   307 KBytes  2.52 Mbits/sec  0.243 ms  0/4915 (0%)
[  5]  63.00-64.00  sec   301 KBytes  2.47 Mbits/sec  0.044 ms  0/4820 (0%)
[  5]  64.00-65.00  sec   391 KBytes  3.21 Mbits/sec  0.053 ms  1/6263
(0.016%)
[  5]  65.00-66.00  sec   364 KBytes  2.98 Mbits/sec  0.431 ms  0/5828 (0%)

Here is another tx timeout dump:

Jan  1fec 63fec000.ethernet eth0: TX ring dump
Nr    SC     addr       len  SKB
 0    0x1c00 0xce485000  106 dec7bf00
 1    0x1c00 0xce485800  106 dec7bcc0
 2    0x1c00 0xce486000  106 dec7b480
 3    0x1c00 0xce486800  106 dec7b840
 4    0x1c00 0xce487000  106 dec7b3c0
 5    0x1c00 0xce487800  106 dec7b600
 6    0x1c00 0xce500000  106 dec7b780
 7    0x1c00 0xce500800  106 dec7be40
 8    0x1c00 0xce501000  106 dec7b9c0
 9    0x1c00 0xce501800  106 dec7bb40
10    0x1c00 0xce502000  106 dec7b180
11    0x1c00 0xce502800  106 de7b2540
12    0x1c00 0xce503000  106 de7b2480
13    0x1c00 0xce503800  106 de7b2900
14    0x1c00 0xce504000  106 de7b2d80
15    0x1c00 0xce504800  106 de7b23c0
16    0x1c00 0xce505000  106 de7b2cc0
17    0x1c00 0xce505800  106 de7b2000
18    0x1c00 0xce506000  106 de7b2b40
19    0x1c00 0xce506800  106 de7b2840
20    0x1c00 0xce507000  106 de7b2780
21    0x1c00 0xce507800  106 de7b2300
22    0x1c00 0xce508000  106 de7b2e40
23    0x1c00 0xce508800  106 de7b2180
24    0x1c00 0xce509000  106 de7b2c00
25    0x1c00 0xce509800  106 de7b29c0
26    0x1c00 0xce50a000  106 de7b2240
27    0x1c00 0xce50a800  106 de7b2a80
28    0x1c00 0xce50b000  106 de7b2600
29    0x1c00 0xce50b800  106 de7b20c0
30    0x1c00 0xce50c000  106 de7ea780
31    0x1c00 0xce50c800  106 de7ea480
32    0x1c00 0xce50d000  106 de7ea900
33 SH 0x1c00 0x00000000  106   (null)
34    0x9c00 0xce50e000  106 de120e40
35    0x1c00 0xce50e800  106 de120f00
36    0x1c00 0xce50f000  106 de120840
37    0x1c00 0xce50f800  106 de120600
38    0x1c00 0xce510000   66 de1209c0
39    0x1c00 0xce510800  106 de120a80
40    0x1c00 0xce511000  106 de120cc0
41    0x1c00 0xce511800   66 de120000
42    0x1c00 0xce512000  106 de1200c0
43    0x1c00 0xce512800  106 de120780
44    0x1c00 0xce513000  106 de120c00
45    0x1c00 0xce513800   66 de120b40
46    0x1c00 0xce514000  106 de120540
47    0x1c00 0xce514800  106 de120900
48    0x1c00 0xce515000   66 de120240
49    0x1c00 0xce515800  106 de120d80
50    0x1c00 0xce516000  106 de1203c0
51    0x1c00 0xce516800   66 de120480
52    0x1c00 0xce517000  106 de120300
53    0x1c00 0xce517800  106 de120180
54    0x1c00 0xce518000   66 dec7b000
55    0x1c00 0xce518800  106 dec7b300
56    0x1c00 0xce519000  106 dec7b0c0
57    0x1c00 0xce519800   66 dec7b540
58    0x1c00 0xce51a000  106 dec7bc00
59    0x1c00 0xce51a800  106 dec7ba80
60    0x1c00 0xce51b000   66 dec7b900
61    0x1c00 0xce51b800  106 dec7bd80
62    0x1c00 0xce51c000  106 dec7b240
63    0x3c00 0xce51c800  106 dec7b6c0


As the test continues to run, the tx timeout will continue to happen.  I
monitored the sequence of tx timeouts to see
if there was anything interesting about which buffer descriptors were
having problems - here is the sequence:

15, 60, 57, 54, 53, 6, 18, 47, 24, 54, 23, 9, 39, 14


This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.
Marek Vasut March 24, 2014, 8:21 p.m. UTC | #16
On Monday, March 24, 2014 at 06:57:18 PM, robert.daniels@vantagecontrols.com 
wrote:
> Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 03/21/2014
> 
> 11:32:53 AM:
> > Hmm, rt kernels.  Does this happen without the rt patches applied?
> 
> I am not using the rt patches and I still see the problem.  I'm using
> 3.14.0-rc6+ with
> fec patches on the i.MX53 Quick Start Board with the latest U-Boot.
> 
> When I run my test I will immediately see packet loss with the eventual tx
> timeout issue.
> 
> These are the steps I use to reproduce the problem:
> 
> Setup: Desktop (Ubuntu 12.04: webfs, iperf3, /srv/ftp/test.bmp ~27 MB)
>        i.MX53 Quick Start Board (linux 3.14.0-rc6+ with fec patches:
> iperf3, wget)
> 
> Test:
> 
>   Desktop:
>     iperf3 -s -V
> 
>   i.MX53 QSB:
>     ssh 1> iperf3 -c 192.168.1.101 -u -l 64 -b 55M -V -t 1000

I think the UDP test might be lossy, can you try with TCP test please ?

Best regards,
Marek Vasut
robert.daniels@vantagecontrols.com March 24, 2014, 10:37 p.m. UTC | #17
Marek Vasut <marex@denx.de> wrote on 03/24/2014 02:21:58 PM:

>
> I think the UDP test might be lossy, can you try with TCP test please ?
>
> Best regards,
> Marek Vasut

Yes, I tried with TCP and there were no problems.
I then tried my same test with a crossover cable: no problems.
Then I started testing some different network configurations:

1. Desktop & i.MX53 both connected to the same router: no problems.
2. Desktop & i.MX53 both connected to the same hub: packet loss but no tx
timeout
3. Desktop connected to router, hub connected to router, i.MX53 connected
to hub: packet loss + tx timeout

So, the packet loss is due to the hub, not the fec driver.
I'm not sure why I'm getting the tx timeout.  My router is a Cisco
ValetPlus and the hub is a Netgear DS108.
The router and desktop are gigabit and the hub is 10/100.

I assumed that the tx timeout should not be happening - is this an
incorrect assumption based on my setup?

Thanks,
- Robert Daniels

This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.
Russell King - ARM Linux March 24, 2014, 11:44 p.m. UTC | #18
On Mon, Mar 24, 2014 at 04:37:20PM -0600, robert.daniels@vantagecontrols.com wrote:
> Marek Vasut <marex@denx.de> wrote on 03/24/2014 02:21:58 PM:
> > I think the UDP test might be lossy, can you try with TCP test please ?
> >
> > Best regards,
> > Marek Vasut
> 
> Yes, I tried with TCP and there were no problems.
> I then tried my same test with a crossover cable: no problems.
> Then I started testing some different network configurations:
> 
> 1. Desktop & i.MX53 both connected to the same router: no problems.
> 2. Desktop & i.MX53 both connected to the same hub: packet loss but no tx
> timeout
> 3. Desktop connected to router, hub connected to router, i.MX53 connected
> to hub: packet loss + tx timeout
> 
> So, the packet loss is due to the hub, not the fec driver.
> I'm not sure why I'm getting the tx timeout.  My router is a Cisco
> ValetPlus and the hub is a Netgear DS108.
> The router and desktop are gigabit and the hub is 10/100.
> 
> I assumed that the tx timeout should not be happening - is this an
> incorrect assumption based on my setup?

The first point is that the timeout should not be happening, because
when the link is up and running, packets should be transmitted.

The second point is that I think you have a setup which tickles a bug
in the FEC hardware.  What I think is going on is that the driver is
filling the transmit ring correctly.  For some reason, the FEC is
either failing to update the header after transmission leaving the
buffer owned by the FEC (and therefore unable to be reaped), or the
FEC is skipping over a ring descriptor.

As I hinted in one of my previous replies, I think the only way we're
truely going to work out what's going on here is to come up with some
way to mark each packet that is transmitted with where it was in the
ring, so a tcpdump session on another machine can be used to inspect
which packets from the ring were actually transmitted on the wire.

Now, you mention packet loss above.  Even on half-duplex networks, I
would not expect there to be much packet loss because of the
retransmissions after a collision, and 100Mbit networks handle this
much better than 10Mbit.  It takes repeated collisions to cause a
packet to actually be dropped (and a packet being dropped because
of collisions should be logged by the transmitting host in its
interface statistics.)

Another possibility is that the FEC isn't properly negotiating with
the hub, and the two ends are ending up configured differently, or
that it is negotiating properly but the FEC isn't being configured
correctly.

Now that we know a little more about your setup, including that it
seems to be one specific setup which is causing the problems, I'll
do some more testing - I have an old Allied Telesys 10/100 hub here
which I'll try putting the iMX6 on and re-run your tests.
Marek Vasut March 25, 2014, 1:02 a.m. UTC | #19
On Tuesday, March 25, 2014 at 12:44:43 AM, Russell King - ARM Linux wrote:
> On Mon, Mar 24, 2014 at 04:37:20PM -0600, robert.daniels@vantagecontrols.com 
wrote:
> > Marek Vasut <marex@denx.de> wrote on 03/24/2014 02:21:58 PM:
> > > I think the UDP test might be lossy, can you try with TCP test please ?
> > > 
> > > Best regards,
> > > Marek Vasut
> > 
> > Yes, I tried with TCP and there were no problems.
> > I then tried my same test with a crossover cable: no problems.
> > Then I started testing some different network configurations:
> > 
> > 1. Desktop & i.MX53 both connected to the same router: no problems.
> > 2. Desktop & i.MX53 both connected to the same hub: packet loss but no tx
> > timeout
> > 3. Desktop connected to router, hub connected to router, i.MX53 connected
> > to hub: packet loss + tx timeout
> > 
> > So, the packet loss is due to the hub, not the fec driver.
> > I'm not sure why I'm getting the tx timeout.  My router is a Cisco
> > ValetPlus and the hub is a Netgear DS108.
> > The router and desktop are gigabit and the hub is 10/100.
> > 
> > I assumed that the tx timeout should not be happening - is this an
> > incorrect assumption based on my setup?
> 
> The first point is that the timeout should not be happening, because
> when the link is up and running, packets should be transmitted.
> 
> The second point is that I think you have a setup which tickles a bug
> in the FEC hardware.  What I think is going on is that the driver is
> filling the transmit ring correctly.  For some reason, the FEC is
> either failing to update the header after transmission leaving the
> buffer owned by the FEC (and therefore unable to be reaped), or the
> FEC is skipping over a ring descriptor.

I think we might be seeing something similar on i.MX28 in U-Boot too, but I dare 
not to correlate this and the in-kernel FEC issue. We also sporadically see the 
FEC not transmitting a packet. I am just raising this since I want to update you 
all there is something like that as well.

[...]

Best regards,
Marek Vasut
Russell King - ARM Linux March 25, 2014, 11:10 p.m. UTC | #20
On Tue, Mar 25, 2014 at 02:02:42AM +0100, Marek Vasut wrote:
> I think we might be seeing something similar on i.MX28 in U-Boot too,
> but I dare not to correlate this and the in-kernel FEC issue. We also
> sporadically see the FEC not transmitting a packet.

That could be the ERR006358 problem.

This occurs when you've transmitted a packet previously, and the
indicator in FEC_X_DES_ACTIVE remains set at the point in time when
you load the next packet.  The subsequent write to FEC_X_DES_ACTIVE
has no effect on the TX side, which has already shut down.

The work-around for that which is in the kernel is to look at the
previous packet status: if that indicates it has been sent immediately
after setting everything up for the next packet (remembering to take
care of the store buffers between the CPU and memory), then you need to
re-ping the transmitter after a short delay.

(I've cocked that work-around up today... it can happen quite frequently
depending on your workload.)
Russell King - ARM Linux March 26, 2014, 12:11 a.m. UTC | #21
On Mon, Mar 24, 2014 at 11:44:43PM +0000, Russell King - ARM Linux wrote:
> Now, you mention packet loss above.  Even on half-duplex networks, I
> would not expect there to be much packet loss because of the
> retransmissions after a collision, and 100Mbit networks handle this
> much better than 10Mbit.  It takes repeated collisions to cause a
> packet to actually be dropped (and a packet being dropped because
> of collisions should be logged by the transmitting host in its
> interface statistics.)
> 
> Another possibility is that the FEC isn't properly negotiating with
> the hub, and the two ends are ending up configured differently, or
> that it is negotiating properly but the FEC isn't being configured
> correctly.
> 
> Now that we know a little more about your setup, including that it
> seems to be one specific setup which is causing the problems, I'll
> do some more testing - I have an old Allied Telesys 10/100 hub here
> which I'll try putting the iMX6 on and re-run your tests.

PC --- 100Mbit hub --- iMX6S

iperf (2.0.5) -c PC -r reports:

iMX6 Tx: [  5]  0.0-10.0 sec  85.5 MBytes  71.5 Mbits/sec
iMX6 Rx: [  3]  0.0-10.0 sec   103 MBytes  86.2 Mbits/sec

However, there's a marked difference when the iMX6 sends - the
collision LED is almost permanently on at full brightness, whereas
when the PC is sending, it's almost always off.

I'm seeing lots of late collisions and CRC errors in the statistics
registers.

It looks to me like the iMX6's transmitter is blathering at any moment,
not taking any notice whether there's already a carrier on the RX side.
I'm not yet sure what to make of this yet.
Russell King - ARM Linux April 1, 2014, 9:26 a.m. UTC | #22
On Wed, Mar 26, 2014 at 12:11:35AM +0000, Russell King - ARM Linux wrote:
> On Mon, Mar 24, 2014 at 11:44:43PM +0000, Russell King - ARM Linux wrote:
> > Now, you mention packet loss above.  Even on half-duplex networks, I
> > would not expect there to be much packet loss because of the
> > retransmissions after a collision, and 100Mbit networks handle this
> > much better than 10Mbit.  It takes repeated collisions to cause a
> > packet to actually be dropped (and a packet being dropped because
> > of collisions should be logged by the transmitting host in its
> > interface statistics.)
> > 
> > Another possibility is that the FEC isn't properly negotiating with
> > the hub, and the two ends are ending up configured differently, or
> > that it is negotiating properly but the FEC isn't being configured
> > correctly.
> > 
> > Now that we know a little more about your setup, including that it
> > seems to be one specific setup which is causing the problems, I'll
> > do some more testing - I have an old Allied Telesys 10/100 hub here
> > which I'll try putting the iMX6 on and re-run your tests.
> 
> PC --- 100Mbit hub --- iMX6S
> 
> iperf (2.0.5) -c PC -r reports:
> 
> iMX6 Tx: [  5]  0.0-10.0 sec  85.5 MBytes  71.5 Mbits/sec
> iMX6 Rx: [  3]  0.0-10.0 sec   103 MBytes  86.2 Mbits/sec
> 
> However, there's a marked difference when the iMX6 sends - the
> collision LED is almost permanently on at full brightness, whereas
> when the PC is sending, it's almost always off.
> 
> I'm seeing lots of late collisions and CRC errors in the statistics
> registers.
> 
> It looks to me like the iMX6's transmitter is blathering at any moment,
> not taking any notice whether there's already a carrier on the RX side.
> I'm not yet sure what to make of this yet.

Last night, I performed a different test:

PC --- Gigabit switch --- 10/100Mbit hub --- 10Mbit hub --- iMX6S
           |
  (the rest of my network)

This shows that all the (late) collisions occur in the 10Mbit domain,
and very few in the 100Mbit domain, which puts the blame fairly
squarely on the iMX6 side.

I can see nothing wrong with the setup of the iMX6, nor of the AR8035
phy which my board has.  Both the phy and the FEC appear to correctly
indicate that they are configured for half-duplex with flow control
disabled, and I've been through the iomux settings for the RGMII
interface.

So, I'm left with three possibilities:
- the AR8035 doesn't work with half-duplex
- there is something wrong with the signalling for carrier sense between
  the AR8035 and the FEC.
- the iMX6 FEC doesn't work with half-duplex

It's not easy to monitor the TX_CTL and RX_CTL signals and make sense of
them, so I can't really say what's at fault at the moment, but from what
I can see, my hardware fails to work correctly with half-duplex
connections.
Eric Nelson April 1, 2014, 2 p.m. UTC | #23
Hi Russell,

On 04/01/2014 02:26 AM, Russell King - ARM Linux wrote:
> On Wed, Mar 26, 2014 at 12:11:35AM +0000, Russell King - ARM Linux wrote:
>> On Mon, Mar 24, 2014 at 11:44:43PM +0000, Russell King - ARM Linux wrote:
>>> Now, you mention packet loss above.  Even on half-duplex networks, I
>>> would not expect there to be much packet loss because of the
>>> retransmissions after a collision, and 100Mbit networks handle this
>>> much better than 10Mbit.  It takes repeated collisions to cause a
>>> packet to actually be dropped (and a packet being dropped because
>>> of collisions should be logged by the transmitting host in its
>>> interface statistics.)
>>>
>>> Another possibility is that the FEC isn't properly negotiating with
>>> the hub, and the two ends are ending up configured differently, or
>>> that it is negotiating properly but the FEC isn't being configured
>>> correctly.
>>>
>>> Now that we know a little more about your setup, including that it
>>> seems to be one specific setup which is causing the problems, I'll
>>> do some more testing - I have an old Allied Telesys 10/100 hub here
>>> which I'll try putting the iMX6 on and re-run your tests.
>>
>> PC --- 100Mbit hub --- iMX6S
>>
>> iperf (2.0.5) -c PC -r reports:
>>
>> iMX6 Tx: [  5]  0.0-10.0 sec  85.5 MBytes  71.5 Mbits/sec
>> iMX6 Rx: [  3]  0.0-10.0 sec   103 MBytes  86.2 Mbits/sec
>>
>> However, there's a marked difference when the iMX6 sends - the
>> collision LED is almost permanently on at full brightness, whereas
>> when the PC is sending, it's almost always off.
>>
>> I'm seeing lots of late collisions and CRC errors in the statistics
>> registers.
>>
>> It looks to me like the iMX6's transmitter is blathering at any moment,
>> not taking any notice whether there's already a carrier on the RX side.
>> I'm not yet sure what to make of this yet.
>
> Last night, I performed a different test:
>
> PC --- Gigabit switch --- 10/100Mbit hub --- 10Mbit hub --- iMX6S
>             |
>    (the rest of my network)
>
> This shows that all the (late) collisions occur in the 10Mbit domain,
> and very few in the 100Mbit domain, which puts the blame fairly
> squarely on the iMX6 side.
>
> I can see nothing wrong with the setup of the iMX6, nor of the AR8035
> phy which my board has.  Both the phy and the FEC appear to correctly
> indicate that they are configured for half-duplex with flow control
> disabled, and I've been through the iomux settings for the RGMII
> interface.
>
> So, I'm left with three possibilities:
> - the AR8035 doesn't work with half-duplex
> - there is something wrong with the signalling for carrier sense between
>    the AR8035 and the FEC.
> - the iMX6 FEC doesn't work with half-duplex
>
> It's not easy to monitor the TX_CTL and RX_CTL signals and make sense of
> them, so I can't really say what's at fault at the moment, but from what
> I can see, my hardware fails to work correctly with half-duplex
> connections.
>

What value do you have for pad control register
IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)?

I notice in some of the DTS files that this has a value of 0x100b0,
which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling
that seems wrong (also bit 7 seems wrong, as it's undefined in the RM).
Russell King - ARM Linux April 1, 2014, 5:29 p.m. UTC | #24
On Tue, Apr 01, 2014 at 07:00:29AM -0700, Eric Nelson wrote:
> Hi Russell,
>
> What value do you have for pad control register
> IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)?

You're referring to the iMX6Q, the Hummingboard is iMX6S, so that's
0x20e06bc.  I have value 0x1b030.

> I notice in some of the DTS files that this has a value of 0x100b0,
> which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling
> that seems wrong (also bit 7 seems wrong, as it's undefined in the RM).

Yes, bit 7 should not be set as it's reserved, but this is ignored.
Setting the hysteresis bit affects the input thresholds, and this signal
is only ever used as an output.  While selecting HYS mode for an output
may seem odd, that should not have any effect here.

In any case, TX_CTL doesn't have much to do with it.  I've fully proved
that the interface works fine under full duplex conditions, achieving
500Mbps transmit and 570-600Mbps receive with TCP connections (which is
nicely beyond what freescale claim the hardware is capable of.)

The phy needs TX_CTL asserted in order to transmit anything, so as it
works in full duplex mode, this must be happening correctly.

The FEC needs RX_CTL asserted to indicate that valid data is being
received by the phy - if this was not being recognised by the FEC, full
duplex mode wouldn't work.

The last scenario is the error signals encoded onto TX_CTL/RX_CTL.  This
is one area I can't comment on yet - monitoring these signals is not
exactly easy (and requires two steady hands.)  However, I have briefly
monitored these signals when running at 100mbit half-duplex. With the
scope RX_CTL and triggering on it, I do see TX_CTL being raised by the
FEC.

Whether TX_CTL is raised by the FEC just before or after RX_CTL has been
asserted I can't say yet.

However, TX_CTL raised while RX_CTL is raised or toggling means that a
collision has occurred.  TX_CTL should never be raised while RX_CTL
indicates that there is a "carrier" present in half-duplex mode (and
therefore there is already another ethernet station transmitting on
the wire.)

Remember, I'm seeing late collisions.  The only way that happens is if
something starts transmitting after 64 bytes into an ethernet frame, and
it implies that something is not noticing that the link is busy before it
starts blurting out.

I know that my hubs are fine - the 10mbit hub has been used over _many_
years (it's more than 20 years old now) and has always been rock solid...
unlike the 10/100 hub with a dodgy switch between the two segments which
I can lock out the 10 mbit segment with enough 100mbit traffic.

Let's also not forget that these signals run at 125MHz at gigabit speeds,
and 2.5MHz at 10mbit...  if they were marginal at 10mbit, gigabit
certainly would not work.
Eric Nelson April 1, 2014, 6:11 p.m. UTC | #25
On 04/01/2014 10:29 AM, Russell King - ARM Linux wrote:
> On Tue, Apr 01, 2014 at 07:00:29AM -0700, Eric Nelson wrote:
>> Hi Russell,
>>
>> What value do you have for pad control register
>> IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)?
>
> You're referring to the iMX6Q, the Hummingboard is iMX6S, so that's
> 0x20e06bc.  I have value 0x1b030.
>

Right. I missed that you were running on Solo.

>> I notice in some of the DTS files that this has a value of 0x100b0,
>> which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling
>> that seems wrong (also bit 7 seems wrong, as it's undefined in the RM).
>
> Yes, bit 7 should not be set as it's reserved, but this is ignored.
> Setting the hysteresis bit affects the input thresholds, and this signal
> is only ever used as an output.  While selecting HYS mode for an output
> may seem odd, that should not have any effect here.
>
> In any case, TX_CTL doesn't have much to do with it.  I've fully proved
> that the interface works fine under full duplex conditions, achieving
> 500Mbps transmit and 570-600Mbps receive with TCP connections (which is
> nicely beyond what freescale claim the hardware is capable of.)
>
> The phy needs TX_CTL asserted in order to transmit anything, so as it
> works in full duplex mode, this must be happening correctly.
>
It appears I'm gender-challenged and confused TX_CTL and RX_CTL.

My thought (and question) was whether the pad setup for "transmit
enable" to the i.MX is either being missed entirely, or arriving
late because of an internal pullup/pull down or other pad
misconfiguration.

The pad settings have every indication of being cut/pasted, which
made me question them.

> The FEC needs RX_CTL asserted to indicate that valid data is being
> received by the phy - if this was not being recognised by the FEC, full
> duplex mode wouldn't work.
>
> The last scenario is the error signals encoded onto TX_CTL/RX_CTL.  This
> is one area I can't comment on yet - monitoring these signals is not
> exactly easy (and requires two steady hands.)  However, I have briefly
> monitored these signals when running at 100mbit half-duplex. With the
> scope RX_CTL and triggering on it, I do see TX_CTL being raised by the
> FEC.
>
> Whether TX_CTL is raised by the FEC just before or after RX_CTL has been
> asserted I can't say yet.
>
> However, TX_CTL raised while RX_CTL is raised or toggling means that a
> collision has occurred.  TX_CTL should never be raised while RX_CTL
> indicates that there is a "carrier" present in half-duplex mode (and
> therefore there is already another ethernet station transmitting on
> the wire.)
>
> Remember, I'm seeing late collisions.  The only way that happens is if
> something starts transmitting after 64 bytes into an ethernet frame, and
> it implies that something is not noticing that the link is busy before it
> starts blurting out.
>
I guess we can rule out slew rate.

> I know that my hubs are fine - the 10mbit hub has been used over _many_
> years (it's more than 20 years old now) and has always been rock solid...
> unlike the 10/100 hub with a dodgy switch between the two segments which
> I can lock out the 10 mbit segment with enough 100mbit traffic.
>
> Let's also not forget that these signals run at 125MHz at gigabit speeds,
> and 2.5MHz at 10mbit...  if they were marginal at 10mbit, gigabit
> certainly would not work.
>

Based on this, I don't see how a badly-configured pad could produce the
symptoms you're seeing.
robert.daniels@vantagecontrols.com April 1, 2014, 7:38 p.m. UTC | #26
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014
03:26:38 AM:

>
> Last night, I performed a different test:
>
> PC --- Gigabit switch --- 10/100Mbit hub --- 10Mbit hub --- iMX6S
>            |
>   (the rest of my network)
>
> This shows that all the (late) collisions occur in the 10Mbit domain,
> and very few in the 100Mbit domain, which puts the blame fairly
> squarely on the iMX6 side.
>
> I can see nothing wrong with the setup of the iMX6, nor of the AR8035
> phy which my board has.  Both the phy and the FEC appear to correctly
> indicate that they are configured for half-duplex with flow control
> disabled, and I've been through the iomux settings for the RGMII
> interface.
>
> So, I'm left with three possibilities:
> - the AR8035 doesn't work with half-duplex
> - there is something wrong with the signalling for carrier sense between
>   the AR8035 and the FEC.
> - the iMX6 FEC doesn't work with half-duplex
>
> It's not easy to monitor the TX_CTL and RX_CTL signals and make sense of
> them, so I can't really say what's at fault at the moment, but from what
> I can see, my hardware fails to work correctly with half-duplex
> connections.

I'm not sure where this factors in, but I originally saw this problem using
the
Freescale 2.6.35 kernel.  The driver there exhibits this problem
differently,
although it could very well be a different problem.  What I observed was
that
when the FEC got into this bad state the driver would attempt to transmit a
socket buffer but for some reason the buffer would not actually get
transmitted.
The driver would continue transmitting packets until it got all the way
around in the ring buffer to the buffer descriptor right before the one
that
was never transmitted.  When this buffer descriptor was set to transmit
you'd
get a double transmit - the new packet and the previously untransmitted
buffer.
This results in out-of-order packets being sent directly from the i.MX53.

To illustrate this, I got my i.MX53 into this bad state and then ran ping
with the
following results:

PING 192.168.1.101 (192.168.1.101) 56(84) bytes of data.
64 bytes from 192.168.1.101: icmp_seq=1 ttl=64 time=6.69 ms
64 bytes from 192.168.1.101: icmp_seq=2 ttl=64 time=0.306 ms
64 bytes from 192.168.1.101: icmp_seq=4 ttl=64 time=0.314 ms
64 bytes from 192.168.1.101: icmp_seq=5 ttl=64 time=0.325 ms
64 bytes from 192.168.1.101: icmp_seq=6 ttl=64 time=0.351 ms
64 bytes from 192.168.1.101: icmp_seq=7 ttl=64 time=0.323 ms
64 bytes from 192.168.1.101: icmp_seq=9 ttl=64 time=0.337 ms
64 bytes from 192.168.1.101: icmp_seq=10 ttl=64 time=0.319 ms
64 bytes from 192.168.1.101: icmp_seq=11 ttl=64 time=0.353 ms
64 bytes from 192.168.1.101: icmp_seq=12 ttl=64 time=0.321 ms
64 bytes from 192.168.1.101: icmp_seq=13 ttl=64 time=0.308 ms
64 bytes from 192.168.1.101: icmp_seq=14 ttl=64 time=0.329 ms
64 bytes from 192.168.1.101: icmp_seq=15 ttl=64 time=0.335 ms
64 bytes from 192.168.1.101: icmp_seq=16 ttl=64 time=0.316 ms
64 bytes from 192.168.1.101: icmp_seq=17 ttl=64 time=0.326 ms
64 bytes from 192.168.1.101: icmp_seq=3 ttl=64 time=14006 ms
64 bytes from 192.168.1.101: icmp_seq=19 ttl=64 time=0.323 ms
64 bytes from 192.168.1.101: icmp_seq=20 ttl=64 time=0.314 ms
64 bytes from 192.168.1.101: icmp_seq=21 ttl=64 time=0.310 ms
64 bytes from 192.168.1.101: icmp_seq=22 ttl=64 time=0.317 ms
64 bytes from 192.168.1.101: icmp_seq=23 ttl=64 time=0.331 ms
64 bytes from 192.168.1.101: icmp_seq=8 ttl=64 time=15000 ms
64 bytes from 192.168.1.101: icmp_seq=25 ttl=64 time=0.322 ms
64 bytes from 192.168.1.101: icmp_seq=26 ttl=64 time=0.333 ms
64 bytes from 192.168.1.101: icmp_seq=27 ttl=64 time=0.337 ms
64 bytes from 192.168.1.101: icmp_seq=28 ttl=64 time=0.337 ms
64 bytes from 192.168.1.101: icmp_seq=29 ttl=64 time=0.335 ms
64 bytes from 192.168.1.101: icmp_seq=30 ttl=64 time=0.325 ms
64 bytes from 192.168.1.101: icmp_seq=31 ttl=64 time=0.307 ms
64 bytes from 192.168.1.101: icmp_seq=32 ttl=64 time=0.333 ms
64 bytes from 192.168.1.101: icmp_seq=18 ttl=64 time=14006 ms
64 bytes from 192.168.1.101: icmp_seq=34 ttl=64 time=0.330 ms
64 bytes from 192.168.1.101: icmp_seq=35 ttl=64 time=0.342 ms

Here you can see that icmp_seq=3 wasn't replied to until after
icmp_seq=17 was sent.

Once these buffer descriptors get into this state, they stay
that way until the FEC is reset.

I don't see this exact behavior when I run the test with the 3.14
kernel but I'm starting to wonder if it's because the 3.14 kernel
is receiving the transmit timeout which allows the driver to reset
itself and start working again whereas the 2.6.35 driver is not.

I hope this additional information is useful, I don't know enough
about these low-level networking details to contribute much but
it's possible that what I've seen in the 2.6.35 kernel is actually
the same issue that I'm seeing in the 3.14 kernel but handled
better.

Thanks,

Robert Daniels

This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.
Russell King - ARM Linux April 1, 2014, 10:51 p.m. UTC | #27
On Tue, Apr 01, 2014 at 01:38:37PM -0600, robert.daniels@vantagecontrols.com wrote:
> I'm not sure where this factors in, but I originally saw this problem using
> the Freescale 2.6.35 kernel.  The driver there exhibits this problem
> differently, although it could very well be a different problem. What
> I observed was that when the FEC got into this bad state the driver would
> attempt to transmit a socket buffer but for some reason the buffer would
> not actually get transmitted.
>
> The driver would continue transmitting packets until it got all the way
> around in the ring buffer to the buffer descriptor right before the one
> that was never transmitted.  When this buffer descriptor was set to
> transmit you'd get a double transmit - the new packet and the previously
> untransmitted buffer.
>
> This results in out-of-order packets being sent directly from the i.MX53.

At initial glance, this is coherent with my idea of the FEC skipping a
ring entry on the initial pass around.  Then when a new entry is loaded,

Let's say that the problem entry is number 12 that has been skipped.
When we get back around to entry 11, the FEC will transmit entries 11
and 12, as you rightly point out, and it will then look at entry 13
for the next packet.

However, the driver loads the next packet into entry 12, and hits the
FEC to transmit it.  The FEC re-reads entry 13, finds no packet, so
does nothing.

Then the next packet is submitted to the driver, and it enters it into
entry 13, again hitting the FEC.  The FEC now sees the entry at 13,
meanwhile the entry at 12 is still pending.

> I hope this additional information is useful, I don't know enough
> about these low-level networking details to contribute much but
> it's possible that what I've seen in the 2.6.35 kernel is actually
> the same issue that I'm seeing in the 3.14 kernel but handled
> better.

It confirms the theory, but doesn't really provide much clues for a
solution at the moment.

However, I've had something of a breakthrough with iMX6 and half-duplex.
I think much of the problem comes down to this ERR006358 workaround
implemented in the driver (this apparantly doesn't affect your device.)
The delayed work implementation, and my delayed timer implementation of
the same are fundamentally wrong to the erratum documentation - as is
the version implemented in the Freescale BSP.

Implementing what the erratum says as an acceptable workaround improves
things tremendously - I see iperf on a 10Mbit hub go from 1-2Mbps up to
8Mbps, though still with loads of collisions.  That said, I'm not that
trusting of the error bits indicated from the FEC.

The reason I mention it here is that I wonder if less wacking of the
FEC_X_DES_ACTIVE register may help your problem.

In 3.14, in the fec_enet_start_xmit function, find the
"writel(0, fep->hwp + FEC_X_DES_ACTIVE);" and change it to:

	wmb();

        /* Trigger transmission start */
        if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
                writel(0, fep->hwp + FEC_X_DES_ACTIVE);

and see whether that helps your problem(s).
robert.daniels@vantagecontrols.com April 2, 2014, 12:37 a.m. UTC | #28
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014
04:51:49 PM:

> At initial glance, this is coherent with my idea of the FEC skipping a
> ring entry on the initial pass around.  Then when a new entry is loaded,
>
> Let's say that the problem entry is number 12 that has been skipped.
> When we get back around to entry 11, the FEC will transmit entries 11
> and 12, as you rightly point out, and it will then look at entry 13
> for the next packet.
>
> However, the driver loads the next packet into entry 12, and hits the
> FEC to transmit it.  The FEC re-reads entry 13, finds no packet, so
> does nothing.
>
> Then the next packet is submitted to the driver, and it enters it into
> entry 13, again hitting the FEC.  The FEC now sees the entry at 13,
> meanwhile the entry at 12 is still pending.
>
> > I hope this additional information is useful, I don't know enough
> > about these low-level networking details to contribute much but
> > it's possible that what I've seen in the 2.6.35 kernel is actually
> > the same issue that I'm seeing in the 3.14 kernel but handled
> > better.
>
> It confirms the theory, but doesn't really provide much clues for a
> solution at the moment.

So according to this theory, if we can detect this situation we should be
able to skip entry 12 and then everything should get back into sink?  I did
a little test where I marked '12' as needing to be skipped and then the
next
time it was to be used I skipped it... and I got an out of order packet but
then things got back on track - that's the good news.

Unfortunately, I think there must be something wrong with my hack because
after running for a while longer ethernet stops working entirely.  Not sure
why but I'll see if I can figure that out.  This was on my 2.6.35 kernel.

> However, I've had something of a breakthrough with iMX6 and half-duplex.
> I think much of the problem comes down to this ERR006358 workaround
> implemented in the driver (this apparantly doesn't affect your device.)
> The delayed work implementation, and my delayed timer implementation of
> the same are fundamentally wrong to the erratum documentation - as is
> the version implemented in the Freescale BSP.
>
> Implementing what the erratum says as an acceptable workaround improves
> things tremendously - I see iperf on a 10Mbit hub go from 1-2Mbps up to
> 8Mbps, though still with loads of collisions.  That said, I'm not that
> trusting of the error bits indicated from the FEC.
>
> The reason I mention it here is that I wonder if less wacking of the
> FEC_X_DES_ACTIVE register may help your problem.
>
> In 3.14, in the fec_enet_start_xmit function, find the
> "writel(0, fep->hwp + FEC_X_DES_ACTIVE);" and change it to:
>
>    wmb();
>
>         /* Trigger transmission start */
>         if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
>                 writel(0, fep->hwp + FEC_X_DES_ACTIVE);
>
> and see whether that helps your problem(s).

I tried less wacking and I still have the same tx transmit timeout issue.

Thanks,

Robert Daniels


This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.
fugang.duan@freescale.com April 2, 2014, 2:26 a.m. UTC | #29
On 04/02/2014 1:30 AM, Russell King - ARM Linux wrote:
>On Tue, Apr 01, 2014 at 07:00:29AM -0700, Eric Nelson wrote:
>> Hi Russell,
>>
>> What value do you have for pad control register
>> IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)?
>
>You're referring to the iMX6Q, the Hummingboard is iMX6S, so that's 0x20e06bc.
>I have value 0x1b030.
>
>> I notice in some of the DTS files that this has a value of 0x100b0,
>> which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling
>> that seems wrong (also bit 7 seems wrong, as it's undefined in the RM).
>
>Yes, bit 7 should not be set as it's reserved, but this is ignored.
>Setting the hysteresis bit affects the input thresholds, and this signal is
>only ever used as an output.  While selecting HYS mode for an output may seem
>odd, that should not have any effect here.
>
>In any case, TX_CTL doesn't have much to do with it.  I've fully proved that
>the interface works fine under full duplex conditions, achieving 500Mbps
>transmit and 570-600Mbps receive with TCP connections (which is nicely beyond
>what freescale claim the hardware is capable of.)
>
Hi, Russell,

Since imx6q/dl/s sillicon enet have bandwidth issue, the bandwidth is 400 ~ 700 Mbps which is supplied by soc arichitecturer.
So we claim our enet performance is 460Mbps.
In fact, for our kernel 3.0.35 serial release,  kernel 3.10.17 release:
TCP tx: 470Mbps (stable)
TCP rx: 600Mbps (stable)
 
[...]

Thanks,
Andy
fugang.duan@freescale.com April 2, 2014, 3:19 a.m. UTC | #30
Russell King wrote on 04/01/2014 04:51:49 PM:
>On Tue, Apr 01, 2014 at 01:38:37PM -0600, robert.daniels@vantagecontrols.com
>wrote:
>> I'm not sure where this factors in, but I originally saw this problem
>> using the Freescale 2.6.35 kernel.  The driver there exhibits this
>> problem differently, although it could very well be a different
>> problem. What I observed was that when the FEC got into this bad state
>> the driver would attempt to transmit a socket buffer but for some
>> reason the buffer would not actually get transmitted.
>>
>> The driver would continue transmitting packets until it got all the
>> way around in the ring buffer to the buffer descriptor right before
>> the one that was never transmitted.  When this buffer descriptor was
>> set to transmit you'd get a double transmit - the new packet and the
>> previously untransmitted buffer.
>>
>> This results in out-of-order packets being sent directly from the i.MX53.
>
>At initial glance, this is coherent with my idea of the FEC skipping a ring
>entry on the initial pass around.  Then when a new entry is loaded,
>
>Let's say that the problem entry is number 12 that has been skipped.
>When we get back around to entry 11, the FEC will transmit entries 11 and 12,
>as you rightly point out, and it will then look at entry 13 for the next packet.
>
>However, the driver loads the next packet into entry 12, and hits the FEC to
>transmit it.  The FEC re-reads entry 13, finds no packet, so does nothing.
>
>Then the next packet is submitted to the driver, and it enters it into entry 13,
>again hitting the FEC.  The FEC now sees the entry at 13, meanwhile the entry
>at 12 is still pending.
>
>> I hope this additional information is useful, I don't know enough
>> about these low-level networking details to contribute much but it's
>> possible that what I've seen in the 2.6.35 kernel is actually the same
>> issue that I'm seeing in the 3.14 kernel but handled better.
>
>It confirms the theory, but doesn't really provide much clues for a solution at
>the moment.
>
>However, I've had something of a breakthrough with iMX6 and half-duplex.
>I think much of the problem comes down to this ERR006358 workaround implemented
>in the driver (this apparantly doesn't affect your device.) The delayed work
>implementation, and my delayed timer implementation of the same are
>fundamentally wrong to the erratum documentation - as is the version
>implemented in the Freescale BSP.
>
>Implementing what the erratum says as an acceptable workaround improves things
>tremendously - I see iperf on a 10Mbit hub go from 1-2Mbps up to 8Mbps, though
>still with loads of collisions.  That said, I'm not that trusting of the error
>bits indicated from the FEC.
I don't have 10Mbit hub, but use ethtool to change the phy speed/duplex mode in imx6q sabresd platform:
- 100M half: work fine
- 10M half: work fine, iperf test, the tx bandwidth is 6.2 ~ 11 Mbps.

Log:
root@freescale /data/ptp-debug$ ./ethtool -s eth0 autoneg off speed 10 duplex half
root@freescale /data/ptp-debug$ PHY: 1:01 - Link is Down
root@freescale /data/ptp-debug$ PHY: 1:01 - Link is Up - 10/Half

root@freescale /data/ptp-debug$ iperf -c 10.192.242.202 -t 100 -i 1
------------------------------------------------------------
Client connecting to 10.192.242.202, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 10.192.242.124 port 52725 connected with 10.192.242.202 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   768 KBytes  6.29 Mbits/sec
[  3]  1.0- 2.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3]  2.0- 3.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3]  3.0- 4.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3]  4.0- 5.0 sec   768 KBytes  6.29 Mbits/sec
[  3]  5.0- 6.0 sec  1.38 MBytes  11.5 Mbits/sec
[  3]  6.0- 7.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3]  7.0- 8.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3]  8.0- 9.0 sec   896 KBytes  7.34 Mbits/sec
[  3]  9.0-10.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 10.0-11.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 11.0-12.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 12.0-13.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 13.0-14.0 sec  1.38 MBytes  11.5 Mbits/sec
[  3] 14.0-15.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 15.0-16.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3] 16.0-17.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 17.0-18.0 sec   512 KBytes  4.19 Mbits/sec
[  3] 18.0-19.0 sec  1.62 MBytes  13.6 Mbits/sec
[  3] 19.0-20.0 sec   640 KBytes  5.24 Mbits/sec
[  3] 20.0-21.0 sec  1.50 MBytes  12.6 Mbits/sec
[  3] 21.0-22.0 sec   640 KBytes  5.24 Mbits/sec
[  3] 22.0-23.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3] 23.0-24.0 sec  1.50 MBytes  12.6 Mbits/sec
[  3] 24.0-25.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 25.0-26.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 26.0-27.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 27.0-28.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 28.0-29.0 sec  1.00 MBytes  8.39 Mbits/sec
....

>
>The reason I mention it here is that I wonder if less wacking of the
>FEC_X_DES_ACTIVE register may help your problem.
>
>In 3.14, in the fec_enet_start_xmit function, find the "writel(0, fep->hwp +
>FEC_X_DES_ACTIVE);" and change it to:
>
>	wmb();
>
>        /* Trigger transmission start */
>        if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
>                writel(0, fep->hwp + FEC_X_DES_ACTIVE);
>
>and see whether that helps your problem(s).
>
So far, I don't see wmb() bring any effort on FEC.  Some years ago, we want to add wmb() before set BD "Ready" bit, but we don't find any issue without wmb(), on the contrary, it introduce more
System overhead special for Gbps networking. And now, imx6sx sillicon (sigle core), add wmb() cause tx performance drop much since cpu loading is the bottleneck. In our bsp release, we don't find
Tx watchdog timeout issue, don't find ping order reversal issue after overnight stress test.

Thanks,
Andy
Russell King - ARM Linux April 2, 2014, 8:59 a.m. UTC | #31
On Wed, Apr 02, 2014 at 03:19:58AM +0000, fugang.duan@freescale.com wrote:
> >	wmb();
> >
> >        /* Trigger transmission start */
> >        if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
> >                writel(0, fep->hwp + FEC_X_DES_ACTIVE);
> >
> >and see whether that helps your problem(s).
> >
> So far, I don't see wmb() bring any effort on FEC.  Some years ago, we
> want to add wmb() before set BD "Ready" bit, but we don't find any issue
> without wmb(), on the contrary, it introduce more System overhead special
> for Gbps networking.

I wonder whether you understand what is going on here, and why it is
required.  I doubt it somehow from your comments.  Maybe if you were
to read about the operation of the store buffer in the PL310, it may
open your eyes to why it would be necessary for reliable operation.

> And now, imx6sx sillicon (sigle core), add wmb() cause tx performance
> drop much since cpu loading is the bottleneck.

I don't find any measurable performance drop from adding it on either
iMX6Q or iMX6S with the L2 cache code fixed up.

The reality is, with the mainline ethernet driver with no changes, the
best it can do is around 300Mbps transmit and 400Mbps receive.  With
both my L2 and FEC changes, this has increased to 500Mbps transmit and
570Mbps receive.  Plus it now works reasonably (though with lots of
collisions) on half-duplex giving a 4 to 8 fold increase in speed there.

I really don't care how your private BSP kernels perform.  It's not
mainline, so it's not of any interest.
fugang.duan@freescale.com April 2, 2014, 9:40 a.m. UTC | #32
From: Russell King - ARM Linux <linux@arm.linux.org.uk>
Data: Wednesday, April 02, 2014 4:59 PM

>To: Duan Fugang-B38611
>Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
>Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org
>Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>
>On Wed, Apr 02, 2014 at 03:19:58AM +0000, fugang.duan@freescale.com wrote:
>> >	wmb();
>> >
>> >        /* Trigger transmission start */
>> >        if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
>> >                writel(0, fep->hwp + FEC_X_DES_ACTIVE);
>> >
>> >and see whether that helps your problem(s).
>> >
>> So far, I don't see wmb() bring any effort on FEC.  Some years ago, we
>> want to add wmb() before set BD "Ready" bit, but we don't find any
>> issue without wmb(), on the contrary, it introduce more System
>> overhead special for Gbps networking.
>
>I wonder whether you understand what is going on here, and why it is required.
>I doubt it somehow from your comments.  Maybe if you were to read about the
>operation of the store buffer in the PL310, it may open your eyes to why it
>would be necessary for reliable operation.
>
In kernel 3.0.35 internal BSP,  BD memory is non-cacheable, non-bufferable (we add new api to support it: dma_alloc_noncacheable()),
So wmb() is not necessary. We don't find net watchdog timeout/ ping order issue.

>> And now, imx6sx sillicon (sigle core), add wmb() cause tx performance
>> drop much since cpu loading is the bottleneck.
>
>I don't find any measurable performance drop from adding it on either iMX6Q or
>iMX6S with the L2 cache code fixed up.
>
Yes, it don't impact imx6q since cpu loading is not bottleneck due rx/tx bandwidth is slow and multi-cores.
But for imx6sx, enet rx can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core. Enet IP don't support
TSO feaure, cpu loading is the bottleneck. Wmb() is very expensive which cause tx performance drop much.

>The reality is, with the mainline ethernet driver with no changes, the best it
>can do is around 300Mbps transmit and 400Mbps receive.  With both my L2 and FEC
>changes, this has increased to 500Mbps transmit and 570Mbps receive.  Plus it
>now works reasonably (though with lots of
>collisions) on half-duplex giving a 4 to 8 fold increase in speed there.
>
>I really don't care how your private BSP kernels perform.  It's not mainline,
>so it's not of any interest.
>
Yes, I agree. There have some arch/driver patches need to upstream to align the performance with internal bsp.

>--
>FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
>improving, and getting towards what was expected from it.
>

Thanks,
Andy
Russell King - ARM Linux April 2, 2014, 10:46 a.m. UTC | #33
On Wed, Apr 02, 2014 at 09:40:53AM +0000, fugang.duan@freescale.com wrote:
> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
> Data: Wednesday, April 02, 2014 4:59 PM
> >I wonder whether you understand what is going on here, and why it is required.
> >I doubt it somehow from your comments.  Maybe if you were to read about the
> >operation of the store buffer in the PL310, it may open your eyes to why it
> >would be necessary for reliable operation.
> 
> In kernel 3.0.35 internal BSP,  BD memory is non-cacheable, non-bufferable
> (we add new api to support it: dma_alloc_noncacheable()),

As is the memory you get from dma_alloc_coherent().  So, why did you
invent a new API which does something which the mainline kernel APIs
already do?

Maybe yours is doing something different but you haven't explained it
in correct terminology.

> So wmb() is not necessary.

Even on non-cacheable normal memory, the wmb() is required.  Please read
up in the ARM architecture reference manual about memory types and their
various attributes, followed by the memory ordering chapters.

> Yes, it don't impact imx6q since cpu loading is not bottleneck due
> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx can
> reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.

What netdev features do you support to achieve that?

> Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb()
> is very expensive which cause tx performance drop much.

wmb() is very expensive because of the L2 cache code using a sledge hammer
with it - particularly the spinlock, which has a large overhead if lockdep
or spinlock debugging is enabled.

> Yes, I agree. There have some arch/driver patches need to upstream to
> align the performance with internal bsp.

Well, post these patches so that people can test them.  If your patches
are indeed the worlds best thing since sliced bread, I've just wasted
over two months of solid work on the iMX6 ethernet driver.

However, hacks like dma_alloc_noncacheable will not be acceptable for
mainline.
fugang.duan@freescale.com April 2, 2014, 11:33 a.m. UTC | #34
From: Russell King - ARM Linux <linux@arm.linux.org.uk>
Data: Wednesday, April 02, 2014 6:47 PM

>To: Duan Fugang-B38611
>Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
>Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org
>Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>
>On Wed, Apr 02, 2014 at 09:40:53AM +0000, fugang.duan@freescale.com wrote:
>> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
>> Data: Wednesday, April 02, 2014 4:59 PM
>> >I wonder whether you understand what is going on here, and why it is
>required.
>> >I doubt it somehow from your comments.  Maybe if you were to read
>> >about the operation of the store buffer in the PL310, it may open
>> >your eyes to why it would be necessary for reliable operation.
>>
>> In kernel 3.0.35 internal BSP,  BD memory is non-cacheable,
>> non-bufferable (we add new api to support it:
>> dma_alloc_noncacheable()),
>
>As is the memory you get from dma_alloc_coherent().  So, why did you invent a
>new API which does something which the mainline kernel APIs already do?
>
>Maybe yours is doing something different but you haven't explained it in
>correct terminology.

In kernel 3.0.35, there have no CMA memory allocate mechanism.
Below Kernel configs are enabled:
CONFIG_ARM_DMA_MEM_BUFFERABLE
CONFIG_SMP

If use dma_alloc_coherent() allocate memory, it must be non-cacheable, but bufferable.
The new invented api "dma_alloc_noncacheable()" allocate memory is non-cacheable, non-bufferable, the memory type is Strongly ordered.

>
>> So wmb() is not necessary.
>
>Even on non-cacheable normal memory, the wmb() is required.  Please read up in
>the ARM architecture reference manual about memory types and their various
>attributes, followed by the memory ordering chapters.
>
>> Yes, it don't impact imx6q since cpu loading is not bottleneck due
>> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx can
>> reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.
>
>What netdev features do you support to achieve that?
>
Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
So we enable the two features.

>> Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb()
>> is very expensive which cause tx performance drop much.
>
>wmb() is very expensive because of the L2 cache code using a sledge hammer with
>it - particularly the spinlock, which has a large overhead if lockdep or
>spinlock debugging is enabled.
>
Yes, if add wmb() to xmit(), imx6sx enet performance will drop more than 100Mbps.

[...]

Thanks,
Andy
Russell King - ARM Linux April 2, 2014, 4:51 p.m. UTC | #35
On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan@freescale.com wrote:
> In kernel 3.0.35, there have no CMA memory allocate mechanism.
> Below Kernel configs are enabled:
> CONFIG_ARM_DMA_MEM_BUFFERABLE
> CONFIG_SMP
> 
> If use dma_alloc_coherent() allocate memory, it must be non-cacheable,
> but bufferable.  The new invented api "dma_alloc_noncacheable()"
> allocate memory is non-cacheable, non-bufferable, the memory type is
> Strongly ordered.

Right, so what you've just said is that it's fine to violate the
requirements of the architecture L1 memory model by setting up a
strongly ordered memory mapping for the same physical addresses as
an existing mapping which is mapped as normal memory.

Sorry, I'm not going to listen to you anymore, you just lost any kind
of authority on this matter.

> >> So wmb() is not necessary.
> >
> >Even on non-cacheable normal memory, the wmb() is required.  Please read up in
> >the ARM architecture reference manual about memory types and their various
> >attributes, followed by the memory ordering chapters.
> >
> >> Yes, it don't impact imx6q since cpu loading is not bottleneck due
> >> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx can
> >> reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.
> >
> >What netdev features do you support to achieve that?
> >
> Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
> So we enable the two features.

Checksum and... presumably you're referring to NAPI don't get you to that
kind of speed.  Even on x86, you can't get close to wire speed without
GSO, which you need scatter-gather for, and you don't support that.  So
I don't believe your 900Mbps figure.

Plus, as you're memcpy'ing every packet received, I don't believe you can
reach 940Mbps receive either.

> >> Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb()
> >> is very expensive which cause tx performance drop much.
> >
> >wmb() is very expensive because of the L2 cache code using a sledge hammer with
> >it - particularly the spinlock, which has a large overhead if lockdep or
> >spinlock debugging is enabled.
>
> Yes, if add wmb() to xmit(), imx6sx enet performance will drop more
> than 100Mbps.

In any case, I suspect that isn't directly attributable to wmb() itself.
What I've noticed is that even changing an unsigned short to an unsigned
int *can* result in a substantial performance drop.  Although the
unsigned int results in fewer instructions, it's lower performance because
they're placed differently, and the efficiency of the instruction cache
changes, resulting in different throughput.

What this means is that even changing compiler versions can get you
significantly different performance figures.  So I don't attribute very
much creedence to "wmb() causes performance to drop 100Mbps".  It may
very well go back up with some other changes which result in a slightly
different placement of the instructions.
fugang.duan@freescale.com April 3, 2014, 2:41 a.m. UTC | #36
From: Russell King - ARM Linux <linux@arm.linux.org.uk>
Data: Thursday, April 03, 2014 12:51 AM

>To: Duan Fugang-B38611
>Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
>Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org
>Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>
>On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan@freescale.com wrote:
>> In kernel 3.0.35, there have no CMA memory allocate mechanism.
>> Below Kernel configs are enabled:
>> CONFIG_ARM_DMA_MEM_BUFFERABLE
>> CONFIG_SMP
>>
>> If use dma_alloc_coherent() allocate memory, it must be non-cacheable,
>> but bufferable.  The new invented api "dma_alloc_noncacheable()"
>> allocate memory is non-cacheable, non-bufferable, the memory type is
>> Strongly ordered.
>
>Right, so what you've just said is that it's fine to violate the requirements
>of the architecture L1 memory model by setting up a strongly ordered memory
>mapping for the same physical addresses as an existing mapping which is mapped
>as normal memory.
>
>Sorry, I'm not going to listen to you anymore, you just lost any kind of
>authority on this matter.
>
>> >> So wmb() is not necessary.
>> >
>> >Even on non-cacheable normal memory, the wmb() is required.  Please
>> >read up in the ARM architecture reference manual about memory types
>> >and their various attributes, followed by the memory ordering chapters.
>> >
>> >> Yes, it don't impact imx6q since cpu loading is not bottleneck due
>> >> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx
>> >> can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.
>> >
>> >What netdev features do you support to achieve that?
>> >
>> Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
>> So we enable the two features.
>
>Checksum and... presumably you're referring to NAPI don't get you to that kind
>of speed.  Even on x86, you can't get close to wire speed without GSO, which
>you need scatter-gather for, and you don't support that.  So I don't believe
>your 900Mbps figure.
>
>Plus, as you're memcpy'ing every packet received, I don't believe you can reach
>940Mbps receive either.
>
Since Imx6sx enet still don't support TSO and Jumbo packet, scatter-gather cannot improve ethernet performance in
Most cases special for iperf test. 

Imx6sx: sigle core, cpu frequency is 996Mhz, cpu government is performance.
Kernel config: disable SMP config
For rx path: 
	- hw accleration: crc checksum, interrupt coalescing.
	- software part: napi, new skb allocation instead of memory copy.
	- Test result: 940Mbps, 8% cpu idle
For tx path:
	- hw accleration: crc checksum, interrupt coalescing.
	- software part: napi, no memory copy in driver since tx DMA support data buffer byte alignment.
	- Test result: 900Mbps, cpu loading near to 100%

>> >> Enet IP don't support TSO feaure, cpu loading is the bottleneck.
>> >> Wmb() is very expensive which cause tx performance drop much.
>> >
>> >wmb() is very expensive because of the L2 cache code using a sledge
>> >hammer with it - particularly the spinlock, which has a large
>> >overhead if lockdep or spinlock debugging is enabled.
>>
>> Yes, if add wmb() to xmit(), imx6sx enet performance will drop more
>> than 100Mbps.
>
>In any case, I suspect that isn't directly attributable to wmb() itself.
>What I've noticed is that even changing an unsigned short to an unsigned int
>*can* result in a substantial performance drop.  Although the unsigned int
>results in fewer instructions, it's lower performance because they're placed
>differently, and the efficiency of the instruction cache changes, resulting in
>different throughput.
>
It is interesting, I will try it.

>What this means is that even changing compiler versions can get you
>significantly different performance figures.  So I don't attribute very much
>creedence to "wmb() causes performance to drop 100Mbps".  It may very well go
>back up with some other changes which result in a slightly different placement
>of the instructions.
>
I test the performance with three compiler:
gcc-4.4.4-glibc-2.11.1-multilib-1.0,
4.7
4.8.1 

The test result is similar.

Thanks,
Andy
Russell King - ARM Linux April 3, 2014, 8:56 a.m. UTC | #37
On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan@freescale.com wrote:
> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
> Data: Thursday, April 03, 2014 12:51 AM
> 
> >To: Duan Fugang-B38611
> >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
> >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org
> >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
> >
> >On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan@freescale.com wrote:
> >> In kernel 3.0.35, there have no CMA memory allocate mechanism.
> >> Below Kernel configs are enabled:
> >> CONFIG_ARM_DMA_MEM_BUFFERABLE
> >> CONFIG_SMP
> >>
> >> If use dma_alloc_coherent() allocate memory, it must be non-cacheable,
> >> but bufferable.  The new invented api "dma_alloc_noncacheable()"
> >> allocate memory is non-cacheable, non-bufferable, the memory type is
> >> Strongly ordered.
> >
> >Right, so what you've just said is that it's fine to violate the requirements
> >of the architecture L1 memory model by setting up a strongly ordered memory
> >mapping for the same physical addresses as an existing mapping which is mapped
> >as normal memory.
> >
> >Sorry, I'm not going to listen to you anymore, you just lost any kind of
> >authority on this matter.
> >
> >> >> So wmb() is not necessary.
> >> >
> >> >Even on non-cacheable normal memory, the wmb() is required.  Please
> >> >read up in the ARM architecture reference manual about memory types
> >> >and their various attributes, followed by the memory ordering chapters.
> >> >
> >> >> Yes, it don't impact imx6q since cpu loading is not bottleneck due
> >> >> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx
> >> >> can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.
> >> >
> >> >What netdev features do you support to achieve that?
> >> >
> >> Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
> >> So we enable the two features.
> >
> >Checksum and... presumably you're referring to NAPI don't get you to that kind
> >of speed.  Even on x86, you can't get close to wire speed without GSO, which
> >you need scatter-gather for, and you don't support that.  So I don't believe
> >your 900Mbps figure.
> >
> >Plus, as you're memcpy'ing every packet received, I don't believe you can reach
> >940Mbps receive either.
> >
> Since Imx6sx enet still don't support TSO and Jumbo packet, scatter-gather
> cannot improve ethernet performance in
> Most cases special for iperf test. 

Again, you are losing credibility every time you deny stuff like this.
I'm now at the point of just not listening to you anymore because you're
contradicting what I know to be solid fact through my own measurements.

This seems to be Freescale's overall attitude - as I've read on
Freescale's forums.  Your customers/users are always wrong, you're always
right.  Eg, any performance issues are not the fault of Freescale stuff,
it's tarnished connectors or similar.
fugang.duan@freescale.com April 3, 2014, 9:55 a.m. UTC | #38
From: Russell King - ARM Linux <linux@arm.linux.org.uk>
Data: Thursday, April 03, 2014 4:57 PM

>To: Duan Fugang-B38611
>Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
>Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org
>Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>
>On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan@freescale.com wrote:
>> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
>> Data: Thursday, April 03, 2014 12:51 AM
>>
>> >To: Duan Fugang-B38611
>> >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel;
>> >Troy Kisky; Grant Likely; Bernd Faust; Fabio Estevam;
>> >linux-arm-kernel@lists.infradead.org
>> >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>> >
>> >On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan@freescale.com wrote:
>> >> In kernel 3.0.35, there have no CMA memory allocate mechanism.
>> >> Below Kernel configs are enabled:
>> >> CONFIG_ARM_DMA_MEM_BUFFERABLE
>> >> CONFIG_SMP
>> >>
>> >> If use dma_alloc_coherent() allocate memory, it must be
>> >> non-cacheable, but bufferable.  The new invented api
>"dma_alloc_noncacheable()"
>> >> allocate memory is non-cacheable, non-bufferable, the memory type
>> >> is Strongly ordered.
>> >
>> >Right, so what you've just said is that it's fine to violate the
>> >requirements of the architecture L1 memory model by setting up a
>> >strongly ordered memory mapping for the same physical addresses as an
>> >existing mapping which is mapped as normal memory.
>> >
>> >Sorry, I'm not going to listen to you anymore, you just lost any kind
>> >of authority on this matter.
>> >
>> >> >> So wmb() is not necessary.
>> >> >
>> >> >Even on non-cacheable normal memory, the wmb() is required.
>> >> >Please read up in the ARM architecture reference manual about
>> >> >memory types and their various attributes, followed by the memory
>ordering chapters.
>> >> >
>> >> >> Yes, it don't impact imx6q since cpu loading is not bottleneck
>> >> >> due rx/tx bandwidth is slow and multi-cores.  But for imx6sx,
>> >> >> enet rx can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle
>core.
>> >> >
>> >> >What netdev features do you support to achieve that?
>> >> >
>> >> Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
>> >> So we enable the two features.
>> >
>> >Checksum and... presumably you're referring to NAPI don't get you to
>> >that kind of speed.  Even on x86, you can't get close to wire speed
>> >without GSO, which you need scatter-gather for, and you don't support
>> >that.  So I don't believe your 900Mbps figure.
>> >
>> >Plus, as you're memcpy'ing every packet received, I don't believe you
>> >can reach 940Mbps receive either.
>> >
>> Since Imx6sx enet still don't support TSO and Jumbo packet,
>> scatter-gather cannot improve ethernet performance in Most cases
>> special for iperf test.
>
>Again, you are losing credibility every time you deny stuff like this.
>I'm now at the point of just not listening to you anymore because you're
>contradicting what I know to be solid fact through my own measurements.
>
>This seems to be Freescale's overall attitude - as I've read on Freescale's
>forums.  Your customers/users are always wrong, you're always right.  Eg, any
>performance issues are not the fault of Freescale stuff, it's tarnished
>connectors or similar.
>
Hi, Russell,

I don't contradict your thinking/solution and measurements.  You are expert on arm/modules, we keep study attitude to dicuss with you.
For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you did great job on performance tuning, and the test result is similar
To our internal test result. Your suggestion for the optimiztion is meaningful. Pls understand my thinking.  

Thanks,
Andy
Russell King - ARM Linux April 3, 2014, 10:32 a.m. UTC | #39
On Thu, Apr 03, 2014 at 09:55:06AM +0000, fugang.duan@freescale.com wrote:
> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
> Data: Thursday, April 03, 2014 4:57 PM
> 
> >To: Duan Fugang-B38611
> >Cc: robert.daniels@vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
> >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel@lists.infradead.org
> >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
> >
> >On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan@freescale.com wrote:
> >> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
> >> Data: Thursday, April 03, 2014 12:51 AM
> >>
> >> >Checksum and... presumably you're referring to NAPI don't get you to
> >> >that kind of speed.  Even on x86, you can't get close to wire speed
> >> >without GSO, which you need scatter-gather for, and you don't support
> >> >that.  So I don't believe your 900Mbps figure.
> >> >
> >> >Plus, as you're memcpy'ing every packet received, I don't believe you
> >> >can reach 940Mbps receive either.
> >> >
> >> Since Imx6sx enet still don't support TSO and Jumbo packet,
> >> scatter-gather cannot improve ethernet performance in Most cases
> >> special for iperf test.
> >
> >Again, you are losing credibility every time you deny stuff like this.
> >I'm now at the point of just not listening to you anymore because you're
> >contradicting what I know to be solid fact through my own measurements.
> >
> >This seems to be Freescale's overall attitude - as I've read on Freescale's
> >forums.  Your customers/users are always wrong, you're always right.  Eg, any
> >performance issues are not the fault of Freescale stuff, it's tarnished
> >connectors or similar.
> >
> Hi, Russell,
> 
> I don't contradict your thinking/solution and measurements.  You are
> expert on arm/modules, we keep study attitude to dicuss with you.
> For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you
> did great job on performance tuning, and the test result is similar
> To our internal test result. Your suggestion for the optimiztion is
> meaningful. Pls understand my thinking.  

The reason I said what I said is because I'm not talking about TSO.  I'm
talking about GSO.  They're similar features, but are done in totally
different ways.

TSO requires either hardware support, or driver cooperation to segment
the data.  GSO does not.

There's several points here which make me discount your figures:

1. The overhead of routing packets is not insigificant.

   GSO is a feature where the higher levels (such as TCP) can submit
   large socket buffers to the lower levels, all the way through to
   packet queues to just before the device.

   At the very last moment, just before the buffer is handed over to
   the driver's start_xmit function, the buffer is carved into
   appropriately sized chunks, and new skbuffs allocated.  Each skbuff
   "head" contains the protocol headers, and the associated fragments
   contain pointers/lengths into the large buffer.  (This is why SG is
   required for GSO.)

   What this means is that all the overhead from the layers such as
   TCP / IP / routing / packet queuing are all reduced because rather
   than those code paths having to be run for every single 1500 byte
   packet, they're run once for maybe 16 to 32K of data to be sent.
   This avoids the overhead from a lot of code.

   I have monitored the RGMII TX_CTL signal while performing iperf
   tests without GSO, and I can see from that the gap between packets,
   it is very rare that it gets to the point of transmitting two
   packets back to back.

2. At 500Mbps transmit on the iMX6S, performance testing shows the
   majority of CPU time is spent in the cache cleaning/flushing code.
   This overhead is necessary and can't be removed without risking
   data corruption.

3. I discount your receive figure because even with all my work, if
   I set my x86 box to perform a UDP iperf test with x86 sending to
   iMX6S, the result remains extremely poor, with 95% (modified) to 99%
   (unmodified) of UDP packets lost.  This is not entirely the fault
   of the FEC.  Around 70% of the packet loss is in the UDP receive
   path - iperf seems to be unable to read the packets from UDP fast
   enough.  (This confirmed by checking the statistics in /proc/net/snmp.)
   This means 30% of packet loss is due to the FEC not keeping up with
   the on-wire packet rate (which is around 810Mbps.)

   Therefore, I place the maximum receive packet rate of the FEC with
   the current packet-copy strategy in the receive path and 128 entry
   receive ring size at around 550Mbps, and again, much of the overhead
   comes from the cache handling code according to perf.
Shawn Guo April 3, 2014, 1:36 p.m. UTC | #40
On Thu, Apr 03, 2014 at 11:32:06AM +0100, Russell King - ARM Linux wrote:
> > Hi, Russell,
> > 
> > I don't contradict your thinking/solution and measurements.  You are
> > expert on arm/modules, we keep study attitude to dicuss with you.
> > For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you
> > did great job on performance tuning, and the test result is similar
> > To our internal test result. Your suggestion for the optimiztion is
> > meaningful. Pls understand my thinking.  
> 
> The reason I said what I said is because I'm not talking about TSO.  I'm
> talking about GSO.  They're similar features, but are done in totally
> different ways.
> 
> TSO requires either hardware support, or driver cooperation to segment
> the data.  GSO does not.
> 
> There's several points here which make me discount your figures:

Russell,

In case there is a confusion.  The 900Mbps figure that Fugang said is
not on any of i.MX6 SoCs that are publicly available - i.MX6SoloLite
(imx6sl), i.MX6Solo/DualLite (imx6dl), i.MX6Dual/Quad (imx6q), but on
a new member of i.MX6 family - i.MX6SoloX (imx6sx).  This new SoC hasn't
been announced by Freescale yet.  One major improvement of this new SoC
over its ancestors is the FEC throughput.  It claims 1Gbps throughput
support.  So it's really a hardware optimization instead of anything
that software can do.

Shawn
Russell King - ARM Linux April 3, 2014, 1:45 p.m. UTC | #41
On Thu, Apr 03, 2014 at 09:36:41PM +0800, Shawn Guo wrote:
> On Thu, Apr 03, 2014 at 11:32:06AM +0100, Russell King - ARM Linux wrote:
> > > Hi, Russell,
> > > 
> > > I don't contradict your thinking/solution and measurements.  You are
> > > expert on arm/modules, we keep study attitude to dicuss with you.
> > > For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you
> > > did great job on performance tuning, and the test result is similar
> > > To our internal test result. Your suggestion for the optimiztion is
> > > meaningful. Pls understand my thinking.  
> > 
> > The reason I said what I said is because I'm not talking about TSO.  I'm
> > talking about GSO.  They're similar features, but are done in totally
> > different ways.
> > 
> > TSO requires either hardware support, or driver cooperation to segment
> > the data.  GSO does not.
> > 
> > There's several points here which make me discount your figures:
> 
> Russell,
> 
> In case there is a confusion.  The 900Mbps figure that Fugang said is
> not on any of i.MX6 SoCs that are publicly available - i.MX6SoloLite
> (imx6sl), i.MX6Solo/DualLite (imx6dl), i.MX6Dual/Quad (imx6q), but on
> a new member of i.MX6 family - i.MX6SoloX (imx6sx).  This new SoC hasn't
> been announced by Freescale yet.  One major improvement of this new SoC
> over its ancestors is the FEC throughput.  It claims 1Gbps throughput
> support.  So it's really a hardware optimization instead of anything
> that software can do.

That means it's irrelevant to this discussion because it's different
hardware, with who knows what changes to the memory subsystem and
CPU/cache implementation.  Hence, it can't be compared in any way to
the performance I see on iMX6Q and iMX6S.

What matters for here is fixing the issues with the mainline driver on
hardware that people have today - not only the bugs and instability
that people are seeing, but also the low performance.

If some of that involves doing stuff "The Right Way" by using memory
barriers rather than violating the architecture specification and
introducing new kernel APIs in order to do so, we're going to do it
the right way, even if that means that those changes may not be in the
best interest of this unreleased part.  There is no room for compromise
on that.
Shawn Guo April 3, 2014, 2 p.m. UTC | #42
On Thu, Apr 03, 2014 at 02:45:10PM +0100, Russell King - ARM Linux wrote:
> On Thu, Apr 03, 2014 at 09:36:41PM +0800, Shawn Guo wrote:
> > Russell,
> > 
> > In case there is a confusion.  The 900Mbps figure that Fugang said is
> > not on any of i.MX6 SoCs that are publicly available - i.MX6SoloLite
> > (imx6sl), i.MX6Solo/DualLite (imx6dl), i.MX6Dual/Quad (imx6q), but on
> > a new member of i.MX6 family - i.MX6SoloX (imx6sx).  This new SoC hasn't
> > been announced by Freescale yet.  One major improvement of this new SoC
> > over its ancestors is the FEC throughput.  It claims 1Gbps throughput
> > support.  So it's really a hardware optimization instead of anything
> > that software can do.
> 
> That means it's irrelevant to this discussion because it's different
> hardware, with who knows what changes to the memory subsystem and
> CPU/cache implementation.  Hence, it can't be compared in any way to
> the performance I see on iMX6Q and iMX6S.

Exactly.  I don't think that Fugang should bring i.MX6SX figure into the
discussion either, and I see that's where the confusion comes, from what
I read the thread today.

Shawn
robert.daniels@vantagecontrols.com April 4, 2014, 8:21 p.m. UTC | #43
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014
04:51:49 PM:

> At initial glance, this is coherent with my idea of the FEC skipping a
> ring entry on the initial pass around.  Then when a new entry is loaded,
>
> Let's say that the problem entry is number 12 that has been skipped.
> When we get back around to entry 11, the FEC will transmit entries 11
> and 12, as you rightly point out, and it will then look at entry 13
> for the next packet.
>
> However, the driver loads the next packet into entry 12, and hits the
> FEC to transmit it.  The FEC re-reads entry 13, finds no packet, so
> does nothing.
>
> Then the next packet is submitted to the driver, and it enters it into
> entry 13, again hitting the FEC.  The FEC now sees the entry at 13,
> meanwhile the entry at 12 is still pending.

I've explored the option of providing a work around for this observed
problem.
In the 2.6.35 kernel, I've used the BD_ENET_TX_INTR flag which is marked as
TO2 in the RM and which was being set but never used by the software to
help
detect the skip.  In the interrupt handler I clear this bit which allows me
to use it to know when the bd goes from ready -> clean.  In the error
situation
above, you would end up with 12 marked with both READY and INTR and at some
point
you would have bd 13 marked as INTR only (as it would have been transmitted
but
not cleaned up.)  This allows me to clean up bd 12 despite not actually
being
transmitted.  This gets the driver back in sync with the FEC and things
continue
on normally... until it happens again.

Of course, this results in the occasional dropped packet but I feel like
for now
(until Freescale figures out what's going on) this is better than nothing.
At
least the driver is able to recover somewhat from the situation.

I'm not sure if the mainline driver could benefit from a strategy like this
or not,
especially since it manifested this problem differently (tx transmit
timeout).
Also, in my opinion this is an undesirable hack to make things work
acceptably.
There could also be some inherent problem with this strategy that I'm
unaware of,
since dealing with linux kernel ethernet drivers is not exactly my area of
expertise.

Any thoughts or new insights?

Thanks,

Robert Daniels

This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.
Jaccon Bastiaansen April 29, 2014, 9:26 a.m. UTC | #44
Hello all,

I tried the FEC patches from Russel
(http://ftp.arm.linux.org.uk/cgit/linux-arm.git/log/?h=fec-testing),
but with the following test the FEC complety stops receiving frames:

     Linux PC <----->  Gigabit Ethernet switch  <------> SabreSD board

Both ethernet links run at 1 Gigabit, full duplex.

The SabreSD board runs the fec-testing kernel from Russel.

On the SabreSD board we run "iperf -s -u"

On the Linux PC we
- run the iperf client (command used: “while [ 1 ]; do iperf –c ‘IP
address of SabreSD’ -u -b 100m -t 300 -l 256;sleep 1;done”)
- ping the SabreSD board every second

After a while (sometimes 10 seconds, sometimes a couple of minutes),
we see that the SabreSD board stops replying to the pings from the
Linux PC. Closer inspection shows that the FEC doesn't generate
receive frame interrupts anymore. Analysis with a debugger shows that
the RXF interrupt is enabled and that the FEC is still receiving
ethernet frames (because the IEEE_R_FRAME_OK event counter is
increasing), but no receive frame interrupts are occurring. We only
see MII interrupts occurring.

The RDAR register has the value 0 meaning that the FEC cannot write
the received frame in main memory because of a lack of available free
receive descriptors.

When the FEC has stopped generating receive interrupts, we used the
ping command on the Sabre board to send some frames from the Sabre
board to the Linux PC. The FEC then starts to generate receive
interrupts again. The driver source code shows that the receive buffer
is also emptied when a transmit interrupt occurs. So it seems that the
FEC completely stops generating receive interrupts when frames are
received and there are no more empty descriptors.

Any ideas or insights?

Regards,
  Jaccon

2014-04-04 22:21 GMT+02:00  <robert.daniels@vantagecontrols.com>:
> Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014
> 04:51:49 PM:
>
>> At initial glance, this is coherent with my idea of the FEC skipping a
>> ring entry on the initial pass around.  Then when a new entry is loaded,
>>
>> Let's say that the problem entry is number 12 that has been skipped.
>> When we get back around to entry 11, the FEC will transmit entries 11
>> and 12, as you rightly point out, and it will then look at entry 13
>> for the next packet.
>>
>> However, the driver loads the next packet into entry 12, and hits the
>> FEC to transmit it.  The FEC re-reads entry 13, finds no packet, so
>> does nothing.
>>
>> Then the next packet is submitted to the driver, and it enters it into
>> entry 13, again hitting the FEC.  The FEC now sees the entry at 13,
>> meanwhile the entry at 12 is still pending.
>
> I've explored the option of providing a work around for this observed
> problem.
> In the 2.6.35 kernel, I've used the BD_ENET_TX_INTR flag which is marked as
> TO2 in the RM and which was being set but never used by the software to
> help
> detect the skip.  In the interrupt handler I clear this bit which allows me
> to use it to know when the bd goes from ready -> clean.  In the error
> situation
> above, you would end up with 12 marked with both READY and INTR and at some
> point
> you would have bd 13 marked as INTR only (as it would have been transmitted
> but
> not cleaned up.)  This allows me to clean up bd 12 despite not actually
> being
> transmitted.  This gets the driver back in sync with the FEC and things
> continue
> on normally... until it happens again.
>
> Of course, this results in the occasional dropped packet but I feel like
> for now
> (until Freescale figures out what's going on) this is better than nothing.
> At
> least the driver is able to recover somewhat from the situation.
>
> I'm not sure if the mainline driver could benefit from a strategy like this
> or not,
> especially since it manifested this problem differently (tx transmit
> timeout).
> Also, in my opinion this is an undesirable hack to make things work
> acceptably.
> There could also be some inherent problem with this strategy that I'm
> unaware of,
> since dealing with linux kernel ethernet drivers is not exactly my area of
> expertise.
>
> Any thoughts or new insights?
>
> Thanks,
>
> Robert Daniels
>
> This email, and any document attached hereto, may contain
> confidential and/or privileged information.  If you are not the
> intended recipient (or have received this email in error) please
> notify the sender immediately and destroy this email.  Any
> unauthorized, direct or indirect, copying, disclosure, distribution
> or other use of the material or parts thereof is strictly
> forbidden.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Russell King - ARM Linux May 2, 2014, 11:41 a.m. UTC | #45
On Tue, Apr 29, 2014 at 11:05:04AM +0200, Jaccon Bastiaansen wrote:
> Hello all,
> 
> I tried the FEC patches from Russel
> (http://ftp.arm.linux.org.uk/cgit/linux-arm.git/log/?h=fec-testing),
> but with the following test the FEC complety stops receiving frames:
> 
> 
>  Linux PC <----->  Gigabit Ethernet switch  <------> SabreSD board
> 
> Both ethernet links run at 1 Gigabit, full duplex.
> 
> The SabreSD board runs the fec-testing kernel from Russel.
> 
> On the SabreSD board we run "iperf -s -u"
> 
> On the Linux PC we
> - run the iperf client (command used: “while [ 1 ]; do iperf –c ‘IP
> address of SabreSD’ -u -b 100m -t 300 -l 256;sleep 1;done”)
> - ping the SabreSD board every second
> 
> After a while (sometimes 10 seconds, sometimes a couple of minutes),
> we see that the SabreSD board stops replying to the pings from the
> Linux PC. Closer inspection shows that the FEC doesn't generate
> receive frame interrupts anymore. The attached debugger screenshots
> show the FEC registers when it is in this state. The RXF interrupt is
> enabled and we know that the FEC is still receiving ethernet frames
> (because the IEEE_R_FRAME_OK event counter is increasing), but no
> receive frame interrupts are occurring. We only see MII interrupts
> occurring.

If the RXF interrupt is enabled, it means that we must have received
less than NAPI_POLL_WEIGHT (64) frames from the ring during the previous
NAPI poll - this is the only circumstance when we will re-enable
interrupts.

There are only two conditions where we stop receiving frames from the
receive ring:
1. if we reach the NAPI poll weight number of frames received.
2. if we encounter a packet descriptor marked empty.

It can't be (1) because we would have left the TXF/RXF interrupts
disabled and waited for the next NAPI poll.  So, it can only be (2).

(2) implies that there is a ring descriptor which is marked as being
owned by the FEC, which means that there are descriptors free.  Whether
it's the one which the FEC is expecting to be free or not is something
that's impossible to tell (the FEC hardware doesn't tell us where it
is in its ring, which is a big minus point against debugging.)

It would be useful to see the state of the receive ring, as well as
the rx_next index.  That may be difficult to get though.
Jaccon Bastiaansen May 8, 2014, 9:23 a.m. UTC | #46
Hello all,

2014-05-02 13:41 GMT+02:00 Russell King - ARM Linux <linux@arm.linux.org.uk>:
> On Tue, Apr 29, 2014 at 11:05:04AM +0200, Jaccon Bastiaansen wrote:
>> Hello all,
>>
>> I tried the FEC patches from Russel
>> (http://ftp.arm.linux.org.uk/cgit/linux-arm.git/log/?h=fec-testing),
>> but with the following test the FEC complety stops receiving frames:
>>
>>
>>  Linux PC <----->  Gigabit Ethernet switch  <------> SabreSD board
>>
>> Both ethernet links run at 1 Gigabit, full duplex.
>>
>> The SabreSD board runs the fec-testing kernel from Russel.
>>
>> On the SabreSD board we run "iperf -s -u"
>>
>> On the Linux PC we
>> - run the iperf client (command used: “while [ 1 ]; do iperf –c ‘IP
>> address of SabreSD’ -u -b 100m -t 300 -l 256;sleep 1;done”)
>> - ping the SabreSD board every second
>>
>> After a while (sometimes 10 seconds, sometimes a couple of minutes),
>> we see that the SabreSD board stops replying to the pings from the
>> Linux PC. Closer inspection shows that the FEC doesn't generate
>> receive frame interrupts anymore. The attached debugger screenshots
>> show the FEC registers when it is in this state. The RXF interrupt is
>> enabled and we know that the FEC is still receiving ethernet frames
>> (because the IEEE_R_FRAME_OK event counter is increasing), but no
>> receive frame interrupts are occurring. We only see MII interrupts
>> occurring.
>
> If the RXF interrupt is enabled, it means that we must have received
> less than NAPI_POLL_WEIGHT (64) frames from the ring during the previous
> NAPI poll - this is the only circumstance when we will re-enable
> interrupts.
>
> There are only two conditions where we stop receiving frames from the
> receive ring:
> 1. if we reach the NAPI poll weight number of frames received.
> 2. if we encounter a packet descriptor marked empty.
>
> It can't be (1) because we would have left the TXF/RXF interrupts
> disabled and waited for the next NAPI poll.  So, it can only be (2).
>
> (2) implies that there is a ring descriptor which is marked as being
> owned by the FEC, which means that there are descriptors free.  Whether
> it's the one which the FEC is expecting to be free or not is something
> that's impossible to tell (the FEC hardware doesn't tell us where it
> is in its ring, which is a big minus point against debugging.)
>
> It would be useful to see the state of the receive ring, as well as
> the rx_next index.  That may be difficult to get though.
>

I added some code to show the RX ring when the reception of frames has
stopped. See the patch below. Reading the fec_rx_ring file gives the
following output (I stripped the bit decoding here to reduce the
number of lines):

0: c0a0d000 0x0800 (== rx_bd_base)
1: c0a0d020 0x0800
2: c0a0d040 0x0800
3: c0a0d060 0x0800
4: c0a0d080 0x0800
5: c0a0d0a0 0x0800
6: c0a0d0c0 0x0800 (== rx_next)
7: c0a0d0e0 0x0800
8: c0a0d100 0x0800
9: c0a0d120 0x0800
10: c0a0d140 0x0800
11: c0a0d160 0x0800
12: c0a0d180 0x0800
13: c0a0d1a0 0x0800
14: c0a0d1c0 0x0800
15: c0a0d1e0 0x0800
16: c0a0d200 0x0800
17: c0a0d220 0x0800
18: c0a0d240 0x0800
19: c0a0d260 0x0800
20: c0a0d280 0x0800
21: c0a0d2a0 0x0800
22: c0a0d2c0 0x0800
23: c0a0d2e0 0x0800
24: c0a0d300 0x0800
25: c0a0d320 0x0800
26: c0a0d340 0x0800
27: c0a0d360 0x0800
28: c0a0d380 0x0800
29: c0a0d3a0 0x0800
30: c0a0d3c0 0x0800
31: c0a0d3e0 0x0800
32: c0a0d400 0x0800
33: c0a0d420 0x0800
34: c0a0d440 0x0800
35: c0a0d460 0x0800
36: c0a0d480 0x0800
37: c0a0d4a0 0x0800
38: c0a0d4c0 0x0800
39: c0a0d4e0 0x0800
40: c0a0d500 0x0800
41: c0a0d520 0x0800
42: c0a0d540 0x0800
43: c0a0d560 0x0800
44: c0a0d580 0x0800
45: c0a0d5a0 0x0800
46: c0a0d5c0 0x0800
47: c0a0d5e0 0x0800
48: c0a0d600 0x0800
49: c0a0d620 0x0800
50: c0a0d640 0x0800
51: c0a0d660 0x0800
52: c0a0d680 0x0800
53: c0a0d6a0 0x0800
54: c0a0d6c0 0x0800
55: c0a0d6e0 0x0800
56: c0a0d700 0x0800
57: c0a0d720 0x0800
58: c0a0d740 0x0800
59: c0a0d760 0x0800
60: c0a0d780 0x0800
61: c0a0d7a0 0x0800
62: c0a0d7c0 0x0800
63: c0a0d7e0 0x0800
64: c0a0d800 0x0800
65: c0a0d820 0x0800
66: c0a0d840 0x0800
67: c0a0d860 0x0800
68: c0a0d880 0x0800
69: c0a0d8a0 0x0800
70: c0a0d8c0 0x0800
71: c0a0d8e0 0x0800
72: c0a0d900 0x0800
73: c0a0d920 0x0800
74: c0a0d940 0x0800
75: c0a0d960 0x0800
76: c0a0d980 0x0800
77: c0a0d9a0 0x0800
78: c0a0d9c0 0x0800
79: c0a0d9e0 0x0800
80: c0a0da00 0x0800
81: c0a0da20 0x0800
82: c0a0da40 0x0800
83: c0a0da60 0x0800
84: c0a0da80 0x0800
85: c0a0daa0 0x0800
86: c0a0dac0 0x0800
87: c0a0dae0 0x0800
88: c0a0db00 0x0800
89: c0a0db20 0x0800
90: c0a0db40 0x0800
91: c0a0db60 0x0800
92: c0a0db80 0x0800
93: c0a0dba0 0x0800
94: c0a0dbc0 0x0800
95: c0a0dbe0 0x0800
96: c0a0dc00 0x0800
97: c0a0dc20 0x0800
98: c0a0dc40 0x0800
99: c0a0dc60 0x0800
100: c0a0dc80 0x0800
101: c0a0dca0 0x0800
102: c0a0dcc0 0x0800
103: c0a0dce0 0x0800
104: c0a0dd00 0x0800
105: c0a0dd20 0x0800
106: c0a0dd40 0x0800
107: c0a0dd60 0x0800
108: c0a0dd80 0x0800
109: c0a0dda0 0x0800
110: c0a0ddc0 0x0800
111: c0a0dde0 0x0800
112: c0a0de00 0x0800
113: c0a0de20 0x0800
114: c0a0de40 0x0800
115: c0a0de60 0x0800
116: c0a0de80 0x0800
117: c0a0dea0 0x0800
118: c0a0dec0 0x0800
119: c0a0dee0 0x0800
120: c0a0df00 0x0800
121: c0a0df20 0x0800
122: c0a0df40 0x0800
123: c0a0df60 0x0800
124: c0a0df80 0x0800
125: c0a0dfa0 0x0800
126: c0a0dfc0 0x0800
127: c0a0dfe0 0x2800

This shows that the complete receive ring is filled with received
frames, which matches with the value 0 in the RDAR register.

The patched FEC driver also logs "FEC receive ring full" when the
receive ring is full after the napi_complete() call and before the
enabling of the RXF interrupt. The kernel logging shows this message a
couple of times. So the RX ring has been filled completely between the
fec_enet_rx() call and the napi_complete() call. In this testcase
(100Megabit, 256 byte frames) it takes less than 3 milliseconds to
completely fill the RX ring. So a preemption between the fec_enet_rx()
call and the napi_complete() call or a long execution time of the
fec_enet_tx() function (which is called between fec_enet_rx() and
napi_complete() ) results in the enabling of the RXF interrupt when
the RX ring is completely full. It seems that in that case the RXF
interrupt will not be generated anymore.



@@ -23,17 +23,20 @@

 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/string.h>
 #include <linux/ptrace.h>
+#include <linux/debugfs.h>
 #include <linux/errno.h>
 #include <linux/ioport.h>
 #include <linux/slab.h>
 #include <linux/interrupt.h>
 #include <linux/delay.h>
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
 #include <linux/skbuff.h>
 #include <linux/spinlock.h>
 #include <linux/workqueue.h>
 #include <linux/bitops.h>
 #include <linux/io.h>
@@ -1164,11 +1167,17 @@ static int fec_enet_rx_napi(struct napi_struct
*napi, int budget)

     fec_enet_tx(ndev);

     if (pkts < budget) {
         napi_complete(napi);
-        writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
+        if (readl(fep->hwp + FEC_R_DES_ACTIVE) == 0) {
+            writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
+            printk(KERN_ERR "%llu FEC: receive ring full!!!\n",
+                sched_clock());
+        } else {
+            writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
+        }
     }
     return pkts;
 }

 /* ------------------------------------------------------------------------- */
@@ -2404,10 +2413,79 @@ static void fec_reset_phy(struct platform_device *pdev)
      * by machine code.
      */
 }
 #endif /* CONFIG_OF */

+#define BDRX_EMPTY    (1 << 15)
+#define BDRX_RO1    (1 << 14)
+#define BDRX_WRAP    (1 << 13)
+#define BDRX_RO2    (1 << 12)
+#define BDRX_LAST    (1 << 11)
+#define BDRX_UNDEF10    (1 << 10)
+#define BDRX_UNDEF9    (1 << 9)
+#define BDRX_MISS    (1 << 8)
+#define BDRX_BCAST    (1 << 7)
+#define BDRX_MCAST    (1 << 6)
+#define BDRX_LENERR    (1 << 5)
+#define BDRX_ALGNERR    (1 << 4)
+#define BDRX_UNDEF3    (1 << 3)
+#define BDRX_CRCERR    (1 << 2)
+#define BDRX_OVERRUN    (1 << 1)
+#define BDRX_TRUNC    (1 << 0)
+
+
+static int fec_rx_ring_show(struct seq_file *s, void *p)
+{
+    struct fec_enet_private *fep = s->private;
+    int nbds;
+    union bufdesc_u *bd;
+
+    bd = fep->rx_bd_base;
+    for (nbds = 0;nbds < fep->rx_ring_size;nbds++) {
+        seq_printf(s, "%d: %p 0x%.4x", nbds, bd, bd->ebd.desc.cbd_sc);
+        if (bd == fep->rx_bd_base)
+            seq_printf(s, " (== rx_bd_base)");
+        if (nbds == fep->rx_next)
+            seq_printf(s, " (== rx_next)");
+        seq_printf(s, "\n");
+        seq_printf(s,
+               "  [15]%s [14]%s [13]%s [12]%s [11]%s [10]%s [ 9]%s [ 8]%s"
+               "  [ 7]%s [ 6]%s [ 5]%s [ 4]%s [ 3]%s [ 2]%s [ 1]%s [ 0]%s\n",
+               bd->ebd.desc.cbd_sc & BDRX_EMPTY   ? "EMP" : "emp",
+               bd->ebd.desc.cbd_sc & BDRX_RO1     ? "RO1" : "ro1",
+               bd->ebd.desc.cbd_sc & BDRX_WRAP    ? "WRP" : "wrp",
+               bd->ebd.desc.cbd_sc & BDRX_RO2     ? "RO2" : "ro2",
+               bd->ebd.desc.cbd_sc & BDRX_LAST    ? "LST" : "lst",
+               bd->ebd.desc.cbd_sc & BDRX_UNDEF10 ? "UNA" : "una",
+               bd->ebd.desc.cbd_sc & BDRX_UNDEF9  ? "UN9" : "un9",
+               bd->ebd.desc.cbd_sc & BDRX_MISS    ? "MIS" : "mis",
+               bd->ebd.desc.cbd_sc & BDRX_BCAST   ? "BCS" : "bcs",
+               bd->ebd.desc.cbd_sc & BDRX_MCAST   ? "MCS" : "mcs",
+               bd->ebd.desc.cbd_sc & BDRX_LENERR  ? "LEN" : "len",
+               bd->ebd.desc.cbd_sc & BDRX_ALGNERR ? "ALN" : "aln",
+               bd->ebd.desc.cbd_sc & BDRX_UNDEF3  ? "UN3" : "un3",
+               bd->ebd.desc.cbd_sc & BDRX_CRCERR  ? "CRC" : "crc",
+               bd->ebd.desc.cbd_sc & BDRX_OVERRUN ? "OVR" : "ovr",
+               bd->ebd.desc.cbd_sc & BDRX_TRUNC   ? "TRC" : "trc");
+        bd += 1;
+    }
+    return 0;
+}
+
+static int fec_rx_ring_open(struct inode *inode, struct file *file)
+{
+    struct fec_enet_private *fep = inode->i_private;
+    return single_open(file, fec_rx_ring_show, fep);
+}
+
+static const struct file_operations fec_rx_ring_fops = {
+    .open        = fec_rx_ring_open,
+    .read        = seq_read,
+    .llseek        = seq_lseek,
+    .release    = single_release,
+};
+
 static int
 fec_probe(struct platform_device *pdev)
 {
     struct fec_enet_private *fep;
     struct fec_platform_data *pdata;
@@ -2558,10 +2636,18 @@ fec_probe(struct platform_device *pdev)

     if (fep->flags & FEC_FLAG_BUFDESC_EX && fep->ptp_clock)
         netdev_info(ndev, "registered PHC device %d\n", fep->dev_id);

     INIT_WORK(&fep->tx_timeout_work, fec_enet_timeout_work);
+
+    if (!debugfs_create_file("fec_rx_ring",
+                 S_IFREG | S_IRUGO,
+                 NULL,
+                 fep,
+                 &fec_rx_ring_fops))
+        goto failed_register;
+
     return 0;

 failed_register:
     fec_enet_mii_remove(fep);
 failed_mii_init:



Regards,
  Jaccon
diff mbox

Patch

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 3b8d6d19ff05..510580eeae4b 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -170,6 +170,11 @@  struct bufdesc_ex {
 	unsigned short res0[4];
 };
 
+union bufdesc_u {
+	struct bufdesc bd;
+	struct bufdesc_ex ebd;
+};
+
 /*
  *	The following definitions courtesy of commproc.h, which where
  *	Copyright (c) 1997 Dan Malek (dmalek@jlc.net).
@@ -240,14 +245,14 @@  struct bufdesc_ex {
  * the skbuffer directly.
  */
 
-#define FEC_ENET_RX_PAGES	8
+#define FEC_ENET_RX_PAGES	32
 #define FEC_ENET_RX_FRSIZE	2048
 #define FEC_ENET_RX_FRPPG	(PAGE_SIZE / FEC_ENET_RX_FRSIZE)
 #define RX_RING_SIZE		(FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES)
 #define FEC_ENET_TX_FRSIZE	2048
 #define FEC_ENET_TX_FRPPG	(PAGE_SIZE / FEC_ENET_TX_FRSIZE)
-#define TX_RING_SIZE		16	/* Must be power of two */
-#define TX_RING_MOD_MASK	15	/*   for this to work */
+#define TX_RING_SIZE		64	/* Must be power of two */
+#define TX_RING_MOD_MASK	63	/*   for this to work */
 
 #define BD_ENET_RX_INT          0x00800000
 #define BD_ENET_RX_PTP          ((ushort)0x0400)
@@ -289,12 +294,12 @@  struct fec_enet_private {
 	/* CPM dual port RAM relative addresses */
 	dma_addr_t	bd_dma;
 	/* Address of Rx and Tx buffers */
-	struct bufdesc	*rx_bd_base;
-	struct bufdesc	*tx_bd_base;
+	union bufdesc_u	*rx_bd_base;
+	union bufdesc_u	*tx_bd_base;
 	/* The next free ring entry */
-	struct bufdesc	*cur_rx, *cur_tx;
-	/* The ring entries to be free()ed */
-	struct bufdesc	*dirty_tx;
+	unsigned short tx_next;
+	unsigned short tx_dirty;
+	unsigned short rx_next;
 
 	unsigned short tx_ring_size;
 	unsigned short rx_ring_size;
@@ -335,6 +340,9 @@  struct fec_enet_private {
 	struct timer_list time_keep;
 	struct fec_enet_delayed_work delay_work;
 	struct regulator *reg_phy;
+	unsigned long quirks;
+
+
 };
 
 void fec_ptp_init(struct platform_device *pdev);
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 03a351300013..8105697d5a99 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -101,6 +101,8 @@  static void set_multicast_list(struct net_device *ndev);
  * ENET_TDAR[TDAR].
  */
 #define FEC_QUIRK_ERR006358            (1 << 7)
+/* Controller has ability to offset rx packets */
+#define FEC_QUIRK_RX_SHIFT16           (1 << 8)
 
 static struct platform_device_id fec_devtype[] = {
 	{
@@ -120,7 +122,8 @@  static struct platform_device_id fec_devtype[] = {
 		.name = "imx6q-fec",
 		.driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT |
 				FEC_QUIRK_HAS_BUFDESC_EX | FEC_QUIRK_HAS_CSUM |
-				FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358,
+				FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358 |
+				FEC_QUIRK_RX_SHIFT16,
 	}, {
 		.name = "mvf600-fec",
 		.driver_data = FEC_QUIRK_ENET_MAC,
@@ -200,6 +203,7 @@  MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
 /* FEC receive acceleration */
 #define FEC_RACC_IPDIS		(1 << 1)
 #define FEC_RACC_PRODIS		(1 << 2)
+#define FEC_RACC_SHIFT16	BIT(7)
 #define FEC_RACC_OPTIONS	(FEC_RACC_IPDIS | FEC_RACC_PRODIS)
 
 /*
@@ -233,57 +237,54 @@  MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
 
 static int mii_cnt;
 
-static inline
-struct bufdesc *fec_enet_get_nextdesc(struct bufdesc *bdp, struct fec_enet_private *fep)
+static unsigned copybreak = 200;
+module_param(copybreak, uint, 0644);
+MODULE_PARM_DESC(copybreak,
+		 "Maximum size of packet that is copied to a new buffer on receive");
+
+
+
+
+
+static bool fec_enet_rx_zerocopy(struct fec_enet_private *fep, unsigned pktlen)
 {
-	struct bufdesc *new_bd = bdp + 1;
-	struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp + 1;
-	struct bufdesc_ex *ex_base;
-	struct bufdesc *base;
-	int ring_size;
-
-	if (bdp >= fep->tx_bd_base) {
-		base = fep->tx_bd_base;
-		ring_size = fep->tx_ring_size;
-		ex_base = (struct bufdesc_ex *)fep->tx_bd_base;
-	} else {
-		base = fep->rx_bd_base;
-		ring_size = fep->rx_ring_size;
-		ex_base = (struct bufdesc_ex *)fep->rx_bd_base;
-	}
+#ifndef CONFIG_M5272
+	if (fep->quirks & FEC_QUIRK_RX_SHIFT16 && pktlen >= copybreak)
+		return true;
+#endif
+	return false;
+}
+
+static union bufdesc_u *
+fec_enet_tx_get(unsigned index, struct fec_enet_private *fep)
+{
+	union bufdesc_u *base = fep->tx_bd_base;
+	union bufdesc_u *bdp;
+
+	index &= fep->tx_ring_size - 1;
 
 	if (fep->bufdesc_ex)
-		return (struct bufdesc *)((ex_new_bd >= (ex_base + ring_size)) ?
-			ex_base : ex_new_bd);
+		bdp = (union bufdesc_u *)(&base->ebd + index);
 	else
-		return (new_bd >= (base + ring_size)) ?
-			base : new_bd;
+		bdp = (union bufdesc_u *)(&base->bd + index);
+
+	return bdp;
 }
 
-static inline
-struct bufdesc *fec_enet_get_prevdesc(struct bufdesc *bdp, struct fec_enet_private *fep)
+static union bufdesc_u *
+fec_enet_rx_get(unsigned index, struct fec_enet_private *fep)
 {
-	struct bufdesc *new_bd = bdp - 1;
-	struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp - 1;
-	struct bufdesc_ex *ex_base;
-	struct bufdesc *base;
-	int ring_size;
-
-	if (bdp >= fep->tx_bd_base) {
-		base = fep->tx_bd_base;
-		ring_size = fep->tx_ring_size;
-		ex_base = (struct bufdesc_ex *)fep->tx_bd_base;
-	} else {
-		base = fep->rx_bd_base;
-		ring_size = fep->rx_ring_size;
-		ex_base = (struct bufdesc_ex *)fep->rx_bd_base;
-	}
+	union bufdesc_u *base = fep->rx_bd_base;
+	union bufdesc_u *bdp;
+
+	index &= fep->rx_ring_size - 1;
 
 	if (fep->bufdesc_ex)
-		return (struct bufdesc *)((ex_new_bd < ex_base) ?
-			(ex_new_bd + ring_size) : ex_new_bd);
+		bdp = (union bufdesc_u *)(&base->ebd + index);
 	else
-		return (new_bd < base) ? (new_bd + ring_size) : new_bd;
+		bdp = (union bufdesc_u *)(&base->bd + index);
+
+	return bdp;
 }
 
 static void *swap_buffer(void *bufaddr, int len)
@@ -297,6 +298,26 @@  static void *swap_buffer(void *bufaddr, int len)
 	return bufaddr;
 }
 
+static void fec_dump(struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	unsigned index = 0;
+
+	netdev_info(ndev, "TX ring dump\n");
+	pr_info("Nr    SC     addr       len  SKB\n");
+
+	for (index = 0; index < fep->tx_ring_size; index++) {
+		union bufdesc_u *bdp = fec_enet_tx_get(index, fep);
+
+		pr_info("%2u %c%c 0x%04x 0x%08lx %4u %p\n",
+			index,
+			index == fep->tx_next ? 'S' : ' ',
+			index == fep->tx_dirty ? 'H' : ' ',
+			bdp->bd.cbd_sc, bdp->bd.cbd_bufaddr, bdp->bd.cbd_datlen,
+			fep->tx_skbuff[index]);
+	}
+}
+
 static int
 fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
 {
@@ -312,21 +333,42 @@  fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
 	return 0;
 }
 
+static void
+fec_enet_tx_unmap(struct bufdesc *bdp, struct fec_enet_private *fep)
+{
+	dma_addr_t addr = bdp->cbd_bufaddr;
+	unsigned length = bdp->cbd_datlen;
+
+	bdp->cbd_bufaddr = 0;
+
+	dma_unmap_single(&fep->pdev->dev, addr, length, DMA_TO_DEVICE);
+}
+
 static netdev_tx_t
 fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
-	struct bufdesc *bdp, *bdp_pre;
+	union bufdesc_u *bdp, *bdp_pre;
 	void *bufaddr;
 	unsigned short	status;
-	unsigned int index;
+	unsigned index;
+	unsigned length;
+	dma_addr_t addr;
+
+
+
+
+
+
+
+
+
 
 	/* Fill in a Tx ring entry */
-	bdp = fep->cur_tx;
+	index = fep->tx_next;
 
-	status = bdp->cbd_sc;
+	bdp = fec_enet_tx_get(index, fep);
+	status = bdp->bd.cbd_sc;
 
 	if (status & BD_ENET_TX_READY) {
 		/* Ooops.  All transmit buffers are full.  Bail out.
@@ -347,21 +389,15 @@  fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 
 	/* Set buffer length and buffer pointer */
 	bufaddr = skb->data;
-	bdp->cbd_datlen = skb->len;
+	length = skb->len;
 
 	/*
 	 * On some FEC implementations data must be aligned on
 	 * 4-byte boundaries. Use bounce buffers to copy data
 	 * and get it aligned. Ugh.
 	 */
-	if (fep->bufdesc_ex)
-		index = (struct bufdesc_ex *)bdp -
-			(struct bufdesc_ex *)fep->tx_bd_base;
-	else
-		index = bdp - fep->tx_bd_base;
-
 	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
-		memcpy(fep->tx_bounce[index], skb->data, skb->len);
+		memcpy(fep->tx_bounce[index], skb->data, length);
 		bufaddr = fep->tx_bounce[index];
 	}
 
@@ -370,70 +406,72 @@  fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	 * the system that it's running on. As the result, driver has to
 	 * swap every frame going to and coming from the controller.
 	 */
-	if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
-		swap_buffer(bufaddr, skb->len);
+	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
+		swap_buffer(bufaddr, length);
 
-	/* Save skb pointer */
-	fep->tx_skbuff[index] = skb;
-
-	/* Push the data cache so the CPM does not get stale memory
-	 * data.
-	 */
-	bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
-			skb->len, DMA_TO_DEVICE);
-	if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
-		bdp->cbd_bufaddr = 0;
-		fep->tx_skbuff[index] = NULL;
+	/* Push the data cache so the CPM does not get stale memory data. */
+	addr = dma_map_single(&fep->pdev->dev, bufaddr, length, DMA_TO_DEVICE);
+	if (dma_mapping_error(&fep->pdev->dev, addr)) {
 		dev_kfree_skb_any(skb);
 		if (net_ratelimit())
 			netdev_err(ndev, "Tx DMA memory map failed\n");
 		return NETDEV_TX_OK;
 	}
 
-	if (fep->bufdesc_ex) {
+	/* Save skb pointer */
+	fep->tx_skbuff[index] = skb;
 
-		struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-		ebdp->cbd_bdu = 0;
+	bdp->bd.cbd_datlen = length;
+	bdp->bd.cbd_bufaddr = addr;
+
+	if (fep->bufdesc_ex) {
+		bdp->ebd.cbd_bdu = 0;
 		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
 			fep->hwts_tx_en)) {
-			ebdp->cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
+			bdp->ebd.cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
 			skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
 		} else {
-			ebdp->cbd_esc = BD_ENET_TX_INT;
+			bdp->ebd.cbd_esc = BD_ENET_TX_INT;
 
 			/* Enable protocol checksum flags
 			 * We do not bother with the IP Checksum bits as they
 			 * are done by the kernel
 			 */
 			if (skb->ip_summed == CHECKSUM_PARTIAL)
-				ebdp->cbd_esc |= BD_ENET_TX_PINS;
+				bdp->ebd.cbd_esc |= BD_ENET_TX_PINS;
 		}
 	}
 
+	/*
+	 * We need the preceding stores to the descriptor to complete
+	 * before updating the status field, which hands it over to the
+	 * hardware.  The corresponding rmb() is "in the hardware".
+	 */
+	wmb();
+
 	/* Send it on its way.  Tell FEC it's ready, interrupt when done,
 	 * it's the last BD of the frame, and to put the CRC on the end.
 	 */
 	status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
 			| BD_ENET_TX_LAST | BD_ENET_TX_TC);
-	bdp->cbd_sc = status;
+	bdp->bd.cbd_sc = status;
 
-	bdp_pre = fec_enet_get_prevdesc(bdp, fep);
-	if ((id_entry->driver_data & FEC_QUIRK_ERR006358) &&
-	    !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) {
+	bdp_pre = fec_enet_tx_get(index - 1, fep);
+	if ((fep->quirks & FEC_QUIRK_ERR006358) &&
+	    !(bdp_pre->bd.cbd_sc & BD_ENET_TX_READY)) {
 		fep->delay_work.trig_tx = true;
 		schedule_delayed_work(&(fep->delay_work.delay_work),
 					msecs_to_jiffies(1));
 	}
 
-	/* If this was the last BD in the ring, start at the beginning again. */
-	bdp = fec_enet_get_nextdesc(bdp, fep);
-
 	skb_tx_timestamp(skb);
 
-	fep->cur_tx = bdp;
+	fep->tx_next = (index + 1) & (fep->tx_ring_size - 1);
 
-	if (fep->cur_tx == fep->dirty_tx)
+	if (fep->tx_next == fep->tx_dirty) {
+
 		netif_stop_queue(ndev);
+	}
 
 	/* Trigger transmission start */
 	writel(0, fep->hwp + FEC_X_DES_ACTIVE);
@@ -446,46 +484,43 @@  fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 static void fec_enet_bd_init(struct net_device *dev)
 {
 	struct fec_enet_private *fep = netdev_priv(dev);
-	struct bufdesc *bdp;
 	unsigned int i;
 
 	/* Initialize the receive buffer descriptors. */
-	bdp = fep->rx_bd_base;
 	for (i = 0; i < fep->rx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
 
 		/* Initialize the BD for every fragment in the page. */
-		if (bdp->cbd_bufaddr)
-			bdp->cbd_sc = BD_ENET_RX_EMPTY;
+		if (bdp->bd.cbd_bufaddr)
+			bdp->bd.cbd_sc = BD_ENET_RX_EMPTY;
 		else
-			bdp->cbd_sc = 0;
-		bdp = fec_enet_get_nextdesc(bdp, fep);
+			bdp->bd.cbd_sc = 0;
+		if (i == fep->rx_ring_size - 1)
+			bdp->bd.cbd_sc |= BD_SC_WRAP;
 	}
 
-	/* Set the last buffer to wrap */
-	bdp = fec_enet_get_prevdesc(bdp, fep);
-	bdp->cbd_sc |= BD_SC_WRAP;
-
-	fep->cur_rx = fep->rx_bd_base;
+	fep->rx_next = 0;
 
 	/* ...and the same for transmit */
-	bdp = fep->tx_bd_base;
-	fep->cur_tx = bdp;
 	for (i = 0; i < fep->tx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
 
 		/* Initialize the BD for every fragment in the page. */
-		bdp->cbd_sc = 0;
-		if (bdp->cbd_bufaddr && fep->tx_skbuff[i]) {
+		/* Set the last buffer to wrap */
+		if (i == fep->tx_ring_size - 1)
+			bdp->bd.cbd_sc = BD_SC_WRAP;
+		else
+			bdp->bd.cbd_sc = 0;
+		if (bdp->bd.cbd_bufaddr)
+			fec_enet_tx_unmap(&bdp->bd, fep);
+		if (fep->tx_skbuff[i]) {
 			dev_kfree_skb_any(fep->tx_skbuff[i]);
 			fep->tx_skbuff[i] = NULL;
 		}
-		bdp->cbd_bufaddr = 0;
-		bdp = fec_enet_get_nextdesc(bdp, fep);
 	}
 
-	/* Set the last buffer to wrap */
-	bdp = fec_enet_get_prevdesc(bdp, fep);
-	bdp->cbd_sc |= BD_SC_WRAP;
-	fep->dirty_tx = bdp;
+	fep->tx_next = 0;
+	fep->tx_dirty = fep->tx_ring_size - 1;
 }
 
 /* This function is called to start or restart the FEC during a link
@@ -496,8 +531,6 @@  static void
 fec_restart(struct net_device *ndev, int duplex)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
 	int i;
 	u32 val;
 	u32 temp_mac[2];
@@ -519,7 +552,7 @@  fec_restart(struct net_device *ndev, int duplex)
 	 * enet-mac reset will reset mac address registers too,
 	 * so need to reconfigure it.
 	 */
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
+	if (fep->quirks & FEC_QUIRK_ENET_MAC) {
 		memcpy(&temp_mac, ndev->dev_addr, ETH_ALEN);
 		writel(cpu_to_be32(temp_mac[0]), fep->hwp + FEC_ADDR_LOW);
 		writel(cpu_to_be32(temp_mac[1]), fep->hwp + FEC_ADDR_HIGH);
@@ -568,6 +601,8 @@  fec_restart(struct net_device *ndev, int duplex)
 #if !defined(CONFIG_M5272)
 	/* set RX checksum */
 	val = readl(fep->hwp + FEC_RACC);
+	if (fep->quirks & FEC_QUIRK_RX_SHIFT16)
+		val |= FEC_RACC_SHIFT16;
 	if (fep->csum_flags & FLAG_RX_CSUM_ENABLED)
 		val |= FEC_RACC_OPTIONS;
 	else
@@ -579,7 +614,7 @@  fec_restart(struct net_device *ndev, int duplex)
 	 * The phy interface and speed need to get configured
 	 * differently on enet-mac.
 	 */
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
+	if (fep->quirks & FEC_QUIRK_ENET_MAC) {
 		/* Enable flow control and length check */
 		rcntl |= 0x40000000 | 0x00000020;
 
@@ -602,7 +637,7 @@  fec_restart(struct net_device *ndev, int duplex)
 		}
 	} else {
 #ifdef FEC_MIIGSK_ENR
-		if (id_entry->driver_data & FEC_QUIRK_USE_GASKET) {
+		if (fep->quirks & FEC_QUIRK_USE_GASKET) {
 			u32 cfgr;
 			/* disable the gasket and wait */
 			writel(0, fep->hwp + FEC_MIIGSK_ENR);
@@ -655,7 +690,7 @@  fec_restart(struct net_device *ndev, int duplex)
 	writel(0, fep->hwp + FEC_HASH_TABLE_LOW);
 #endif
 
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
+	if (fep->quirks & FEC_QUIRK_ENET_MAC) {
 		/* enable ENET endian swap */
 		ecntl |= (1 << 8);
 		/* enable ENET store and forward mode */
@@ -692,8 +727,6 @@  static void
 fec_stop(struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
 	u32 rmii_mode = readl(fep->hwp + FEC_R_CNTRL) & (1 << 8);
 
 	/* We cannot expect a graceful transmit stop without link !!! */
@@ -711,7 +744,7 @@  fec_stop(struct net_device *ndev)
 	writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
 
 	/* We have to keep ENET enabled to have MII interrupt stay working */
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
+	if (fep->quirks & FEC_QUIRK_ENET_MAC) {
 		writel(2, fep->hwp + FEC_ECNTRL);
 		writel(rmii_mode, fep->hwp + FEC_R_CNTRL);
 	}
@@ -723,6 +756,8 @@  fec_timeout(struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
 
+	fec_dump(ndev);
+
 	ndev->stats.tx_errors++;
 
 	fep->delay_work.timeout = true;
@@ -751,34 +786,28 @@  static void fec_enet_work(struct work_struct *work)
 static void
 fec_enet_tx(struct net_device *ndev)
 {
-	struct	fec_enet_private *fep;
-	struct bufdesc *bdp;
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	union bufdesc_u *bdp;
 	unsigned short status;
 	struct	sk_buff	*skb;
-	int	index = 0;
-
-	fep = netdev_priv(ndev);
-	bdp = fep->dirty_tx;
+	unsigned index = fep->tx_dirty;
 
-	/* get next bdp of dirty_tx */
-	bdp = fec_enet_get_nextdesc(bdp, fep);
+	do {
+		index = (index + 1) & (fep->tx_ring_size - 1);
+		bdp = fec_enet_tx_get(index, fep);
 
-	while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
+		status = bdp->bd.cbd_sc;
+		if (status & BD_ENET_TX_READY)
+			break;
 
 		/* current queue is empty */
-		if (bdp == fep->cur_tx)
+		if (index == fep->tx_next)
 			break;
 
-		if (fep->bufdesc_ex)
-			index = (struct bufdesc_ex *)bdp -
-				(struct bufdesc_ex *)fep->tx_bd_base;
-		else
-			index = bdp - fep->tx_bd_base;
+		fec_enet_tx_unmap(&bdp->bd, fep);
 
 		skb = fep->tx_skbuff[index];
-		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len,
-				DMA_TO_DEVICE);
-		bdp->cbd_bufaddr = 0;
+		fep->tx_skbuff[index] = NULL;
 
 		/* Check for errors. */
 		if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC |
@@ -797,19 +826,18 @@  fec_enet_tx(struct net_device *ndev)
 				ndev->stats.tx_carrier_errors++;
 		} else {
 			ndev->stats.tx_packets++;
-			ndev->stats.tx_bytes += bdp->cbd_datlen;
+			ndev->stats.tx_bytes += bdp->bd.cbd_datlen;
 		}
 
 		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) &&
 			fep->bufdesc_ex) {
 			struct skb_shared_hwtstamps shhwtstamps;
 			unsigned long flags;
-			struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
 
 			memset(&shhwtstamps, 0, sizeof(shhwtstamps));
 			spin_lock_irqsave(&fep->tmreg_lock, flags);
 			shhwtstamps.hwtstamp = ns_to_ktime(
-				timecounter_cyc2time(&fep->tc, ebdp->ts));
+				timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
 			spin_unlock_irqrestore(&fep->tmreg_lock, flags);
 			skb_tstamp_tx(skb, &shhwtstamps);
 		}
@@ -825,45 +853,252 @@  fec_enet_tx(struct net_device *ndev)
 
 		/* Free the sk buffer associated with this last transmit */
 		dev_kfree_skb_any(skb);
-		fep->tx_skbuff[index] = NULL;
-
-		fep->dirty_tx = bdp;
-
-		/* Update pointer to next buffer descriptor to be transmitted */
-		bdp = fec_enet_get_nextdesc(bdp, fep);
 
 		/* Since we have freed up a buffer, the ring is no longer full
 		 */
-		if (fep->dirty_tx != fep->cur_tx) {
-			if (netif_queue_stopped(ndev))
-				netif_wake_queue(ndev);
+		if (netif_queue_stopped(ndev)) {
+
+
+
+
+
+			netif_wake_queue(ndev);
+
 		}
-	}
+
+		fep->tx_dirty = index;
+	} while (1);
 	return;
 }
 
 
-/* During a receive, the cur_rx points to the current incoming buffer.
+static void
+fec_enet_receive(struct sk_buff *skb, union bufdesc_u *bdp, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+
+	skb->protocol = eth_type_trans(skb, ndev);
+
+	/* Get receive timestamp from the skb */
+	if (fep->hwts_rx_en && fep->bufdesc_ex) {
+		struct skb_shared_hwtstamps *shhwtstamps =
+						    skb_hwtstamps(skb);
+		unsigned long flags;
+
+		memset(shhwtstamps, 0, sizeof(*shhwtstamps));
+
+		spin_lock_irqsave(&fep->tmreg_lock, flags);
+		shhwtstamps->hwtstamp = ns_to_ktime(
+		    timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
+		spin_unlock_irqrestore(&fep->tmreg_lock, flags);
+	}
+
+	if (fep->csum_flags & FLAG_RX_CSUM_ENABLED) {
+		if (!(bdp->ebd.cbd_esc & FLAG_RX_CSUM_ERROR)) {
+			/* don't check it */
+			skb->ip_summed = CHECKSUM_UNNECESSARY;
+		} else {
+			skb_checksum_none_assert(skb);
+		}
+	}
+
+	napi_gro_receive(&fep->napi, skb);
+}
+
+static void noinline
+fec_enet_receive_copy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	struct sk_buff *skb;
+	unsigned char *data;
+	bool vlan_packet_rcvd = false;
+
+	/*
+	 * Detect the presence of the VLAN tag, and adjust
+	 * the packet length appropriately.
+	 */
+	if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX &&
+	    bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) {
+		pkt_len -= VLAN_HLEN;
+		vlan_packet_rcvd = true;
+	}
+
+	/* This does 16 byte alignment, exactly what we need. */
+	skb = netdev_alloc_skb(ndev, pkt_len + NET_IP_ALIGN);
+	if (unlikely(!skb)) {
+		ndev->stats.rx_dropped++;
+		return;
+	}
+
+	dma_sync_single_for_cpu(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
+				FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+
+	data = fep->rx_skbuff[index]->data;
+
+#ifndef CONFIG_M5272
+	/*
+	 * If we have enabled this feature, we need to discard
+	 * the two bytes at the beginning of the packet before
+	 * copying it.
+	 */
+	if (fep->quirks & FEC_QUIRK_RX_SHIFT16) {
+		pkt_len -= 2;
+		data += 2;
+	}
+#endif
+
+	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
+		swap_buffer(data, pkt_len);
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, pkt_len);	/* Make room */
+
+	/* If this is a VLAN packet remove the VLAN Tag */
+	if (vlan_packet_rcvd) {
+		struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN);
+
+		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+				       ntohs(vlan->h_vlan_TCI));
+
+		/* Extract the frame data without the VLAN header. */
+		skb_copy_to_linear_data(skb, data, 2 * ETH_ALEN);
+		skb_copy_to_linear_data_offset(skb, 2 * ETH_ALEN,
+					       data + 2 * ETH_ALEN + VLAN_HLEN,
+					       pkt_len - 2 * ETH_ALEN);
+	} else {
+		skb_copy_to_linear_data(skb, data, pkt_len);
+	}
+
+	dma_sync_single_for_device(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
+				   FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+
+	fec_enet_receive(skb, bdp, ndev);
+}
+
+static void noinline
+fec_enet_receive_nocopy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	struct sk_buff *skb, *skb_new;
+	unsigned char *data;
+	dma_addr_t addr;
+
+#if 0
+	skb_new = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
+	if (!skb_new) {
+		ndev->stats.rx_dropped++;
+		return;
+	}
+
+	addr = dma_map_single(&fep->pdev->dev, skb_new->data,
+			      FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+	if (dma_mapping_error(&fep->pdev->dev, addr)) {
+		dev_kfree_skb(skb_new);
+		ndev->stats.rx_dropped++;
+		return;
+	}
+#else
+	skb_new = NULL;
+	addr = 0;
+#endif
+
+	/*
+	 * We have the new skb, so proceed to deal with the
+	 * received data.
+	 */
+	dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
+			 FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+
+	skb = fep->rx_skbuff[index];
+
+	/* Now subsitute in the new skb */
+	fep->rx_skbuff[index] = skb_new;
+	bdp->bd.cbd_bufaddr = addr;
+
+	/*
+	 * Update the skb length according to the raw packet
+	 * length.  Then remove the two bytes of additional
+	 * padding.
+	 */
+	skb_put(skb, pkt_len);
+	data = skb_pull_inline(skb, 2);
+
+	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
+		swap_buffer(data, skb->len);
+
+	/*
+	 * Now juggle things for the VLAN tag - if the hardware
+	 * flags this as present, we need to read the tag, and
+	 * then shuffle the ethernet addresses up.
+	 */
+	if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX &&
+	    bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) {
+		struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN);
+
+		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+				       ntohs(vlan->h_vlan_TCI));
+
+		memmove(data + VLAN_HLEN, data, 2 * ETH_ALEN);
+		skb_pull_inline(skb, VLAN_HLEN);
+	}
+
+	fec_enet_receive(skb, bdp, ndev);
+}
+
+static int
+fec_enet_refill_ring(unsigned first, unsigned last, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	unsigned i = first;
+
+	do {
+		union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
+		struct sk_buff *skb;
+		dma_addr_t addr;
+
+		if (!fep->rx_skbuff[i]) {
+			skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
+			if (!skb)
+				return -ENOMEM;
+
+			addr = dma_map_single(&fep->pdev->dev, skb->data,
+					      FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+			if (dma_mapping_error(&fep->pdev->dev, addr)) {
+				dev_kfree_skb(skb);
+				return -ENOMEM;
+			}
+
+			fep->rx_skbuff[i] = skb;
+			bdp->bd.cbd_bufaddr = addr;
+		}
+
+		bdp->bd.cbd_sc = (bdp->bd.cbd_sc & BD_SC_WRAP) |
+				 BD_ENET_RX_EMPTY;
+
+		if (fep->bufdesc_ex) {
+			bdp->ebd.cbd_esc = BD_ENET_RX_INT;
+			bdp->ebd.cbd_prot = 0;
+			bdp->ebd.cbd_bdu = 0;
+		}
+		i = (i + 1) & (fep->rx_ring_size - 1);
+	} while (i != last);
+
+	return 0;
+}
+
+/* During a receive, the rx_next points to the current incoming buffer.
  * When we update through the ring, if the next incoming buffer has
  * not been given to the system, we just set the empty indicator,
  * effectively tossing the packet.
  */
-static int
+static int noinline
 fec_enet_rx(struct net_device *ndev, int budget)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
-	struct bufdesc *bdp;
 	unsigned short status;
-	struct	sk_buff	*skb;
-	ushort	pkt_len;
-	__u8 *data;
+	unsigned pkt_len;
 	int	pkt_received = 0;
-	struct	bufdesc_ex *ebdp = NULL;
-	bool	vlan_packet_rcvd = false;
-	u16	vlan_tag;
-	int	index = 0;
+	unsigned index = fep->rx_next;
 
 #ifdef CONFIG_M532x
 	flush_cache_all();
@@ -872,12 +1107,16 @@  fec_enet_rx(struct net_device *ndev, int budget)
 	/* First, grab all of the stats for the incoming packet.
 	 * These get messed up if we get called due to a busy condition.
 	 */
-	bdp = fep->cur_rx;
+	do {
+		union bufdesc_u *bdp = fec_enet_rx_get(index, fep);
 
-	while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
+		status = bdp->bd.cbd_sc;
+		if (status & BD_ENET_RX_EMPTY)
+			break;
 
 		if (pkt_received >= budget)
 			break;
+
 		pkt_received++;
 
 		/* Since we have allocated space to hold a complete frame,
@@ -917,124 +1156,33 @@  fec_enet_rx(struct net_device *ndev, int budget)
 
 		/* Process the incoming frame. */
 		ndev->stats.rx_packets++;
-		pkt_len = bdp->cbd_datlen;
-		ndev->stats.rx_bytes += pkt_len;
-
-		if (fep->bufdesc_ex)
-			index = (struct bufdesc_ex *)bdp -
-				(struct bufdesc_ex *)fep->rx_bd_base;
-		else
-			index = bdp - fep->rx_bd_base;
-		data = fep->rx_skbuff[index]->data;
-		dma_sync_single_for_cpu(&fep->pdev->dev, bdp->cbd_bufaddr,
-					FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
-
-		if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
-			swap_buffer(data, pkt_len);
-
-		/* Extract the enhanced buffer descriptor */
-		ebdp = NULL;
-		if (fep->bufdesc_ex)
-			ebdp = (struct bufdesc_ex *)bdp;
-
-		/* If this is a VLAN packet remove the VLAN Tag */
-		vlan_packet_rcvd = false;
-		if ((ndev->features & NETIF_F_HW_VLAN_CTAG_RX) &&
-		    fep->bufdesc_ex && (ebdp->cbd_esc & BD_ENET_RX_VLAN)) {
-			/* Push and remove the vlan tag */
-			struct vlan_hdr *vlan_header =
-					(struct vlan_hdr *) (data + ETH_HLEN);
-			vlan_tag = ntohs(vlan_header->h_vlan_TCI);
-			pkt_len -= VLAN_HLEN;
-
-			vlan_packet_rcvd = true;
-		}
 
-		/* This does 16 byte alignment, exactly what we need.
-		 * The packet length includes FCS, but we don't want to
-		 * include that when passing upstream as it messes up
-		 * bridging applications.
+		/*
+		 * The packet length includes FCS, but we don't want
+		 * to include that when passing upstream as it messes
+		 * up bridging applications.
 		 */
-		skb = netdev_alloc_skb(ndev, pkt_len - 4 + NET_IP_ALIGN);
+		pkt_len = bdp->bd.cbd_datlen - 4;
+		ndev->stats.rx_bytes += pkt_len;
 
-		if (unlikely(!skb)) {
-			ndev->stats.rx_dropped++;
+		if (fec_enet_rx_zerocopy(fep, pkt_len)) {
+			fec_enet_receive_nocopy(pkt_len, index, bdp, ndev);
 		} else {
-			int payload_offset = (2 * ETH_ALEN);
-			skb_reserve(skb, NET_IP_ALIGN);
-			skb_put(skb, pkt_len - 4);	/* Make room */
-
-			/* Extract the frame data without the VLAN header. */
-			skb_copy_to_linear_data(skb, data, (2 * ETH_ALEN));
-			if (vlan_packet_rcvd)
-				payload_offset = (2 * ETH_ALEN) + VLAN_HLEN;
-			skb_copy_to_linear_data_offset(skb, (2 * ETH_ALEN),
-						       data + payload_offset,
-						       pkt_len - 4 - (2 * ETH_ALEN));
-
-			skb->protocol = eth_type_trans(skb, ndev);
-
-			/* Get receive timestamp from the skb */
-			if (fep->hwts_rx_en && fep->bufdesc_ex) {
-				struct skb_shared_hwtstamps *shhwtstamps =
-							    skb_hwtstamps(skb);
-				unsigned long flags;
-
-				memset(shhwtstamps, 0, sizeof(*shhwtstamps));
-
-				spin_lock_irqsave(&fep->tmreg_lock, flags);
-				shhwtstamps->hwtstamp = ns_to_ktime(
-				    timecounter_cyc2time(&fep->tc, ebdp->ts));
-				spin_unlock_irqrestore(&fep->tmreg_lock, flags);
-			}
-
-			if (fep->bufdesc_ex &&
-			    (fep->csum_flags & FLAG_RX_CSUM_ENABLED)) {
-				if (!(ebdp->cbd_esc & FLAG_RX_CSUM_ERROR)) {
-					/* don't check it */
-					skb->ip_summed = CHECKSUM_UNNECESSARY;
-				} else {
-					skb_checksum_none_assert(skb);
-				}
-			}
-
-			/* Handle received VLAN packets */
-			if (vlan_packet_rcvd)
-				__vlan_hwaccel_put_tag(skb,
-						       htons(ETH_P_8021Q),
-						       vlan_tag);
-
-			napi_gro_receive(&fep->napi, skb);
+			fec_enet_receive_copy(pkt_len, index, bdp, ndev);
 		}
 
-		dma_sync_single_for_device(&fep->pdev->dev, bdp->cbd_bufaddr,
-					FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
 rx_processing_done:
-		/* Clear the status flags for this buffer */
-		status &= ~BD_ENET_RX_STATS;
-
-		/* Mark the buffer empty */
-		status |= BD_ENET_RX_EMPTY;
-		bdp->cbd_sc = status;
-
-		if (fep->bufdesc_ex) {
-			struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-
-			ebdp->cbd_esc = BD_ENET_RX_INT;
-			ebdp->cbd_prot = 0;
-			ebdp->cbd_bdu = 0;
-		}
-
-		/* Update BD pointer to next entry */
-		bdp = fec_enet_get_nextdesc(bdp, fep);
+		index = (index + 1) & (fep->rx_ring_size - 1);
+		if (index == fep->rx_next)
+			break;
+	} while (1);
 
-		/* Doing this here will keep the FEC running while we process
-		 * incoming frames.  On a heavily loaded network, we should be
-		 * able to keep up at the expense of system resources.
-		 */
+	if (pkt_received) {
+		fec_enet_refill_ring(fep->rx_next, index, ndev);
 		writel(0, fep->hwp + FEC_R_DES_ACTIVE);
 	}
-	fep->cur_rx = bdp;
+
+	fep->rx_next = index;
 
 	return pkt_received;
 }
@@ -1044,29 +1192,25 @@  fec_enet_interrupt(int irq, void *dev_id)
 {
 	struct net_device *ndev = dev_id;
 	struct fec_enet_private *fep = netdev_priv(ndev);
+	const unsigned napi_mask = FEC_ENET_RXF | FEC_ENET_TXF;
 	uint int_events;
 	irqreturn_t ret = IRQ_NONE;
 
-	do {
-		int_events = readl(fep->hwp + FEC_IEVENT);
-		writel(int_events, fep->hwp + FEC_IEVENT);
+	int_events = readl(fep->hwp + FEC_IEVENT);
+	writel(int_events & ~napi_mask, fep->hwp + FEC_IEVENT);
 
-		if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
-			ret = IRQ_HANDLED;
+	if (int_events & napi_mask) {
+		ret = IRQ_HANDLED;
 
-			/* Disable the RX interrupt */
-			if (napi_schedule_prep(&fep->napi)) {
-				writel(FEC_RX_DISABLED_IMASK,
-					fep->hwp + FEC_IMASK);
-				__napi_schedule(&fep->napi);
-			}
-		}
+		/* Disable the NAPI interrupts */
+		writel(FEC_ENET_MII, fep->hwp + FEC_IMASK);
+		napi_schedule(&fep->napi);
+	}
 
-		if (int_events & FEC_ENET_MII) {
-			ret = IRQ_HANDLED;
-			complete(&fep->mdio_done);
-		}
-	} while (int_events);
+	if (int_events & FEC_ENET_MII) {
+		ret = IRQ_HANDLED;
+		complete(&fep->mdio_done);
+	}
 
 	return ret;
 }
@@ -1074,10 +1218,24 @@  fec_enet_interrupt(int irq, void *dev_id)
 static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
 {
 	struct net_device *ndev = napi->dev;
-	int pkts = fec_enet_rx(ndev, budget);
 	struct fec_enet_private *fep = netdev_priv(ndev);
+	unsigned status;
+	int pkts = 0;
+
+	status = readl(fep->hwp + FEC_IEVENT) & (FEC_ENET_RXF | FEC_ENET_TXF);
+	if (status) {
+		/*
+		 * Clear any pending transmit or receive interrupts before
+		 * processing the rings to avoid racing with the hardware.
+		 */
+		writel(status, fep->hwp + FEC_IEVENT);
 
-	fec_enet_tx(ndev);
+		if (status & FEC_ENET_RXF)
+			pkts = fec_enet_rx(ndev, budget);
+
+		if (status & FEC_ENET_TXF)
+			fec_enet_tx(ndev);
+	}
 
 	if (pkts < budget) {
 		napi_complete(napi);
@@ -1263,8 +1421,6 @@  static int fec_enet_mdio_reset(struct mii_bus *bus)
 static int fec_enet_mii_probe(struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
 	struct phy_device *phy_dev = NULL;
 	char mdio_bus_id[MII_BUS_ID_SIZE];
 	char phy_name[MII_BUS_ID_SIZE + 3];
@@ -1302,7 +1458,7 @@  static int fec_enet_mii_probe(struct net_device *ndev)
 	}
 
 	/* mask with MAC supported features */
-	if (id_entry->driver_data & FEC_QUIRK_HAS_GBIT) {
+	if (fep->quirks & FEC_QUIRK_HAS_GBIT) {
 		phy_dev->supported &= PHY_GBIT_FEATURES;
 #if !defined(CONFIG_M5272)
 		phy_dev->supported |= SUPPORTED_Pause;
@@ -1329,8 +1485,6 @@  static int fec_enet_mii_init(struct platform_device *pdev)
 	static struct mii_bus *fec0_mii_bus;
 	struct net_device *ndev = platform_get_drvdata(pdev);
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
 	int err = -ENXIO, i;
 
 	/*
@@ -1349,7 +1503,7 @@  static int fec_enet_mii_init(struct platform_device *pdev)
 	 * mdio interface in board design, and need to be configured by
 	 * fec0 mii_bus.
 	 */
-	if ((id_entry->driver_data & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) {
+	if ((fep->quirks & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) {
 		/* fec1 uses fec0 mii_bus */
 		if (mii_cnt && fec0_mii_bus) {
 			fep->mii_bus = fec0_mii_bus;
@@ -1370,7 +1524,7 @@  static int fec_enet_mii_init(struct platform_device *pdev)
 	 * document.
 	 */
 	fep->phy_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ahb), 5000000);
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC)
+	if (fep->quirks & FEC_QUIRK_ENET_MAC)
 		fep->phy_speed--;
 	fep->phy_speed <<= 1;
 	writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED);
@@ -1405,7 +1559,7 @@  static int fec_enet_mii_init(struct platform_device *pdev)
 	mii_cnt++;
 
 	/* save fec0 mii_bus */
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC)
+	if (fep->quirks & FEC_QUIRK_ENET_MAC)
 		fec0_mii_bus = fep->mii_bus;
 
 	return 0;
@@ -1694,23 +1848,24 @@  static void fec_enet_free_buffers(struct net_device *ndev)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	unsigned int i;
 	struct sk_buff *skb;
-	struct bufdesc	*bdp;
 
-	bdp = fep->rx_bd_base;
 	for (i = 0; i < fep->rx_ring_size; i++) {
-		skb = fep->rx_skbuff[i];
+		union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
 
-		if (bdp->cbd_bufaddr)
-			dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
+		skb = fep->rx_skbuff[i];
+		if (skb) {
+			dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
 					FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
-		if (skb)
 			dev_kfree_skb(skb);
-		bdp = fec_enet_get_nextdesc(bdp, fep);
+		}
 	}
 
-	bdp = fep->tx_bd_base;
-	for (i = 0; i < fep->tx_ring_size; i++)
+	for (i = 0; i < fep->tx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
+		if (bdp->bd.cbd_bufaddr)
+			fec_enet_tx_unmap(&bdp->bd, fep);
 		kfree(fep->tx_bounce[i]);
+	}
 }
 
 static int fec_enet_alloc_buffers(struct net_device *ndev)
@@ -1718,58 +1873,54 @@  static int fec_enet_alloc_buffers(struct net_device *ndev)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	unsigned int i;
 	struct sk_buff *skb;
-	struct bufdesc	*bdp;
 
-	bdp = fep->rx_bd_base;
 	for (i = 0; i < fep->rx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
+		dma_addr_t addr;
+
 		skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
 		if (!skb) {
 			fec_enet_free_buffers(ndev);
 			return -ENOMEM;
 		}
-		fep->rx_skbuff[i] = skb;
 
-		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, skb->data,
-				FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
-		if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
+		addr = dma_map_single(&fep->pdev->dev, skb->data,
+				      FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+		if (dma_mapping_error(&fep->pdev->dev, addr)) {
+			dev_kfree_skb(skb);
 			fec_enet_free_buffers(ndev);
 			if (net_ratelimit())
 				netdev_err(ndev, "Rx DMA memory map failed\n");
 			return -ENOMEM;
 		}
-		bdp->cbd_sc = BD_ENET_RX_EMPTY;
 
-		if (fep->bufdesc_ex) {
-			struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-			ebdp->cbd_esc = BD_ENET_RX_INT;
-		}
+		fep->rx_skbuff[i] = skb;
 
-		bdp = fec_enet_get_nextdesc(bdp, fep);
-	}
+		bdp->bd.cbd_bufaddr = addr;
+		bdp->bd.cbd_sc = BD_ENET_RX_EMPTY;
+		/* Set the last buffer to wrap. */
+		if (i == fep->rx_ring_size - 1)
+			bdp->bd.cbd_sc |= BD_SC_WRAP;
 
-	/* Set the last buffer to wrap. */
-	bdp = fec_enet_get_prevdesc(bdp, fep);
-	bdp->cbd_sc |= BD_SC_WRAP;
+		if (fep->bufdesc_ex)
+			bdp->ebd.cbd_esc = BD_ENET_RX_INT;
+	}
 
-	bdp = fep->tx_bd_base;
 	for (i = 0; i < fep->tx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
 		fep->tx_bounce[i] = kmalloc(FEC_ENET_TX_FRSIZE, GFP_KERNEL);
 
-		bdp->cbd_sc = 0;
-		bdp->cbd_bufaddr = 0;
-
-		if (fep->bufdesc_ex) {
-			struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-			ebdp->cbd_esc = BD_ENET_TX_INT;
-		}
+		/* Set the last buffer to wrap. */
+		if (i == fep->tx_ring_size - 1)
+			bdp->bd.cbd_sc = BD_SC_WRAP;
+		else
+			bdp->bd.cbd_sc = 0;
+		bdp->bd.cbd_bufaddr = 0;
 
-		bdp = fec_enet_get_nextdesc(bdp, fep);
+		if (fep->bufdesc_ex)
+			bdp->ebd.cbd_esc = BD_ENET_TX_INT;
 	}
 
-	/* Set the last buffer to wrap. */
-	bdp = fec_enet_get_prevdesc(bdp, fep);
-	bdp->cbd_sc |= BD_SC_WRAP;
-
 	return 0;
 }
 
@@ -1990,9 +2141,7 @@  static const struct net_device_ops fec_netdev_ops = {
 static int fec_enet_init(struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
-	struct bufdesc *cbd_base;
+	union bufdesc_u *cbd_base;
 
 	/* Allocate memory for buffer descriptors. */
 	cbd_base = dma_alloc_coherent(NULL, PAGE_SIZE, &fep->bd_dma,
@@ -2014,10 +2163,11 @@  static int fec_enet_init(struct net_device *ndev)
 	/* Set receive and transmit descriptor base. */
 	fep->rx_bd_base = cbd_base;
 	if (fep->bufdesc_ex)
-		fep->tx_bd_base = (struct bufdesc *)
-			(((struct bufdesc_ex *)cbd_base) + fep->rx_ring_size);
+		fep->tx_bd_base = (union bufdesc_u *)
+			(&cbd_base->ebd + fep->rx_ring_size);
 	else
-		fep->tx_bd_base = cbd_base + fep->rx_ring_size;
+		fep->tx_bd_base = (union bufdesc_u *)
+			(&cbd_base->bd + fep->rx_ring_size);
 
 	/* The FEC Ethernet specific entries in the device structure */
 	ndev->watchdog_timeo = TX_TIMEOUT;
@@ -2027,19 +2177,24 @@  static int fec_enet_init(struct net_device *ndev)
 	writel(FEC_RX_DISABLED_IMASK, fep->hwp + FEC_IMASK);
 	netif_napi_add(ndev, &fep->napi, fec_enet_rx_napi, NAPI_POLL_WEIGHT);
 
-	if (id_entry->driver_data & FEC_QUIRK_HAS_VLAN) {
-		/* enable hw VLAN support */
-		ndev->features |= NETIF_F_HW_VLAN_CTAG_RX;
-		ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX;
-	}
+	if (fep->bufdesc_ex) {
+		/* Features which require the enhanced buffer descriptors */
+		netdev_features_t features = 0;
 
-	if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) {
-		/* enable hw accelerator */
-		ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
-				| NETIF_F_RXCSUM);
-		ndev->hw_features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
-				| NETIF_F_RXCSUM);
-		fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
+		if (fep->quirks & FEC_QUIRK_HAS_VLAN) {
+			/* enable hw VLAN support */
+			features |= NETIF_F_HW_VLAN_CTAG_RX;
+		}
+
+		if (fep->quirks & FEC_QUIRK_HAS_CSUM) {
+			/* enable hw accelerator */
+			features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
+				    NETIF_F_RXCSUM;
+			fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
+		}
+
+		ndev->hw_features |= features;
+		ndev->features |= features;
 	}
 
 	fec_restart(ndev, 0);
@@ -2110,13 +2265,6 @@  fec_probe(struct platform_device *pdev)
 	/* setup board info structure */
 	fep = netdev_priv(ndev);
 
-#if !defined(CONFIG_M5272)
-	/* default enable pause frame auto negotiation */
-	if (pdev->id_entry &&
-	    (pdev->id_entry->driver_data & FEC_QUIRK_HAS_GBIT))
-		fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG;
-#endif
-
 	r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
 	fep->hwp = devm_ioremap_resource(&pdev->dev, r);
 	if (IS_ERR(fep->hwp)) {
@@ -2126,6 +2274,14 @@  fec_probe(struct platform_device *pdev)
 
 	fep->pdev = pdev;
 	fep->dev_id = dev_id++;
+	if (pdev->id_entry)
+		fep->quirks = pdev->id_entry->driver_data;
+
+#if !defined(CONFIG_M5272)
+	/* default enable pause frame auto negotiation */
+	if (fep->quirks & FEC_QUIRK_HAS_GBIT)
+		fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG;
+#endif
 
 	fep->bufdesc_ex = 0;
 
@@ -2160,8 +2316,7 @@  fec_probe(struct platform_device *pdev)
 		fep->clk_enet_out = NULL;
 
 	fep->clk_ptp = devm_clk_get(&pdev->dev, "ptp");
-	fep->bufdesc_ex =
-		pdev->id_entry->driver_data & FEC_QUIRK_HAS_BUFDESC_EX;
+	fep->bufdesc_ex = fep->quirks & FEC_QUIRK_HAS_BUFDESC_EX;
 	if (IS_ERR(fep->clk_ptp)) {
 		fep->clk_ptp = NULL;
 		fep->bufdesc_ex = 0;