PL310 errata workarounds

From: Russell King - ARM Linux <linux@arm.linux.org.uk>

On Wed, Mar 19, 2014 at 10:52:32PM +0100, Marek Vasut wrote:
> On Tuesday, March 18, 2014 at 06:26:15 PM, Russell King - ARM Linux wrote:
> > On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
> > > Setting prefetch enables and early BRESP could all be done
> > > unconditionally in the core code.
> > 
> > I think we can do a few things here, if we know that the CPUs we're
> > connected to are all Cortex-A9:
> > 
> > 1. Enable BRESP.
> > 
> > 2. Enable I+D prefetching - but we really need to tune the prefetch offset
> >    for this to be worthwhile.  The value depends on the L3 memory system
> >    latency, so isn't something that should be specified at the SoC level.
> >    It may also change with different operating points.
> > 
> > 3. Full line of zeros - I think this is a difficult one to achieve
> > properly. The required sequence:
> > 
> >    - enable FLZ in L2 cache
> >    - enable L2 cache
> >    - enable FLZ in Cortex A9
> > 
> >    I'd also assume that when we turn the L2 cache off, we need the reverse
> >    sequence too.  So this sequence can't be done entirely by the boot
> > loader.
> > 
> > With (1) enabled and (2) properly tuned, I see a performance increase of
> > around 60Mbps on transmission, bringing the Cubox-i4 up from 250Mbps to
> > 315Mbps transmit on its gigabit interface with cpufreq ondemand enabled.
> > With "performance", this goes up to [323, 323, 321, 325, 322]Mbps.  On
> > receive [446, 603, 605, 605, 601]Mbps, which hasn't really changed
> > very much (and still impressively exceeds the Freescale stated maximum
> > total bandwidth of the gigabit interface.)
> 
> Speaking of FEC and slightly off-topic, have you ever seen this on your box 
> [1]/[2]/[3] ? I wonder if this might be cache-related as well, since I saw 
> similar issue on MX6 with PCIe-connected ethernet. I cannot put a finger on
> this though.

I think I've seen something similar once or twice.

Let me just pull up the transmit function.  This is the code which
writes to the descriptors (which frankly is pretty horrid):

        bdp->cbd_datlen = skb->len;
        bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
                        skb->len, DMA_TO_DEVICE); <-- barrier inside
                ebdp->cbd_bdu = 0;
                        ebdp->cbd_esc = BD_ENET_TX_INT;
                                ebdp->cbd_esc |= BD_ENET_TX_PINS;
        bdp->cbd_sc = status;
        writel(0, fep->hwp + FEC_X_DES_ACTIVE); <-- barrier before write

A couple of points here:

1. The hardware operates in a ring - it can read the next ring if it
   sees the BD_ENET_TX_READY bit set if has just finished processing
   the previous descriptor - this can happen before the write to
   FEC_X_DES_ACTIVE.

2. The ARM can re-order writes.  The writes it can re-order are:

	bdp->cbd_bufaddr
	ebdp->cbd_bdu
	ebdp->cbd_esc
	ebdp->cbd_sc

Hence, it's entirely possible for the FEC to see the updated descriptor
status before the rest of the descriptor has been written.  What's missing
is a barrier between the descriptor writes, and the final write to
bdp->cbd_sc.

Had I not got distracted by the L2 issues, I'd have posted my FEC patches
by now... in any case, my current xmit function looks a little different -
I've organised it such that:

1. We don't modify anything until we're past the point where things can
   error out.

2. Writes to the descriptor are localised in one area.

3. There's a wmb() barrier between cbd_sc and the previous writes - I
   discussed this with Will Deacon, which resulted in this documentation
   for the barrier:

        /*
         * We need the preceding stores to the descriptor to complete
         * before updating the status field, which hands it over to the
         * hardware.  The corresponding rmb() is "in the hardware".
         */

The second thing that causes the transmit timeouts is the horrid way
NAPI has been added to the driver - it's racy.  NAPI itself isn't the
problem, it's this (compressed a bit to show only the relevant bits):

        do {
                int_events = readl(fep->hwp + FEC_IEVENT);
                writel(int_events, fep->hwp + FEC_IEVENT);
                if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
                        if (napi_schedule_prep(&fep->napi)) {
                                writel(FEC_RX_DISABLED_IMASK,
                                        fep->hwp + FEC_IMASK);
                                __napi_schedule(&fep->napi);
                        }
                }
                if (int_events & FEC_ENET_MII) {
                        complete(&fep->mdio_done);
                }
        } while (int_events);

Consider what happens here if:
- we talk in the MII bus and receive a MII interrupt
- we're just finishing NAPI processing but haven't quite got around to
  calling napi_complete()
- the ethernet has sent all packets, and has also raised a transmit
  interrupt

The result is the handler is entered, FEC_IEVENT contains TXF and MII
events.  Both these events are cleared down, (and thus no longer exist
as interrupt-causing events.)  napi_schedule_prep() returns false as
the NAPI rx function is still running, and doesn't mark it for a re-run.
We then do the MII interrupt.  Loop again, and int_events is zero,
we exit.

Meanwhile, the NAPI rx function calls napi_complete() and re-enables
the receive interrupt.  If you're unlucky enough that the RX ring is
also full... no RXF interrupt.  So no further interrupts except maybe
MII interrupts.

NAPI never gets scheduled.  RX ring never gets emptied.  TX ring never
gets reaped.  The result is a timeout with a completely full TX ring.

I think I've seen both cases: I've seen the case where the TX ring is
completely empty, but it hasn't been reaped.   I've also seen the case
where the TX ring contains packets to be transmitted but the hardware
isn't sending them.

That all said - with the patch below I haven't seen problems since.
(which is the quickest way I can think of to get you a copy of what
I'm presently running - I've killed a number of debug bits denoted by
all the blank lines - and this is against -rc7.)  You may notice that
I added some TX ring dumping code to the driver - always useful in
these situations. ;-)

This patch is of course the consolidated version: individually, this
would be at least 19 patches with nice commit messages describing
each change...

As far as the 600Mbps receive - you need the right conditions for that.
I select the performance cpufreq governor after boot, and let the boot
quiesce.  It doesn't take much for it to drop back to 460Mbps - another
running process other than iperf -s is sufficient to do that.

Let me know how you get on with this.

PL310 errata workarounds

Commit Message

Comments

Patch