diff mbox

[BUG] i2c-designware silently fails on long transfers

Message ID 20161118193542.GO1041@n2100.armlinux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Russell King (Oracle) Nov. 18, 2016, 7:35 p.m. UTC
With reference to this commit:

commit d39f77b06a712fcba6185a20bb209e357923d980
Author: Andrew Jackson <Andrew.Jackson@arm.com>
Date:   Fri Nov 7 12:10:44 2014 +0000

    i2c: designware: prevent early stop on TX FIFO empty

    If the Designware core is configured with IC_EMPTYFIFO_HOLD_MASTER_EN
    set to zero, allowing the TX FIFO to become empty causes a STOP
    condition to be generated on the I2C bus. If the transmit FIFO
    threshold is set too high, an erroneous STOP condition can be
    generated on long transfers - particularly where the interrupt
    latency is extended.

    Signed-off-by: Andrew Jackson <Andrew.Jackson@arm.com>
    Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
    Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
    Signed-off-by: Wolfram Sang <wsa@the-dreams.de>

The TDA998x driver issues long I2C transfers to read the EDID from the
device - and userspace can also issue large transfers too.  However,
if a DW core is configured with IC_EMPTYFIFO_HOLD_MASTER_EN set as
zero, the above commit doesn't seem to solve the problem.  During
boot, with the patch below, I see:

[    1.736549] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x10
[    1.736564] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x510
[    1.736608] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x504
[    1.736799] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x514
[    1.736819] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x510
...
[    1.737986] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x504
[    1.738010] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x514
[    1.738034] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x504
[    1.738039] random: fast init done
[    1.740120] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x714
[    1.740231] i2c_dw_xfer: ffffffc97657b770:1 -> ffffffc97657b770:1 (0:0) [0 0 3 0] 8 [tx:ffffffc976682380:47] [rx:ffffffc9766823c9:55]
[    1.740249] [drm:drm_edid_block_valid] *ERROR* EDID checksum is invalid, remainder is 93
[    1.746979] Raw EDID:
[    1.747934]          00 ff ff ff ff ff ff 00 34 a9 96 a2 01 01 01 01
[    1.752342]          00 17 01 03 80 80 48 78 0a da ff a3 58 4a a2 29
[    1.756748]          17 49 4b 21 08 00 31 40 45 40 61 40 81 80 01 01
[    1.761153]          01 01 01 01 01 01 02 3a 80 d0 72 38 2d 40 10 2c
[    1.765555]          45 80 ba 88 21 00 00 1e 02 00 d0 4e 30 09 12 54
[    1.769958]          01 08 02 00 23 36 01 40 01 05 00 80 a1 4c 4b 49
[    1.774361]          22 00 00 40 03 00 28 00 23 01 20 00 01 88 00 01
[    1.778762]          08 00 00 40 00 02 03 04 0a 00 80 00 02 00 00 40

The significant thing is the "i2c_dw_xfer" line, where I add a print of
the current state.  Here, we can see that the transfer is mid-way, but
a stop condition has been generated by the hardware, leaving 55 bytes
to be received.

Unfortunately, the i2c-designware driver ignores this, and believes that
the transfer completed both fully and successfully, but returns bogus
data to userspace or the kernel driver.  That's really _bad_ behaviour
by the driver - it should at least return an error.

This problem is _soo_ bad that on my Juno, I can't run Xorg (it hits
this every time we try to read the EDID) nor can I boot with the TV
connected (it hits this every boot as well.)

I'd go as far as to say that the i2c-designware hardware, when
configured with this option set to zero, is fundamentally broken for OS
which do not provide any guarantee for interrupt latency, such as Linux.

The commit above tries to mitigate this by reducing the Tx FIFO
threshold, so the interrupt is raised sooner, but that's clearly not
enough for reliable operation.

Another mitigation would be to lower the I2C bus frequency on Juno from
400kHz to 100kHz, so that there's 4x longer IRQ latency possible.
However, even that isn't going to be reliable - even going to 100kHz
isn't going to allow the above case to be solved - the interrupt is
delayed by around 2ms, and it takes about 1.4ms to send/receive 16 bytes
at 100kHz.  (9 * 16 / (100*10^3)).

So, I think all hope is lost for i2c-designware on Juno to cope with
reading the EDID from TDA998x reliably.

I have one patch which solves a problem in the accounting of bytes, and
another to ensure that we return an error for an incomplete transfer,
both will be sent threaded to this mail.

Comments

Mika Westerberg Nov. 21, 2016, 10:29 a.m. UTC | #1
On Fri, Nov 18, 2016 at 07:35:42PM +0000, Russell King - ARM Linux wrote:
> With reference to this commit:
> 
> commit d39f77b06a712fcba6185a20bb209e357923d980
> Author: Andrew Jackson <Andrew.Jackson@arm.com>
> Date:   Fri Nov 7 12:10:44 2014 +0000
> 
>     i2c: designware: prevent early stop on TX FIFO empty
> 
>     If the Designware core is configured with IC_EMPTYFIFO_HOLD_MASTER_EN
>     set to zero, allowing the TX FIFO to become empty causes a STOP
>     condition to be generated on the I2C bus. If the transmit FIFO
>     threshold is set too high, an erroneous STOP condition can be
>     generated on long transfers - particularly where the interrupt
>     latency is extended.
> 
>     Signed-off-by: Andrew Jackson <Andrew.Jackson@arm.com>
>     Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
>     Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
>     Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
> 
> The TDA998x driver issues long I2C transfers to read the EDID from the
> device - and userspace can also issue large transfers too.  However,
> if a DW core is configured with IC_EMPTYFIFO_HOLD_MASTER_EN set as
> zero, the above commit doesn't seem to solve the problem.  During
> boot, with the patch below, I see:
> 
> [    1.736549] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x10
> [    1.736564] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x510
> [    1.736608] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x504
> [    1.736799] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x514
> [    1.736819] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x510
> ...
> [    1.737986] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x504
> [    1.738010] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x514
> [    1.738034] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x504
> [    1.738039] random: fast init done
> [    1.740120] i2c_designware 7ffa0000.i2c: i2c_dw_isr: enabled=0x1 stat=0x714
> [    1.740231] i2c_dw_xfer: ffffffc97657b770:1 -> ffffffc97657b770:1 (0:0) [0 0 3 0] 8 [tx:ffffffc976682380:47] [rx:ffffffc9766823c9:55]
> [    1.740249] [drm:drm_edid_block_valid] *ERROR* EDID checksum is invalid, remainder is 93
> [    1.746979] Raw EDID:
> [    1.747934]          00 ff ff ff ff ff ff 00 34 a9 96 a2 01 01 01 01
> [    1.752342]          00 17 01 03 80 80 48 78 0a da ff a3 58 4a a2 29
> [    1.756748]          17 49 4b 21 08 00 31 40 45 40 61 40 81 80 01 01
> [    1.761153]          01 01 01 01 01 01 02 3a 80 d0 72 38 2d 40 10 2c
> [    1.765555]          45 80 ba 88 21 00 00 1e 02 00 d0 4e 30 09 12 54
> [    1.769958]          01 08 02 00 23 36 01 40 01 05 00 80 a1 4c 4b 49
> [    1.774361]          22 00 00 40 03 00 28 00 23 01 20 00 01 88 00 01
> [    1.778762]          08 00 00 40 00 02 03 04 0a 00 80 00 02 00 00 40
> 
> The significant thing is the "i2c_dw_xfer" line, where I add a print of
> the current state.  Here, we can see that the transfer is mid-way, but
> a stop condition has been generated by the hardware, leaving 55 bytes
> to be received.
> 
> Unfortunately, the i2c-designware driver ignores this, and believes that
> the transfer completed both fully and successfully, but returns bogus
> data to userspace or the kernel driver.  That's really _bad_ behaviour
> by the driver - it should at least return an error.

Totally agree.

> This problem is _soo_ bad that on my Juno, I can't run Xorg (it hits
> this every time we try to read the EDID) nor can I boot with the TV
> connected (it hits this every boot as well.)
> 
> I'd go as far as to say that the i2c-designware hardware, when
> configured with this option set to zero, is fundamentally broken for OS
> which do not provide any guarantee for interrupt latency, such as Linux.
> 
> The commit above tries to mitigate this by reducing the Tx FIFO
> threshold, so the interrupt is raised sooner, but that's clearly not
> enough for reliable operation.
> 
> Another mitigation would be to lower the I2C bus frequency on Juno from
> 400kHz to 100kHz, so that there's 4x longer IRQ latency possible.
> However, even that isn't going to be reliable - even going to 100kHz
> isn't going to allow the above case to be solved - the interrupt is
> delayed by around 2ms, and it takes about 1.4ms to send/receive 16 bytes
> at 100kHz.  (9 * 16 / (100*10^3)).
> 
> So, I think all hope is lost for i2c-designware on Juno to cope with
> reading the EDID from TDA998x reliably.

:-(

I wonder if we can get it work more reliably by using DMA (provided that
there are DMA channels available for I2C in Juno)? That would allow the
hardware to perform longer reads without relying on how fast the
interrupt handler is able to empty the Rx FIFO.
Russell King (Oracle) Nov. 21, 2016, 10:43 a.m. UTC | #2
On Mon, Nov 21, 2016 at 12:29:01PM +0200, Mika Westerberg wrote:
> On Fri, Nov 18, 2016 at 07:35:42PM +0000, Russell King - ARM Linux wrote:
> > Another mitigation would be to lower the I2C bus frequency on Juno from
> > 400kHz to 100kHz, so that there's 4x longer IRQ latency possible.
> > However, even that isn't going to be reliable - even going to 100kHz
> > isn't going to allow the above case to be solved - the interrupt is
> > delayed by around 2ms, and it takes about 1.4ms to send/receive 16 bytes
> > at 100kHz.  (9 * 16 / (100*10^3)).
> > 
> > So, I think all hope is lost for i2c-designware on Juno to cope with
> > reading the EDID from TDA998x reliably.
> 
> :-(
> 
> I wonder if we can get it work more reliably by using DMA (provided that
> there are DMA channels available for I2C in Juno)? That would allow the
> hardware to perform longer reads without relying on how fast the
> interrupt handler is able to empty the Rx FIFO.

It would need to DMA to the Tx FIFO to keep it filled - it triggers the
stop condition when the Tx FIFO empties.  From what I can see in the
driver, the Tx FIFO not only takes the data but also a "command" to tell
the hardware what to do.

The Rx FIFO would also need DMA to avoid it overflowing due to high
interrupt latency.

I don't know what state DMA is in on the Juno, or even whether it has
DMA - it has a PL330 DMA controller, but I see nothing in the DT files
making use of it.
Mika Westerberg Nov. 21, 2016, 11:09 a.m. UTC | #3
On Mon, Nov 21, 2016 at 10:43:29AM +0000, Russell King - ARM Linux wrote:
> On Mon, Nov 21, 2016 at 12:29:01PM +0200, Mika Westerberg wrote:
> > On Fri, Nov 18, 2016 at 07:35:42PM +0000, Russell King - ARM Linux wrote:
> > > Another mitigation would be to lower the I2C bus frequency on Juno from
> > > 400kHz to 100kHz, so that there's 4x longer IRQ latency possible.
> > > However, even that isn't going to be reliable - even going to 100kHz
> > > isn't going to allow the above case to be solved - the interrupt is
> > > delayed by around 2ms, and it takes about 1.4ms to send/receive 16 bytes
> > > at 100kHz.  (9 * 16 / (100*10^3)).
> > > 
> > > So, I think all hope is lost for i2c-designware on Juno to cope with
> > > reading the EDID from TDA998x reliably.
> > 
> > :-(
> > 
> > I wonder if we can get it work more reliably by using DMA (provided that
> > there are DMA channels available for I2C in Juno)? That would allow the
> > hardware to perform longer reads without relying on how fast the
> > interrupt handler is able to empty the Rx FIFO.
> 
> It would need to DMA to the Tx FIFO to keep it filled - it triggers the
> stop condition when the Tx FIFO empties.  From what I can see in the
> driver, the Tx FIFO not only takes the data but also a "command" to tell
> the hardware what to do.

Yes, the command is either "read" or "write" (meaning even for Rx a
write to the Tx FIFO is needed).

> The Rx FIFO would also need DMA to avoid it overflowing due to high
> interrupt latency.

I guess for Rx we would need to supply a dummy buffer of "read" commands
for the Tx FIFO in addition to reading bytes from Rx FIFO. But still
that might help to improve reliability in case of Juno.

> I don't know what state DMA is in on the Juno, or even whether it has
> DMA - it has a PL330 DMA controller, but I see nothing in the DT files
> making use of it.

OK
Liviu Dudau Nov. 21, 2016, 11:21 a.m. UTC | #4
On Mon, Nov 21, 2016 at 10:43:29AM +0000, Russell King - ARM Linux wrote:
> On Mon, Nov 21, 2016 at 12:29:01PM +0200, Mika Westerberg wrote:
> > On Fri, Nov 18, 2016 at 07:35:42PM +0000, Russell King - ARM Linux wrote:
> > > Another mitigation would be to lower the I2C bus frequency on Juno from
> > > 400kHz to 100kHz, so that there's 4x longer IRQ latency possible.
> > > However, even that isn't going to be reliable - even going to 100kHz
> > > isn't going to allow the above case to be solved - the interrupt is
> > > delayed by around 2ms, and it takes about 1.4ms to send/receive 16 bytes
> > > at 100kHz.  (9 * 16 / (100*10^3)).
> > > 
> > > So, I think all hope is lost for i2c-designware on Juno to cope with
> > > reading the EDID from TDA998x reliably.
> > 
> > :-(
> > 
> > I wonder if we can get it work more reliably by using DMA (provided that
> > there are DMA channels available for I2C in Juno)? That would allow the
> > hardware to perform longer reads without relying on how fast the
> > interrupt handler is able to empty the Rx FIFO.
> 
> It would need to DMA to the Tx FIFO to keep it filled - it triggers the
> stop condition when the Tx FIFO empties.  From what I can see in the
> driver, the Tx FIFO not only takes the data but also a "command" to tell
> the hardware what to do.
> 
> The Rx FIFO would also need DMA to avoid it overflowing due to high
> interrupt latency.
> 
> I don't know what state DMA is in on the Juno, or even whether it has
> DMA - it has a PL330 DMA controller, but I see nothing in the DT files
> making use of it.

The only thing we have tested PL330 with was UART and I2S. I'm not sure we
can really use it with I2C given the above hardware configuration issue.

The other thing we have tried in our private branches was to repeat the EDID
transfer a number of times until the checksum was correct. Andrew Jackson
sent a patch a year or so back to have some ridiculously high number of
retries (30) and that has been rightfully shut down in the upstream.

The unfortunate thing is that the monitors and TVs that we use inside ARM
for testing don't seem to trigger this issue on a regular basis. We had one
or two small 7" TVs that did that, and the brute force retry of EDID transfer
sorted them out for the limited use that we had for them.

Best regards,
Liviu

> 
> -- 
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.
Russell King (Oracle) Nov. 21, 2016, 11:36 a.m. UTC | #5
On Mon, Nov 21, 2016 at 11:21:35AM +0000, Liviu Dudau wrote:
> On Mon, Nov 21, 2016 at 10:43:29AM +0000, Russell King - ARM Linux wrote:
> > It would need to DMA to the Tx FIFO to keep it filled - it triggers the
> > stop condition when the Tx FIFO empties.  From what I can see in the
> > driver, the Tx FIFO not only takes the data but also a "command" to tell
> > the hardware what to do.
> > 
> > The Rx FIFO would also need DMA to avoid it overflowing due to high
> > interrupt latency.
> > 
> > I don't know what state DMA is in on the Juno, or even whether it has
> > DMA - it has a PL330 DMA controller, but I see nothing in the DT files
> > making use of it.
> 
> The only thing we have tested PL330 with was UART and I2S. I'm not sure we
> can really use it with I2C given the above hardware configuration issue.
> 
> The other thing we have tried in our private branches was to repeat the EDID
> transfer a number of times until the checksum was correct. Andrew Jackson
> sent a patch a year or so back to have some ridiculously high number of
> retries (30) and that has been rightfully shut down in the upstream.
> 
> The unfortunate thing is that the monitors and TVs that we use inside ARM
> for testing don't seem to trigger this issue on a regular basis. We had one
> or two small 7" TVs that did that, and the brute force retry of EDID transfer
> sorted them out for the limited use that we had for them.

Due to the way the TDA998x works, it doesn't have much to do with the TV.
The TDA998x is instructed to read up to 128 bytes of EDID data into its
own internal buffer.  It then does so, and raises an interrupt (or we
notice that the interrupt flag is set when there's no hardware interrupt
line), and we then read the EDID data out of the TDA998x.

At the point we're reading from the TDA998x, it has loaded the data from
the DDC bus and the DDC bus is idle... we don't have direct access to
the DDC bus.

The only reason I can think that the checksum would pass is if you were
somehow ending up with data in the buffer that did cause the basic
checksum to pass, even though you had an incomplete read.
Robin Murphy Nov. 21, 2016, 12:11 p.m. UTC | #6
On 21/11/16 10:43, Russell King - ARM Linux wrote:
> On Mon, Nov 21, 2016 at 12:29:01PM +0200, Mika Westerberg wrote:
>> On Fri, Nov 18, 2016 at 07:35:42PM +0000, Russell King - ARM Linux wrote:
>>> Another mitigation would be to lower the I2C bus frequency on Juno from
>>> 400kHz to 100kHz, so that there's 4x longer IRQ latency possible.
>>> However, even that isn't going to be reliable - even going to 100kHz
>>> isn't going to allow the above case to be solved - the interrupt is
>>> delayed by around 2ms, and it takes about 1.4ms to send/receive 16 bytes
>>> at 100kHz.  (9 * 16 / (100*10^3)).
>>>
>>> So, I think all hope is lost for i2c-designware on Juno to cope with
>>> reading the EDID from TDA998x reliably.
>>
>> :-(
>>
>> I wonder if we can get it work more reliably by using DMA (provided that
>> there are DMA channels available for I2C in Juno)? That would allow the
>> hardware to perform longer reads without relying on how fast the
>> interrupt handler is able to empty the Rx FIFO.
> 
> It would need to DMA to the Tx FIFO to keep it filled - it triggers the
> stop condition when the Tx FIFO empties.  From what I can see in the
> driver, the Tx FIFO not only takes the data but also a "command" to tell
> the hardware what to do.
> 
> The Rx FIFO would also need DMA to avoid it overflowing due to high
> interrupt latency.
> 
> I don't know what state DMA is in on the Juno, or even whether it has
> DMA - it has a PL330 DMA controller, but I see nothing in the DT files
> making use of it.

There's no DMA request wired up for I2C, unfortunately - only for the
UARTs, I2S, FPFGA tile and trace unit. I recall Liviu did have a patch
to lower the I2C clock as the mitigation you mention, which I thought
had made it into mainline, but apparently not.

Robin.
diff mbox

Patch

diff --git a/drivers/i2c/busses/i2c-designware-core.c b/drivers/i2c/busses/i2c-designware-core.c
index 11e866d05368..060ae9e5a916 100644
--- a/drivers/i2c/busses/i2c-designware-core.c
+++ b/drivers/i2c/busses/i2c-designware-core.c
@@ -752,6 +752,15 @@  i2c_dw_xfer(struct i2c_adapter *adap, struct i2c_msg msgs[], int num)
 		goto done;
 	}
 
+	printk(KERN_DEBUG "%s: %p:%d -> %p:%d (%d:%d) [%x %x %x %x] %d [tx:%p:%d] [rx:%p:%d]\n",
+		__func__, msgs, num,
+		dev->msgs, dev->msgs_num,
+		dev->msg_write_idx, dev->msg_read_idx,
+		dev->cmd_err, dev->msg_err, dev->status, dev->abort_source,
+		dev->rx_outstanding,
+		dev->tx_buf, dev->tx_buf_len,
+		dev->rx_buf, dev->rx_buf_len);
+
 	if (dev->msg_err) {
 		ret = dev->msg_err;
 		goto done;
@@ -857,7 +866,7 @@  static irqreturn_t i2c_dw_isr(int this_irq, void *dev_id)
 
 	enabled = dw_readl(dev, DW_IC_ENABLE);
 	stat = dw_readl(dev, DW_IC_RAW_INTR_STAT);
-	dev_dbg(dev->dev, "%s: enabled=%#x stat=%#x\n", __func__, enabled, stat);
+	dev_printk(KERN_DEBUG, dev->dev, "%s: enabled=%#x stat=%#x\n", __func__, enabled, stat);
 	if (!enabled || !(stat & ~DW_IC_INTR_ACTIVITY))
 		return IRQ_NONE;