Message ID | 54D9F3C7.5000809@freebox.fr (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 02/10/2015 07:04 AM, Nicolas Schichan wrote: > On 02/10/2015 12:34 AM, Peter Hurley wrote: >> Hi Nicolas, >> >> Thanks for the report. >> > [...] >>> When a caracter is received on the UART while the kernel is printing >>> the boot messages, as soon as the kernel configures the UART for >>> receiving (after root filesystem mount), it gets stuck printing the >>> following message repeatedly: >>> >>> serial8250: too much work for irq29 >>> >>> Once stuck, the reception of another character allows the boot process >>> to finish. >>> >>> From what I can gather, when we hit that, the UART_IIR_NO_INT is 0 (so the >>> interrupt is raised), but the UART_LSR_DR bit is 0 as well so the UART_RX >>> register is never read to clear the interrupt. >> >> The "too much work" message means serial8250_handle_irq() is returning 0, >> ie., not handled. Which in turn means IIR indicates no interrupt is pending >> (UART_IIR_NO_INT == 1). >> >> Can you log the register values for LSR and IIR at both patch locations >> in serial8250_do_startup()? >> >> (I can get you a debug patch, if necessary. Let me know) > > Hi Peter, > > Thanks for your reply. > > Here is what I have when the issue is triggered: > > [ 12.154877] lsr 0x60 / iir 0x01 > [ 12.158071] lsr 0x60 / iir 0x01 > [ 12.161438] serial8250: too much work for irq29 > [ 12.165982] lsr 0x60 / iir 0x0c > [ 12.169354] serial8250: too much work for irq29 > [ 12.173900] lsr 0x60 / iir 0x0c > (previous two messages are repeated and printk_ratelimited()) Thanks for this information; I see I was wrong about the cause of message. I think what happens during startup is that on this silicon clearing the rx fifo (by serial8250_clear_fifos()) clears data ready but not the rx timeout condition which causes a spurious rx interrupt when interrupts are enabled. So caught between two broken UARTs: one that underflows its rx fifo because of unsolicited rx reads and the other that generates spurious interrupt without unsolicited rx reads. > When the issue is not triggered: > > [ 10.784871] lsr 0x60 / iir 0x01 > [ 10.788066] lsr 0x60 / iir 0x01 > [ 10.794734] VFS: Mounted root (nfs filesystem) readonly on device 0:13. > [ 10.801654] devtmpfs: mounted > [ 10.805169] Freeing unused kernel memory: 184K (807be000 - 807ec000) > (userland takes over after that) > > I have also displayed the IIR and LSR registers when the "too much fork for > IRQ" condition is triggered. > > In the serial8250_do_startup(), before the interrupt are unmasked at the end, > the IIR looks sane and UART_IIR_NO_INT bit is set. When stuck > serial8250_interrupt(), UART_IIR_NO_INT is cleared and the interrupt ID is set > to 0xc which is not handled by the kernel at this time (the Kirkwood datasheet > indicates that it is some kind of timeout condition from what I can gather). Yes, IIR == UART_IIR_RX_TIMEOUT is to used indicate that data is in the rx fifo but has not reached the rx trigger level yet. ATM, I'm not exactly sure if there is a safe way to clear the spurious interrupt from the interrupt handler. I'm fairly certain the only way to clear the rx timeout interrupt is to read the rx fifo, but I think this would race with actual data arrival. IOW, there might not be a way to determine if the data read is spurious or not. Regards, Peter Hurley > Here is the corresponding debug patch, for reference (against a 3.19 kernel > this time): > > diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_ > index 11c6685..ed93741 100644 > --- a/drivers/tty/serial/8250/8250_core.c > +++ b/drivers/tty/serial/8250/8250_core.c > @@ -1661,6 +1661,12 @@ static int exar_handle_irq(struct uart_port *port) > return ret; > } > > +static inline void show_lsr_iir(struct uart_port *port) > +{ > + printk("lsr 0x%02x / iir 0x%02x\n", serial_port_in(port, UART_LSR), > + serial_port_in(port, UART_IIR)); > +} > + > /* > * This is the serial driver's interrupt routine. > * > @@ -1705,6 +1711,8 @@ static irqreturn_t serial8250_interrupt(int irq, void *dev > /* If we hit this, we're dead. */ > printk_ratelimited(KERN_ERR > "serial8250: too much work for irq%d\n", irq); > + if (printk_ratelimit()) > + show_lsr_iir(port); > break; > } > } while (l != end); > @@ -2107,6 +2115,7 @@ int serial8250_do_startup(struct uart_port *port) > /* > * Clear the interrupt registers. > */ > + show_lsr_iir(port); > if (serial_port_in(port, UART_LSR) & UART_LSR_DR) > serial_port_in(port, UART_RX); > serial_port_in(port, UART_IIR); > @@ -2269,6 +2278,7 @@ dont_test_tx_en: > * saved flags to avoid getting false values from polling > * routines or the previous session. > */ > + show_lsr_iir(port); > if (serial_port_in(port, UART_LSR) & UART_LSR_DR) > serial_port_in(port, UART_RX); > serial_port_in(port, UART_IIR); > > Regards, >
On 02/10/2015 12:46 PM, Peter Hurley wrote: > On 02/10/2015 07:04 AM, Nicolas Schichan wrote: >> On 02/10/2015 12:34 AM, Peter Hurley wrote: >>> Hi Nicolas, >>> >>> Thanks for the report. >>> >> [...] >>>> When a caracter is received on the UART while the kernel is printing >>>> the boot messages, as soon as the kernel configures the UART for >>>> receiving (after root filesystem mount), it gets stuck printing the >>>> following message repeatedly: >>>> >>>> serial8250: too much work for irq29 >>>> >>>> Once stuck, the reception of another character allows the boot process >>>> to finish. >>>> >>>> From what I can gather, when we hit that, the UART_IIR_NO_INT is 0 (so the >>>> interrupt is raised), but the UART_LSR_DR bit is 0 as well so the UART_RX >>>> register is never read to clear the interrupt. >>> >>> The "too much work" message means serial8250_handle_irq() is returning 0, >>> ie., not handled. Which in turn means IIR indicates no interrupt is pending >>> (UART_IIR_NO_INT == 1). >>> >>> Can you log the register values for LSR and IIR at both patch locations >>> in serial8250_do_startup()? >>> >>> (I can get you a debug patch, if necessary. Let me know) >> >> Hi Peter, >> >> Thanks for your reply. >> >> Here is what I have when the issue is triggered: >> >> [ 12.154877] lsr 0x60 / iir 0x01 >> [ 12.158071] lsr 0x60 / iir 0x01 >> [ 12.161438] serial8250: too much work for irq29 >> [ 12.165982] lsr 0x60 / iir 0x0c >> [ 12.169354] serial8250: too much work for irq29 >> [ 12.173900] lsr 0x60 / iir 0x0c >> (previous two messages are repeated and printk_ratelimited()) > > Thanks for this information; I see I was wrong about the cause of message. > > I think what happens during startup is that on this silicon clearing > the rx fifo (by serial8250_clear_fifos()) clears data ready but not > the rx timeout condition which causes a spurious rx interrupt when > interrupts are enabled. > > So caught between two broken UARTs: one that underflows its rx fifo because > of unsolicited rx reads and the other that generates spurious interrupt > without unsolicited rx reads. > > >> When the issue is not triggered: >> >> [ 10.784871] lsr 0x60 / iir 0x01 >> [ 10.788066] lsr 0x60 / iir 0x01 >> [ 10.794734] VFS: Mounted root (nfs filesystem) readonly on device 0:13. >> [ 10.801654] devtmpfs: mounted >> [ 10.805169] Freeing unused kernel memory: 184K (807be000 - 807ec000) >> (userland takes over after that) >> >> I have also displayed the IIR and LSR registers when the "too much fork for >> IRQ" condition is triggered. >> >> In the serial8250_do_startup(), before the interrupt are unmasked at the end, >> the IIR looks sane and UART_IIR_NO_INT bit is set. When stuck >> serial8250_interrupt(), UART_IIR_NO_INT is cleared and the interrupt ID is set >> to 0xc which is not handled by the kernel at this time (the Kirkwood datasheet >> indicates that it is some kind of timeout condition from what I can gather). > > Yes, IIR == UART_IIR_RX_TIMEOUT is to used indicate that data is in the rx fifo > but has not reached the rx trigger level yet. > > ATM, I'm not exactly sure if there is a safe way to clear the spurious interrupt > from the interrupt handler. > > I'm fairly certain the only way to clear the rx timeout interrupt is to read > the rx fifo, but I think this would race with actual data arrival. IOW, there > might not be a way to determine if the data read is spurious or not. Yep, I see no safe way to clear the spurious interrupt [1] and no idea how to keep it from happening (other than via the unsolicited RX reads in serial8250_do_startup). Unfortunately, I think this means we'll have to revert Sebastian's commit: commit 0aa525d11859c1a4d5b78fdc704148e2ae03ae13 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Wed Sep 10 21:29:58 2014 +0200 tty: serial: 8250_core: read only RX if there is something in the FIFO which just means OMAP3630 will be limited to using the omap_serial driver. Regards, Peter Hurley [1] To clear the RX timeout interrupt requires reading the rx fifo even though LSR[data ready] indicates no data. However, this could result in dropped data if the data became available just before clearing the RX timeout. For example, CPU | Device | irq handler (simplified) | | read IIR | is interrupt? yes | read LSR | is data ready? no | is IIR == Rx timeout? yes | new data arrives | rx_fifo[0] = new data | lsr[data ready] = 1 read RX and discard | |
* Peter Hurley <peter@hurleysoftware.com> [150211 12:05]: > On 02/10/2015 12:46 PM, Peter Hurley wrote: > > On 02/10/2015 07:04 AM, Nicolas Schichan wrote: > >> On 02/10/2015 12:34 AM, Peter Hurley wrote: > >>> Hi Nicolas, > >>> > >>> Thanks for the report. > >>> > >> [...] > >>>> When a caracter is received on the UART while the kernel is printing > >>>> the boot messages, as soon as the kernel configures the UART for > >>>> receiving (after root filesystem mount), it gets stuck printing the > >>>> following message repeatedly: > >>>> > >>>> serial8250: too much work for irq29 > >>>> > >>>> Once stuck, the reception of another character allows the boot process > >>>> to finish. > >>>> > >>>> From what I can gather, when we hit that, the UART_IIR_NO_INT is 0 (so the > >>>> interrupt is raised), but the UART_LSR_DR bit is 0 as well so the UART_RX > >>>> register is never read to clear the interrupt. > >>> > >>> The "too much work" message means serial8250_handle_irq() is returning 0, > >>> ie., not handled. Which in turn means IIR indicates no interrupt is pending > >>> (UART_IIR_NO_INT == 1). > >>> > >>> Can you log the register values for LSR and IIR at both patch locations > >>> in serial8250_do_startup()? > >>> > >>> (I can get you a debug patch, if necessary. Let me know) > >> > >> Hi Peter, > >> > >> Thanks for your reply. > >> > >> Here is what I have when the issue is triggered: > >> > >> [ 12.154877] lsr 0x60 / iir 0x01 > >> [ 12.158071] lsr 0x60 / iir 0x01 > >> [ 12.161438] serial8250: too much work for irq29 > >> [ 12.165982] lsr 0x60 / iir 0x0c > >> [ 12.169354] serial8250: too much work for irq29 > >> [ 12.173900] lsr 0x60 / iir 0x0c > >> (previous two messages are repeated and printk_ratelimited()) > > > > Thanks for this information; I see I was wrong about the cause of message. > > > > I think what happens during startup is that on this silicon clearing > > the rx fifo (by serial8250_clear_fifos()) clears data ready but not > > the rx timeout condition which causes a spurious rx interrupt when > > interrupts are enabled. > > > > So caught between two broken UARTs: one that underflows its rx fifo because > > of unsolicited rx reads and the other that generates spurious interrupt > > without unsolicited rx reads. > > > > > >> When the issue is not triggered: > >> > >> [ 10.784871] lsr 0x60 / iir 0x01 > >> [ 10.788066] lsr 0x60 / iir 0x01 > >> [ 10.794734] VFS: Mounted root (nfs filesystem) readonly on device 0:13. > >> [ 10.801654] devtmpfs: mounted > >> [ 10.805169] Freeing unused kernel memory: 184K (807be000 - 807ec000) > >> (userland takes over after that) > >> > >> I have also displayed the IIR and LSR registers when the "too much fork for > >> IRQ" condition is triggered. > >> > >> In the serial8250_do_startup(), before the interrupt are unmasked at the end, > >> the IIR looks sane and UART_IIR_NO_INT bit is set. When stuck > >> serial8250_interrupt(), UART_IIR_NO_INT is cleared and the interrupt ID is set > >> to 0xc which is not handled by the kernel at this time (the Kirkwood datasheet > >> indicates that it is some kind of timeout condition from what I can gather). > > > > Yes, IIR == UART_IIR_RX_TIMEOUT is to used indicate that data is in the rx fifo > > but has not reached the rx trigger level yet. > > > > ATM, I'm not exactly sure if there is a safe way to clear the spurious interrupt > > from the interrupt handler. > > > > I'm fairly certain the only way to clear the rx timeout interrupt is to read > > the rx fifo, but I think this would race with actual data arrival. IOW, there > > might not be a way to determine if the data read is spurious or not. > > Yep, I see no safe way to clear the spurious interrupt [1] and no idea how to > keep it from happening (other than via the unsolicited RX reads in > serial8250_do_startup). > > Unfortunately, I think this means we'll have to revert Sebastian's commit: > > commit 0aa525d11859c1a4d5b78fdc704148e2ae03ae13 > Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Date: Wed Sep 10 21:29:58 2014 +0200 > > tty: serial: 8250_core: read only RX if there is something in the FIFO > > which just means OMAP3630 will be limited to using the omap_serial driver. Reverting makes sense to me if it has caused a regression. Maybe Sebastian can update his patch to do this based on some quirk flag instead? Regards, Tony > [1] To clear the RX timeout interrupt requires reading the rx fifo even though > LSR[data ready] indicates no data. However, this could result in dropped data > if the data became available just before clearing the RX timeout. For example, > > CPU | Device > | > irq handler (simplified) | > | > read IIR | > is interrupt? yes | > read LSR | > is data ready? no | > is IIR == Rx timeout? yes | new data arrives > | rx_fifo[0] = new data > | lsr[data ready] = 1 > read RX and discard | > | >
On 02/11/2015 03:03 PM, Tony Lindgren wrote: > * Peter Hurley <peter@hurleysoftware.com> [150211 12:05]: >> On 02/10/2015 12:46 PM, Peter Hurley wrote: >>> On 02/10/2015 07:04 AM, Nicolas Schichan wrote: >>>> On 02/10/2015 12:34 AM, Peter Hurley wrote: >>>>> Hi Nicolas, >>>>> >>>>> Thanks for the report. >>>>> >>>> [...] >>>>>> When a caracter is received on the UART while the kernel is printing >>>>>> the boot messages, as soon as the kernel configures the UART for >>>>>> receiving (after root filesystem mount), it gets stuck printing the >>>>>> following message repeatedly: >>>>>> >>>>>> serial8250: too much work for irq29 >>>>>> >>>>>> Once stuck, the reception of another character allows the boot process >>>>>> to finish. >>>>>> >>>>>> From what I can gather, when we hit that, the UART_IIR_NO_INT is 0 (so the >>>>>> interrupt is raised), but the UART_LSR_DR bit is 0 as well so the UART_RX >>>>>> register is never read to clear the interrupt. >>>>> >>>>> The "too much work" message means serial8250_handle_irq() is returning 0, >>>>> ie., not handled. Which in turn means IIR indicates no interrupt is pending >>>>> (UART_IIR_NO_INT == 1). >>>>> >>>>> Can you log the register values for LSR and IIR at both patch locations >>>>> in serial8250_do_startup()? >>>>> >>>>> (I can get you a debug patch, if necessary. Let me know) >>>> >>>> Hi Peter, >>>> >>>> Thanks for your reply. >>>> >>>> Here is what I have when the issue is triggered: >>>> >>>> [ 12.154877] lsr 0x60 / iir 0x01 >>>> [ 12.158071] lsr 0x60 / iir 0x01 >>>> [ 12.161438] serial8250: too much work for irq29 >>>> [ 12.165982] lsr 0x60 / iir 0x0c >>>> [ 12.169354] serial8250: too much work for irq29 >>>> [ 12.173900] lsr 0x60 / iir 0x0c >>>> (previous two messages are repeated and printk_ratelimited()) >>> >>> Thanks for this information; I see I was wrong about the cause of message. >>> >>> I think what happens during startup is that on this silicon clearing >>> the rx fifo (by serial8250_clear_fifos()) clears data ready but not >>> the rx timeout condition which causes a spurious rx interrupt when >>> interrupts are enabled. >>> >>> So caught between two broken UARTs: one that underflows its rx fifo because >>> of unsolicited rx reads and the other that generates spurious interrupt >>> without unsolicited rx reads. >>> >>> >>>> When the issue is not triggered: >>>> >>>> [ 10.784871] lsr 0x60 / iir 0x01 >>>> [ 10.788066] lsr 0x60 / iir 0x01 >>>> [ 10.794734] VFS: Mounted root (nfs filesystem) readonly on device 0:13. >>>> [ 10.801654] devtmpfs: mounted >>>> [ 10.805169] Freeing unused kernel memory: 184K (807be000 - 807ec000) >>>> (userland takes over after that) >>>> >>>> I have also displayed the IIR and LSR registers when the "too much fork for >>>> IRQ" condition is triggered. >>>> >>>> In the serial8250_do_startup(), before the interrupt are unmasked at the end, >>>> the IIR looks sane and UART_IIR_NO_INT bit is set. When stuck >>>> serial8250_interrupt(), UART_IIR_NO_INT is cleared and the interrupt ID is set >>>> to 0xc which is not handled by the kernel at this time (the Kirkwood datasheet >>>> indicates that it is some kind of timeout condition from what I can gather). >>> >>> Yes, IIR == UART_IIR_RX_TIMEOUT is to used indicate that data is in the rx fifo >>> but has not reached the rx trigger level yet. >>> >>> ATM, I'm not exactly sure if there is a safe way to clear the spurious interrupt >>> from the interrupt handler. >>> >>> I'm fairly certain the only way to clear the rx timeout interrupt is to read >>> the rx fifo, but I think this would race with actual data arrival. IOW, there >>> might not be a way to determine if the data read is spurious or not. >> >> Yep, I see no safe way to clear the spurious interrupt [1] and no idea how to >> keep it from happening (other than via the unsolicited RX reads in >> serial8250_do_startup). >> >> Unfortunately, I think this means we'll have to revert Sebastian's commit: >> >> commit 0aa525d11859c1a4d5b78fdc704148e2ae03ae13 >> Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >> Date: Wed Sep 10 21:29:58 2014 +0200 >> >> tty: serial: 8250_core: read only RX if there is something in the FIFO >> >> which just means OMAP3630 will be limited to using the omap_serial driver. > > Reverting makes sense to me if it has caused a regression. Maybe Sebastian > can update his patch to do this based on some quirk flag instead? That's fine with me. There's a 'bugs' field in struct 8250_uart_port and UART_BUG_* defines in 8250/8250.h for that purpose. Regards, Peter Hurley
On 02/11/2015 09:42 PM, Peter Hurley wrote: >> Reverting makes sense to me if it has caused a regression. Maybe Sebastian >> can update his patch to do this based on some quirk flag instead? > > That's fine with me. There's a 'bugs' field in struct 8250_uart_port and > UART_BUG_* defines in 8250/8250.h for that purpose. Makes sense. Reading an empty FIFO does not look right. Maybe we should do the bug flag the other way around? But I can do what I am told to so if there is more fallout than just this Marvell UART I could come around with a patch to the bug field for the older OMAP. > Regards, > Peter Hurley Sebastian
On Thu, Feb 12, 2015 at 09:45:38AM +0100, Sebastian Andrzej Siewior wrote: > On 02/11/2015 09:42 PM, Peter Hurley wrote: > > >> Reverting makes sense to me if it has caused a regression. Maybe Sebastian > >> can update his patch to do this based on some quirk flag instead? > > > > That's fine with me. There's a 'bugs' field in struct 8250_uart_port and > > UART_BUG_* defines in 8250/8250.h for that purpose. > > Makes sense. Reading an empty FIFO does not look right. Maybe we should > do the bug flag the other way around? But I can do what I am told to so > if there is more fallout than just this Marvell UART I could come > around with a patch to the bug field for the older OMAP. Reading the RX FIFO is something that goes back a long time in the 8250 driver, to the time when Ted Ts'o wrote the driver. I would suggest that it remains there as the default (it was obviously found to be necessary for x86 all those years ago when Linux was in its infancy) and has survived for getting on 20 years without issue.
On 02/12/2015 03:45 AM, Sebastian Andrzej Siewior wrote: > On 02/11/2015 09:42 PM, Peter Hurley wrote: > >>> Reverting makes sense to me if it has caused a regression. Maybe Sebastian >>> can update his patch to do this based on some quirk flag instead? >> >> That's fine with me. There's a 'bugs' field in struct 8250_uart_port and >> UART_BUG_* defines in 8250/8250.h for that purpose. > > Makes sense. Reading an empty FIFO does not look right. Maybe we should > do the bug flag the other way around? But I can do what I am told to so > if there is more fallout than just this Marvell UART I could come > around with a patch to the bug field for the older OMAP. I agree with Russell on this; better to stick with the rx read that's been running on 20 years of hardware. That said, I don't think serial8250_do_startup() is really doing that much for OMAP h/w startup; open-coding what omap_8250 really needs is probably < 10 loc. Regards, Peter Hurley
diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_ index 11c6685..ed93741 100644 --- a/drivers/tty/serial/8250/8250_core.c +++ b/drivers/tty/serial/8250/8250_core.c @@ -1661,6 +1661,12 @@ static int exar_handle_irq(struct uart_port *port) return ret; } +static inline void show_lsr_iir(struct uart_port *port) +{ + printk("lsr 0x%02x / iir 0x%02x\n", serial_port_in(port, UART_LSR), + serial_port_in(port, UART_IIR)); +} + /* * This is the serial driver's interrupt routine. * @@ -1705,6 +1711,8 @@ static irqreturn_t serial8250_interrupt(int irq, void *dev /* If we hit this, we're dead. */ printk_ratelimited(KERN_ERR "serial8250: too much work for irq%d\n", irq); + if (printk_ratelimit()) + show_lsr_iir(port); break; } } while (l != end); @@ -2107,6 +2115,7 @@ int serial8250_do_startup(struct uart_port *port) /* * Clear the interrupt registers. */ + show_lsr_iir(port); if (serial_port_in(port, UART_LSR) & UART_LSR_DR) serial_port_in(port, UART_RX); serial_port_in(port, UART_IIR); @@ -2269,6 +2278,7 @@ dont_test_tx_en: * saved flags to avoid getting false values from polling * routines or the previous session. */ + show_lsr_iir(port); if (serial_port_in(port, UART_LSR) & UART_LSR_DR) serial_port_in(port, UART_RX); serial_port_in(port, UART_IIR);