mtd: pxa3xx_nand: issue with command time out

Message ID	87a8o9ex9r.fsf@belgarion.home (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Robert Jarzmik <robert.jarzmik@free.fr> To: Michael Wang <Michael.Wang@alliedtelesis.co.nz> Subject: Re: mtd: pxa3xx_nand: issue with command time out References: <5695CEF0.1080100@alliedtelesis.co.nz> Date: Wed, 13 Jan 2016 08:47:28 +0100 In-Reply-To: <5695CEF0.1080100@alliedtelesis.co.nz> (Michael Wang's message of "Wed, 13 Jan 2016 04:13:37 +0000") Message-ID: <87a8o9ex9r.fsf@belgarion.home> User-Agent: Gnus/5.130008 (Ma Gnus v0.8) Emacs/24.4 (gnu/linux) MIME-Version: 1.0 Precedence: list Cc: "computersforpeace@gmail.com" <computersforpeace@gmail.com>, "linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>, "ezequiel@vanguardiasur.com.ar" <ezequiel@vanguardiasur.com.ar>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Robert Jarzmik Jan. 13, 2016, 7:47 a.m. UTC

Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:

> Hi Robert,
>
> I am currently working with Marvell 98DX4251 switchchip with control and
> management subsystem SOC based on ARMADA XP, and after upgrading kernel from
...zip...

> 3.16.7 to 4.3, the nand controller command have a high chance of timing out.
> -#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(200)
> +#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(300)
>   #define NAND_STOP_DELAY                msecs_to_jiffies(40)
>   #define PAGE_CHUNK_SIZE                (2048)
>
> So I guess my question is whether the CHIP_DELAY_TIMEOUT should be increased
> for 98DX4251, or is this likely to be caused by some other issue?

Hi Michael,

> Any suggestion, help or hint will be very much appreciated.
Actually I have one.
Before we change the timeout, I'd like to see the sequence of register accesses.

As I owe a patch to Brian on this topic, I'd like to have you as a tester. In
return, I'll analyse the output of debug traces it generates for you.

The patch is at the bottom of this mail, and a "git am" on the mail should apply
it to your kernel. It leaves up to you to add the "#define VERBOSE_DEBUG 1" in
pxa3xx-nand.c, recompile, and take the traces.

Once I have the traces, I will be able to compare with my traces on pxa3xx
platforms and spot if there is a difference. Then we'll be able to see if
increasing the timeout is the only option.

Cheers.

Robert Jarzmik Jan. 16, 2016, 11:53 p.m. UTC | #1

Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:

> The complete output of debug traces and boot log from your debug patch
> is in the attachment.

Okay, thanks, that's great.
Might a ask another capture, but with :
 1) the timestamps this time
    (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
    in the dmesg you will provide me.
 2) on the first line of drain_fifo(), add a single :
    nand_readl(info, NDSR)

This will enable me to see how your board behaves on the temporal aspect, as
today I only have this, which looks correct AFAICT:

pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)

pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1, NDSR)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():720 nand_writel(0x100d3000, NDCB0)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():721 nand_writel(0x4ff0000, NDCB0)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x2, NDCB0)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():726 nand_writel(0x820, NDCB0)
  => RJK: here the read command is submitted

pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1800
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1800, NDSR)
  => RJK: here the read command is completed

pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x2
pxa3xx-nand f10d0000.nand: Wait time out!!!
  => RJK: here data is available in NAND controller, and yet it's too late

pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():476 nand_readl(0x0000): 0xd104b000
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():636 nand_writel(0x6, 0x0014)
  => RJK: here we finish the read data transfer operation, too late

Cheers.

--
Robert

Robert Jarzmik Jan. 18, 2016, 9:13 p.m. UTC | #2

Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:

> On 01/17/2016 12:53 PM, Robert Jarzmik wrote:
>> Might a ask another capture, but with :
>>   1) the timestamps this time
>>      (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
>>      in the dmesg you will provide me.
>>   2) on the first line of drain_fifo(), add a single :
>>      nand_readl(info, NDSR)
>>
>> This will enable me to see how your board behaves on the temporal aspect, as
>> today I only have this, which looks correct AFAICT
> Hi Robert,
>
> Thank you for your analysis.
>
> The attachment file contains the complete debug output with the 
> additional info you requested.
>
> I just want to mention that I get slightly different results on my board 
> with debugs enabled/disabled.
>
> Without your debugs, the kernel will fail to boot up properly with 
> command timing out and nand
> controller in invalid state:
>  >pxa3xx-nand f10d0000.nand: Wait time out!!!
>  >pxa3xx-nand f10d0000.nand: handle_data_pio: invalid state 0
>
> With your debugs, commands still time out a few times, but 
> handle_data_pio() doesn't get into
> invalid state and the kernel boots up fine.

Ok, I analysed with the timings, and now I'm pretty against the timeout change
if the analysis below is correct.

See in the log extract in [1], you have a contiguous sequence of one working
read followed by one broken read :
 - in the working read, the irq thread is scheduled in 8ms, which is pretty
   decent.
 - in the broken read, the irq thread is scheduled in 160ms, which is just
   unbearable.

From this analysis I think you have something broken in your platform, a driver
taking a lock and blocking the scheduling or something like that. Your timeouts
seem to come from there.

If you want to add 2 prinks to validate the theory (and remove my debug logs
which change the realtime behavior) :
 - put one just before the "ret = IRQ_WAKE_THREAD";
 - put the other on the first line of pxa3xx_nand_irq_thread()
 - and calculate the difference. 

Cheers.

--
Robert

[1] Extract of your log
	=> RJK: begin of the working read
[ 2172.409771] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
[ 2172.417844] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
[ 2172.426084] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
[ 2172.434147] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)
[ 2172.442823] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x1
[ 2172.450608] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x1, NDSR)
[ 2172.458479] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x300d3000, NDCB0)
[ 2172.466961] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():723 nand_writel(0x7fcb0000, NDCB0)
[ 2172.475442] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():724 nand_writel(0x3, NDCB0)
[ 2172.483312] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():728 nand_writel(0x820, NDCB0)
[ 2172.491450] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x2
[ 2172.499259] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
	=> RJK: here the irq thread was scheduled in 8ms
[ 2172.506741] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
[ 2172.514106] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():638 nand_writel(0x6, NDSR)
[ 2172.522606] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x500
[ 2172.530565] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x500, NDSR)
	=> RJK: begin of the broken read
[ 2172.538671] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
[ 2172.546753] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
[ 2172.554991] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
[ 2172.563050] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)
[ 2172.571722] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x1
[ 2172.579508] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x1, NDSR)
[ 2172.587379] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x100d3000, NDCB0)
[ 2172.595860] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():723 nand_writel(0xfefd0000, NDCB0)
[ 2172.604341] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():724 nand_writel(0x2, NDCB0)
[ 2172.612212] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():728 nand_writel(0x820, NDCB0)
[ 2172.620319] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x1800
[ 2172.628367] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x1800, NDSR)
[ 2172.636519] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x2
[ 2172.798485] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
	=> RJK: here the irq thread was scheduled in 160ms
[ 2172.805996] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
[ 2172.813379] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():638 nand_writel(0x6, NDSR)
[ 2172.821017] pxa3xx-nand f10d0000.nand: Wait time out!!!
[ 2172.821029] pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():476 nand_readl(NDCR): 0xc104b000
[ 2172.821038] pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():487 nand_writel(0xfff, NDSR)

Michael Wang Jan. 19, 2016, 10:33 p.m. UTC | #3

On 01/19/2016 10:13 AM, Robert Jarzmik wrote:
>  From this analysis I think you have something broken in your platform, a driver
> taking a lock and blocking the scheduling or something like that. Your timeouts
> seem to come from there.
>
> If you want to add 2 prinks to validate the theory (and remove my debug logs
> which change the realtime behavior) :
>   - put one just before the "ret = IRQ_WAKE_THREAD";
>   - put the other on the first line of pxa3xx_nand_irq_thread()
>   - and calculate the difference.
Hi Robert,

Thank you very much for your help.

I have added the 2 printks, and I think you are right that there is 
something
blocking the scheduling.

[  146.372745] pxa3xx_nand_irq:700 IRQ_WAKE_THREAD status=2
[  146.378106] pxa3xx_nand_irq_thread:643
[  146.411117] pxa3xx_nand_irq:700 IRQ_WAKE_THREAD status=2
[  146.537124] pxa3xx_nand_irq_thread:643
       => irq thread was scheduled in 0.126 ms
[  146.541245] pxa3xx_nand_irq:700 IRQ_WAKE_THREAD status=2
[  146.546603] pxa3xx_nand_irq_thread:643

I will continue to investigate the cause of this delay.

Thanks,
Michael

Brian Norris Jan. 25, 2016, 6:30 p.m. UTC | #4

Hi Robert,

On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
> Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:
> 
> > The complete output of debug traces and boot log from your debug patch
> > is in the attachment.
> 
> Okay, thanks, that's great.
> Might a ask another capture, but with :
>  1) the timestamps this time
>     (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
>     in the dmesg you will provide me.
>  2) on the first line of drain_fifo(), add a single :
>     nand_readl(info, NDSR)
> 
> This will enable me to see how your board behaves on the temporal aspect, as
> today I only have this, which looks correct AFAICT:
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1, NDSR)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():720 nand_writel(0x100d3000, NDCB0)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():721 nand_writel(0x4ff0000, NDCB0)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x2, NDCB0)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():726 nand_writel(0x820, NDCB0)
>   => RJK: here the read command is submitted
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1800
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1800, NDSR)
>   => RJK: here the read command is completed
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x2
> pxa3xx-nand f10d0000.nand: Wait time out!!!
>   => RJK: here data is available in NAND controller, and yet it's too late
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():476 nand_readl(0x0000): 0xd104b000
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():636 nand_writel(0x6, 0x0014)
>   => RJK: here we finish the read data transfer operation, too late

I'm not extremely familiar with this driver, but you've caused me to
take a second look at this commit:

commit 24542257a3b987025d4b998ec2d15e556c98ad3f
Author: Robert Jarzmik <robert.jarzmik@free.fr>
Date:   Fri Feb 20 19:36:43 2015 +0100

    mtd: pxa3xx-nand: handle PIO in threaded interrupt


And now I'm wondering: when does the completion get triggered? i.e.:

	complete(&info->cmd_complete);

?

It seems to me like you've short-circuited some of the IRQ handling
code, so that the threaded handler is buggy. AIUI, if the completion
event ever happens, it's actually happening *before* the full (threaded)
handler is actually finished, since it occurs in pxa3xx_nand_irq(),
before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
causes a real problem in practice, since you used IRQF_ONESHOT, but I
would think that's also suspect.

Brian

Robert Jarzmik Jan. 25, 2016, 8:48 p.m. UTC | #5

Brian Norris <computersforpeace@gmail.com> writes:

> Hi Robert,
>
> On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
> commit 24542257a3b987025d4b998ec2d15e556c98ad3f
> Author: Robert Jarzmik <robert.jarzmik@free.fr>
> Date:   Fri Feb 20 19:36:43 2015 +0100
>
>     mtd: pxa3xx-nand: handle PIO in threaded interrupt
>
>
> And now I'm wondering: when does the completion get triggered? i.e.:
>
> 	complete(&info->cmd_complete);

> It seems to me like you've short-circuited some of the IRQ handling
> code, so that the threaded handler is buggy. AIUI, if the completion
> event ever happens, it's actually happening *before* the full (threaded)
> handler is actually finished, since it occurs in pxa3xx_nand_irq(),
> before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
> causes a real problem in practice, since you used IRQF_ONESHOT, but I
> would think that's also suspect.

I think you're very right, this "goto NORMAL_IRQ_EXIT" creates a
short-cirtcuit. That makes me think that I should :
 - add
static void pxa3xx_nand_data_finished(struct pxa3xx_nand_info *info)
{
	unsigned int done;

 	cmd_done = info->cs ? NDSR_CS1_CMDD : NDSR_CS0_CMDD;
	nand_writel(info, NDSR, cmd_done);
        complete(&inf0->cmd_complete);
}

 - in pxa3xx_nand_irq_thread(), I should add at the end:
	pxa3xx_nand_data_finished(info);

 - in pxa3xx_nand_data_dma_irq(), I should add at the end:
	pxa3xx_nand_data_finished(info);

Ezequiel, would you have a look at that code snippet and tell me if I'm missing
something in the driver's structure ?

Cheers.

--
Robert

Ezequiel Garcia Jan. 26, 2016, 1:03 p.m. UTC | #6

On 25 January 2016 at 17:48, Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> Brian Norris <computersforpeace@gmail.com> writes:
>
>> Hi Robert,
>>
>> On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
>> commit 24542257a3b987025d4b998ec2d15e556c98ad3f
>> Author: Robert Jarzmik <robert.jarzmik@free.fr>
>> Date:   Fri Feb 20 19:36:43 2015 +0100
>>
>>     mtd: pxa3xx-nand: handle PIO in threaded interrupt
>>
>>
>> And now I'm wondering: when does the completion get triggered? i.e.:
>>
>>       complete(&info->cmd_complete);
>
>> It seems to me like you've short-circuited some of the IRQ handling
>> code, so that the threaded handler is buggy. AIUI, if the completion
>> event ever happens, it's actually happening *before* the full (threaded)
>> handler is actually finished, since it occurs in pxa3xx_nand_irq(),
>> before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
>> causes a real problem in practice, since you used IRQF_ONESHOT, but I
>> would think that's also suspect.
>
> I think you're very right, this "goto NORMAL_IRQ_EXIT" creates a
> short-cirtcuit. That makes me think that I should :
>  - add
> static void pxa3xx_nand_data_finished(struct pxa3xx_nand_info *info)
> {
>         unsigned int done;
>
>         cmd_done = info->cs ? NDSR_CS1_CMDD : NDSR_CS0_CMDD;
>         nand_writel(info, NDSR, cmd_done);
>         complete(&inf0->cmd_complete);
> }
>
>  - in pxa3xx_nand_irq_thread(), I should add at the end:
>         pxa3xx_nand_data_finished(info);
>
>  - in pxa3xx_nand_data_dma_irq(), I should add at the end:
>         pxa3xx_nand_data_finished(info);
>
> Ezequiel, would you have a look at that code snippet and tell me if I'm missing
> something in the driver's structure ?
>

Although it seems Brian is right, I'm not so sure about it. As far as
I can recall, the IRQ sequence is (please correct me if I'm wrong!):

1. Write command request (WRCMDREQ), the IRQ handler writes the command.
2. Read/write data request (RDDREQ | WRDREQ). The IRQ handler fires
the threaded handler, which does data transfer.
3. Command done IRQ (CMDD), the IRQ handler completes the completion.

I don't think data request and command done can be found
simultaneously, so I don't think this is a real issue. Ideas?

Regarding the fixed proposed up there, it seems wrong: the completion
should be completed only after a command done IRQ is found, and not as
the result of data transfer being finished.

Hamish Martin March 1, 2016, 10:17 p.m. UTC | #7

Hi Robert, Ezequiel, and Brian,

I have been working with Michael Wang on this issue and wanted to update 
you on our results as we think there is a wider issue to resolve here.

I believe Ezequiel's analysis of the IRQ sequence is correct. The 
proposed patch from Brian is not needed, not does it resolve our issue.

Further, we have debugged in more detail with kernel tracing and see 
that the problem occurs because nothing calls 'schedule()' in the kernel 
within the timeout period (200ms) to actually allow the IRQ bottom-half 
thread ((pxa3xx_nand_irq_thread()) to run.

With the trace enabled we were able to see long delays between the IRQ 
thread being queued for run as a result of the return with 
IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually 
selecting the thread for execution. This delay was tied to whatever else 
was happening in our system during the boot up when this issue occurs. 
For instance, if a mutex got unlocked this would ultimately lead to a 
call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run. 
If some file system ops happened schedule() would get called. These 
unrelated activities were required so that schedule() got called and the 
scheduler would let the bottom-half execute.
In our tracing we saw the issue occur whenever the call to schedule() 
didn't occur before the timeout expired. In short, if we got lucky and 
something completely unrelated happened that led to a call to 
'schedule()', then the issue didn't occur. If we got unlucky and nothing 
called schedule(), the issue was observed.

We have enabled CONFIG_PREEMPT in our system to resolve this. This has 
the effect of leading to a triggering of 'schedule()' at the end of 
processing interrupts. This means that we now see a nice tight execution 
of pxa3xx_nand threaded IRQ such that we see all ops completing with 
less that 20ms delay.

I wanted to highlight this to you because I think it shows that the 
change to a threaded IRQ is probably only safe if we can guarantee 
scheduling of the bottom-half thread within the time out period (200ms). 
I think this is not safe for systems with CONFIG_PREEMPT not configured. 
Or perhaps there is a better way to resolve this. Is this a deficiency 
in the Armada XP code?

What do you think?

Thanks,
Hamish Martin

On 01/27/2016 02:03 AM, Ezequiel Garcia wrote:
> On 25 January 2016 at 17:48, Robert Jarzmik <robert.jarzmik@free.fr> wrote:
>> Brian Norris <computersforpeace@gmail.com> writes:
>>
>>> Hi Robert,
>>>
>>> On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
>>> commit 24542257a3b987025d4b998ec2d15e556c98ad3f
>>> Author: Robert Jarzmik <robert.jarzmik@free.fr>
>>> Date:   Fri Feb 20 19:36:43 2015 +0100
>>>
>>>      mtd: pxa3xx-nand: handle PIO in threaded interrupt
>>>
>>>
>>> And now I'm wondering: when does the completion get triggered? i.e.:
>>>
>>>        complete(&info->cmd_complete);
>>> It seems to me like you've short-circuited some of the IRQ handling
>>> code, so that the threaded handler is buggy. AIUI, if the completion
>>> event ever happens, it's actually happening *before* the full (threaded)
>>> handler is actually finished, since it occurs in pxa3xx_nand_irq(),
>>> before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
>>> causes a real problem in practice, since you used IRQF_ONESHOT, but I
>>> would think that's also suspect.
>> I think you're very right, this "goto NORMAL_IRQ_EXIT" creates a
>> short-cirtcuit. That makes me think that I should :
>>   - add
>> static void pxa3xx_nand_data_finished(struct pxa3xx_nand_info *info)
>> {
>>          unsigned int done;
>>
>>          cmd_done =nfo->cs ? NDSR_CS1_CMDD : NDSR_CS0_CMDD;
>>          nand_writel(info, NDSR, cmd_done);
>>          complete(&inf0->cmd_complete);
>> }
>>
>>   - in pxa3xx_nand_irq_thread(), I should add at the end:
>>          pxa3xx_nand_data_finished(info);
>>
>>   - in pxa3xx_nand_data_dma_irq(), I should add at the end:
>>          pxa3xx_nand_data_finished(info);
>>
>> Ezequiel, would you have a look at that code snippet and tell me if I'm missing
>> something in the driver's structure ?
>>
> Although it seems Brian is right, I'm not so sure about it. As far as
> I can recall, the IRQ sequence is (please correct me if I'm wrong!):
>
> 1. Write command request (WRCMDREQ), the IRQ handler writes the command.
> 2. Read/write data request (RDDREQ | WRDREQ). The IRQ handler fires
> the threaded handler, which does data transfer.
> 3. Command done IRQ (CMDD), the IRQ handler completes the completion.
>
> I don't think data request and command done can be found
> simultaneously, so I don't think this is a real issue. Ideas?
>
> Regarding the fixed proposed up there, it seems wrong: the completion
> should be completed only after a command done IRQ is found, and not as
> the result of data transfer being finished.

Ezequiel Garcia March 3, 2016, 7:24 p.m. UTC | #8

+Richard, Boris and Thomas

On 1 March 2016 at 19:17, Hamish Martin
<Hamish.Martin@alliedtelesis.co.nz> wrote:
> Hi Robert, Ezequiel, and Brian,
>
> I have been working with Michael Wang on this issue and wanted to update
> you on our results as we think there is a wider issue to resolve here.
>
> I believe Ezequiel's analysis of the IRQ sequence is correct. The
> proposed patch from Brian is not needed, not does it resolve our issue.
>
> Further, we have debugged in more detail with kernel tracing and see
> that the problem occurs because nothing calls 'schedule()' in the kernel
> within the timeout period (200ms) to actually allow the IRQ bottom-half
> thread ((pxa3xx_nand_irq_thread()) to run.
>
> With the trace enabled we were able to see long delays between the IRQ
> thread being queued for run as a result of the return with
> IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually
> selecting the thread for execution. This delay was tied to whatever else
> was happening in our system during the boot up when this issue occurs.
> For instance, if a mutex got unlocked this would ultimately lead to a
> call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run.
> If some file system ops happened schedule() would get called. These
> unrelated activities were required so that schedule() got called and the
> scheduler would let the bottom-half execute.
> In our tracing we saw the issue occur whenever the call to schedule()
> didn't occur before the timeout expired. In short, if we got lucky and
> something completely unrelated happened that led to a call to
> 'schedule()', then the issue didn't occur. If we got unlucky and nothing
> called schedule(), the issue was observed.
>
> We have enabled CONFIG_PREEMPT in our system to resolve this. This has
> the effect of leading to a triggering of 'schedule()' at the end of
> processing interrupts. This means that we now see a nice tight execution
> of pxa3xx_nand threaded IRQ such that we see all ops completing with
> less that 20ms delay.
>
> I wanted to highlight this to you because I think it shows that the
> change to a threaded IRQ is probably only safe if we can guarantee
> scheduling of the bottom-half thread within the time out period (200ms).
> I think this is not safe for systems with CONFIG_PREEMPT not configured.
> Or perhaps there is a better way to resolve this. Is this a deficiency
> in the Armada XP code?
>
> What do you think?
>

Thanks a lot for updating us. Richard, Boris and I have
discussed about this on IRC.

We came to the following conclusions:

1. There's no way to guarantee that the thread will be scheduled
soon enough to always do the data transfer in time.

Using CONFIG_PREEMPT could prevent a timeout in many (most?)
cases, but it's not guaranteed to work.

2. The pxa3xx-nand driver is poorly designed, and it shouldn't
do any data transfer in the IRQ (threaded or not) handler.

Instead, this should be done in {read,write}_buf, i.e. in the
context of the calling process that asks data to be read/write.
This is potentially a big rework.

3. Richard proposed to get rid of the timeout
(using wait_for_completion_interruptible instead?).

I'm wondering if that could avoid these kind of issues (false
timeouts), as a less invasive solution, until someone steps
up to rework the driver.

How does it sound?

Hamish Martin March 3, 2016, 8:16 p.m. UTC | #9

Hi Ezequiel,

I'm glad my analysis was correct and we more fully understand the 
problem now. CONFIG_PREEMPT seems to work for us now, but it does seem 
like the wrong fix, and if you guys think it will not guarantee 
preventing the issue then we'd rather back that change out.

I had contemplated using wait_for_completion_interruptible() but was a 
bit unsure if it was safe. Do you have a proposed patch? We would be 
happy to apply it to our system, turn off CONFIG_PREEMPT, and see how it 
goes. Out issue is highly reproducible so I think it would be a good 
first test for a change?

Thanks,
Hamish M.

On 03/04/2016 08:24 AM, Ezequiel Garcia wrote:
> +Richard, Boris and Thomas
>
> On 1 March 2016 at 19:17, Hamish Martin
> <Hamish.Martin@alliedtelesis.co.nz> wrote:
>> Hi Robert, Ezequiel, and Brian,
>>
>> I have been working with Michael Wang on this issue and wanted to update
>> you on our results as we think there is a wider issue to resolve here.
>>
>> I believe Ezequiel's analysis of the IRQ sequence is correct. The
>> proposed patch from Brian is not needed, not does it resolve our issue.
>>
>> Further, we have debugged in more detail with kernel tracing and see
>> that the problem occurs because nothing calls 'schedule()' in the kernel
>> within the timeout period (200ms) to actually allow the IRQ bottom-half
>> thread ((pxa3xx_nand_irq_thread()) to run.
>>
>> With the trace enabled we were able to see long delays between the IRQ
>> thread being queued for run as a result of the return with
>> IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually
>> selecting the thread for execution. This delay was tied to whatever else
>> was happening in our system during the boot up when this issue occurs.
>> For instance, if a mutex got unlocked this would ultimately lead to a
>> call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run.
>> If some file system ops happened schedule() would get called. These
>> unrelated activities were required so that schedule() got called and the
>> scheduler would let the bottom-half execute.
>> In our tracing we saw the issue occur whenever the call to schedule()
>> didn't occur before the timeout expired. In short, if we got lucky and
>> something completely unrelated happened that led to a call to
>> 'schedule()', then the issue didn't occur. If we got unlucky and nothing
>> called schedule(), the issue was observed.
>>
>> We have enabled CONFIG_PREEMPT in our system to resolve this. This has
>> the effect of leading to a triggering of 'schedule()' at the end of
>> processing interrupts. This means that we now see a nice tight execution
>> of pxa3xx_nand threaded IRQ such that we see all ops completing with
>> less that 20ms delay.
>>
>> I wanted to highlight this to you because I think it shows that the
>> change to a threaded IRQ is probably only safe if we can guarantee
>> scheduling of the bottom-half thread within the time out period (200ms).
>> I think this is not safe for systems with CONFIG_PREEMPT not configured.
>> Or perhaps there is a better way to resolve this. Is this a deficiency
>> in the Armada XP code?
>>
>> What do you think?
>>
> Thanks a lot for updating us. Richard, Boris and I have
> discussed about this on IRC.
>
> We came to the following conclusions:
>
> 1. There's no way to guarantee that the thread will be scheduled
> soon enough to always do the data transfer in time.
>
> Using CONFIG_PREEMPT could prevent a timeout in many (most?)
> cases, but it's not guaranteed to work.
>
> 2. The pxa3xx-nand driver is poorly designed, and it shouldn't
> do any data transfer in the IRQ (threaded or not) handler.
>
> Instead, this should be done in {read,write}_buf, i.e. in the
> context of the calling process that asks data to be read/write.
> This is potentially a big rework.
>
> 3. Richard proposed to get rid of the timeout
> (using wait_for_completion_interruptible instead?).
>
> I'm wondering if that could avoid these kind of issues (false
> timeouts), as a less invasive solution, until someone steps
> up to rework the driver.
>
> How does it sound?

Richard Weinberger March 3, 2016, 11:18 p.m. UTC | #10

Am 03.03.2016 um 20:24 schrieb Ezequiel Garcia:
> +Richard, Boris and Thomas
> 
> On 1 March 2016 at 19:17, Hamish Martin
> <Hamish.Martin@alliedtelesis.co.nz> wrote:
>> Hi Robert, Ezequiel, and Brian,
>>
>> I have been working with Michael Wang on this issue and wanted to update
>> you on our results as we think there is a wider issue to resolve here.
>>
>> I believe Ezequiel's analysis of the IRQ sequence is correct. The
>> proposed patch from Brian is not needed, not does it resolve our issue.
>>
>> Further, we have debugged in more detail with kernel tracing and see
>> that the problem occurs because nothing calls 'schedule()' in the kernel
>> within the timeout period (200ms) to actually allow the IRQ bottom-half
>> thread ((pxa3xx_nand_irq_thread()) to run.
>>
>> With the trace enabled we were able to see long delays between the IRQ
>> thread being queued for run as a result of the return with
>> IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually
>> selecting the thread for execution. This delay was tied to whatever else
>> was happening in our system during the boot up when this issue occurs.
>> For instance, if a mutex got unlocked this would ultimately lead to a
>> call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run.
>> If some file system ops happened schedule() would get called. These
>> unrelated activities were required so that schedule() got called and the
>> scheduler would let the bottom-half execute.
>> In our tracing we saw the issue occur whenever the call to schedule()
>> didn't occur before the timeout expired. In short, if we got lucky and
>> something completely unrelated happened that led to a call to
>> 'schedule()', then the issue didn't occur. If we got unlucky and nothing
>> called schedule(), the issue was observed.
>>
>> We have enabled CONFIG_PREEMPT in our system to resolve this. This has
>> the effect of leading to a triggering of 'schedule()' at the end of
>> processing interrupts. This means that we now see a nice tight execution
>> of pxa3xx_nand threaded IRQ such that we see all ops completing with
>> less that 20ms delay.
>>
>> I wanted to highlight this to you because I think it shows that the
>> change to a threaded IRQ is probably only safe if we can guarantee
>> scheduling of the bottom-half thread within the time out period (200ms).
>> I think this is not safe for systems with CONFIG_PREEMPT not configured.
>> Or perhaps there is a better way to resolve this. Is this a deficiency
>> in the Armada XP code?
>>
>> What do you think?
>>
> 
> Thanks a lot for updating us. Richard, Boris and I have
> discussed about this on IRC.
> 
> We came to the following conclusions:
> 
> 1. There's no way to guarantee that the thread will be scheduled
> soon enough to always do the data transfer in time.
> 
> Using CONFIG_PREEMPT could prevent a timeout in many (most?)
> cases, but it's not guaranteed to work.
> 
> 2. The pxa3xx-nand driver is poorly designed, and it shouldn't
> do any data transfer in the IRQ (threaded or not) handler.
> 
> Instead, this should be done in {read,write}_buf, i.e. in the
> context of the calling process that asks data to be read/write.
> This is potentially a big rework.
> 
> 3. Richard proposed to get rid of the timeout
> (using wait_for_completion_interruptible instead?).

Getting rid of the timeout can work but definitely needs more
clarification.
Before commit 24542257a3b987025d4b998ec2d15e556c98ad3f it made sense
but now the thread can be delayed for unknown time.

Thanks,
//richard

Ezequiel Garcia March 4, 2016, 12:03 a.m. UTC | #11

On 3 March 2016 at 17:16, Hamish Martin
<Hamish.Martin@alliedtelesis.co.nz> wrote:
> Hi Ezequiel,
>
> I'm glad my analysis was correct and we more fully understand the
> problem now. CONFIG_PREEMPT seems to work for us now, but it does seem
> like the wrong fix, and if you guys think it will not guarantee
> preventing the issue then we'd rather back that change out.
>

Do you have any good reason for NOT enabling CONFIG_PREEMPT?

I've had very bad experiences on embedded systems when not
using CONFIG_PREEMPT, and came to the conclusion that
enable full preemption, unless there's a good reason not to.

Enabling CONFIG_PREEMPT would result in slightly lower throughput,
but maximum latency will be greatly reduced.

For instance, in a kernel without preemption, a UBI static volume check
(which happens on the first UBI static volume open call), can take
several seconds.

> I had contemplated using wait_for_completion_interruptible() but was a
> bit unsure if it was safe. Do you have a proposed patch? We would be
> happy to apply it to our system, turn off CONFIG_PREEMPT, and see how it
> goes. Out issue is highly reproducible so I think it would be a good
> first test for a change?
>

Sure, I can try to cook a patch for you.

Hamish Martin March 7, 2016, 8:04 p.m. UTC | #12

Hi Ezequiel,

On 03/04/2016 01:03 PM, Ezequiel Garcia wrote:
> On 3 March 2016 at 17:16, Hamish Martin
> <Hamish.Martin@alliedtelesis.co.nz> wrote:
>> Hi Ezequiel,
>>
>> I'm glad my analysis was correct and we more fully understand the
>> problem now. CONFIG_PREEMPT seems to work for us now, but it does seem
>> like the wrong fix, and if you guys think it will not guarantee
>> preventing the issue then we'd rather back that change out.
>>
> Do you have any good reason for NOT enabling CONFIG_PREEMPT?
>
> I've had very bad experiences on embedded systems when not
> using CONFIG_PREEMPT, and came to the conclusion that
> enable full preemption, unless there's a good reason not to.
>
> Enabling CONFIG_PREEMPT would result in slightly lower throughput,
> but maximum latency will be greatly reduced.
>
> For instance, in a kernel without preemption, a UBI static volume check
> (which happens on the first UBI static volume open call), can take
> several seconds.
We don't have a good reason, it is mainly an historic artifact. We are 
happy to go with that as a fix and we are investigating it for other 
platforms too. I think it is an oversight that hasn't been given the 
thought it should.
I think we still need to consider operation of the driver without 
CONFIG_PREEMPT enabled.

>
>> I had contemplated using wait_for_completion_interruptible() but was a
>> bit unsure if it was safe. Do you have a proposed patch? We would be
>> happy to apply it to our system, turn off CONFIG_PREEMPT, and see how it
>> goes. Out issue is highly reproducible so I think it would be a good
>> first test for a change?
>>
> Sure, I can try to cook a patch for you.
Thanks for your help with this guys. Send me the patch when you have one 
you want me to test out for you.

Cheers,
Hamish M.

mtd: pxa3xx_nand: issue with command time out

Commit Message

Comments

Patch