diff mbox

arm64: Boot failure on m400 with new cont PTEs

Message ID 1447858999-26665-1-git-send-email-jeremy.linton@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeremy Linton Nov. 18, 2015, 3:03 p.m. UTC
The HP m400 fails to boot the linux 4.4rc1 kernel. It usually
hangs or sometimes takes an unhanded exception around the DMA
zone messages. This was bisected to the new CONT PTE changes.

Adding an extra flush_tlb_all() in the code path which is
changing the kernel permissions allows the machine to boot
consistently.

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
---
 arch/arm64/mm/mmu.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Mark Rutland Nov. 18, 2015, 3:20 p.m. UTC | #1
Hi Jeremy,

On Wed, Nov 18, 2015 at 09:03:19AM -0600, Jeremy Linton wrote:
> The HP m400 fails to boot the linux 4.4rc1 kernel. 

Are you using defconfig? If not, can you share your config?

> It usually hangs or sometimes takes an unhanded exception around the
> DMA zone messages. This was bisected to the new CONT PTE changes.

Do you have any examples of the unhandled exception cases? Are they a
mixed bag, or a consistent exception class?

> Adding an extra flush_tlb_all() in the code path which is
> changing the kernel permissions allows the machine to boot
> consistently.

As you mention changing permissions, I take it you're using
CONFIG_DEBUG_RODATA?

Thanks,
Mark.

> Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
> ---
>  arch/arm64/mm/mmu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index e3f563c..e92fe77 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -121,6 +121,7 @@ static void __populate_init_pte(pte_t *pte, unsigned long addr,
>  		pfn++;
>  		addr += PAGE_SIZE;
>  	} while (addr != end);
> +	flush_tlb_all();
>  }
>  
>  static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
> -- 
> 2.4.3
>
Jeremy Linton Nov. 18, 2015, 4:08 p.m. UTC | #2
On 11/18/2015 09:20 AM, Mark Rutland wrote:
> Hi Jeremy,
>
> On Wed, Nov 18, 2015 at 09:03:19AM -0600, Jeremy Linton wrote:
>> The HP m400 fails to boot the linux 4.4rc1 kernel.
>
> Are you using defconfig? If not, can you share your config?
	No, its not defconfig, its roughly the RHELSA config tossed into a 
mainline 4.4 tree and all the default options selected. AFAIK RHELSA is 
still limited access.

>
>> It usually hangs or sometimes takes an unhanded exception around the
>> DMA zone messages. This was bisected to the new CONT PTE changes.
>
> Do you have any examples of the unhandled exception cases? Are they a
> mixed bag, or a consistent exception class?

I'm guessing about 90% of the time its a dead hang, the remaining are 
the faults of which there is one that happens more frequently than the 
others. Here is one i found in my notes..

[    0.000000] On node 0 totalpages: 1048512
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 65472 pages, LIFO batch:1
[    0.000000] Unhandled fault: unknown 48 (0x96000070) at 
0xfffffe0000d60588

>> Adding an extra flush_tlb_all() in the code path which is
>> changing the kernel permissions allows the machine to boot
>> consistently.
>
> As you mention changing permissions, I take it you're using
> CONFIG_DEBUG_RODATA?

The failing configuration doesn't have DEBUG_RODATA set, I might have 
been pretty loose with my terminology.

Frankly, I wondered originally how config RODATA was working reliably 
because the flushes were only around the directories getting split, 
fixup_init() (and basically anything calling create_mapping_late()) 
looked like there were paths that could avoid flushing. When I added the 
CONT changes I didn't add flushes to paths that didn't previously have 
them (except in the split cont range case, which matched the spit p[mu]d 
case). I made the mistake of assuming someone knew about some edge case 
that avoided the need for the flush.

Once I find/fix the console issue on that machine with 4.4rc1 (there are 
a small handful of issues that keep mainline from working on it, 
including the sata patch that was posted, and rejected), I will focus on 
hoisting the tlb flush into create_mapping_late() and removing the 
splattering of flushes in those code paths. That is unless there is a 
reason to be preforming them as soon as the directories are split.
Mark Rutland Nov. 18, 2015, 4:29 p.m. UTC | #3
On Wed, Nov 18, 2015 at 10:08:58AM -0600, Jeremy Linton wrote:
> On 11/18/2015 09:20 AM, Mark Rutland wrote:
> >Hi Jeremy,
> >
> >On Wed, Nov 18, 2015 at 09:03:19AM -0600, Jeremy Linton wrote:
> >>The HP m400 fails to boot the linux 4.4rc1 kernel.
> >
> >Are you using defconfig? If not, can you share your config?
> 	No, its not defconfig, its roughly the RHELSA config tossed into a
> mainline 4.4 tree and all the default options selected. AFAIK RHELSA
> is still limited access.

That renders this extremely difficult for anyone else to reproduce...

> >>It usually hangs or sometimes takes an unhanded exception around the
> >>DMA zone messages. This was bisected to the new CONT PTE changes.
> >
> >Do you have any examples of the unhandled exception cases? Are they a
> >mixed bag, or a consistent exception class?
> 
> I'm guessing about 90% of the time its a dead hang, the remaining
> are the faults of which there is one that happens more frequently
> than the others. Here is one i found in my notes..

Ok. In future please provide a sample with any bug report.

> [    0.000000] On node 0 totalpages: 1048512
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 0 pages reserved
> [    0.000000]   DMA zone: 65472 pages, LIFO batch:1
> [    0.000000] Unhandled fault: unknown 48 (0x96000070) at
> 0xfffffe0000d60588

From a quick grep that's from do_mem_abort, where the "unknown 48" is
the DFSC, the bit in brackets is the ESR, and the address is the
faulting address from FAR_EL1.

That 48 / 0b110000 for the DFSC decodes as "TLB conflict abort" per the
ARM ARM. Other than that, the WnR bit is set in the ISS.

So this is probably a break-before-make issue.

Can you figure out where 0xfffffe0000d60588 pointed to, and where in the
kernel the access was performed? It would be nice to know if this is
consistently happening at some edge of the kernel address space.

FWIW, Will had a patch [1] for detecting PTE level break-before-make
violations. I gave this a go on Juno with v4.4-rc1, and saw an issue in
the EFI virtmap code that I'm currently investigating.

> >>Adding an extra flush_tlb_all() in the code path which is
> >>changing the kernel permissions allows the machine to boot
> >>consistently.
> >
> >As you mention changing permissions, I take it you're using
> >CONFIG_DEBUG_RODATA?
> 
> The failing configuration doesn't have DEBUG_RODATA set, I might
> have been pretty loose with my terminology.

Ok, good to know.

> Frankly, I wondered originally how config RODATA was working
> reliably because the flushes were only around the directories
> getting split, fixup_init() (and basically anything calling
> create_mapping_late()) looked like there were paths that could avoid
> flushing. When I added the CONT changes I didn't add flushes to
> paths that didn't previously have them (except in the split cont
> range case, which matched the spit p[mu]d case). I made the mistake
> of assuming someone knew about some edge case that avoided the need
> for the flush.

I'll need to page the code back into my head, but I recall I had
concerns about break-before-make, so there's some auditing to be done.

> Once I find/fix the console issue on that machine with 4.4rc1 (there
> are a small handful of issues that keep mainline from working on it,
> including the sata patch that was posted, and rejected), I will
> focus on hoisting the tlb flush into create_mapping_late() and
> removing the splattering of flushes in those code paths. That is
> unless there is a reason to be preforming them as soon as the
> directories are split.

We need to figure out exactly what maintenance we actually need.

Hoisting the TLB flush isn't necessarily possible if we need to perform
break-before-make at the PTE level, and even that may not be possible
for the kernel page tables; we might need to do something more
drastic like using ASIDs and double-buffering them...

We also need to figure out what's happening with the code as it is.

Thanks,
Mark.

[1] https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=aarch64/devel&id=372f39220ad35fa39a75419f2221ffeb6ffd78d3
Jeremy Linton Nov. 18, 2015, 5:14 p.m. UTC | #4
On 11/18/2015 10:29 AM, Mark Rutland wrote:
> On Wed, Nov 18, 2015 at 10:08:58AM -0600, Jeremy Linton wrote:
>> 	No, its not defconfig, its roughly the RHELSA config tossed into a
>> mainline 4.4 tree and all the default options selected. AFAIK RHELSA
>> is still limited access.
>
> That renders this extremely difficult for anyone else to reproduce...

Well the kernel in question boots fine on a Juno. I haven't tried any 
other APM based machines. And given whats happening I doubt its config 
related.

> That 48 / 0b110000 for the DFSC decodes as "TLB conflict abort" per the
> ARM ARM. Other than that, the WnR bit is set in the ISS.
>
> So this is probably a break-before-make issue.
>
> Can you figure out where 0xfffffe0000d60588 pointed to, and where in the
> kernel the access was performed? It would be nice to know if this is
> consistently happening at some edge of the kernel address space.

I decoded everything when I initially saw it, but it didn't make a lick 
of sense related to what I was attempting to accomplish so I didn't keep 
any of it. Only later when I found out it wasn't related to the patches 
I was applying did I start trying to track down the regression. Even so, 
given some other patches that went in, it wasn't blindingly obvious 
where the problem was until I was sure that it was related to the linear 
mapping changes. AKA I didn't think anyone would be able to debug the 
failure with that little information, maybe i'm wrong on that point... 
Anyway, the kernel that produced that failure is long gone, I can in the 
near future attempt to reproduce the message.


>> Once I find/fix the console issue on that machine with 4.4rc1 (there
>> are a small handful of issues that keep mainline from working on it,
>> including the sata patch that was posted, and rejected), I will
>> focus on hoisting the tlb flush into create_mapping_late() and
>> removing the splattering of flushes in those code paths. That is
>> unless there is a reason to be preforming them as soon as the
>> directories are split.
>
> We need to figure out exactly what maintenance we actually need.
>
> Hoisting the TLB flush isn't necessarily possible if we need to perform
> break-before-make at the PTE level, and even that may not be possible
> for the kernel page tables; we might need to do something more
> drastic like using ASIDs and double-buffering them...
>
> We also need to figure out what's happening with the code as it is.

Well, I'm suspect what is happening is that there are conflicting TLB's 
hanging around, one for a cont range that is overlapping a stale non 
cont one. This sort of implies that this has been happening all along, 
AKA RO regions were being "lazy" activated if you will. Its only on a 
core that aborts when it detects that (which i assume requires differing 
size entries for this core) does it cause problems. The 
break-before-make issue, seems like it won't cause a big problem here as 
long as there is some way to assure valid TLBs before the update, and 
then assure they are cleared following it. Hence the overly aggressive 
change works because it flushes following every cont block update. Which 
would bother me more if the code were run more than once per boot (or in 
the future per module load/unload if someone gets around to updating the 
no execute reliably).
Mark Rutland Nov. 18, 2015, 6:04 p.m. UTC | #5
> >We also need to figure out what's happening with the code as it is.
> 
> Well, I'm suspect what is happening is that there are conflicting
> TLB's hanging around, one for a cont range that is overlapping a stale
> non cont one. This sort of implies that this has been happening all
> along, AKA RO regions were being "lazy" activated if you will. Its
> only on a core that aborts when it detects that (which i assume
> requires differing size entries for this core) does it cause problems.

I suspect that we may have believed that the TLB maintenance at the end
of paging_init was sufficient, as we evidently  id not consider TLB
conflicts.

> The break-before-make issue, seems like it won't cause a big problem
> here as long as there is some way to assure valid TLBs before the
> update, and then assure they are cleared following it.

If at any instant in time you have a valid TLB entry for an address, and
the page tables hold a value that would give rise to a conflicting TLB
entry, you can encounter a TLB conflict abort.

It's a race against the hardware.

> Hence the overly aggressive change works because it
> flushes following every cont block update. Which would bother me
> more if the code were run more than once per boot (or in the future
> per module load/unload if someone gets around to updating the no
> execute reliably).

You're racing against other parts of the CPU (the page table walker(s),
I-caches, etc). The flushing only minimises the window for a race, and
does not prevent the race from being possible.

Given that the envelope is constantly pushing forward w.r.t. how
aggressive CPUs may be in this area, we need to fix the issue by
reasoning against what the architecture guarantees us.

We're almost certainly going to have to revisit this code in future, and
sprinkling TLB maintenance over it will only make it harder to reason
about in future.

Mark.
Jeremy Linton Nov. 18, 2015, 7:31 p.m. UTC | #6
On 11/18/2015 12:04 PM, Mark Rutland wrote:

> You're racing against other parts of the CPU (the page table walker(s),
> I-caches, etc). The flushing only minimises the window for a race, and
> does not prevent the race from being possible.
>
> Given that the envelope is constantly pushing forward w.r.t. how
> aggressive CPUs may be in this area, we need to fix the issue by
> reasoning against what the architecture guarantees us.
	Its also not suppose to fault on speculative access, and to me that 
means page table walks/etc that are the result of speculative access. 
Which AFAIK, closes the window significantly. I would only really worry 
about interrupt activity, and updates to the memory containing the PTE's 
themselves. Either way the simple change (rather than rewriting the 
whole code path) is probably to flag the fault handler to simply resume 
from these kinds of faults during create_mapping_late().

	But that isn't what is happening here AFAIK, the faults are long after 
the PTE's have been updated, and are the result of failure to flush the 
TLB..
Mark Rutland Nov. 19, 2015, 11:31 a.m. UTC | #7
On Wed, Nov 18, 2015 at 01:31:18PM -0600, Jeremy Linton wrote:
> On 11/18/2015 12:04 PM, Mark Rutland wrote:
> 
> >You're racing against other parts of the CPU (the page table walker(s),
> >I-caches, etc). The flushing only minimises the window for a race, and
> >does not prevent the race from being possible.
> >
> >Given that the envelope is constantly pushing forward w.r.t. how
> >aggressive CPUs may be in this area, we need to fix the issue by
> >reasoning against what the architecture guarantees us.
> 	Its also not suppose to fault on speculative access, and to me that
> means page table walks/etc that are the result of speculative
> access.

I was under the impression that TLB conflict abort could be delivered
for asynchronous events (e.g. speculative I-cache fetches rather than
for speculative execution of already fetched instructions).

Having looked at the ARM ARM, I appear to have been mistaken. As you
say, it appears that TLB conflict aborts are always delivered
synchronously.

> Which AFAIK, closes the window significantly. I would only
> really worry about interrupt activity, and updates to the memory
> containing the PTE's themselves. Either way the simple change
> (rather than rewriting the whole code path) is probably to flag the
> fault handler to simply resume from these kinds of faults during
> create_mapping_late().

Unfortunately that may not be sufficient. The conflicting address range
might cover the current stack or the text of the exception handler, and
in those cases trying to handle the exception would result in taking
another TLB conflict abort recursively.

> 	But that isn't what is happening here AFAIK, the faults are long
> after the PTE's have been updated, and are the result of failure to
> flush the TLB..

It's not quite the same case, certainly.

It's also possible that the faults you are seeing see are also possible
earlier, but simply less likely, and that we get away without seeing the
other potential issues because of things that may change (i.e. the way
the compiler lays out the text).

I think that if we need to do something more drastic to account for the
other issues above (e.g. by ensuring that we can never allocate
conflicting TLB entries in the first place), and that said strategy
would also fix this problem, that would be preferable, given that we're
going to have to do that eventually anyway.

Mark.
Mark Rutland Nov. 20, 2015, 7:52 p.m. UTC | #8
On Thu, Nov 19, 2015 at 11:31:34AM +0000, Mark Rutland wrote:
> On Wed, Nov 18, 2015 at 01:31:18PM -0600, Jeremy Linton wrote:
> > On 11/18/2015 12:04 PM, Mark Rutland wrote:
> > 
> > >You're racing against other parts of the CPU (the page table walker(s),
> > >I-caches, etc). The flushing only minimises the window for a race, and
> > >does not prevent the race from being possible.
> > >
> > >Given that the envelope is constantly pushing forward w.r.t. how
> > >aggressive CPUs may be in this area, we need to fix the issue by
> > >reasoning against what the architecture guarantees us.
> > 	Its also not suppose to fault on speculative access, and to me that
> > means page table walks/etc that are the result of speculative
> > access.
> 
> I was under the impression that TLB conflict abort could be delivered
> for asynchronous events (e.g. speculative I-cache fetches rather than
> for speculative execution of already fetched instructions).
> 
> Having looked at the ARM ARM, I appear to have been mistaken. As you
> say, it appears that TLB conflict aborts are always delivered
> synchronously.
> 
> > Which AFAIK, closes the window significantly. I would only
> > really worry about interrupt activity, and updates to the memory
> > containing the PTE's themselves. Either way the simple change
> > (rather than rewriting the whole code path) is probably to flag the
> > fault handler to simply resume from these kinds of faults during
> > create_mapping_late().

> > 	But that isn't what is happening here AFAIK, the faults are long
> > after the PTE's have been updated, and are the result of failure to
> > flush the TLB..

> I think that if we need to do something more drastic to account for the
> other issues above (e.g. by ensuring that we can never allocate
> conflicting TLB entries in the first place), and that said strategy
> would also fix this problem, that would be preferable, given that we're
> going to have to do that eventually anyway.

Having looked into this further, we also have the same issue with the
kasan init code.

I believe that the issue is restricted to one-off init code, as I don't
think that we do anything at runtime which would be problematic. If
anyone knows of a counter-example, please let me know!

Given that, we can restrict the problem to an early UP environment, and
it won't matter if therre's some large(ish) fixed cost associated with
updating the kernel page tables. I think that we can avoid the issue
entirely by modifying a copy of the kernel page tables, which we can
later install via some idmap code (going via a reserved table to clear
the TLBs).

I'm working on patches to implement the above, which I'll try to get
somewhere with next week.

Thanks,
Mark
Mark Rutland Nov. 20, 2015, 8:15 p.m. UTC | #9
On Thu, Nov 19, 2015 at 11:31:34AM +0000, Mark Rutland wrote:
> On Wed, Nov 18, 2015 at 01:31:18PM -0600, Jeremy Linton wrote:
> > On 11/18/2015 12:04 PM, Mark Rutland wrote:
> > 
> > >You're racing against other parts of the CPU (the page table walker(s),
> > >I-caches, etc). The flushing only minimises the window for a race, and
> > >does not prevent the race from being possible.
> > >
> > >Given that the envelope is constantly pushing forward w.r.t. how
> > >aggressive CPUs may be in this area, we need to fix the issue by
> > >reasoning against what the architecture guarantees us.
> > 	Its also not suppose to fault on speculative access, and to me that
> > means page table walks/etc that are the result of speculative
> > access.
> 
> I was under the impression that TLB conflict abort could be delivered
> for asynchronous events (e.g. speculative I-cache fetches rather than
> for speculative execution of already fetched instructions).
> 
> Having looked at the ARM ARM, I appear to have been mistaken. As you
> say, it appears that TLB conflict aborts are always delivered
> synchronously.

Having invesitgated further, while we may not encounter (synchronous)
TLB conflict aborts, we may still encounter (asynchronous) issues from
conflicting TLB entries.

Per the ARM ARM, if the TLB contains multiple entries for the same
address, the result of a translation may be some amalgamation of said
entries (where the amalgamation could be an arbitrary function of all of
said matching entries).

Thus page table walks and *-cache fetches may use completely erroneous
addresses and/or attributes, asynchronous to the instruction stream, and
as a result of this may change the state of MMIO peripherals, trigger
SError, etc.

This is a much scarier proposition than the TLB conflict aborts.

Thanks,
Mark.
Catalin Marinas Nov. 23, 2015, 12:15 p.m. UTC | #10
On Fri, Nov 20, 2015 at 07:52:44PM +0000, Mark Rutland wrote:
> On Thu, Nov 19, 2015 at 11:31:34AM +0000, Mark Rutland wrote:
> > I think that if we need to do something more drastic to account for the
> > other issues above (e.g. by ensuring that we can never allocate
> > conflicting TLB entries in the first place), and that said strategy
> > would also fix this problem, that would be preferable, given that we're
> > going to have to do that eventually anyway.
> 
> Having looked into this further, we also have the same issue with the
> kasan init code.

I don't think the kasan_init() problem is that bad. We are preserving
the same size mappings (PAGE_SIZE) and just changing the physical
address they point at without a break-before-make (just a TTBR1 switch).
I don't know how clear the ARM ARM is around this but at least so far we
haven't hit any problems.

The problem with the contiguous bit is that we switch from e.g. a 4KB
mapping to a 64KB one and it's very likely that we would get a TLB
conflict.

With CONFIG_DEBUG_RODATA, we go from bigger block to a smaller one, so
less chance of a TLB conflict but still present. I need to read the ARM
ARM some more in this area (and maybe ask for clarification).

> I believe that the issue is restricted to one-off init code, as I don't
> think that we do anything at runtime which would be problematic. If
> anyone knows of a counter-example, please let me know!
> 
> Given that, we can restrict the problem to an early UP environment, and
> it won't matter if therre's some large(ish) fixed cost associated with
> updating the kernel page tables. I think that we can avoid the issue
> entirely by modifying a copy of the kernel page tables, which we can
> later install via some idmap code (going via a reserved table to clear
> the TLBs).
> 
> I'm working on patches to implement the above, which I'll try to get
> somewhere with next week.

That's a complete fix indeed but it would require some more testing and
I don't think it's feasible for 4.4-rc. In the meantime, I propose that
we revert the contiguous PTE patches and push them again once we fix the
TLB conflict problems.
Mark Rutland Nov. 23, 2015, 1:49 p.m. UTC | #11
On Mon, Nov 23, 2015 at 12:15:15PM +0000, Catalin Marinas wrote:
> On Fri, Nov 20, 2015 at 07:52:44PM +0000, Mark Rutland wrote:
> > On Thu, Nov 19, 2015 at 11:31:34AM +0000, Mark Rutland wrote:
> > > I think that if we need to do something more drastic to account for the
> > > other issues above (e.g. by ensuring that we can never allocate
> > > conflicting TLB entries in the first place), and that said strategy
> > > would also fix this problem, that would be preferable, given that we're
> > > going to have to do that eventually anyway.
> > 
> > Having looked into this further, we also have the same issue with the
> > kasan init code.
> 
> I don't think the kasan_init() problem is that bad. We are preserving
> the same size mappings (PAGE_SIZE) and just changing the physical
> address they point at without a break-before-make (just a TTBR1 switch).

Per the ARM ARM, "CONSTRAINED UNPREDICTABLE behaviors due to caching of
control or data values", the result of a translation could be "an
amalgamation" of the values. I believe that we have to read
"amalgamation" as "arbitrary function of" here.

I don't think that we're safe because we only changed the output
addresses of entries.

> I don't know how clear the ARM ARM is around this but at least so far we
> haven't hit any problems.

I assume you're talking generally here, rather than specifically about
kasan. I agree that we haven't spotted any issues so far.

Given that kasan itself is new and requires a relatively new compiler,
it may not yet have been tested on a platform where it would fail on.

Jeremy, for reference, have you tried kasan on m400? Or DEBUG_RODATA?

> The problem with the contiguous bit is that we switch from e.g. a 4KB
> mapping to a 64KB one and it's very likely that we would get a TLB
> conflict.
> 
> With CONFIG_DEBUG_RODATA, we go from bigger block to a smaller one, so
> less chance of a TLB conflict but still present. I need to read the ARM
> ARM some more in this area (and maybe ask for clarification).

We should certainly try to get clarification here.

> > I believe that the issue is restricted to one-off init code, as I don't
> > think that we do anything at runtime which would be problematic. If
> > anyone knows of a counter-example, please let me know!
> > 
> > Given that, we can restrict the problem to an early UP environment, and
> > it won't matter if therre's some large(ish) fixed cost associated with
> > updating the kernel page tables. I think that we can avoid the issue
> > entirely by modifying a copy of the kernel page tables, which we can
> > later install via some idmap code (going via a reserved table to clear
> > the TLBs).
> > 
> > I'm working on patches to implement the above, which I'll try to get
> > somewhere with next week.
> 
> That's a complete fix indeed but it would require some more testing and
> I don't think it's feasible for 4.4-rc. In the meantime, I propose that
> we revert the contiguous PTE patches and push them again once we fix the
> TLB conflict problems.

I agree that this would be too late for v4.4-rc*.

In the meantime, I guess that reverting the patches is the best thing to
do given we're already at rc2.

Thanks,
Mark.
Jeremy Linton Nov. 23, 2015, 2:31 p.m. UTC | #12
On 11/23/2015 06:15 AM, Catalin Marinas wrote:

> That's a complete fix indeed but it would require some more testing and
> I don't think it's feasible for 4.4-rc. In the meantime, I propose that
> we revert the contiguous PTE patches and push them again once we fix the
> TLB conflict problems.


The problem as you say exists with the block mappings as well, and the 
explicit TLB flush there is apparently sufficient to keep this problem 
from occurring. The 1 line explicit flush is sufficient to keep it from 
happening in the late mapping case as well, and actually fixes a problem 
that was in the kernel from before these patches (the complete lack of 
flush in paths calling create_mapping_late()). So, I fail to see how 
reverting the patch fixes anything, it only returns to the previously 
(now known) to be bad code path.

BTW: The M400, has other issues that keep it from booting with ACPI 
firmware (the 4.3 EOI changes irritate a 'bug' in the MADT table).
Jeremy Linton Nov. 23, 2015, 2:48 p.m. UTC | #13
On 11/23/2015 07:49 AM, Mark Rutland wrote:
> Jeremy, for reference, have you tried kasan on m400? Or DEBUG_RODATA?


No, because the machine has a list of issues that keep it from booting a 
mainline kernel in a functional state. Once those are cleared up I will 
revisit this patch.

The goal was to create a conceptually safe fix for the problem, which 
isn't all this hypothetical stuff being discussed, but the fact that the 
TLBs are not being flushed properly (with or without the CONT bit stuff) 
resulting in a tlb conflict fault long after this code path has finished 
executing.
Will Deacon Nov. 23, 2015, 3:41 p.m. UTC | #14
Jeremy,

On Mon, Nov 23, 2015 at 08:48:40AM -0600, Jeremy Linton wrote:
> On 11/23/2015 07:49 AM, Mark Rutland wrote:
> >Jeremy, for reference, have you tried kasan on m400? Or DEBUG_RODATA?
> 
> No, because the machine has a list of issues that keep it from booting a
> mainline kernel in a functional state. Once those are cleared up I will
> revisit this patch.
> 
> The goal was to create a conceptually safe fix for the problem, which isn't
> all this hypothetical stuff being discussed, but the fact that the TLBs are
> not being flushed properly (with or without the CONT bit stuff) resulting in
> a tlb conflict fault long after this code path has finished executing.

I appreciate that you just want to get something working, and we've
established that a TLBIALL probably does solve the problem, however there's
more to it than that.

If we start adding TLBI/DSB/ISB/NOP/cache flush/read from config space
type operations in arbitrary places, we run a very real risk of masking
future bugs. Then, when the original problem that prompted the temporary
bodge is fixed properly, we uncover a whole raft of problems that we didn't
even realise were there. Consequently, we get stuck with the bodge forever
and, crucially, we lose the ability to reason about the state of the CPU
under Linux. That reasoning is incredibly useful when developing new
architectural features, debugging, profiling or assessing whether or
not Linux is susceptible to hardware errata and is precisely why the
"hypothetical stuff" matters.

For these reasons, we have no viable option other than reverting the
offending series until the underlying problem is fixed properly. With
any luck, that's the 4.5 timeframe so we really only lose a single
release (providing that you have time to rebase and repost).

Sorry about this; I hope it doesn't dissuade you from reporting bugs in
the future.

Will
Jeremy Linton Nov. 23, 2015, 3:46 p.m. UTC | #15
On 11/23/2015 09:41 AM, Will Deacon wrote:

> For these reasons, we have no viable option other than reverting the
> offending series until the underlying problem is fixed properly. With
> any luck, that's the 4.5 timeframe so we really only lose a single
> release (providing that you have time to rebase and repost).


That is fine but, I think you need to take this fix anyway (or something 
similar), because there _IS_ a TLB flush bug in that code path. I saw it 
when I wrote the original path, but discounted it assuming that the 
original authors had more insight on the hardware than I did.

Whether the fix takes the suggest form or another, the cont bit changes 
are only irritating an existing bug, not causing it.
diff mbox

Patch

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e3f563c..e92fe77 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -121,6 +121,7 @@  static void __populate_init_pte(pte_t *pte, unsigned long addr,
 		pfn++;
 		addr += PAGE_SIZE;
 	} while (addr != end);
+	flush_tlb_all();
 }
 
 static void alloc_init_pte(pmd_t *pmd, unsigned long addr,