diff mbox

CONFIG_SCRUB_DEBUG=y + arm64 + livepatch = Xen BUG at page_alloc.c:738

Message ID 20170913153242.GA11299@char.us.oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Konrad Rzeszutek Wilk Sept. 13, 2017, 3:32 p.m. UTC
On Tue, Sep 12, 2017 at 09:19:23PM -0400, Boris Ostrovsky wrote:
> 
> 
> On 09/12/2017 08:01 PM, Konrad Rzeszutek Wilk wrote:
> > On Mon, Sep 11, 2017 at 08:45:02PM -0400, Boris Ostrovsky wrote:
> > > 
> > > 
> > > On 09/11/2017 07:55 PM, Konrad Rzeszutek Wilk wrote:
> > > > Hey,
> > > > 
> > > > I've only been able to reproduce this on ARM64 (trying right now ARM32
> > > > as well), and not on x86.
> > > > 
> > > > If I compile Xen without CONFIG_SCRUB_DEBUG it works great. But if
> > > > enable it and try to load a livepatch it blows up in page_alloc.c:738
> > > > 
> > > > This is with origin/staging (d0291f3391)
> > > 
> > > Can you still reproduce this if you revert 307c3be?
> > 
> > Sadly yes - it still crashes. I didn't capture the serial output.
> > 
> > I honestly think the issue is that on ARM64 the "sleep" loop does not
> > wake up as often as on x86 (CC-ing Dariof who I believe observed this
> > with Credit2 and the wakeup.. something) - maybe he remembers the
> > details. Anyhow my theory is that the pages are not scrubbed at all
> > when they go in the idle loop as once it goes to sleep - it stays there.
> 
> 
> There is no (well, should not be) any timing dependencies in how/whether
> pages are scrubbed. If a page doesn't get scrubbed because someone didn't
> wake up then it should be scrubbed in alloc_heap_pages(). So in this case
> the page is thought to be clean (_PGC_need_scrub is not set), but it is not.
> 
> Have you tried running a guest (or two), rebooting in a loop?

No. I just cold-booted it and tried to livepatch.
> 
> Another thing to try is to set need_scrub to true in free_heap_pages().

Magic!


Fixes it ! :-)


> 
> -boris
> 
> 
> > 
> > Ah, see commit 05c52278a7c92bc753d9fe32017e4961012b9f23
> > 
> > Maybe this is related?
> > > 
> > > 
> > > -boris

Comments

Boris Ostrovsky Sept. 13, 2017, 6:05 p.m. UTC | #1
On 09/13/2017 11:32 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Sep 12, 2017 at 09:19:23PM -0400, Boris Ostrovsky wrote:
>>
>> On 09/12/2017 08:01 PM, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Sep 11, 2017 at 08:45:02PM -0400, Boris Ostrovsky wrote:
>>>>
>>>> On 09/11/2017 07:55 PM, Konrad Rzeszutek Wilk wrote:
>>>>> Hey,
>>>>>
>>>>> I've only been able to reproduce this on ARM64 (trying right now ARM32
>>>>> as well), and not on x86.
>>>>>
>>>>> If I compile Xen without CONFIG_SCRUB_DEBUG it works great. But if
>>>>> enable it and try to load a livepatch it blows up in page_alloc.c:738
>>>>>
>>>>> This is with origin/staging (d0291f3391)
>>>> Can you still reproduce this if you revert 307c3be?
>>> Sadly yes - it still crashes. I didn't capture the serial output.
>>>
>>> I honestly think the issue is that on ARM64 the "sleep" loop does not
>>> wake up as often as on x86 (CC-ing Dariof who I believe observed this
>>> with Credit2 and the wakeup.. something) - maybe he remembers the
>>> details. Anyhow my theory is that the pages are not scrubbed at all
>>> when they go in the idle loop as once it goes to sleep - it stays there.
>>
>> There is no (well, should not be) any timing dependencies in how/whether
>> pages are scrubbed. If a page doesn't get scrubbed because someone didn't
>> wake up then it should be scrubbed in alloc_heap_pages(). So in this case
>> the page is thought to be clean (_PGC_need_scrub is not set), but it is not.
>>
>> Have you tried running a guest (or two), rebooting in a loop?
> No. I just cold-booted it and tried to livepatch.
>> Another thing to try is to set need_scrub to true in free_heap_pages().
> Magic!
>
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index dbad1e1ca0..9303eb4517 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -1308,6 +1308,7 @@ static void free_heap_pages(
>      ASSERT(node >= 0);
>  
>      spin_lock(&heap_lock);
> +    need_scrub = true;
>  
>      for ( i = 0; i < (1 << order); i++ )
>      {
>
> Fixes it ! :-)


Well, that's not a fix. This eliminates the case that something in
ARM-specific code (which I haven't tested) accidentally clears
_PGC_need_scrub.

OK, I think I know what the problem is. You are using
CONFIG_SEPARATE_XENHEAP, are you?


-boris
Julien Grall Sept. 13, 2017, 6:25 p.m. UTC | #2
Hi,

On 09/13/2017 07:05 PM, Boris Ostrovsky wrote:
> On 09/13/2017 11:32 AM, Konrad Rzeszutek Wilk wrote:
> Well, that's not a fix. This eliminates the case that something in
> ARM-specific code (which I haven't tested) accidentally clears
> _PGC_need_scrub.
> 
> OK, I think I know what the problem is. You are using
> CONFIG_SEPARATE_XENHEAP, are you?

It seems the bug appear on Arm64, so CONFIG_SEPARATE_XENHEAP is not set.

Note that Arm32 is using separate heap.

Cheers,
diff mbox

Patch

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index dbad1e1ca0..9303eb4517 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1308,6 +1308,7 @@  static void free_heap_pages(
     ASSERT(node >= 0);
 
     spin_lock(&heap_lock);
+    need_scrub = true;
 
     for ( i = 0; i < (1 << order); i++ )
     {