diff mbox

[v6,1/8] mm: Place unscrubbed pages at the end of pagelist

Message ID 1501866346-9774-2-git-send-email-boris.ostrovsky@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Boris Ostrovsky Aug. 4, 2017, 5:05 p.m. UTC
. so that it's easy to find pages that need to be scrubbed (those pages are
now marked with _PGC_need_scrub bit).

We keep track of the first unscrubbed page in a page buddy using first_dirty
field. For now it can have two values, 0 (whole buddy needs scrubbing) or
INVALID_DIRTY_IDX (the buddy does not need to be scrubbed). Subsequent patches
will allow scrubbing to be interrupted, resulting in first_dirty taking any
value.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
Changes in v6:
* Track indexes and not pointers in alloc_heap_pages() and reserve_offlined_page()
  when breaking a potentially dirty buddy and merging the fragments.
* Reduced (by one bit) width of INVALID_DIRTY_IDX
* Added a coupe of ASSERT()s

 xen/common/page_alloc.c  | 194 ++++++++++++++++++++++++++++++++++++++---------
 xen/include/asm-arm/mm.h |  18 ++++-
 xen/include/asm-x86/mm.h |  17 ++++-
 3 files changed, 193 insertions(+), 36 deletions(-)

Comments

Jan Beulich Aug. 6, 2017, 5:41 p.m. UTC | #1
>>> Boris Ostrovsky <boris.ostrovsky@oracle.com> 08/04/17 7:03 PM >>>
>@@ -873,6 +916,8 @@ static int reserve_offlined_page(struct page_info *head)
 >
>while ( cur_order < head_order )
>{
>+            unsigned int idx = INVALID_DIRTY_IDX;

Is it correct for the variable to live in this scope, rather than ...

>@@ -892,8 +937,28 @@ static int reserve_offlined_page(struct page_info *head)
>{
>merge:

... in this one? Of course it's less the variable scope itself, but the initial
value at the point here.

>/* We don't consider merging outside the head_order. */
>-                page_list_add_tail(cur_head, &heap(node, zone, cur_order));
>-                PFN_ORDER(cur_head) = cur_order;
>+
>+                /* See if any of the pages indeed need scrubbing. */
>+                if ( first_dirty != INVALID_DIRTY_IDX )
>+                {
>+                    if ( (1U << cur_order) > first_dirty )
>+                    {
>+                        for ( i = first_dirty ; i < (1U << cur_order); i++ )
>+                            if ( test_bit(_PGC_need_scrub,
>+                                          &cur_head[i].count_info) )
>+                            {
>+                                idx = i;
>+                                break;
>+                            }

Why again do you need to look through all the pages here, rather than
simply marking the chunks as needing scrubbing simply based on first_dirty?
It seems to me that I've asked this before, which is a good indication that
such special behavior would better have a comment attached.

>@@ -977,35 +1096,49 @@ static void free_heap_pages(
 >
>if ( (page_to_mfn(pg) & mask) )
>{
>+            struct page_info *predecessor = pg - mask;

For this and ...

>}
>else
>{
>+            struct page_info *successor = pg + mask;

... this, it would certainly help readability of the patch here if the introduction
of the new local variables was broken out in a prereq patch. But yes, I should
have asked for this earlier on, so I'm not going to insist.

>--- a/xen/include/asm-x86/mm.h
>+++ b/xen/include/asm-x86/mm.h
>@@ -88,7 +88,15 @@ struct page_info
>/* Page is on a free list: ((count_info & PGC_count_mask) == 0). */
>struct {
>/* Do TLBs need flushing for safety before next page use? */
>-            bool_t need_tlbflush;
>+            unsigned long need_tlbflush:1;
>+
>+            /*
>+             * Index of the first *possibly* unscrubbed page in the buddy.
>+             * One more than maximum possible order to accommodate

"One more bit than ..." (I think I did point out the ambiguity of the wording
here before).

Jan
Julien Grall Aug. 7, 2017, 10:45 a.m. UTC | #2
Hi Boris,

I would have appreciated to be CCed as maintainer of the ARM bits... 
Please use scripts/get_maintainers.pl in the future.

On 04/08/17 18:05, Boris Ostrovsky wrote:
> . so that it's easy to find pages that need to be scrubbed (those pages are

Pointless .

> diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
> index ef84b72..d26b232 100644
> --- a/xen/include/asm-arm/mm.h
> +++ b/xen/include/asm-arm/mm.h
> @@ -44,7 +44,16 @@ struct page_info
>          /* Page is on a free list: ((count_info & PGC_count_mask) == 0). */
>          struct {
>              /* Do TLBs need flushing for safety before next page use? */
> -            bool_t need_tlbflush;
> +            unsigned long need_tlbflush:1;

You've turned need_tlbflush from bool to unsigned long : 1. But some of 
the users use true/false or may rely on the boolean property.  So it 
sounds like to me you want to use bool bitfields here (and in the x86 part).

> +
> +            /*
> +             * Index of the first *possibly* unscrubbed page in the buddy.
> +             * One more than maximum possible order to accommodate
> +             * INVALID_DIRTY_IDX.
> +             */
> +#define INVALID_DIRTY_IDX ((1UL << (MAX_ORDER + 1)) - 1)
> +            unsigned long first_dirty:MAX_ORDER + 1;

We need to make sure that this union will not be bigger than unsigned 
long. Otherwise this will limit lower down the maximum amount of memory 
we support.
So this likely means a BUILD_BUG_ON(....).

Cheers,
Boris Ostrovsky Aug. 7, 2017, 2:12 p.m. UTC | #3
On 08/06/2017 01:41 PM, Jan Beulich wrote:
>>>> Boris Ostrovsky <boris.ostrovsky@oracle.com> 08/04/17 7:03 PM >>>
>> @@ -873,6 +916,8 @@ static int reserve_offlined_page(struct page_info *head)
>  >
>> while ( cur_order < head_order )
>> {
>> +            unsigned int idx = INVALID_DIRTY_IDX;
> Is it correct for the variable to live in this scope, rather than ...
>
>> @@ -892,8 +937,28 @@ static int reserve_offlined_page(struct page_info *head)
>> {
>> merge:
> ... in this one? Of course it's less the variable scope itself, but the initial
> value at the point here.

I should move it to the inner scope --- I actually *want* to
reinitialize it on each iteration.

>
>> /* We don't consider merging outside the head_order. */
>> -                page_list_add_tail(cur_head, &heap(node, zone, cur_order));
>> -                PFN_ORDER(cur_head) = cur_order;
>> +
>> +                /* See if any of the pages indeed need scrubbing. */
>> +                if ( first_dirty != INVALID_DIRTY_IDX )
>> +                {
>> +                    if ( (1U << cur_order) > first_dirty )
>> +                    {
>> +                        for ( i = first_dirty ; i < (1U << cur_order); i++ )
>> +                            if ( test_bit(_PGC_need_scrub,
>> +                                          &cur_head[i].count_info) )
>> +                            {
>> +                                idx = i;
>> +                                break;
>> +                            }
> Why again do you need to look through all the pages here, rather than
> simply marking the chunks as needing scrubbing simply based on first_dirty?
> It seems to me that I've asked this before, which is a good indication that
> such special behavior would better have a comment attached.

We want to make sure that there are in fact dirty pages in the
(sub-)buddy: first_dirty is only a hint there *may be* some. That's why
I have the word "indeed" in the comment there but it's probably worth
expanding on that.

>
>> @@ -977,35 +1096,49 @@ static void free_heap_pages(
>  >
>> if ( (page_to_mfn(pg) & mask) )
>> {
>> +            struct page_info *predecessor = pg - mask;
> For this and ...
>
>> }
>> else
>> {
>> +            struct page_info *successor = pg + mask;
> ... this, it would certainly help readability of the patch here if the introduction
> of the new local variables was broken out in a prereq patch. But yes, I should
> have asked for this earlier on, so I'm not going to insist.

Since I will be re-spinning this patch I will split this out.

-boris
Jan Beulich Aug. 7, 2017, 2:37 p.m. UTC | #4
>>>> Boris Ostrovsky <boris.ostrovsky@oracle.com> 08/07/17 4:16 PM >>>
>On 08/06/2017 01:41 PM, Jan Beulich wrote:
>>>>> Boris Ostrovsky <boris.ostrovsky@oracle.com> 08/04/17 7:03 PM >>>
>>> +                /* See if any of the pages indeed need scrubbing. */
>>> +                if ( first_dirty != INVALID_DIRTY_IDX )
>>> +                {
>>> +                    if ( (1U << cur_order) > first_dirty )
>>> +                    {
>>> +                        for ( i = first_dirty ; i < (1U << cur_order); i++ )
>>> +                            if ( test_bit(_PGC_need_scrub,
>>> +                                          &cur_head[i].count_info) )
>>> +                            {
>>> +                                idx = i;
>>> +                                break;
>>> +                            }
>> Why again do you need to look through all the pages here, rather than
>> simply marking the chunks as needing scrubbing simply based on first_dirty?
>> It seems to me that I've asked this before, which is a good indication that
>> such special behavior would better have a comment attached.
>
>We want to make sure that there are in fact dirty pages in the
>(sub-)buddy: first_dirty is only a hint there *may be* some.

But why is that needed? Iirc you don't go to such length when splitting a
buddy during allocation.

Jan
Boris Ostrovsky Aug. 7, 2017, 2:46 p.m. UTC | #5
On 08/07/2017 06:45 AM, Julien Grall wrote:
> Hi Boris,
>
> I would have appreciated to be CCed as maintainer of the ARM bits...
> Please use scripts/get_maintainers.pl in the future.

Ugh, sorry about that. (I did test builds for both ARM64 and ARM32, if
this make my transgression any less serious ;-))

>
> On 04/08/17 18:05, Boris Ostrovsky wrote:
>> . so that it's easy to find pages that need to be scrubbed (those
>> pages are
>
> Pointless .
>
>> diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
>> index ef84b72..d26b232 100644
>> --- a/xen/include/asm-arm/mm.h
>> +++ b/xen/include/asm-arm/mm.h
>> @@ -44,7 +44,16 @@ struct page_info
>>          /* Page is on a free list: ((count_info & PGC_count_mask) ==
>> 0). */
>>          struct {
>>              /* Do TLBs need flushing for safety before next page
>> use? */
>> -            bool_t need_tlbflush;
>> +            unsigned long need_tlbflush:1;
>
> You've turned need_tlbflush from bool to unsigned long : 1. But some
> of the users use true/false or may rely on the boolean property.  So
> it sounds like to me you want to use bool bitfields here (and in the
> x86 part).

This is what we have (with this series applied):

root@ovs104> git grep need_tlbflush .
common/memory.c:    bool need_tlbflush = false;
common/memory.c:                       
accumulate_tlbflush(&need_tlbflush, &page[j],
common/memory.c:    if ( need_tlbflush )
common/page_alloc.c:    bool need_tlbflush = false;
common/page_alloc.c:            accumulate_tlbflush(&need_tlbflush, &pg[i],
common/page_alloc.c:    if ( need_tlbflush )
* common/page_alloc.c:        pg[i].u.free.need_tlbflush =
(page_get_owner(&pg[i]) != NULL);
common/page_alloc.c:        if ( pg[i].u.free.need_tlbflush )
include/asm-arm/mm.h:                unsigned long need_tlbflush:1;
include/asm-x86/mm.h:           unsigned long need_tlbflush:1;
include/xen/mm.h:static inline void accumulate_tlbflush(bool *need_tlbflush,
include/xen/mm.h:    if ( page->u.free.need_tlbflush &&
include/xen/mm.h:         (!*need_tlbflush ||
include/xen/mm.h:        *need_tlbflush = true;
root@ovs104>

The only possibly boolean-style use is '*' and event that I think is
allowed.

Stand-alone need_tlbflush variables above have nothing to do with this
change.


>
>> +
>> +            /*
>> +             * Index of the first *possibly* unscrubbed page in the
>> buddy.
>> +             * One more than maximum possible order to accommodate
>> +             * INVALID_DIRTY_IDX.
>> +             */
>> +#define INVALID_DIRTY_IDX ((1UL << (MAX_ORDER + 1)) - 1)
>> +            unsigned long first_dirty:MAX_ORDER + 1;
>
> We need to make sure that this union will not be bigger than unsigned
> long. Otherwise this will limit lower down the maximum amount of
> memory we support.
> So this likely means a BUILD_BUG_ON(....).


Are you concerned that (MAX_ORDER+1) will be larger than sizeof(unsigned
long)? If yes, the compiler should complain anyway, shouldn't it?


-boris
Boris Ostrovsky Aug. 7, 2017, 2:55 p.m. UTC | #6
On 08/07/2017 10:37 AM, Jan Beulich wrote:
>>>>> Boris Ostrovsky <boris.ostrovsky@oracle.com> 08/07/17 4:16 PM >>>
>> On 08/06/2017 01:41 PM, Jan Beulich wrote:
>>>>>> Boris Ostrovsky <boris.ostrovsky@oracle.com> 08/04/17 7:03 PM >>>
>>>> +                /* See if any of the pages indeed need scrubbing. */
>>>> +                if ( first_dirty != INVALID_DIRTY_IDX )
>>>> +                {
>>>> +                    if ( (1U << cur_order) > first_dirty )
>>>> +                    {
>>>> +                        for ( i = first_dirty ; i < (1U << cur_order); i++ )
>>>> +                            if ( test_bit(_PGC_need_scrub,
>>>> +                                          &cur_head[i].count_info) )
>>>> +                            {
>>>> +                                idx = i;
>>>> +                                break;
>>>> +                            }
>>> Why again do you need to look through all the pages here, rather than
>>> simply marking the chunks as needing scrubbing simply based on first_dirty?
>>> It seems to me that I've asked this before, which is a good indication that
>>> such special behavior would better have a comment attached.
>> We want to make sure that there are in fact dirty pages in the
>> (sub-)buddy: first_dirty is only a hint there *may be* some.
> But why is that needed? Iirc you don't go to such length when splitting a
> buddy during allocation.

I felt that allocation is more time-critical and so I decided against
trying to be "neat" as far as tracking dirty pages exactly.

-boris
Julien Grall Aug. 7, 2017, 3:23 p.m. UTC | #7
Hi,

On 07/08/17 15:46, Boris Ostrovsky wrote:
> On 08/07/2017 06:45 AM, Julien Grall wrote:
>> Hi Boris,
>>
>> I would have appreciated to be CCed as maintainer of the ARM bits...
>> Please use scripts/get_maintainers.pl in the future.
>
> Ugh, sorry about that. (I did test builds for both ARM64 and ARM32, if
> this make my transgression any less serious ;-))
>
>>
>> On 04/08/17 18:05, Boris Ostrovsky wrote:
>>> . so that it's easy to find pages that need to be scrubbed (those
>>> pages are
>>
>> Pointless .
>>
>>> diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
>>> index ef84b72..d26b232 100644
>>> --- a/xen/include/asm-arm/mm.h
>>> +++ b/xen/include/asm-arm/mm.h
>>> @@ -44,7 +44,16 @@ struct page_info
>>>          /* Page is on a free list: ((count_info & PGC_count_mask) ==
>>> 0). */
>>>          struct {
>>>              /* Do TLBs need flushing for safety before next page
>>> use? */
>>> -            bool_t need_tlbflush;
>>> +            unsigned long need_tlbflush:1;
>>
>> You've turned need_tlbflush from bool to unsigned long : 1. But some
>> of the users use true/false or may rely on the boolean property.  So
>> it sounds like to me you want to use bool bitfields here (and in the
>> x86 part).
>
> This is what we have (with this series applied):
>
> root@ovs104> git grep need_tlbflush .
> common/memory.c:    bool need_tlbflush = false;
> common/memory.c:
> accumulate_tlbflush(&need_tlbflush, &page[j],
> common/memory.c:    if ( need_tlbflush )
> common/page_alloc.c:    bool need_tlbflush = false;
> common/page_alloc.c:            accumulate_tlbflush(&need_tlbflush, &pg[i],
> common/page_alloc.c:    if ( need_tlbflush )
> * common/page_alloc.c:        pg[i].u.free.need_tlbflush =
> (page_get_owner(&pg[i]) != NULL);
> common/page_alloc.c:        if ( pg[i].u.free.need_tlbflush )
> include/asm-arm/mm.h:                unsigned long need_tlbflush:1;
> include/asm-x86/mm.h:           unsigned long need_tlbflush:1;
> include/xen/mm.h:static inline void accumulate_tlbflush(bool *need_tlbflush,
> include/xen/mm.h:    if ( page->u.free.need_tlbflush &&
> include/xen/mm.h:         (!*need_tlbflush ||
> include/xen/mm.h:        *need_tlbflush = true;
> root@ovs104>
>
> The only possibly boolean-style use is '*' and event that I think is
> allowed.
>
> Stand-alone need_tlbflush variables above have nothing to do with this
> change.

If you look at it, all of them use bool semantic. Now you mix bool and 
int. We are trying to remove that in the new code, so please don't 
introduce new one.

>
>
>>
>>> +
>>> +            /*
>>> +             * Index of the first *possibly* unscrubbed page in the
>>> buddy.
>>> +             * One more than maximum possible order to accommodate
>>> +             * INVALID_DIRTY_IDX.
>>> +             */
>>> +#define INVALID_DIRTY_IDX ((1UL << (MAX_ORDER + 1)) - 1)
>>> +            unsigned long first_dirty:MAX_ORDER + 1;
>>
>> We need to make sure that this union will not be bigger than unsigned
>> long. Otherwise this will limit lower down the maximum amount of
>> memory we support.
>> So this likely means a BUILD_BUG_ON(....).
>
>
> Are you concerned that (MAX_ORDER+1) will be larger than sizeof(unsigned
> long)? If yes, the compiler should complain anyway, shouldn't it?

I am more concerned about the sizeof of the union u to be larger than 
unsigned long.

first_dirty should not be greater than 63 bits (or 31 bits for 32-bits 
architecture). Otherwise likely the compiler will add a padding between 
need_tlbflush and first_dirty. This would increase the page_info by 4/8 
byte.

The BUILD_BUG_ON(...) would be here to catch such error.

Cheers,
Jan Beulich Aug. 7, 2017, 3:28 p.m. UTC | #8
>>> Boris Ostrovsky <boris.ostrovsky@oracle.com> 08/07/17 4:56 PM >>>
>On 08/07/2017 10:37 AM, Jan Beulich wrote:
>>>>>> Boris Ostrovsky <boris.ostrovsky@oracle.com> 08/07/17 4:16 PM >>>
>>> On 08/06/2017 01:41 PM, Jan Beulich wrote:
>>>>>>> Boris Ostrovsky <boris.ostrovsky@oracle.com> 08/04/17 7:03 PM >>>
>>>>> +                /* See if any of the pages indeed need scrubbing. */
>>>>> +                if ( first_dirty != INVALID_DIRTY_IDX )
>>>>> +                {
>>>>> +                    if ( (1U << cur_order) > first_dirty )
>>>>> +                    {
>>>>> +                        for ( i = first_dirty ; i < (1U << cur_order); i++ )
>>>>> +                            if ( test_bit(_PGC_need_scrub,
>>>>> +                                          &cur_head[i].count_info) )
>>>>> +                            {
>>>>> +                                idx = i;
>>>>> +                                break;
>>>>> +                            }
>>>> Why again do you need to look through all the pages here, rather than
>>>> simply marking the chunks as needing scrubbing simply based on first_dirty?
>>>> It seems to me that I've asked this before, which is a good indication that
>>>> such special behavior would better have a comment attached.
>>> We want to make sure that there are in fact dirty pages in the
>>> (sub-)buddy: first_dirty is only a hint there *may be* some.
>> But why is that needed? Iirc you don't go to such length when splitting a
>> buddy during allocation.
>
>I felt that allocation is more time-critical and so I decided against
>trying to be "neat" as far as tracking dirty pages exactly.

I'd suggest to use the simpler approach here too, if the more involved one isn't
needed for correctness.

Jan
Boris Ostrovsky Aug. 7, 2017, 4:57 p.m. UTC | #9
>>>
>>>> diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
>>>> index ef84b72..d26b232 100644
>>>> --- a/xen/include/asm-arm/mm.h
>>>> +++ b/xen/include/asm-arm/mm.h
>>>> @@ -44,7 +44,16 @@ struct page_info
>>>>          /* Page is on a free list: ((count_info & PGC_count_mask) ==
>>>> 0). */
>>>>          struct {
>>>>              /* Do TLBs need flushing for safety before next page
>>>> use? */
>>>> -            bool_t need_tlbflush;
>>>> +            unsigned long need_tlbflush:1;
>>>
>>> You've turned need_tlbflush from bool to unsigned long : 1. But some
>>> of the users use true/false or may rely on the boolean property.  So
>>> it sounds like to me you want to use bool bitfields here (and in the
>>> x86 part).
>>
>> This is what we have (with this series applied):
>>
>> root@ovs104> git grep need_tlbflush .
>> common/memory.c:    bool need_tlbflush = false;
>> common/memory.c:
>> accumulate_tlbflush(&need_tlbflush, &page[j],
>> common/memory.c:    if ( need_tlbflush )
>> common/page_alloc.c:    bool need_tlbflush = false;
>> common/page_alloc.c:            accumulate_tlbflush(&need_tlbflush,
>> &pg[i],
>> common/page_alloc.c:    if ( need_tlbflush )
>> * common/page_alloc.c:        pg[i].u.free.need_tlbflush =
>> (page_get_owner(&pg[i]) != NULL);
>> common/page_alloc.c:        if ( pg[i].u.free.need_tlbflush )
>> include/asm-arm/mm.h:                unsigned long need_tlbflush:1;
>> include/asm-x86/mm.h:           unsigned long need_tlbflush:1;
>> include/xen/mm.h:static inline void accumulate_tlbflush(bool
>> *need_tlbflush,
>> include/xen/mm.h:    if ( page->u.free.need_tlbflush &&
>> include/xen/mm.h:         (!*need_tlbflush ||
>> include/xen/mm.h:        *need_tlbflush = true;
>> root@ovs104>
>>
>> The only possibly boolean-style use is '*' and event that I think is
>> allowed.
>>
>> Stand-alone need_tlbflush variables above have nothing to do with this
>> change.
>
> If you look at it, all of them use bool semantic. Now you mix bool and
> int. We are trying to remove that in the new code, so please don't
> introduce new one.


I am not sure I see how we are mixing the semantics here. <datatype>:1
is really a tightly-packed bool.

Switching to bitfields was, btw, suggested by Jan at some point so if
the two of you agree on how to proceed I can go either way (but by
preference is to keep it as a single-bit bitfield).


>
>>
>>
>>>
>>>> +
>>>> +            /*
>>>> +             * Index of the first *possibly* unscrubbed page in the
>>>> buddy.
>>>> +             * One more than maximum possible order to accommodate
>>>> +             * INVALID_DIRTY_IDX.
>>>> +             */
>>>> +#define INVALID_DIRTY_IDX ((1UL << (MAX_ORDER + 1)) - 1)
>>>> +            unsigned long first_dirty:MAX_ORDER + 1;
>>>
>>> We need to make sure that this union will not be bigger than unsigned
>>> long. Otherwise this will limit lower down the maximum amount of
>>> memory we support.
>>> So this likely means a BUILD_BUG_ON(....).
>>
>>
>> Are you concerned that (MAX_ORDER+1) will be larger than sizeof(unsigned
>> long)? If yes, the compiler should complain anyway, shouldn't it?
>
> I am more concerned about the sizeof of the union u to be larger than
> unsigned long.
>
> first_dirty should not be greater than 63 bits (or 31 bits for 32-bits
> architecture). Otherwise likely the compiler will add a padding
> between need_tlbflush and first_dirty. This would increase the
> page_info by 4/8 byte.
>
> The BUILD_BUG_ON(...) would be here to catch such error.

Oh, I see. Sure, I'll add it.

-boris
Julien Grall Aug. 7, 2017, 5:01 p.m. UTC | #10
On 07/08/17 17:57, Boris Ostrovsky wrote:
>
>>>>
>>>>> diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
>>>>> index ef84b72..d26b232 100644
>>>>> --- a/xen/include/asm-arm/mm.h
>>>>> +++ b/xen/include/asm-arm/mm.h
>>>>> @@ -44,7 +44,16 @@ struct page_info
>>>>>          /* Page is on a free list: ((count_info & PGC_count_mask) ==
>>>>> 0). */
>>>>>          struct {
>>>>>              /* Do TLBs need flushing for safety before next page
>>>>> use? */
>>>>> -            bool_t need_tlbflush;
>>>>> +            unsigned long need_tlbflush:1;
>>>>
>>>> You've turned need_tlbflush from bool to unsigned long : 1. But some
>>>> of the users use true/false or may rely on the boolean property.  So
>>>> it sounds like to me you want to use bool bitfields here (and in the
>>>> x86 part).
>>>
>>> This is what we have (with this series applied):
>>>
>>> root@ovs104> git grep need_tlbflush .
>>> common/memory.c:    bool need_tlbflush = false;
>>> common/memory.c:
>>> accumulate_tlbflush(&need_tlbflush, &page[j],
>>> common/memory.c:    if ( need_tlbflush )
>>> common/page_alloc.c:    bool need_tlbflush = false;
>>> common/page_alloc.c:            accumulate_tlbflush(&need_tlbflush,
>>> &pg[i],
>>> common/page_alloc.c:    if ( need_tlbflush )
>>> * common/page_alloc.c:        pg[i].u.free.need_tlbflush =
>>> (page_get_owner(&pg[i]) != NULL);
>>> common/page_alloc.c:        if ( pg[i].u.free.need_tlbflush )
>>> include/asm-arm/mm.h:                unsigned long need_tlbflush:1;
>>> include/asm-x86/mm.h:           unsigned long need_tlbflush:1;
>>> include/xen/mm.h:static inline void accumulate_tlbflush(bool
>>> *need_tlbflush,
>>> include/xen/mm.h:    if ( page->u.free.need_tlbflush &&
>>> include/xen/mm.h:         (!*need_tlbflush ||
>>> include/xen/mm.h:        *need_tlbflush = true;
>>> root@ovs104>
>>>
>>> The only possibly boolean-style use is '*' and event that I think is
>>> allowed.
>>>
>>> Stand-alone need_tlbflush variables above have nothing to do with this
>>> change.
>>
>> If you look at it, all of them use bool semantic. Now you mix bool and
>> int. We are trying to remove that in the new code, so please don't
>> introduce new one.
>
>
> I am not sure I see how we are mixing the semantics here. <datatype>:1
> is really a tightly-packed bool.

It is not a bool. It is an unsigned int. So if I am not mistaken, if you 
assign 0x2, the variable will be 0 unless you do !!(...).

Whilst with bool you would get 1.

>
> Switching to bitfields was, btw, suggested by Jan at some point so if
> the two of you agree on how to proceed I can go either way (but by
> preference is to keep it as a single-bit bitfield).

If you use a single-bit bitfield of bool (i.e bool need_flush : 1) you 
would address both Jan's comment and mine.

Cheers,
Boris Ostrovsky Aug. 7, 2017, 5:20 p.m. UTC | #11
>>
>> Switching to bitfields was, btw, suggested by Jan at some point so if
>> the two of you agree on how to proceed I can go either way (but by
>> preference is to keep it as a single-bit bitfield).
>
> If you use a single-bit bitfield of bool (i.e bool need_flush : 1) you
> would address both Jan's comment and mine.

Haven't considered this. I thought you meant a fully-sized bool. I'll do
this, thanks.

-boris
diff mbox

Patch

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 8bcef6a..9e787e0 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -383,6 +383,8 @@  typedef struct page_list_head heap_by_zone_and_order_t[NR_ZONES][MAX_ORDER+1];
 static heap_by_zone_and_order_t *_heap[MAX_NUMNODES];
 #define heap(node, zone, order) ((*_heap[node])[zone][order])
 
+static unsigned long node_need_scrub[MAX_NUMNODES];
+
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;
 
@@ -678,13 +680,30 @@  static void check_low_mem_virq(void)
     }
 }
 
+/* Pages that need a scrub are added to tail, otherwise to head. */
+static void page_list_add_scrub(struct page_info *pg, unsigned int node,
+                                unsigned int zone, unsigned int order,
+                                unsigned int first_dirty)
+{
+    PFN_ORDER(pg) = order;
+    pg->u.free.first_dirty = first_dirty;
+
+    if ( first_dirty != INVALID_DIRTY_IDX )
+    {
+        ASSERT(first_dirty < (1U << order));
+        page_list_add_tail(pg, &heap(node, zone, order));
+    }
+    else
+        page_list_add(pg, &heap(node, zone, order));
+}
+
 /* Allocate 2^@order contiguous pages. */
 static struct page_info *alloc_heap_pages(
     unsigned int zone_lo, unsigned int zone_hi,
     unsigned int order, unsigned int memflags,
     struct domain *d)
 {
-    unsigned int i, j, zone = 0, nodemask_retry = 0;
+    unsigned int i, j, zone = 0, nodemask_retry = 0, first_dirty;
     nodeid_t first_node, node = MEMF_get_node(memflags), req_node = node;
     unsigned long request = 1UL << order;
     struct page_info *pg;
@@ -798,12 +817,26 @@  static struct page_info *alloc_heap_pages(
     return NULL;
 
  found: 
+
+    first_dirty = pg->u.free.first_dirty;
+
     /* We may have to halve the chunk a number of times. */
     while ( j != order )
     {
-        PFN_ORDER(pg) = --j;
-        page_list_add_tail(pg, &heap(node, zone, j));
-        pg += 1 << j;
+        j--;
+        page_list_add_scrub(pg, node, zone, j,
+                            (1U << j) > first_dirty ?
+                            first_dirty : INVALID_DIRTY_IDX);
+        pg += 1U << j;
+
+        if ( first_dirty != INVALID_DIRTY_IDX )
+        {
+            /* Adjust first_dirty */
+            if ( first_dirty >= 1U << j )
+                first_dirty -= 1U << j;
+            else
+                first_dirty = 0; /* We've moved past original first_dirty */
+        }
     }
 
     ASSERT(avail[node][zone] >= request);
@@ -850,12 +883,20 @@  static int reserve_offlined_page(struct page_info *head)
     unsigned int node = phys_to_nid(page_to_maddr(head));
     int zone = page_to_zone(head), i, head_order = PFN_ORDER(head), count = 0;
     struct page_info *cur_head;
-    int cur_order;
+    unsigned int cur_order, first_dirty;
 
     ASSERT(spin_is_locked(&heap_lock));
 
     cur_head = head;
 
+    /*
+     * We may break the buddy so let's mark the head as clean. Then, when
+     * merging chunks back into the heap, we will see whether the chunk has
+     * unscrubbed pages and set its first_dirty properly.
+     */
+    first_dirty = head->u.free.first_dirty;
+    head->u.free.first_dirty = INVALID_DIRTY_IDX;
+
     page_list_del(head, &heap(node, zone, head_order));
 
     while ( cur_head < (head + (1 << head_order)) )
@@ -866,6 +907,8 @@  static int reserve_offlined_page(struct page_info *head)
         if ( page_state_is(cur_head, offlined) )
         {
             cur_head++;
+            if ( first_dirty != INVALID_DIRTY_IDX && first_dirty )
+                first_dirty--;
             continue;
         }
 
@@ -873,6 +916,8 @@  static int reserve_offlined_page(struct page_info *head)
 
         while ( cur_order < head_order )
         {
+            unsigned int idx = INVALID_DIRTY_IDX;
+
             next_order = cur_order + 1;
 
             if ( (cur_head + (1 << next_order)) >= (head + ( 1 << head_order)) )
@@ -892,8 +937,28 @@  static int reserve_offlined_page(struct page_info *head)
             {
             merge:
                 /* We don't consider merging outside the head_order. */
-                page_list_add_tail(cur_head, &heap(node, zone, cur_order));
-                PFN_ORDER(cur_head) = cur_order;
+
+                /* See if any of the pages indeed need scrubbing. */
+                if ( first_dirty != INVALID_DIRTY_IDX )
+                {
+                    if ( (1U << cur_order) > first_dirty )
+                    {
+                        for ( i = first_dirty ; i < (1U << cur_order); i++ )
+                            if ( test_bit(_PGC_need_scrub,
+                                          &cur_head[i].count_info) )
+                            {
+                                idx = i;
+                                break;
+                            }
+                        ASSERT(idx != INVALID_DIRTY_IDX);
+                        first_dirty = 0;
+                    }
+                    else
+                        first_dirty -= 1U << cur_order;
+                }
+
+                page_list_add_scrub(cur_head, node, zone,
+                                    cur_order, idx);
                 cur_head += (1 << cur_order);
                 break;
             }
@@ -919,9 +984,53 @@  static int reserve_offlined_page(struct page_info *head)
     return count;
 }
 
+static void scrub_free_pages(unsigned int node)
+{
+    struct page_info *pg;
+    unsigned int zone;
+
+    ASSERT(spin_is_locked(&heap_lock));
+
+    if ( !node_need_scrub[node] )
+        return;
+
+    for ( zone = 0; zone < NR_ZONES; zone++ )
+    {
+        unsigned int order = MAX_ORDER;
+
+        do {
+            while ( !page_list_empty(&heap(node, zone, order)) )
+            {
+                unsigned int i;
+
+                /* Unscrubbed pages are always at the end of the list. */
+                pg = page_list_last(&heap(node, zone, order));
+                if ( pg->u.free.first_dirty == INVALID_DIRTY_IDX )
+                    break;
+
+                for ( i = pg->u.free.first_dirty; i < (1U << order); i++)
+                {
+                    if ( test_bit(_PGC_need_scrub, &pg[i].count_info) )
+                    {
+                        scrub_one_page(&pg[i]);
+                        pg[i].count_info &= ~PGC_need_scrub;
+                        node_need_scrub[node]--;
+                    }
+                }
+
+                page_list_del(pg, &heap(node, zone, order));
+                page_list_add_scrub(pg, node, zone, order, INVALID_DIRTY_IDX);
+
+                if ( node_need_scrub[node] == 0 )
+                    return;
+            }
+        } while ( order-- != 0 );
+    }
+}
+
 /* Free 2^@order set of pages. */
 static void free_heap_pages(
-    struct page_info *pg, unsigned int order)
+    struct page_info *pg, unsigned int order, bool need_scrub)
 {
     unsigned long mask, mfn = page_to_mfn(pg);
     unsigned int i, node = phys_to_nid(page_to_maddr(pg)), tainted = 0;
@@ -961,10 +1070,20 @@  static void free_heap_pages(
         /* This page is not a guest frame any more. */
         page_set_owner(&pg[i], NULL); /* set_gpfn_from_mfn snoops pg owner */
         set_gpfn_from_mfn(mfn + i, INVALID_M2P_ENTRY);
+
+        if ( need_scrub )
+            pg[i].count_info |= PGC_need_scrub;
     }
 
     avail[node][zone] += 1 << order;
     total_avail_pages += 1 << order;
+    if ( need_scrub )
+    {
+        node_need_scrub[node] += 1 << order;
+        pg->u.free.first_dirty = 0;
+    }
+    else
+        pg->u.free.first_dirty = INVALID_DIRTY_IDX;
 
     if ( tmem_enabled() )
         midsize_alloc_zone_pages = max(
@@ -977,35 +1096,49 @@  static void free_heap_pages(
 
         if ( (page_to_mfn(pg) & mask) )
         {
+            struct page_info *predecessor = pg - mask;
+
             /* Merge with predecessor block? */
-            if ( !mfn_valid(_mfn(page_to_mfn(pg-mask))) ||
-                 !page_state_is(pg-mask, free) ||
-                 (PFN_ORDER(pg-mask) != order) ||
-                 (phys_to_nid(page_to_maddr(pg-mask)) != node) )
+            if ( !mfn_valid(_mfn(page_to_mfn(predecessor))) ||
+                 !page_state_is(predecessor, free) ||
+                 (PFN_ORDER(predecessor) != order) ||
+                 (phys_to_nid(page_to_maddr(predecessor)) != node) )
                 break;
-            pg -= mask;
-            page_list_del(pg, &heap(node, zone, order));
+
+            page_list_del(predecessor, &heap(node, zone, order));
+
+            /* Keep predecessor's first_dirty if it is already set. */
+            if ( predecessor->u.free.first_dirty == INVALID_DIRTY_IDX &&
+                 pg->u.free.first_dirty != INVALID_DIRTY_IDX )
+                predecessor->u.free.first_dirty = (1U << order) +
+                                                  pg->u.free.first_dirty;
+
+            pg = predecessor;
         }
         else
         {
+            struct page_info *successor = pg + mask;
+
             /* Merge with successor block? */
-            if ( !mfn_valid(_mfn(page_to_mfn(pg+mask))) ||
-                 !page_state_is(pg+mask, free) ||
-                 (PFN_ORDER(pg+mask) != order) ||
-                 (phys_to_nid(page_to_maddr(pg+mask)) != node) )
+            if ( !mfn_valid(_mfn(page_to_mfn(successor))) ||
+                 !page_state_is(successor, free) ||
+                 (PFN_ORDER(successor) != order) ||
+                 (phys_to_nid(page_to_maddr(successor)) != node) )
                 break;
-            page_list_del(pg + mask, &heap(node, zone, order));
+            page_list_del(successor, &heap(node, zone, order));
         }
 
         order++;
     }
 
-    PFN_ORDER(pg) = order;
-    page_list_add_tail(pg, &heap(node, zone, order));
+    page_list_add_scrub(pg, node, zone, order, pg->u.free.first_dirty);
 
     if ( tainted )
         reserve_offlined_page(pg);
 
+    if ( need_scrub )
+        scrub_free_pages(node);
+
     spin_unlock(&heap_lock);
 }
 
@@ -1226,7 +1359,7 @@  unsigned int online_page(unsigned long mfn, uint32_t *status)
     spin_unlock(&heap_lock);
 
     if ( (y & PGC_state) == PGC_state_offlined )
-        free_heap_pages(pg, 0);
+        free_heap_pages(pg, 0, false);
 
     return ret;
 }
@@ -1295,7 +1428,7 @@  static void init_heap_pages(
             nr_pages -= n;
         }
 
-        free_heap_pages(pg+i, 0);
+        free_heap_pages(pg + i, 0, false);
     }
 }
 
@@ -1622,7 +1755,7 @@  void free_xenheap_pages(void *v, unsigned int order)
 
     memguard_guard_range(v, 1 << (order + PAGE_SHIFT));
 
-    free_heap_pages(virt_to_page(v), order);
+    free_heap_pages(virt_to_page(v), order, false);
 }
 
 #else
@@ -1676,12 +1809,9 @@  void free_xenheap_pages(void *v, unsigned int order)
     pg = virt_to_page(v);
 
     for ( i = 0; i < (1u << order); i++ )
-    {
-        scrub_one_page(&pg[i]);
         pg[i].count_info &= ~PGC_xen_heap;
-    }
 
-    free_heap_pages(pg, order);
+    free_heap_pages(pg, order, true);
 }
 
 #endif
@@ -1790,7 +1920,7 @@  struct page_info *alloc_domheap_pages(
     if ( d && !(memflags & MEMF_no_owner) &&
          assign_pages(d, pg, order, memflags) )
     {
-        free_heap_pages(pg, order);
+        free_heap_pages(pg, order, false);
         return NULL;
     }
     
@@ -1858,11 +1988,7 @@  void free_domheap_pages(struct page_info *pg, unsigned int order)
             scrub = 1;
         }
 
-        if ( unlikely(scrub) )
-            for ( i = 0; i < (1 << order); i++ )
-                scrub_one_page(&pg[i]);
-
-        free_heap_pages(pg, order);
+        free_heap_pages(pg, order, scrub);
     }
 
     if ( drop_dom_ref )
diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index ef84b72..d26b232 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -44,7 +44,16 @@  struct page_info
         /* Page is on a free list: ((count_info & PGC_count_mask) == 0). */
         struct {
             /* Do TLBs need flushing for safety before next page use? */
-            bool_t need_tlbflush;
+            unsigned long need_tlbflush:1;
+
+            /*
+             * Index of the first *possibly* unscrubbed page in the buddy.
+             * One more than maximum possible order to accommodate
+             * INVALID_DIRTY_IDX.
+             */
+#define INVALID_DIRTY_IDX ((1UL << (MAX_ORDER + 1)) - 1)
+            unsigned long first_dirty:MAX_ORDER + 1;
+
         } free;
 
     } u;
@@ -107,6 +116,13 @@  struct page_info
 #define PGC_count_width   PG_shift(9)
 #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
 
+/*
+ * Page needs to be scrubbed. Since this bit can only be set on a page that is
+ * free (i.e. in PGC_state_free) we can reuse PGC_allocated bit.
+ */
+#define _PGC_need_scrub   _PGC_allocated
+#define PGC_need_scrub    PGC_allocated
+
 extern mfn_t xenheap_mfn_start, xenheap_mfn_end;
 extern vaddr_t xenheap_virt_end;
 #ifdef CONFIG_ARM_64
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 2bf3f33..9b7fd05 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -88,7 +88,15 @@  struct page_info
         /* Page is on a free list: ((count_info & PGC_count_mask) == 0). */
         struct {
             /* Do TLBs need flushing for safety before next page use? */
-            bool_t need_tlbflush;
+            unsigned long need_tlbflush:1;
+
+            /*
+             * Index of the first *possibly* unscrubbed page in the buddy.
+             * One more than maximum possible order to accommodate
+             * INVALID_DIRTY_IDX.
+             */
+#define INVALID_DIRTY_IDX ((1UL << (MAX_ORDER + 1)) - 1)
+            unsigned long first_dirty:MAX_ORDER + 1;
         } free;
 
     } u;
@@ -233,6 +241,13 @@  struct page_info
 #define PGC_count_width   PG_shift(9)
 #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
 
+/*
+ * Page needs to be scrubbed. Since this bit can only be set on a page that is
+ * free (i.e. in PGC_state_free) we can reuse PGC_allocated bit.
+ */
+#define _PGC_need_scrub   _PGC_allocated
+#define PGC_need_scrub    PGC_allocated
+
 #define is_xen_heap_page(page) ((page)->count_info & PGC_xen_heap)
 #define is_xen_heap_mfn(mfn) \
     (__mfn_valid(mfn) && is_xen_heap_page(__mfn_to_page(mfn)))