diff mbox

[v2,1/2] x86/mm: fix a potential race condition in map_pages_to_xen().

Message ID 1510298286-30952-1-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu Zhang Nov. 10, 2017, 7:18 a.m. UTC
From: Min He <min.he@intel.com>

In map_pages_to_xen(), a L2 page table entry may be reset to point to
a superpage, and its corresponding L1 page table need be freed in such
scenario, when these L1 page table entries are mapping to consecutive
page frames and having the same mapping flags.

However, variable `pl1e` is not protected by the lock before L1 page table
is enumerated. A race condition may happen if this code path is invoked
simultaneously on different CPUs.

For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
to a page which has just been freed on CPU1. Besides, before this page
is reused, it will still be holding the old PTEs, referencing consecutive
page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
be triggered on CPU0, resulting the unexpected free of a normal page.

This patch fixes the potential race condition by protecting the `pl1e`
with the lock, and checking the PSE flag of the `pl2e`.

Note: PSE flag of `pl3e` is also checked before its re-consolidation,
for the same reason we do for `pl2e` - we cannot presume the contents
of the target superpage.

Signed-off-by: Min He <min.he@intel.com>
Signed-off-by: Yi Zhang <yi.z.zhang@intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes in v2:
According to comments from Jan Beulich,
  - check PSE of pl2e and pl3e, and skip the re-consolidation if set.
  - commit message changes, e.g. add "From :" tag etc.
  - code style changes.
  - introduce a seperate patch to resolve the similar issue in
    modify_xen_mappings().
---
 xen/arch/x86/mm.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Comments

Jan Beulich Nov. 10, 2017, 9:49 a.m. UTC | #1
>>> On 10.11.17 at 08:18, <yu.c.zhang@linux.intel.com> wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4844,9 +4844,19 @@ int map_pages_to_xen(
>              {
>                  unsigned long base_mfn;
>  
> -                pl1e = l2e_to_l1e(*pl2e);
>                  if ( locking )
>                      spin_lock(&map_pgdir_lock);
> +
> +                /* Skip the re-consolidation if it's done on another CPU. */
> +                if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
> +                {
> +                    if ( locking )
> +                        spin_unlock(&map_pgdir_lock);
> +                    goto check_l3;
> +                }
> +
> +                ol2e = *pl2e;

This wants moving up ahead of the if(), so you can use the local
variable inside that if().

> @@ -4880,6 +4889,15 @@ int map_pages_to_xen(
>  
>              if ( locking )
>                  spin_lock(&map_pgdir_lock);
> +
> +            /* Skip the re-consolidation if it's done on another CPU. */
> +            if ( l3e_get_flags(*pl3e) & _PAGE_PSE )
> +            {
> +                if ( locking )
> +                    spin_unlock(&map_pgdir_lock);
> +                continue;
> +            }
> +
>              ol3e = *pl3e;

Same here - move the if() below here and use ol3e in there.

With that
Reviewed-by: Jan Beulich <jbeulich@suse.com>

I'm not certain this is important enough a fix to consider for 4.10,
and you seem to think it's good enough if this gets applied only
after the tree would be branched, as you didn't Cc Julien. Please
indicate if you actually simply weren't aware, and you indeed
there's an important aspect to this that I'm overlooking.

Jan
Yu Zhang Nov. 10, 2017, 2:05 p.m. UTC | #2
On 11/10/2017 5:49 PM, Jan Beulich wrote:
>>>> On 10.11.17 at 08:18, <yu.c.zhang@linux.intel.com> wrote:
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -4844,9 +4844,19 @@ int map_pages_to_xen(
>>               {
>>                   unsigned long base_mfn;
>>   
>> -                pl1e = l2e_to_l1e(*pl2e);
>>                   if ( locking )
>>                       spin_lock(&map_pgdir_lock);
>> +
>> +                /* Skip the re-consolidation if it's done on another CPU. */
>> +                if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
>> +                {
>> +                    if ( locking )
>> +                        spin_unlock(&map_pgdir_lock);
>> +                    goto check_l3;
>> +                }
>> +
>> +                ol2e = *pl2e;
> This wants moving up ahead of the if(), so you can use the local
> variable inside that if().

Got it. Thanks.

>> @@ -4880,6 +4889,15 @@ int map_pages_to_xen(
>>   
>>               if ( locking )
>>                   spin_lock(&map_pgdir_lock);
>> +
>> +            /* Skip the re-consolidation if it's done on another CPU. */
>> +            if ( l3e_get_flags(*pl3e) & _PAGE_PSE )
>> +            {
>> +                if ( locking )
>> +                    spin_unlock(&map_pgdir_lock);
>> +                continue;
>> +            }
>> +
>>               ol3e = *pl3e;
> Same here - move the if() below here and use ol3e in there.
>
> With that
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
> I'm not certain this is important enough a fix to consider for 4.10,
> and you seem to think it's good enough if this gets applied only
> after the tree would be branched, as you didn't Cc Julien. Please
> indicate if you actually simply weren't aware, and you indeed
> there's an important aspect to this that I'm overlooking.

Well, at first I have not expected this to be accepted for 4.10. But 
since we have
met this issue in practice, when running a graphic application which 
consumes
memory intensively in dom0, I think it also makes sense if we can fix it 
in xen's
release as early as possible. Do you think this is a reasonable 
requirement? :-)

Also Cc to Julien. Julien, any comment? :-)

Yu
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Jan Beulich Nov. 13, 2017, 9:31 a.m. UTC | #3
>>> On 10.11.17 at 15:05, <yu.c.zhang@linux.intel.com> wrote:
> On 11/10/2017 5:49 PM, Jan Beulich wrote:
>> I'm not certain this is important enough a fix to consider for 4.10,
>> and you seem to think it's good enough if this gets applied only
>> after the tree would be branched, as you didn't Cc Julien. Please
>> indicate if you actually simply weren't aware, and you indeed
>> there's an important aspect to this that I'm overlooking.
> 
> Well, at first I have not expected this to be accepted for 4.10. But 
> since we have
> met this issue in practice, when running a graphic application which 
> consumes
> memory intensively in dom0, I think it also makes sense if we can fix it 
> in xen's
> release as early as possible. Do you think this is a reasonable 
> requirement? :-)

You'd need to provide further details for us to understand the
scenario. It obviously depends on whether you have other
patches to Xen which actually trigger this. If the problem can
be triggered from outside of a vanilla upstream Xen, then yes,
I think I would favor the fixes being included.

Jan
Yu Zhang Nov. 13, 2017, 10:34 a.m. UTC | #4
On 11/13/2017 5:31 PM, Jan Beulich wrote:
>>>> On 10.11.17 at 15:05, <yu.c.zhang@linux.intel.com> wrote:
>> On 11/10/2017 5:49 PM, Jan Beulich wrote:
>>> I'm not certain this is important enough a fix to consider for 4.10,
>>> and you seem to think it's good enough if this gets applied only
>>> after the tree would be branched, as you didn't Cc Julien. Please
>>> indicate if you actually simply weren't aware, and you indeed
>>> there's an important aspect to this that I'm overlooking.
>> Well, at first I have not expected this to be accepted for 4.10. But
>> since we have
>> met this issue in practice, when running a graphic application which
>> consumes
>> memory intensively in dom0, I think it also makes sense if we can fix it
>> in xen's
>> release as early as possible. Do you think this is a reasonable
>> requirement? :-)
> You'd need to provide further details for us to understand the
> scenario. It obviously depends on whether you have other
> patches to Xen which actually trigger this. If the problem can
> be triggered from outside of a vanilla upstream Xen, then yes,
> I think I would favor the fixes being included.

Thank, Jan. Let me try to give an explaination of the scenario. :-)

We saw an ASSERT failue in ASSERT((page->count_info & PGC_count_mask) != 0)
in is_iomem_page() <- put_page_from_l1e() <- alloc_l1_table(), when we run a
graphic application(which is a memory eater, but close sourced) in dom0. 
And this
failure only happens when dom0 is configured with 2 vCPUs.

Our debug showed the concerned page->count_info was already(and 
unexpectedly)
cleared in free_xenheap_pages(), and the call trace should be like this:

free_xenheap_pages()
     ^
     |
free_xen_pagetable()
     ^
     |
map_pages_to_xen()
     ^
     |
update_xen_mappings()
     ^
     |
get_page_from_l1e()
     ^
     |
mod_l1_entry()
     ^
     |
do_mmu_update()

And we then realized that it happened when dom0 tries to update its page 
table,
and when the cache attributes are gonna be changed for referenced page 
frame,
corresponding mappings for xen VA space will be updated by 
map_pages_to_xen()
as well.

However, since routine map_pages_to_xen() has the aforementioned racing 
problem,
when MMU_NORMAL_PT_UPDATE is triggered concurrently on different CPUs, it
may mistakenly free a superpage referenced by pl2e. That's why our 
ASSERT failure
only happens when dom0 has more than one vCPU configured.

As to the code base, we were running XenGT code, which has only a few 
non-upstreamed
patches in Xen - I believe most of them are libxl related ones, and none 
of them is
mmu related. So I believe this issue could be triggered by a pv guest to 
a vanilla
upstream xen.

Is above description convincing enough? :-)

Yu
>
> Jan
>
>
Jan Beulich Nov. 13, 2017, 11:06 a.m. UTC | #5
>>> On 13.11.17 at 11:34, <yu.c.zhang@linux.intel.com> wrote:
> Our debug showed the concerned page->count_info was already(and 
> unexpectedly)
> cleared in free_xenheap_pages(), and the call trace should be like this:
> 
> free_xenheap_pages()
>      ^
>      |
> free_xen_pagetable()
>      ^
>      |
> map_pages_to_xen()
>      ^
>      |
> update_xen_mappings()
>      ^
>      |
> get_page_from_l1e()
>      ^
>      |
> mod_l1_entry()
>      ^
>      |
> do_mmu_update()

This ...

> Is above description convincing enough? :-)

... is indeed enough for me to suggest to Julien that we take both
patches (once they're ready). But it's his decision in the end.

Jan
Julien Grall Nov. 13, 2017, 11:22 a.m. UTC | #6
Hi,

On 11/13/2017 11:06 AM, Jan Beulich wrote:
>>>> On 13.11.17 at 11:34, <yu.c.zhang@linux.intel.com> wrote:
>> Our debug showed the concerned page->count_info was already(and
>> unexpectedly)
>> cleared in free_xenheap_pages(), and the call trace should be like this:
>>
>> free_xenheap_pages()
>>       ^
>>       |
>> free_xen_pagetable()
>>       ^
>>       |
>> map_pages_to_xen()
>>       ^
>>       |
>> update_xen_mappings()
>>       ^
>>       |
>> get_page_from_l1e()
>>       ^
>>       |
>> mod_l1_entry()
>>       ^
>>       |
>> do_mmu_update()
> 
> This ...
> 
>> Is above description convincing enough? :-)
> 
> ... is indeed enough for me to suggest to Julien that we take both
> patches (once they're ready). But it's his decision in the end.

I will wait the series to be ready before giving my release-ack.

Cheers,


> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
>
diff mbox

Patch

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a20fdca..47855fb 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4844,9 +4844,19 @@  int map_pages_to_xen(
             {
                 unsigned long base_mfn;
 
-                pl1e = l2e_to_l1e(*pl2e);
                 if ( locking )
                     spin_lock(&map_pgdir_lock);
+
+                /* Skip the re-consolidation if it's done on another CPU. */
+                if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
+                {
+                    if ( locking )
+                        spin_unlock(&map_pgdir_lock);
+                    goto check_l3;
+                }
+
+                ol2e = *pl2e;
+                pl1e = l2e_to_l1e(ol2e);
                 base_mfn = l1e_get_pfn(*pl1e) & ~(L1_PAGETABLE_ENTRIES - 1);
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++, pl1e++ )
                     if ( (l1e_get_pfn(*pl1e) != (base_mfn + i)) ||
@@ -4854,7 +4864,6 @@  int map_pages_to_xen(
                         break;
                 if ( i == L1_PAGETABLE_ENTRIES )
                 {
-                    ol2e = *pl2e;
                     l2e_write_atomic(pl2e, l2e_from_pfn(base_mfn,
                                                         l1f_to_lNf(flags)));
                     if ( locking )
@@ -4880,6 +4889,15 @@  int map_pages_to_xen(
 
             if ( locking )
                 spin_lock(&map_pgdir_lock);
+
+            /* Skip the re-consolidation if it's done on another CPU. */
+            if ( l3e_get_flags(*pl3e) & _PAGE_PSE )
+            {
+                if ( locking )
+                    spin_unlock(&map_pgdir_lock);
+                continue;
+            }
+
             ol3e = *pl3e;
             pl2e = l3e_to_l2e(ol3e);
             base_mfn = l2e_get_pfn(*pl2e) & ~(L2_PAGETABLE_ENTRIES *