diff mbox

x86/mm: fix a potential race condition in map_pages_to_xen().

Message ID 1510241398-25793-1-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu Zhang Nov. 9, 2017, 3:29 p.m. UTC
In map_pages_to_xen(), a L2 page table entry may be reset to point to
a superpage, and its corresponding L1 page table need be freed in such
scenario, when these L1 page table entries are mapping to consecutive
page frames and having the same mapping flags.

However, variable `pl1e` is not protected by the lock before L1 page table
is enumerated. A race condition may happen if this code path is invoked
simultaneously on different CPUs.

For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
to a page which has just been freed on CPU1. Besides, before this page
is reused, it will still be holding the old PTEs, referencing consecutive
page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
be triggered on CPU0, resulting the unexpected free of a normal page.

Protecting the `pl1e` with the lock will fix this race condition.

Signed-off-by: Min He <min.he@intel.com>
Signed-off-by: Yi Zhang <yi.z.zhang@intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/mm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jan Beulich Nov. 9, 2017, 9:19 a.m. UTC | #1
>>> On 09.11.17 at 16:29, <yu.c.zhang@linux.intel.com> wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4844,9 +4844,10 @@ int map_pages_to_xen(
>              {
>                  unsigned long base_mfn;
>  
> -                pl1e = l2e_to_l1e(*pl2e);
>                  if ( locking )
>                      spin_lock(&map_pgdir_lock);
> +
> +                pl1e = l2e_to_l1e(*pl2e);
>                  base_mfn = l1e_get_pfn(*pl1e) & ~(L1_PAGETABLE_ENTRIES - 1);
>                  for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++, pl1e++ )
>                      if ( (l1e_get_pfn(*pl1e) != (base_mfn + i)) ||

I agree with the general observation, but there are three things I'd
like to see considered:

1) Please extend the change slightly such that the L2E
re-consolidation code matches the L3E one (i.e. latch into ol2e
earlier and pass that one to l2e_to_l1e(). Personally I would even
prefer if the presence/absence of blank lines matched between
the two pieces of code.

2) Is your change actually enough to take care of all forms of the
race you describe? In particular, isn't it necessary to re-check PSE
after having taken the lock, in case another CPU has just finished
doing the re-consolidation?

3) What about the empty&free checks in modify_xen_mappings()?

Jan
Jan Beulich Nov. 9, 2017, 9:22 a.m. UTC | #2
>>> On 09.11.17 at 16:29, <yu.c.zhang@linux.intel.com> wrote:
> In map_pages_to_xen(), a L2 page table entry may be reset to point to
> a superpage, and its corresponding L1 page table need be freed in such
> scenario, when these L1 page table entries are mapping to consecutive
> page frames and having the same mapping flags.
> 
> However, variable `pl1e` is not protected by the lock before L1 page table
> is enumerated. A race condition may happen if this code path is invoked
> simultaneously on different CPUs.
> 
> For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
> to a page which has just been freed on CPU1. Besides, before this page
> is reused, it will still be holding the old PTEs, referencing consecutive
> page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
> be triggered on CPU0, resulting the unexpected free of a normal page.
> 
> Protecting the `pl1e` with the lock will fix this race condition.
> 
> Signed-off-by: Min He <min.he@intel.com>
> Signed-off-by: Yi Zhang <yi.z.zhang@intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>

Oh, one more thing: Is it really the case that all three of you
contributed to the patch? We don't use the Linux model of
everyone through whose hands a patch passes adding an
S-o-b of their own - that would rather be Reviewed-by then (if
applicable).

Also generally I would consider the first S-o-b to be that of the
original author, yet the absence of an explicit From: tag makes
authorship ambiguous here. Please clarify this in v2.

Jan
Yu Zhang Nov. 9, 2017, 10:24 a.m. UTC | #3
On 11/9/2017 5:19 PM, Jan Beulich wrote:
>>>> On 09.11.17 at 16:29, <yu.c.zhang@linux.intel.com> wrote:
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -4844,9 +4844,10 @@ int map_pages_to_xen(
>>               {
>>                   unsigned long base_mfn;
>>   
>> -                pl1e = l2e_to_l1e(*pl2e);
>>                   if ( locking )
>>                       spin_lock(&map_pgdir_lock);
>> +
>> +                pl1e = l2e_to_l1e(*pl2e);
>>                   base_mfn = l1e_get_pfn(*pl1e) & ~(L1_PAGETABLE_ENTRIES - 1);
>>                   for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++, pl1e++ )
>>                       if ( (l1e_get_pfn(*pl1e) != (base_mfn + i)) ||
> I agree with the general observation, but there are three things I'd
> like to see considered:
>
> 1) Please extend the change slightly such that the L2E
> re-consolidation code matches the L3E one (i.e. latch into ol2e
> earlier and pass that one to l2e_to_l1e(). Personally I would even
> prefer if the presence/absence of blank lines matched between
> the two pieces of code.

Got it. Thanks.

>
> 2) Is your change actually enough to take care of all forms of the
> race you describe? In particular, isn't it necessary to re-check PSE
> after having taken the lock, in case another CPU has just finished
> doing the re-consolidation?

Good question. :-)

I'd thought of checking the PSE for pl2e, and dropped that. My understanding
was below:
After the lock is taken, pl2e will be pointing to either a L1 page table 
in normal
cases; or to a superpage if another CPU has just finished the 
re-consolidation
and released the lock. And for the latter scenario, l1e_get_pfn(*pl1e) 
shall not
be equal to (base_mfn + i), and will not jump out the the loop.

But after second thought, above understanding is based on assumption of the
contents of the target superpage. No matter how small the chance is, we can
not make such assumption.

So my suggestion is we add the check the PSE and if it is set, "goto 
check_l3".
Is this reasonable to you?

>
> 3) What about the empty&free checks in modify_xen_mappings()?

Oh. Thanks for the remind.
Just had a look. It seems pl1e or pl2e may be freed more than once for the
empty & free checks, due to lack of protection.
So we'd better add a lock too, right?

Yu
Yu Zhang Nov. 9, 2017, 10:32 a.m. UTC | #4
On 11/9/2017 5:22 PM, Jan Beulich wrote:
>>>> On 09.11.17 at 16:29, <yu.c.zhang@linux.intel.com> wrote:
>> In map_pages_to_xen(), a L2 page table entry may be reset to point to
>> a superpage, and its corresponding L1 page table need be freed in such
>> scenario, when these L1 page table entries are mapping to consecutive
>> page frames and having the same mapping flags.
>>
>> However, variable `pl1e` is not protected by the lock before L1 page table
>> is enumerated. A race condition may happen if this code path is invoked
>> simultaneously on different CPUs.
>>
>> For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
>> to a page which has just been freed on CPU1. Besides, before this page
>> is reused, it will still be holding the old PTEs, referencing consecutive
>> page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
>> be triggered on CPU0, resulting the unexpected free of a normal page.
>>
>> Protecting the `pl1e` with the lock will fix this race condition.
>>
>> Signed-off-by: Min He <min.he@intel.com>
>> Signed-off-by: Yi Zhang <yi.z.zhang@intel.com>
>> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Oh, one more thing: Is it really the case that all three of you
> contributed to the patch? We don't use the Linux model of
> everyone through whose hands a patch passes adding an
> S-o-b of their own - that would rather be Reviewed-by then (if
> applicable).
>
> Also generally I would consider the first S-o-b to be that of the
> original author, yet the absence of an explicit From: tag makes
> authorship ambiguous here. Please clarify this in v2.

Oh, we three found this issue when debugging a bug together. And Min is
the author of this patch. So I'd like to add

"From: Min He <min.he@intel.com> "

at the beginning of the commit message in v2. :-)

Yu
> Jan
>
>
Jan Beulich Nov. 9, 2017, 12:49 p.m. UTC | #5
>>> On 09.11.17 at 11:24, <yu.c.zhang@linux.intel.com> wrote:
> On 11/9/2017 5:19 PM, Jan Beulich wrote:
>> 2) Is your change actually enough to take care of all forms of the
>> race you describe? In particular, isn't it necessary to re-check PSE
>> after having taken the lock, in case another CPU has just finished
>> doing the re-consolidation?
> 
> Good question. :-)
> 
> I'd thought of checking the PSE for pl2e, and dropped that. My understanding
> was below:
> After the lock is taken, pl2e will be pointing to either a L1 page table 
> in normal
> cases; or to a superpage if another CPU has just finished the 
> re-consolidation
> and released the lock. And for the latter scenario, l1e_get_pfn(*pl1e) 
> shall not
> be equal to (base_mfn + i), and will not jump out the the loop.
> 
> But after second thought, above understanding is based on assumption of the
> contents of the target superpage. No matter how small the chance is, we can
> not make such assumption.
> 
> So my suggestion is we add the check the PSE and if it is set, "goto 
> check_l3".
> Is this reasonable to you?

Yes; for the L3 case it'll be a simple "continue" afaict.

>> 3) What about the empty&free checks in modify_xen_mappings()?
> 
> Oh. Thanks for the remind.
> Just had a look. It seems pl1e or pl2e may be freed more than once for the
> empty & free checks, due to lack of protection.
> So we'd better add a lock too, right?

Yes, I think so.

Jan
diff mbox

Patch

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a20fdca..9c9afa1 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4844,9 +4844,10 @@  int map_pages_to_xen(
             {
                 unsigned long base_mfn;
 
-                pl1e = l2e_to_l1e(*pl2e);
                 if ( locking )
                     spin_lock(&map_pgdir_lock);
+
+                pl1e = l2e_to_l1e(*pl2e);
                 base_mfn = l1e_get_pfn(*pl1e) & ~(L1_PAGETABLE_ENTRIES - 1);
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++, pl1e++ )
                     if ( (l1e_get_pfn(*pl1e) != (base_mfn + i)) ||