diff mbox series

[v2] x86/sgx: Fix deadlock and race conditions between fork() and EPC reclaim

Message ID 20200403093550.104789-1-jarkko.sakkinen@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [v2] x86/sgx: Fix deadlock and race conditions between fork() and EPC reclaim | expand

Commit Message

Jarkko Sakkinen April 3, 2020, 9:35 a.m. UTC
From: Sean Christopherson <sean.j.christopherson@intel.com>

Drop the synchronize_srcu() from sgx_encl_mm_add() and replace it with a
mm_list versioning concept to avoid deadlock when adding a mm during
dup_mmap()/fork(), and to ensure copied PTEs are zapped.

When dup_mmap() runs, it holds mmap_sem for write in both the old mm and
new mm.  Invoking synchronize_srcu() while holding mmap_sem of a mm that
is already attached to the enclave will deadlock if the reclaimer is in
the process of walking mm_list, as the reclaimer will try to acquire
mmap_sem (of the old mm) while holding encl->srcu for read.

 INFO: task ksgxswapd:181 blocked for more than 120 seconds.
 ksgxswapd       D    0   181      2 0x80004000
 Call Trace:
  __schedule+0x2db/0x700
  schedule+0x44/0xb0
  rwsem_down_read_slowpath+0x370/0x470
  down_read+0x95/0xa0
  sgx_reclaim_pages+0x1d2/0x7d0
  ksgxswapd+0x151/0x2e0
  kthread+0x120/0x140
  ret_from_fork+0x35/0x40

 INFO: task fork_consistenc:18824 blocked for more than 120 seconds.
 fork_consistenc D    0 18824  18786 0x00004320
 Call Trace:
  __schedule+0x2db/0x700
  schedule+0x44/0xb0
  schedule_timeout+0x205/0x300
  wait_for_completion+0xb7/0x140
  __synchronize_srcu.part.22+0x81/0xb0
  synchronize_srcu_expedited+0x27/0x30
  synchronize_srcu+0x57/0xe0
  sgx_encl_mm_add+0x12b/0x160
  sgx_vma_open+0x22/0x40
  dup_mm+0x521/0x580
  copy_process+0x1a56/0x1b50
  _do_fork+0x85/0x3a0
  __x64_sys_clone+0x8e/0xb0
  do_syscall_64+0x57/0x1b0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Furthermore, doing synchronize_srcu() in sgx_encl_mm_add() does not
prevent the new mm from having stale PTEs pointing at the EPC page to be
reclaimed.  dup_mmap() calls vm_ops->open()/sgx_encl_mm_add() _after_
PTEs are copied to the new mm, i.e. blocking fork() until reclaim zaps
the old mm is pointless as the stale PTEs have already been created in
the new mm.

All other flows that walk mm_list can safely race with dup_mmap() or are
protected by a different mechanism.  Add comments to all srcu readers
that don't check the list version to document why its ok for the flow to
ignore the version.

Note, synchronize_srcu() is still needed when removing a mm from an
enclave, as the srcu readers must complete their walk before the mm can
be freed.  Removing a mm is never done while holding mmap_sem.

Cc: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
v2:
* Remove smp_wmb() as x86 does not reorder writes in the pipeline.
* Refine comments to be more to the point and more maintainable when
  things might change.
* Replace the ad hoc (goto-based) loop construct with a proper loop
  construct.
 arch/x86/kernel/cpu/sgx/encl.c    | 11 +++++++--
 arch/x86/kernel/cpu/sgx/encl.h    |  1 +
 arch/x86/kernel/cpu/sgx/reclaim.c | 39 +++++++++++++++++++++----------
 3 files changed, 37 insertions(+), 14 deletions(-)

Comments

Jarkko Sakkinen April 3, 2020, 11:33 p.m. UTC | #1
On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Drop the synchronize_srcu() from sgx_encl_mm_add() and replace it with a
> mm_list versioning concept to avoid deadlock when adding a mm during
> dup_mmap()/fork(), and to ensure copied PTEs are zapped.
> 
> When dup_mmap() runs, it holds mmap_sem for write in both the old mm and
> new mm.  Invoking synchronize_srcu() while holding mmap_sem of a mm that
> is already attached to the enclave will deadlock if the reclaimer is in
> the process of walking mm_list, as the reclaimer will try to acquire
> mmap_sem (of the old mm) while holding encl->srcu for read.
> 
>  INFO: task ksgxswapd:181 blocked for more than 120 seconds.
>  ksgxswapd       D    0   181      2 0x80004000
>  Call Trace:
>   __schedule+0x2db/0x700
>   schedule+0x44/0xb0
>   rwsem_down_read_slowpath+0x370/0x470
>   down_read+0x95/0xa0
>   sgx_reclaim_pages+0x1d2/0x7d0
>   ksgxswapd+0x151/0x2e0
>   kthread+0x120/0x140
>   ret_from_fork+0x35/0x40
> 
>  INFO: task fork_consistenc:18824 blocked for more than 120 seconds.
>  fork_consistenc D    0 18824  18786 0x00004320
>  Call Trace:
>   __schedule+0x2db/0x700
>   schedule+0x44/0xb0
>   schedule_timeout+0x205/0x300
>   wait_for_completion+0xb7/0x140
>   __synchronize_srcu.part.22+0x81/0xb0
>   synchronize_srcu_expedited+0x27/0x30
>   synchronize_srcu+0x57/0xe0
>   sgx_encl_mm_add+0x12b/0x160
>   sgx_vma_open+0x22/0x40
>   dup_mm+0x521/0x580
>   copy_process+0x1a56/0x1b50
>   _do_fork+0x85/0x3a0
>   __x64_sys_clone+0x8e/0xb0
>   do_syscall_64+0x57/0x1b0
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Furthermore, doing synchronize_srcu() in sgx_encl_mm_add() does not
> prevent the new mm from having stale PTEs pointing at the EPC page to be
> reclaimed.  dup_mmap() calls vm_ops->open()/sgx_encl_mm_add() _after_
> PTEs are copied to the new mm, i.e. blocking fork() until reclaim zaps
> the old mm is pointless as the stale PTEs have already been created in
> the new mm.
> 
> All other flows that walk mm_list can safely race with dup_mmap() or are
> protected by a different mechanism.  Add comments to all srcu readers
> that don't check the list version to document why its ok for the flow to
> ignore the version.
> 
> Note, synchronize_srcu() is still needed when removing a mm from an
> enclave, as the srcu readers must complete their walk before the mm can
> be freed.  Removing a mm is never done while holding mmap_sem.
> 
> Cc: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> ---
> v2:
> * Remove smp_wmb() as x86 does not reorder writes in the pipeline.
> * Refine comments to be more to the point and more maintainable when
>   things might change.
> * Replace the ad hoc (goto-based) loop construct with a proper loop
>   construct.
>  arch/x86/kernel/cpu/sgx/encl.c    | 11 +++++++--
>  arch/x86/kernel/cpu/sgx/encl.h    |  1 +
>  arch/x86/kernel/cpu/sgx/reclaim.c | 39 +++++++++++++++++++++----------
>  3 files changed, 37 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index e0124a2f22d5..5b15352b3d4f 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -196,6 +196,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
>  	struct sgx_encl_mm *encl_mm;
>  	int ret;
>  
> +	/* mm_list can be accessed only by a single thread at a time. */
> +	lockdep_assert_held_write(&mm->mmap_sem);
> +
>  	if (atomic_read(&encl->flags) & SGX_ENCL_DEAD)
>  		return -EINVAL;
>  
> @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
>  		return ret;
>  	}
>  
> +	/*
> +	 * The page reclaimer uses list version for synchronization instead of
> +	 * synchronize_scru() because otherwise we could conflict with
> +	 * dup_mmap().
> +	 */
>  	spin_lock(&encl->mm_lock);
>  	list_add_rcu(&encl_mm->list, &encl->mm_list);
> +	encl->mm_list_version++;
>  	spin_unlock(&encl->mm_lock);
>  
> -	synchronize_srcu(&encl->srcu);
> -
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index 44b353aa8866..f0f72e591244 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -74,6 +74,7 @@ struct sgx_encl {
>  	struct mutex lock;
>  	struct list_head mm_list;
>  	spinlock_t mm_lock;
> +	unsigned long mm_list_version;
>  	struct file *backing;
>  	struct kref refcount;
>  	struct srcu_struct srcu;
> diff --git a/arch/x86/kernel/cpu/sgx/reclaim.c b/arch/x86/kernel/cpu/sgx/reclaim.c
> index 39f0ddefbb79..5fb8bdfa6a1f 100644
> --- a/arch/x86/kernel/cpu/sgx/reclaim.c
> +++ b/arch/x86/kernel/cpu/sgx/reclaim.c
> @@ -184,28 +184,38 @@ static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
>  	struct sgx_encl_page *page = epc_page->owner;
>  	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
>  	struct sgx_encl *encl = page->encl;
> +	unsigned long mm_list_version;
>  	struct sgx_encl_mm *encl_mm;
>  	struct vm_area_struct *vma;
>  	int idx, ret;
>  
> -	idx = srcu_read_lock(&encl->srcu);
> +	do {
> +		mm_list_version = encl->mm_list_version;
>  
> -	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> -		if (!mmget_not_zero(encl_mm->mm))
> -			continue;
> +		/* Fence reads as the CPU can reorder them. This guarantees
> +		 * that we don't access old list with a new version.
> +		 */
> +		smp_rmb();
>  
> -		down_read(&encl_mm->mm->mmap_sem);
> +		idx = srcu_read_lock(&encl->srcu);
>  
> -		ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> -		if (!ret && encl == vma->vm_private_data)
> -			zap_vma_ptes(vma, addr, PAGE_SIZE);
> +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> +			if (!mmget_not_zero(encl_mm->mm))
> +				continue;
>  
> -		up_read(&encl_mm->mm->mmap_sem);
> +			down_read(&encl_mm->mm->mmap_sem);
>  
> -		mmput_async(encl_mm->mm);
> -	}
> +			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> +			if (!ret && encl == vma->vm_private_data)
> +				zap_vma_ptes(vma, addr, PAGE_SIZE);
>  
> -	srcu_read_unlock(&encl->srcu, idx);
> +			up_read(&encl_mm->mm->mmap_sem);
> +
> +			mmput_async(encl_mm->mm);
> +		}
> +
> +		srcu_read_unlock(&encl->srcu, idx);
> +	} while (unlikely(encl->mm_list_version != mm_list_version));

This is bad, or at least makes the code unnecessarily hard to understand
given the use of the fences: encl->mm_list_version is read twice per
iteration.

I'll change the code to use "for ( ; ; )" and have exactly one read per
iteration.

/Jarko
Sean Christopherson April 3, 2020, 11:35 p.m. UTC | #2
On Sat, Apr 04, 2020 at 02:33:31AM +0300, Jarkko Sakkinen wrote:
> On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > -	idx = srcu_read_lock(&encl->srcu);
> > +	do {
> > +		mm_list_version = encl->mm_list_version;
> >  
> > -	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > -		if (!mmget_not_zero(encl_mm->mm))
> > -			continue;
> > +		/* Fence reads as the CPU can reorder them. This guarantees
> > +		 * that we don't access old list with a new version.
> > +		 */
> > +		smp_rmb();
> >  
> > -		down_read(&encl_mm->mm->mmap_sem);
> > +		idx = srcu_read_lock(&encl->srcu);
> >  
> > -		ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> > -		if (!ret && encl == vma->vm_private_data)
> > -			zap_vma_ptes(vma, addr, PAGE_SIZE);
> > +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > +			if (!mmget_not_zero(encl_mm->mm))
> > +				continue;
> >  
> > -		up_read(&encl_mm->mm->mmap_sem);
> > +			down_read(&encl_mm->mm->mmap_sem);
> >  
> > -		mmput_async(encl_mm->mm);
> > -	}
> > +			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> > +			if (!ret && encl == vma->vm_private_data)
> > +				zap_vma_ptes(vma, addr, PAGE_SIZE);
> >  
> > -	srcu_read_unlock(&encl->srcu, idx);
> > +			up_read(&encl_mm->mm->mmap_sem);
> > +
> > +			mmput_async(encl_mm->mm);
> > +		}
> > +
> > +		srcu_read_unlock(&encl->srcu, idx);
> > +	} while (unlikely(encl->mm_list_version != mm_list_version));
> 
> This is bad, or at least makes the code unnecessarily hard to understand
> given the use of the fences: encl->mm_list_version is read twice per
> iteration.
> 
> I'll change the code to use "for ( ; ; )" and have exactly one read per
> iteration.

It has to be read twice per iteration, the whole point is to see if the
version at the start of the iteration matches the version at the end.
Sean Christopherson April 3, 2020, 11:42 p.m. UTC | #3
On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
>  		return ret;
>  	}
>  
> +	/*
> +	 * The page reclaimer uses list version for synchronization instead of
> +	 * synchronize_scru() because otherwise we could conflict with
> +	 * dup_mmap().
> +	 */
>  	spin_lock(&encl->mm_lock);
>  	list_add_rcu(&encl_mm->list, &encl->mm_list);

You dropped the smp_wmb().

> +	encl->mm_list_version++;
>  	spin_unlock(&encl->mm_lock);
>  
> -	synchronize_srcu(&encl->srcu);
> -
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index 44b353aa8866..f0f72e591244 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -74,6 +74,7 @@ struct sgx_encl {
>  	struct mutex lock;
>  	struct list_head mm_list;
>  	spinlock_t mm_lock;
> +	unsigned long mm_list_version;
>  	struct file *backing;
>  	struct kref refcount;
>  	struct srcu_struct srcu;
> diff --git a/arch/x86/kernel/cpu/sgx/reclaim.c b/arch/x86/kernel/cpu/sgx/reclaim.c
> index 39f0ddefbb79..5fb8bdfa6a1f 100644
> --- a/arch/x86/kernel/cpu/sgx/reclaim.c
> +++ b/arch/x86/kernel/cpu/sgx/reclaim.c
> @@ -184,28 +184,38 @@ static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
>  	struct sgx_encl_page *page = epc_page->owner;
>  	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
>  	struct sgx_encl *encl = page->encl;
> +	unsigned long mm_list_version;
>  	struct sgx_encl_mm *encl_mm;
>  	struct vm_area_struct *vma;
>  	int idx, ret;
>  
> -	idx = srcu_read_lock(&encl->srcu);
> +	do {
> +		mm_list_version = encl->mm_list_version;
>  
> -	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> -		if (!mmget_not_zero(encl_mm->mm))
> -			continue;
> +		/* Fence reads as the CPU can reorder them. This guarantees
> +		 * that we don't access old list with a new version.

This comment is flat out wrong.  This has nothing to do the CPU reordering
things.  The smp_{r,w}mb() are nothing more than compiler barriers, and
even those go away when the kernel is built with SMP=0.

I don't mind gutting the other comments, but there is a well established
pattern form smb_wmb()/smp_rmb() pairs, I would strongly prefer to keep
the exact comment I submitted.

> +		 */
> +		smp_rmb();


>  
> -		down_read(&encl_mm->mm->mmap_sem);
> +		idx = srcu_read_lock(&encl->srcu);
>  
> -		ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> -		if (!ret && encl == vma->vm_private_data)
> -			zap_vma_ptes(vma, addr, PAGE_SIZE);
> +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> +			if (!mmget_not_zero(encl_mm->mm))
> +				continue;
>  
> -		up_read(&encl_mm->mm->mmap_sem);
> +			down_read(&encl_mm->mm->mmap_sem);
>  
> -		mmput_async(encl_mm->mm);
> -	}
> +			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> +			if (!ret && encl == vma->vm_private_data)
> +				zap_vma_ptes(vma, addr, PAGE_SIZE);
>  
> -	srcu_read_unlock(&encl->srcu, idx);
> +			up_read(&encl_mm->mm->mmap_sem);
> +
> +			mmput_async(encl_mm->mm);
> +		}
> +
> +		srcu_read_unlock(&encl->srcu, idx);
> +	} while (unlikely(encl->mm_list_version != mm_list_version));
>  
>  	mutex_lock(&encl->lock);
>  
> @@ -250,6 +260,11 @@ static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl)
>  	struct sgx_encl_mm *encl_mm;
>  	int idx;
>  
> +	/*
> +	 * Can race with sgx_encl_mm_add(), but ETRACK has already been
> +	 * executed, which means that the CPUs running in the new mm will enter
> +	 * into the enclave with a fresh epoch.
> +	 */
>  	cpumask_clear(cpumask);
>  
>  	idx = srcu_read_lock(&encl->srcu);
> -- 
> 2.25.1
>
Jarkko Sakkinen April 4, 2020, 1:12 a.m. UTC | #4
On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote:
> On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
> >  		return ret;
> >  	}
> >  
> > +	/*
> > +	 * The page reclaimer uses list version for synchronization instead of
> > +	 * synchronize_scru() because otherwise we could conflict with
> > +	 * dup_mmap().
> > +	 */
> >  	spin_lock(&encl->mm_lock);
> >  	list_add_rcu(&encl_mm->list, &encl->mm_list);
> 
> You dropped the smp_wmb().

As I said to you in my review x86 pipeline does not reorder writes.

/Jarkko
Jarkko Sakkinen April 4, 2020, 1:12 a.m. UTC | #5
On Fri, Apr 03, 2020 at 04:35:47PM -0700, Sean Christopherson wrote:
> On Sat, Apr 04, 2020 at 02:33:31AM +0300, Jarkko Sakkinen wrote:
> > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > -	idx = srcu_read_lock(&encl->srcu);
> > > +	do {
> > > +		mm_list_version = encl->mm_list_version;
> > >  
> > > -	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > > -		if (!mmget_not_zero(encl_mm->mm))
> > > -			continue;
> > > +		/* Fence reads as the CPU can reorder them. This guarantees
> > > +		 * that we don't access old list with a new version.
> > > +		 */
> > > +		smp_rmb();
> > >  
> > > -		down_read(&encl_mm->mm->mmap_sem);
> > > +		idx = srcu_read_lock(&encl->srcu);
> > >  
> > > -		ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> > > -		if (!ret && encl == vma->vm_private_data)
> > > -			zap_vma_ptes(vma, addr, PAGE_SIZE);
> > > +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > > +			if (!mmget_not_zero(encl_mm->mm))
> > > +				continue;
> > >  
> > > -		up_read(&encl_mm->mm->mmap_sem);
> > > +			down_read(&encl_mm->mm->mmap_sem);
> > >  
> > > -		mmput_async(encl_mm->mm);
> > > -	}
> > > +			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> > > +			if (!ret && encl == vma->vm_private_data)
> > > +				zap_vma_ptes(vma, addr, PAGE_SIZE);
> > >  
> > > -	srcu_read_unlock(&encl->srcu, idx);
> > > +			up_read(&encl_mm->mm->mmap_sem);
> > > +
> > > +			mmput_async(encl_mm->mm);
> > > +		}
> > > +
> > > +		srcu_read_unlock(&encl->srcu, idx);
> > > +	} while (unlikely(encl->mm_list_version != mm_list_version));
> > 
> > This is bad, or at least makes the code unnecessarily hard to understand
> > given the use of the fences: encl->mm_list_version is read twice per
> > iteration.
> > 
> > I'll change the code to use "for ( ; ; )" and have exactly one read per
> > iteration.
> 
> It has to be read twice per iteration, the whole point is to see if the
> version at the start of the iteration matches the version at the end.

Nope if you use zero version as invalid version.

/Jarkko
Sean Christopherson April 6, 2020, 2:31 p.m. UTC | #6
On Sat, Apr 04, 2020 at 04:12:48AM +0300, Jarkko Sakkinen wrote:
> On Fri, Apr 03, 2020 at 04:35:47PM -0700, Sean Christopherson wrote:
> > On Sat, Apr 04, 2020 at 02:33:31AM +0300, Jarkko Sakkinen wrote:
> > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > -	idx = srcu_read_lock(&encl->srcu);
> > > > +	do {
> > > > +		mm_list_version = encl->mm_list_version;
> > > >  
> > > > -	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > > > -		if (!mmget_not_zero(encl_mm->mm))
> > > > -			continue;
> > > > +		/* Fence reads as the CPU can reorder them. This guarantees
> > > > +		 * that we don't access old list with a new version.
> > > > +		 */
> > > > +		smp_rmb();
> > > >  
> > > > -		down_read(&encl_mm->mm->mmap_sem);
> > > > +		idx = srcu_read_lock(&encl->srcu);
> > > >  
> > > > -		ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> > > > -		if (!ret && encl == vma->vm_private_data)
> > > > -			zap_vma_ptes(vma, addr, PAGE_SIZE);
> > > > +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > > > +			if (!mmget_not_zero(encl_mm->mm))
> > > > +				continue;
> > > >  
> > > > -		up_read(&encl_mm->mm->mmap_sem);
> > > > +			down_read(&encl_mm->mm->mmap_sem);
> > > >  
> > > > -		mmput_async(encl_mm->mm);
> > > > -	}
> > > > +			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> > > > +			if (!ret && encl == vma->vm_private_data)
> > > > +				zap_vma_ptes(vma, addr, PAGE_SIZE);
> > > >  
> > > > -	srcu_read_unlock(&encl->srcu, idx);
> > > > +			up_read(&encl_mm->mm->mmap_sem);
> > > > +
> > > > +			mmput_async(encl_mm->mm);
> > > > +		}
> > > > +
> > > > +		srcu_read_unlock(&encl->srcu, idx);
> > > > +	} while (unlikely(encl->mm_list_version != mm_list_version));
> > > 
> > > This is bad, or at least makes the code unnecessarily hard to understand
> > > given the use of the fences: encl->mm_list_version is read twice per
> > > iteration.
> > > 
> > > I'll change the code to use "for ( ; ; )" and have exactly one read per
> > > iteration.
> > 
> > It has to be read twice per iteration, the whole point is to see if the
> > version at the start of the iteration matches the version at the end.
> 
> Nope if you use zero version as invalid version.

Which would break if more than one mm is added to an enclave.
Sean Christopherson April 6, 2020, 2:36 p.m. UTC | #7
On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote:
> On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote:
> > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
> > >  		return ret;
> > >  	}
> > >  
> > > +	/*
> > > +	 * The page reclaimer uses list version for synchronization instead of
> > > +	 * synchronize_scru() because otherwise we could conflict with
> > > +	 * dup_mmap().
> > > +	 */
> > >  	spin_lock(&encl->mm_lock);
> > >  	list_add_rcu(&encl_mm->list, &encl->mm_list);
> > 
> > You dropped the smp_wmb().
> 
> As I said to you in my review x86 pipeline does not reorder writes.

And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if
and only if CONFIG_SMP=y.  The compiler can reorder list_add_rcu() and
mm_list_version++ because from it's perspective there is no dependency
between the two.  And that's entirely true except for the SMP case where
the consumer of mm_list_version is relying on the list to be updated before
the version changes.
Sean Christopherson April 6, 2020, 2:44 p.m. UTC | #8
On Mon, Apr 06, 2020 at 07:31:55AM -0700, Sean Christopherson wrote:
> On Sat, Apr 04, 2020 at 04:12:48AM +0300, Jarkko Sakkinen wrote:
> > On Fri, Apr 03, 2020 at 04:35:47PM -0700, Sean Christopherson wrote:
> > > On Sat, Apr 04, 2020 at 02:33:31AM +0300, Jarkko Sakkinen wrote:
> > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > > -	idx = srcu_read_lock(&encl->srcu);
> > > > > +	do {
> > > > > +		mm_list_version = encl->mm_list_version;
> > > > >  
> > > > > -	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > > > > -		if (!mmget_not_zero(encl_mm->mm))
> > > > > -			continue;
> > > > > +		/* Fence reads as the CPU can reorder them. This guarantees
> > > > > +		 * that we don't access old list with a new version.
> > > > > +		 */
> > > > > +		smp_rmb();
> > > > >  
> > > > > -		down_read(&encl_mm->mm->mmap_sem);
> > > > > +		idx = srcu_read_lock(&encl->srcu);
> > > > >  
> > > > > -		ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> > > > > -		if (!ret && encl == vma->vm_private_data)
> > > > > -			zap_vma_ptes(vma, addr, PAGE_SIZE);
> > > > > +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > > > > +			if (!mmget_not_zero(encl_mm->mm))
> > > > > +				continue;
> > > > >  
> > > > > -		up_read(&encl_mm->mm->mmap_sem);
> > > > > +			down_read(&encl_mm->mm->mmap_sem);
> > > > >  
> > > > > -		mmput_async(encl_mm->mm);
> > > > > -	}
> > > > > +			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> > > > > +			if (!ret && encl == vma->vm_private_data)
> > > > > +				zap_vma_ptes(vma, addr, PAGE_SIZE);
> > > > >  
> > > > > -	srcu_read_unlock(&encl->srcu, idx);
> > > > > +			up_read(&encl_mm->mm->mmap_sem);
> > > > > +
> > > > > +			mmput_async(encl_mm->mm);
> > > > > +		}
> > > > > +
> > > > > +		srcu_read_unlock(&encl->srcu, idx);
> > > > > +	} while (unlikely(encl->mm_list_version != mm_list_version));
> > > > 
> > > > This is bad, or at least makes the code unnecessarily hard to understand
> > > > given the use of the fences: encl->mm_list_version is read twice per
> > > > iteration.
> > > > 
> > > > I'll change the code to use "for ( ; ; )" and have exactly one read per
> > > > iteration.
> > > 
> > > It has to be read twice per iteration, the whole point is to see if the
> > > version at the start of the iteration matches the version at the end.
> > 
> > Nope if you use zero version as invalid version.
> 
> Which would break if more than one mm is added to an enclave.

Ah, you mean within the function (just saw v3).
Jarkko Sakkinen April 6, 2020, 5:10 p.m. UTC | #9
On Mon, Apr 06, 2020 at 07:36:38AM -0700, Sean Christopherson wrote:
> On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote:
> > On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
> > > >  		return ret;
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * The page reclaimer uses list version for synchronization instead of
> > > > +	 * synchronize_scru() because otherwise we could conflict with
> > > > +	 * dup_mmap().
> > > > +	 */
> > > >  	spin_lock(&encl->mm_lock);
> > > >  	list_add_rcu(&encl_mm->list, &encl->mm_list);
> > > 
> > > You dropped the smp_wmb().
> > 
> > As I said to you in my review x86 pipeline does not reorder writes.
> 
> And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if
> and only if CONFIG_SMP=y.  The compiler can reorder list_add_rcu() and
> mm_list_version++ because from it's perspective there is no dependency
> between the two.  And that's entirely true except for the SMP case where
> the consumer of mm_list_version is relying on the list to be updated before
> the version changes.

I see.

So why not change the variable volatile given that x86 is the only
arch that this code gets used?

/Jarkko
Jarkko Sakkinen April 6, 2020, 5:15 p.m. UTC | #10
On Mon, Apr 06, 2020 at 08:10:29PM +0300, Jarkko Sakkinen wrote:
> On Mon, Apr 06, 2020 at 07:36:38AM -0700, Sean Christopherson wrote:
> > On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote:
> > > On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote:
> > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
> > > > >  		return ret;
> > > > >  	}
> > > > >  
> > > > > +	/*
> > > > > +	 * The page reclaimer uses list version for synchronization instead of
> > > > > +	 * synchronize_scru() because otherwise we could conflict with
> > > > > +	 * dup_mmap().
> > > > > +	 */
> > > > >  	spin_lock(&encl->mm_lock);
> > > > >  	list_add_rcu(&encl_mm->list, &encl->mm_list);
> > > > 
> > > > You dropped the smp_wmb().
> > > 
> > > As I said to you in my review x86 pipeline does not reorder writes.
> > 
> > And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if
> > and only if CONFIG_SMP=y.  The compiler can reorder list_add_rcu() and
> > mm_list_version++ because from it's perspective there is no dependency
> > between the two.  And that's entirely true except for the SMP case where
> > the consumer of mm_list_version is relying on the list to be updated before
> > the version changes.
> 
> I see.
> 
> So why not change the variable volatile given that x86 is the only
> arch that this code gets used?

Please note that I'm fully aware of

https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

Just wondering. Anyway, I'll add smp_wmb() back since it is safe play
in terms of acceptance.

/Jarkko
Sean Christopherson April 9, 2020, 7:13 p.m. UTC | #11
On Mon, Apr 06, 2020 at 08:15:56PM +0300, Jarkko Sakkinen wrote:
> On Mon, Apr 06, 2020 at 08:10:29PM +0300, Jarkko Sakkinen wrote:
> > On Mon, Apr 06, 2020 at 07:36:38AM -0700, Sean Christopherson wrote:
> > > On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote:
> > > > On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote:
> > > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > > > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
> > > > > >  		return ret;
> > > > > >  	}
> > > > > >  
> > > > > > +	/*
> > > > > > +	 * The page reclaimer uses list version for synchronization instead of
> > > > > > +	 * synchronize_scru() because otherwise we could conflict with
> > > > > > +	 * dup_mmap().
> > > > > > +	 */
> > > > > >  	spin_lock(&encl->mm_lock);
> > > > > >  	list_add_rcu(&encl_mm->list, &encl->mm_list);
> > > > > 
> > > > > You dropped the smp_wmb().
> > > > 
> > > > As I said to you in my review x86 pipeline does not reorder writes.
> > > 
> > > And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if
> > > and only if CONFIG_SMP=y.  The compiler can reorder list_add_rcu() and
> > > mm_list_version++ because from it's perspective there is no dependency
> > > between the two.  And that's entirely true except for the SMP case where
> > > the consumer of mm_list_version is relying on the list to be updated before
> > > the version changes.
> > 
> > I see.
> > 
> > So why not change the variable volatile given that x86 is the only
> > arch that this code gets used?
> 
> Please note that I'm fully aware of
> 
> https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html
> 
> Just wondering. Anyway, I'll add smp_wmb() back since it is safe play
> in terms of acceptance.

Because volatile is overkill and too heavy-handed for what is needed here.
E.g. if SMP=n, then there are no barriers needed whatsover, as a CPU can't
race with itself.

Even for SMP=y, volatile is too much as it prevents optimization that would
otherwise be allowed.  For the smp_wmb() it doesn't matter much, if at all,
since the list and version are updated inside of a critical section, i.e.
barriered on both sides so the compiler can't do much optimization anyways.

But for the smp_rmb() case, the compiler is free to cache the version in a
register early on in the function, it just needs to make sure that the
access is done before starting the list walk.  If encl->mm_list_version
were volatile, the compiler would not be allowed to do such an optimization
as it'd be required to access memory exactly when it's referenced in code.

This is easily visible in the compiled code, as encl->mm_list_version is
only read from memory once per iteration (in the unlikely case that there
are multiple iterations).  It takes the do-while loop, which appears to
read encl->mm_list_version twice per iteration:

  do {
        mm_list_version = encl->mm_list_version;

        <walk list>

  } while (unlikely(encl->mm_list_version != mm_list_version));


And turns it into an optimized loop that loads encl->mm_list_version the
minimum number of times.  If encl->mm_list_version list were volatile, the
compiler would not be allowed to cache it in %rax.

        mov    0x58(%r12),%rax          // Load from encl->mm_list_version
        lea    0x40(%r12),%rbx          // Interleaved to optimize ALU vs LD
        and    $0xfffffffffffff000,%rbp // Interleaved to optimize ALU vs LD
        mov    %rax,0x8(%rsp)           // mm_list_version = encl->mm_list_version


  walk_mm_list:
        <blah blah blah>

        mov    0x58(%r12),%rax          // Load from encl->mm_list_version
        cmp    0x8(%rsp),%rax           // Compare against mm_list_version
        jne    update_mm_list_version

        <happy path>
        ret

  update_mm_list_version:
        mov    %rax,0x8(%rsp)           // mm_list_version = encl->mm_list_version
        jmpq   0xffffffff8102e460 <walk_mm_list>
Jarkko Sakkinen April 10, 2020, 1:22 p.m. UTC | #12
On Thu, Apr 09, 2020 at 12:13:53PM -0700, Sean Christopherson wrote:
> On Mon, Apr 06, 2020 at 08:15:56PM +0300, Jarkko Sakkinen wrote:
> > On Mon, Apr 06, 2020 at 08:10:29PM +0300, Jarkko Sakkinen wrote:
> > > On Mon, Apr 06, 2020 at 07:36:38AM -0700, Sean Christopherson wrote:
> > > > On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote:
> > > > > On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote:
> > > > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote:
> > > > > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > > > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
> > > > > > >  		return ret;
> > > > > > >  	}
> > > > > > >  
> > > > > > > +	/*
> > > > > > > +	 * The page reclaimer uses list version for synchronization instead of
> > > > > > > +	 * synchronize_scru() because otherwise we could conflict with
> > > > > > > +	 * dup_mmap().
> > > > > > > +	 */
> > > > > > >  	spin_lock(&encl->mm_lock);
> > > > > > >  	list_add_rcu(&encl_mm->list, &encl->mm_list);
> > > > > > 
> > > > > > You dropped the smp_wmb().
> > > > > 
> > > > > As I said to you in my review x86 pipeline does not reorder writes.
> > > > 
> > > > And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if
> > > > and only if CONFIG_SMP=y.  The compiler can reorder list_add_rcu() and
> > > > mm_list_version++ because from it's perspective there is no dependency
> > > > between the two.  And that's entirely true except for the SMP case where
> > > > the consumer of mm_list_version is relying on the list to be updated before
> > > > the version changes.
> > > 
> > > I see.
> > > 
> > > So why not change the variable volatile given that x86 is the only
> > > arch that this code gets used?
> > 
> > Please note that I'm fully aware of
> > 
> > https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html
> > 
> > Just wondering. Anyway, I'll add smp_wmb() back since it is safe play
> > in terms of acceptance.
> 
> Because volatile is overkill and too heavy-handed for what is needed here.
> E.g. if SMP=n, then there are no barriers needed whatsover, as a CPU can't
> race with itself.
> 
> Even for SMP=y, volatile is too much as it prevents optimization that would
> otherwise be allowed.  For the smp_wmb() it doesn't matter much, if at all,
> since the list and version are updated inside of a critical section, i.e.
> barriered on both sides so the compiler can't do much optimization anyways.
> 
> But for the smp_rmb() case, the compiler is free to cache the version in a
> register early on in the function, it just needs to make sure that the
> access is done before starting the list walk.  If encl->mm_list_version
> were volatile, the compiler would not be allowed to do such an optimization
> as it'd be required to access memory exactly when it's referenced in code.
> 
> This is easily visible in the compiled code, as encl->mm_list_version is
> only read from memory once per iteration (in the unlikely case that there
> are multiple iterations).  It takes the do-while loop, which appears to
> read encl->mm_list_version twice per iteration:
> 
>   do {
>         mm_list_version = encl->mm_list_version;
> 
>         <walk list>
> 
>   } while (unlikely(encl->mm_list_version != mm_list_version));
> 
> 
> And turns it into an optimized loop that loads encl->mm_list_version the
> minimum number of times.  If encl->mm_list_version list were volatile, the
> compiler would not be allowed to cache it in %rax.
> 
>         mov    0x58(%r12),%rax          // Load from encl->mm_list_version
>         lea    0x40(%r12),%rbx          // Interleaved to optimize ALU vs LD
>         and    $0xfffffffffffff000,%rbp // Interleaved to optimize ALU vs LD
>         mov    %rax,0x8(%rsp)           // mm_list_version = encl->mm_list_version
> 
> 
>   walk_mm_list:
>         <blah blah blah>
> 
>         mov    0x58(%r12),%rax          // Load from encl->mm_list_version
>         cmp    0x8(%rsp),%rax           // Compare against mm_list_version
>         jne    update_mm_list_version
> 
>         <happy path>
>         ret
> 
>   update_mm_list_version:
>         mov    %rax,0x8(%rsp)           // mm_list_version = encl->mm_list_version
>         jmpq   0xffffffff8102e460 <walk_mm_list>

Thanks and check v4 so that I can squash it :-)

/Jarkko
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index e0124a2f22d5..5b15352b3d4f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -196,6 +196,9 @@  int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 	struct sgx_encl_mm *encl_mm;
 	int ret;
 
+	/* mm_list can be accessed only by a single thread at a time. */
+	lockdep_assert_held_write(&mm->mmap_sem);
+
 	if (atomic_read(&encl->flags) & SGX_ENCL_DEAD)
 		return -EINVAL;
 
@@ -221,12 +224,16 @@  int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 		return ret;
 	}
 
+	/*
+	 * The page reclaimer uses list version for synchronization instead of
+	 * synchronize_scru() because otherwise we could conflict with
+	 * dup_mmap().
+	 */
 	spin_lock(&encl->mm_lock);
 	list_add_rcu(&encl_mm->list, &encl->mm_list);
+	encl->mm_list_version++;
 	spin_unlock(&encl->mm_lock);
 
-	synchronize_srcu(&encl->srcu);
-
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 44b353aa8866..f0f72e591244 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -74,6 +74,7 @@  struct sgx_encl {
 	struct mutex lock;
 	struct list_head mm_list;
 	spinlock_t mm_lock;
+	unsigned long mm_list_version;
 	struct file *backing;
 	struct kref refcount;
 	struct srcu_struct srcu;
diff --git a/arch/x86/kernel/cpu/sgx/reclaim.c b/arch/x86/kernel/cpu/sgx/reclaim.c
index 39f0ddefbb79..5fb8bdfa6a1f 100644
--- a/arch/x86/kernel/cpu/sgx/reclaim.c
+++ b/arch/x86/kernel/cpu/sgx/reclaim.c
@@ -184,28 +184,38 @@  static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
 	struct sgx_encl_page *page = epc_page->owner;
 	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
 	struct sgx_encl *encl = page->encl;
+	unsigned long mm_list_version;
 	struct sgx_encl_mm *encl_mm;
 	struct vm_area_struct *vma;
 	int idx, ret;
 
-	idx = srcu_read_lock(&encl->srcu);
+	do {
+		mm_list_version = encl->mm_list_version;
 
-	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
-		if (!mmget_not_zero(encl_mm->mm))
-			continue;
+		/* Fence reads as the CPU can reorder them. This guarantees
+		 * that we don't access old list with a new version.
+		 */
+		smp_rmb();
 
-		down_read(&encl_mm->mm->mmap_sem);
+		idx = srcu_read_lock(&encl->srcu);
 
-		ret = sgx_encl_find(encl_mm->mm, addr, &vma);
-		if (!ret && encl == vma->vm_private_data)
-			zap_vma_ptes(vma, addr, PAGE_SIZE);
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
 
-		up_read(&encl_mm->mm->mmap_sem);
+			down_read(&encl_mm->mm->mmap_sem);
 
-		mmput_async(encl_mm->mm);
-	}
+			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
+			if (!ret && encl == vma->vm_private_data)
+				zap_vma_ptes(vma, addr, PAGE_SIZE);
 
-	srcu_read_unlock(&encl->srcu, idx);
+			up_read(&encl_mm->mm->mmap_sem);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (unlikely(encl->mm_list_version != mm_list_version));
 
 	mutex_lock(&encl->lock);
 
@@ -250,6 +260,11 @@  static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl)
 	struct sgx_encl_mm *encl_mm;
 	int idx;
 
+	/*
+	 * Can race with sgx_encl_mm_add(), but ETRACK has already been
+	 * executed, which means that the CPUs running in the new mm will enter
+	 * into the enclave with a fresh epoch.
+	 */
 	cpumask_clear(cpumask);
 
 	idx = srcu_read_lock(&encl->srcu);