Message ID | 20200403093550.104789-1-jarkko.sakkinen@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] x86/sgx: Fix deadlock and race conditions between fork() and EPC reclaim | expand |
On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > From: Sean Christopherson <sean.j.christopherson@intel.com> > > Drop the synchronize_srcu() from sgx_encl_mm_add() and replace it with a > mm_list versioning concept to avoid deadlock when adding a mm during > dup_mmap()/fork(), and to ensure copied PTEs are zapped. > > When dup_mmap() runs, it holds mmap_sem for write in both the old mm and > new mm. Invoking synchronize_srcu() while holding mmap_sem of a mm that > is already attached to the enclave will deadlock if the reclaimer is in > the process of walking mm_list, as the reclaimer will try to acquire > mmap_sem (of the old mm) while holding encl->srcu for read. > > INFO: task ksgxswapd:181 blocked for more than 120 seconds. > ksgxswapd D 0 181 2 0x80004000 > Call Trace: > __schedule+0x2db/0x700 > schedule+0x44/0xb0 > rwsem_down_read_slowpath+0x370/0x470 > down_read+0x95/0xa0 > sgx_reclaim_pages+0x1d2/0x7d0 > ksgxswapd+0x151/0x2e0 > kthread+0x120/0x140 > ret_from_fork+0x35/0x40 > > INFO: task fork_consistenc:18824 blocked for more than 120 seconds. > fork_consistenc D 0 18824 18786 0x00004320 > Call Trace: > __schedule+0x2db/0x700 > schedule+0x44/0xb0 > schedule_timeout+0x205/0x300 > wait_for_completion+0xb7/0x140 > __synchronize_srcu.part.22+0x81/0xb0 > synchronize_srcu_expedited+0x27/0x30 > synchronize_srcu+0x57/0xe0 > sgx_encl_mm_add+0x12b/0x160 > sgx_vma_open+0x22/0x40 > dup_mm+0x521/0x580 > copy_process+0x1a56/0x1b50 > _do_fork+0x85/0x3a0 > __x64_sys_clone+0x8e/0xb0 > do_syscall_64+0x57/0x1b0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Furthermore, doing synchronize_srcu() in sgx_encl_mm_add() does not > prevent the new mm from having stale PTEs pointing at the EPC page to be > reclaimed. dup_mmap() calls vm_ops->open()/sgx_encl_mm_add() _after_ > PTEs are copied to the new mm, i.e. blocking fork() until reclaim zaps > the old mm is pointless as the stale PTEs have already been created in > the new mm. > > All other flows that walk mm_list can safely race with dup_mmap() or are > protected by a different mechanism. Add comments to all srcu readers > that don't check the list version to document why its ok for the flow to > ignore the version. > > Note, synchronize_srcu() is still needed when removing a mm from an > enclave, as the srcu readers must complete their walk before the mm can > be freed. Removing a mm is never done while holding mmap_sem. > > Cc: Haitao Huang <haitao.huang@linux.intel.com> > Cc: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> > --- > v2: > * Remove smp_wmb() as x86 does not reorder writes in the pipeline. > * Refine comments to be more to the point and more maintainable when > things might change. > * Replace the ad hoc (goto-based) loop construct with a proper loop > construct. > arch/x86/kernel/cpu/sgx/encl.c | 11 +++++++-- > arch/x86/kernel/cpu/sgx/encl.h | 1 + > arch/x86/kernel/cpu/sgx/reclaim.c | 39 +++++++++++++++++++++---------- > 3 files changed, 37 insertions(+), 14 deletions(-) > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index e0124a2f22d5..5b15352b3d4f 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -196,6 +196,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) > struct sgx_encl_mm *encl_mm; > int ret; > > + /* mm_list can be accessed only by a single thread at a time. */ > + lockdep_assert_held_write(&mm->mmap_sem); > + > if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) > return -EINVAL; > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) > return ret; > } > > + /* > + * The page reclaimer uses list version for synchronization instead of > + * synchronize_scru() because otherwise we could conflict with > + * dup_mmap(). > + */ > spin_lock(&encl->mm_lock); > list_add_rcu(&encl_mm->list, &encl->mm_list); > + encl->mm_list_version++; > spin_unlock(&encl->mm_lock); > > - synchronize_srcu(&encl->srcu); > - > return 0; > } > > diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h > index 44b353aa8866..f0f72e591244 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.h > +++ b/arch/x86/kernel/cpu/sgx/encl.h > @@ -74,6 +74,7 @@ struct sgx_encl { > struct mutex lock; > struct list_head mm_list; > spinlock_t mm_lock; > + unsigned long mm_list_version; > struct file *backing; > struct kref refcount; > struct srcu_struct srcu; > diff --git a/arch/x86/kernel/cpu/sgx/reclaim.c b/arch/x86/kernel/cpu/sgx/reclaim.c > index 39f0ddefbb79..5fb8bdfa6a1f 100644 > --- a/arch/x86/kernel/cpu/sgx/reclaim.c > +++ b/arch/x86/kernel/cpu/sgx/reclaim.c > @@ -184,28 +184,38 @@ static void sgx_reclaimer_block(struct sgx_epc_page *epc_page) > struct sgx_encl_page *page = epc_page->owner; > unsigned long addr = SGX_ENCL_PAGE_ADDR(page); > struct sgx_encl *encl = page->encl; > + unsigned long mm_list_version; > struct sgx_encl_mm *encl_mm; > struct vm_area_struct *vma; > int idx, ret; > > - idx = srcu_read_lock(&encl->srcu); > + do { > + mm_list_version = encl->mm_list_version; > > - list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > - if (!mmget_not_zero(encl_mm->mm)) > - continue; > + /* Fence reads as the CPU can reorder them. This guarantees > + * that we don't access old list with a new version. > + */ > + smp_rmb(); > > - down_read(&encl_mm->mm->mmap_sem); > + idx = srcu_read_lock(&encl->srcu); > > - ret = sgx_encl_find(encl_mm->mm, addr, &vma); > - if (!ret && encl == vma->vm_private_data) > - zap_vma_ptes(vma, addr, PAGE_SIZE); > + list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > + if (!mmget_not_zero(encl_mm->mm)) > + continue; > > - up_read(&encl_mm->mm->mmap_sem); > + down_read(&encl_mm->mm->mmap_sem); > > - mmput_async(encl_mm->mm); > - } > + ret = sgx_encl_find(encl_mm->mm, addr, &vma); > + if (!ret && encl == vma->vm_private_data) > + zap_vma_ptes(vma, addr, PAGE_SIZE); > > - srcu_read_unlock(&encl->srcu, idx); > + up_read(&encl_mm->mm->mmap_sem); > + > + mmput_async(encl_mm->mm); > + } > + > + srcu_read_unlock(&encl->srcu, idx); > + } while (unlikely(encl->mm_list_version != mm_list_version)); This is bad, or at least makes the code unnecessarily hard to understand given the use of the fences: encl->mm_list_version is read twice per iteration. I'll change the code to use "for ( ; ; )" and have exactly one read per iteration. /Jarko
On Sat, Apr 04, 2020 at 02:33:31AM +0300, Jarkko Sakkinen wrote: > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > - idx = srcu_read_lock(&encl->srcu); > > + do { > > + mm_list_version = encl->mm_list_version; > > > > - list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > > - if (!mmget_not_zero(encl_mm->mm)) > > - continue; > > + /* Fence reads as the CPU can reorder them. This guarantees > > + * that we don't access old list with a new version. > > + */ > > + smp_rmb(); > > > > - down_read(&encl_mm->mm->mmap_sem); > > + idx = srcu_read_lock(&encl->srcu); > > > > - ret = sgx_encl_find(encl_mm->mm, addr, &vma); > > - if (!ret && encl == vma->vm_private_data) > > - zap_vma_ptes(vma, addr, PAGE_SIZE); > > + list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > > + if (!mmget_not_zero(encl_mm->mm)) > > + continue; > > > > - up_read(&encl_mm->mm->mmap_sem); > > + down_read(&encl_mm->mm->mmap_sem); > > > > - mmput_async(encl_mm->mm); > > - } > > + ret = sgx_encl_find(encl_mm->mm, addr, &vma); > > + if (!ret && encl == vma->vm_private_data) > > + zap_vma_ptes(vma, addr, PAGE_SIZE); > > > > - srcu_read_unlock(&encl->srcu, idx); > > + up_read(&encl_mm->mm->mmap_sem); > > + > > + mmput_async(encl_mm->mm); > > + } > > + > > + srcu_read_unlock(&encl->srcu, idx); > > + } while (unlikely(encl->mm_list_version != mm_list_version)); > > This is bad, or at least makes the code unnecessarily hard to understand > given the use of the fences: encl->mm_list_version is read twice per > iteration. > > I'll change the code to use "for ( ; ; )" and have exactly one read per > iteration. It has to be read twice per iteration, the whole point is to see if the version at the start of the iteration matches the version at the end.
On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > From: Sean Christopherson <sean.j.christopherson@intel.com> > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) > return ret; > } > > + /* > + * The page reclaimer uses list version for synchronization instead of > + * synchronize_scru() because otherwise we could conflict with > + * dup_mmap(). > + */ > spin_lock(&encl->mm_lock); > list_add_rcu(&encl_mm->list, &encl->mm_list); You dropped the smp_wmb(). > + encl->mm_list_version++; > spin_unlock(&encl->mm_lock); > > - synchronize_srcu(&encl->srcu); > - > return 0; > } > > diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h > index 44b353aa8866..f0f72e591244 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.h > +++ b/arch/x86/kernel/cpu/sgx/encl.h > @@ -74,6 +74,7 @@ struct sgx_encl { > struct mutex lock; > struct list_head mm_list; > spinlock_t mm_lock; > + unsigned long mm_list_version; > struct file *backing; > struct kref refcount; > struct srcu_struct srcu; > diff --git a/arch/x86/kernel/cpu/sgx/reclaim.c b/arch/x86/kernel/cpu/sgx/reclaim.c > index 39f0ddefbb79..5fb8bdfa6a1f 100644 > --- a/arch/x86/kernel/cpu/sgx/reclaim.c > +++ b/arch/x86/kernel/cpu/sgx/reclaim.c > @@ -184,28 +184,38 @@ static void sgx_reclaimer_block(struct sgx_epc_page *epc_page) > struct sgx_encl_page *page = epc_page->owner; > unsigned long addr = SGX_ENCL_PAGE_ADDR(page); > struct sgx_encl *encl = page->encl; > + unsigned long mm_list_version; > struct sgx_encl_mm *encl_mm; > struct vm_area_struct *vma; > int idx, ret; > > - idx = srcu_read_lock(&encl->srcu); > + do { > + mm_list_version = encl->mm_list_version; > > - list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > - if (!mmget_not_zero(encl_mm->mm)) > - continue; > + /* Fence reads as the CPU can reorder them. This guarantees > + * that we don't access old list with a new version. This comment is flat out wrong. This has nothing to do the CPU reordering things. The smp_{r,w}mb() are nothing more than compiler barriers, and even those go away when the kernel is built with SMP=0. I don't mind gutting the other comments, but there is a well established pattern form smb_wmb()/smp_rmb() pairs, I would strongly prefer to keep the exact comment I submitted. > + */ > + smp_rmb(); > > - down_read(&encl_mm->mm->mmap_sem); > + idx = srcu_read_lock(&encl->srcu); > > - ret = sgx_encl_find(encl_mm->mm, addr, &vma); > - if (!ret && encl == vma->vm_private_data) > - zap_vma_ptes(vma, addr, PAGE_SIZE); > + list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > + if (!mmget_not_zero(encl_mm->mm)) > + continue; > > - up_read(&encl_mm->mm->mmap_sem); > + down_read(&encl_mm->mm->mmap_sem); > > - mmput_async(encl_mm->mm); > - } > + ret = sgx_encl_find(encl_mm->mm, addr, &vma); > + if (!ret && encl == vma->vm_private_data) > + zap_vma_ptes(vma, addr, PAGE_SIZE); > > - srcu_read_unlock(&encl->srcu, idx); > + up_read(&encl_mm->mm->mmap_sem); > + > + mmput_async(encl_mm->mm); > + } > + > + srcu_read_unlock(&encl->srcu, idx); > + } while (unlikely(encl->mm_list_version != mm_list_version)); > > mutex_lock(&encl->lock); > > @@ -250,6 +260,11 @@ static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl) > struct sgx_encl_mm *encl_mm; > int idx; > > + /* > + * Can race with sgx_encl_mm_add(), but ETRACK has already been > + * executed, which means that the CPUs running in the new mm will enter > + * into the enclave with a fresh epoch. > + */ > cpumask_clear(cpumask); > > idx = srcu_read_lock(&encl->srcu); > -- > 2.25.1 >
On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote: > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) > > return ret; > > } > > > > + /* > > + * The page reclaimer uses list version for synchronization instead of > > + * synchronize_scru() because otherwise we could conflict with > > + * dup_mmap(). > > + */ > > spin_lock(&encl->mm_lock); > > list_add_rcu(&encl_mm->list, &encl->mm_list); > > You dropped the smp_wmb(). As I said to you in my review x86 pipeline does not reorder writes. /Jarkko
On Fri, Apr 03, 2020 at 04:35:47PM -0700, Sean Christopherson wrote: > On Sat, Apr 04, 2020 at 02:33:31AM +0300, Jarkko Sakkinen wrote: > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > - idx = srcu_read_lock(&encl->srcu); > > > + do { > > > + mm_list_version = encl->mm_list_version; > > > > > > - list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > > > - if (!mmget_not_zero(encl_mm->mm)) > > > - continue; > > > + /* Fence reads as the CPU can reorder them. This guarantees > > > + * that we don't access old list with a new version. > > > + */ > > > + smp_rmb(); > > > > > > - down_read(&encl_mm->mm->mmap_sem); > > > + idx = srcu_read_lock(&encl->srcu); > > > > > > - ret = sgx_encl_find(encl_mm->mm, addr, &vma); > > > - if (!ret && encl == vma->vm_private_data) > > > - zap_vma_ptes(vma, addr, PAGE_SIZE); > > > + list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > > > + if (!mmget_not_zero(encl_mm->mm)) > > > + continue; > > > > > > - up_read(&encl_mm->mm->mmap_sem); > > > + down_read(&encl_mm->mm->mmap_sem); > > > > > > - mmput_async(encl_mm->mm); > > > - } > > > + ret = sgx_encl_find(encl_mm->mm, addr, &vma); > > > + if (!ret && encl == vma->vm_private_data) > > > + zap_vma_ptes(vma, addr, PAGE_SIZE); > > > > > > - srcu_read_unlock(&encl->srcu, idx); > > > + up_read(&encl_mm->mm->mmap_sem); > > > + > > > + mmput_async(encl_mm->mm); > > > + } > > > + > > > + srcu_read_unlock(&encl->srcu, idx); > > > + } while (unlikely(encl->mm_list_version != mm_list_version)); > > > > This is bad, or at least makes the code unnecessarily hard to understand > > given the use of the fences: encl->mm_list_version is read twice per > > iteration. > > > > I'll change the code to use "for ( ; ; )" and have exactly one read per > > iteration. > > It has to be read twice per iteration, the whole point is to see if the > version at the start of the iteration matches the version at the end. Nope if you use zero version as invalid version. /Jarkko
On Sat, Apr 04, 2020 at 04:12:48AM +0300, Jarkko Sakkinen wrote: > On Fri, Apr 03, 2020 at 04:35:47PM -0700, Sean Christopherson wrote: > > On Sat, Apr 04, 2020 at 02:33:31AM +0300, Jarkko Sakkinen wrote: > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > - idx = srcu_read_lock(&encl->srcu); > > > > + do { > > > > + mm_list_version = encl->mm_list_version; > > > > > > > > - list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > > > > - if (!mmget_not_zero(encl_mm->mm)) > > > > - continue; > > > > + /* Fence reads as the CPU can reorder them. This guarantees > > > > + * that we don't access old list with a new version. > > > > + */ > > > > + smp_rmb(); > > > > > > > > - down_read(&encl_mm->mm->mmap_sem); > > > > + idx = srcu_read_lock(&encl->srcu); > > > > > > > > - ret = sgx_encl_find(encl_mm->mm, addr, &vma); > > > > - if (!ret && encl == vma->vm_private_data) > > > > - zap_vma_ptes(vma, addr, PAGE_SIZE); > > > > + list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > > > > + if (!mmget_not_zero(encl_mm->mm)) > > > > + continue; > > > > > > > > - up_read(&encl_mm->mm->mmap_sem); > > > > + down_read(&encl_mm->mm->mmap_sem); > > > > > > > > - mmput_async(encl_mm->mm); > > > > - } > > > > + ret = sgx_encl_find(encl_mm->mm, addr, &vma); > > > > + if (!ret && encl == vma->vm_private_data) > > > > + zap_vma_ptes(vma, addr, PAGE_SIZE); > > > > > > > > - srcu_read_unlock(&encl->srcu, idx); > > > > + up_read(&encl_mm->mm->mmap_sem); > > > > + > > > > + mmput_async(encl_mm->mm); > > > > + } > > > > + > > > > + srcu_read_unlock(&encl->srcu, idx); > > > > + } while (unlikely(encl->mm_list_version != mm_list_version)); > > > > > > This is bad, or at least makes the code unnecessarily hard to understand > > > given the use of the fences: encl->mm_list_version is read twice per > > > iteration. > > > > > > I'll change the code to use "for ( ; ; )" and have exactly one read per > > > iteration. > > > > It has to be read twice per iteration, the whole point is to see if the > > version at the start of the iteration matches the version at the end. > > Nope if you use zero version as invalid version. Which would break if more than one mm is added to an enclave.
On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote: > On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote: > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) > > > return ret; > > > } > > > > > > + /* > > > + * The page reclaimer uses list version for synchronization instead of > > > + * synchronize_scru() because otherwise we could conflict with > > > + * dup_mmap(). > > > + */ > > > spin_lock(&encl->mm_lock); > > > list_add_rcu(&encl_mm->list, &encl->mm_list); > > > > You dropped the smp_wmb(). > > As I said to you in my review x86 pipeline does not reorder writes. And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if and only if CONFIG_SMP=y. The compiler can reorder list_add_rcu() and mm_list_version++ because from it's perspective there is no dependency between the two. And that's entirely true except for the SMP case where the consumer of mm_list_version is relying on the list to be updated before the version changes.
On Mon, Apr 06, 2020 at 07:31:55AM -0700, Sean Christopherson wrote: > On Sat, Apr 04, 2020 at 04:12:48AM +0300, Jarkko Sakkinen wrote: > > On Fri, Apr 03, 2020 at 04:35:47PM -0700, Sean Christopherson wrote: > > > On Sat, Apr 04, 2020 at 02:33:31AM +0300, Jarkko Sakkinen wrote: > > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > - idx = srcu_read_lock(&encl->srcu); > > > > > + do { > > > > > + mm_list_version = encl->mm_list_version; > > > > > > > > > > - list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > > > > > - if (!mmget_not_zero(encl_mm->mm)) > > > > > - continue; > > > > > + /* Fence reads as the CPU can reorder them. This guarantees > > > > > + * that we don't access old list with a new version. > > > > > + */ > > > > > + smp_rmb(); > > > > > > > > > > - down_read(&encl_mm->mm->mmap_sem); > > > > > + idx = srcu_read_lock(&encl->srcu); > > > > > > > > > > - ret = sgx_encl_find(encl_mm->mm, addr, &vma); > > > > > - if (!ret && encl == vma->vm_private_data) > > > > > - zap_vma_ptes(vma, addr, PAGE_SIZE); > > > > > + list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { > > > > > + if (!mmget_not_zero(encl_mm->mm)) > > > > > + continue; > > > > > > > > > > - up_read(&encl_mm->mm->mmap_sem); > > > > > + down_read(&encl_mm->mm->mmap_sem); > > > > > > > > > > - mmput_async(encl_mm->mm); > > > > > - } > > > > > + ret = sgx_encl_find(encl_mm->mm, addr, &vma); > > > > > + if (!ret && encl == vma->vm_private_data) > > > > > + zap_vma_ptes(vma, addr, PAGE_SIZE); > > > > > > > > > > - srcu_read_unlock(&encl->srcu, idx); > > > > > + up_read(&encl_mm->mm->mmap_sem); > > > > > + > > > > > + mmput_async(encl_mm->mm); > > > > > + } > > > > > + > > > > > + srcu_read_unlock(&encl->srcu, idx); > > > > > + } while (unlikely(encl->mm_list_version != mm_list_version)); > > > > > > > > This is bad, or at least makes the code unnecessarily hard to understand > > > > given the use of the fences: encl->mm_list_version is read twice per > > > > iteration. > > > > > > > > I'll change the code to use "for ( ; ; )" and have exactly one read per > > > > iteration. > > > > > > It has to be read twice per iteration, the whole point is to see if the > > > version at the start of the iteration matches the version at the end. > > > > Nope if you use zero version as invalid version. > > Which would break if more than one mm is added to an enclave. Ah, you mean within the function (just saw v3).
On Mon, Apr 06, 2020 at 07:36:38AM -0700, Sean Christopherson wrote: > On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote: > > On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote: > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) > > > > return ret; > > > > } > > > > > > > > + /* > > > > + * The page reclaimer uses list version for synchronization instead of > > > > + * synchronize_scru() because otherwise we could conflict with > > > > + * dup_mmap(). > > > > + */ > > > > spin_lock(&encl->mm_lock); > > > > list_add_rcu(&encl_mm->list, &encl->mm_list); > > > > > > You dropped the smp_wmb(). > > > > As I said to you in my review x86 pipeline does not reorder writes. > > And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if > and only if CONFIG_SMP=y. The compiler can reorder list_add_rcu() and > mm_list_version++ because from it's perspective there is no dependency > between the two. And that's entirely true except for the SMP case where > the consumer of mm_list_version is relying on the list to be updated before > the version changes. I see. So why not change the variable volatile given that x86 is the only arch that this code gets used? /Jarkko
On Mon, Apr 06, 2020 at 08:10:29PM +0300, Jarkko Sakkinen wrote: > On Mon, Apr 06, 2020 at 07:36:38AM -0700, Sean Christopherson wrote: > > On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote: > > > On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote: > > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) > > > > > return ret; > > > > > } > > > > > > > > > > + /* > > > > > + * The page reclaimer uses list version for synchronization instead of > > > > > + * synchronize_scru() because otherwise we could conflict with > > > > > + * dup_mmap(). > > > > > + */ > > > > > spin_lock(&encl->mm_lock); > > > > > list_add_rcu(&encl_mm->list, &encl->mm_list); > > > > > > > > You dropped the smp_wmb(). > > > > > > As I said to you in my review x86 pipeline does not reorder writes. > > > > And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if > > and only if CONFIG_SMP=y. The compiler can reorder list_add_rcu() and > > mm_list_version++ because from it's perspective there is no dependency > > between the two. And that's entirely true except for the SMP case where > > the consumer of mm_list_version is relying on the list to be updated before > > the version changes. > > I see. > > So why not change the variable volatile given that x86 is the only > arch that this code gets used? Please note that I'm fully aware of https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html Just wondering. Anyway, I'll add smp_wmb() back since it is safe play in terms of acceptance. /Jarkko
On Mon, Apr 06, 2020 at 08:15:56PM +0300, Jarkko Sakkinen wrote: > On Mon, Apr 06, 2020 at 08:10:29PM +0300, Jarkko Sakkinen wrote: > > On Mon, Apr 06, 2020 at 07:36:38AM -0700, Sean Christopherson wrote: > > > On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote: > > > > On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote: > > > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) > > > > > > return ret; > > > > > > } > > > > > > > > > > > > + /* > > > > > > + * The page reclaimer uses list version for synchronization instead of > > > > > > + * synchronize_scru() because otherwise we could conflict with > > > > > > + * dup_mmap(). > > > > > > + */ > > > > > > spin_lock(&encl->mm_lock); > > > > > > list_add_rcu(&encl_mm->list, &encl->mm_list); > > > > > > > > > > You dropped the smp_wmb(). > > > > > > > > As I said to you in my review x86 pipeline does not reorder writes. > > > > > > And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if > > > and only if CONFIG_SMP=y. The compiler can reorder list_add_rcu() and > > > mm_list_version++ because from it's perspective there is no dependency > > > between the two. And that's entirely true except for the SMP case where > > > the consumer of mm_list_version is relying on the list to be updated before > > > the version changes. > > > > I see. > > > > So why not change the variable volatile given that x86 is the only > > arch that this code gets used? > > Please note that I'm fully aware of > > https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html > > Just wondering. Anyway, I'll add smp_wmb() back since it is safe play > in terms of acceptance. Because volatile is overkill and too heavy-handed for what is needed here. E.g. if SMP=n, then there are no barriers needed whatsover, as a CPU can't race with itself. Even for SMP=y, volatile is too much as it prevents optimization that would otherwise be allowed. For the smp_wmb() it doesn't matter much, if at all, since the list and version are updated inside of a critical section, i.e. barriered on both sides so the compiler can't do much optimization anyways. But for the smp_rmb() case, the compiler is free to cache the version in a register early on in the function, it just needs to make sure that the access is done before starting the list walk. If encl->mm_list_version were volatile, the compiler would not be allowed to do such an optimization as it'd be required to access memory exactly when it's referenced in code. This is easily visible in the compiled code, as encl->mm_list_version is only read from memory once per iteration (in the unlikely case that there are multiple iterations). It takes the do-while loop, which appears to read encl->mm_list_version twice per iteration: do { mm_list_version = encl->mm_list_version; <walk list> } while (unlikely(encl->mm_list_version != mm_list_version)); And turns it into an optimized loop that loads encl->mm_list_version the minimum number of times. If encl->mm_list_version list were volatile, the compiler would not be allowed to cache it in %rax. mov 0x58(%r12),%rax // Load from encl->mm_list_version lea 0x40(%r12),%rbx // Interleaved to optimize ALU vs LD and $0xfffffffffffff000,%rbp // Interleaved to optimize ALU vs LD mov %rax,0x8(%rsp) // mm_list_version = encl->mm_list_version walk_mm_list: <blah blah blah> mov 0x58(%r12),%rax // Load from encl->mm_list_version cmp 0x8(%rsp),%rax // Compare against mm_list_version jne update_mm_list_version <happy path> ret update_mm_list_version: mov %rax,0x8(%rsp) // mm_list_version = encl->mm_list_version jmpq 0xffffffff8102e460 <walk_mm_list>
On Thu, Apr 09, 2020 at 12:13:53PM -0700, Sean Christopherson wrote: > On Mon, Apr 06, 2020 at 08:15:56PM +0300, Jarkko Sakkinen wrote: > > On Mon, Apr 06, 2020 at 08:10:29PM +0300, Jarkko Sakkinen wrote: > > > On Mon, Apr 06, 2020 at 07:36:38AM -0700, Sean Christopherson wrote: > > > > On Sat, Apr 04, 2020 at 04:12:02AM +0300, Jarkko Sakkinen wrote: > > > > > On Fri, Apr 03, 2020 at 04:42:39PM -0700, Sean Christopherson wrote: > > > > > > On Fri, Apr 03, 2020 at 12:35:50PM +0300, Jarkko Sakkinen wrote: > > > > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > > > @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) > > > > > > > return ret; > > > > > > > } > > > > > > > > > > > > > > + /* > > > > > > > + * The page reclaimer uses list version for synchronization instead of > > > > > > > + * synchronize_scru() because otherwise we could conflict with > > > > > > > + * dup_mmap(). > > > > > > > + */ > > > > > > > spin_lock(&encl->mm_lock); > > > > > > > list_add_rcu(&encl_mm->list, &encl->mm_list); > > > > > > > > > > > > You dropped the smp_wmb(). > > > > > > > > > > As I said to you in my review x86 pipeline does not reorder writes. > > > > > > > > And as I pointed out in this thread, smp_wmb() is a _compiler_ barrier if > > > > and only if CONFIG_SMP=y. The compiler can reorder list_add_rcu() and > > > > mm_list_version++ because from it's perspective there is no dependency > > > > between the two. And that's entirely true except for the SMP case where > > > > the consumer of mm_list_version is relying on the list to be updated before > > > > the version changes. > > > > > > I see. > > > > > > So why not change the variable volatile given that x86 is the only > > > arch that this code gets used? > > > > Please note that I'm fully aware of > > > > https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html > > > > Just wondering. Anyway, I'll add smp_wmb() back since it is safe play > > in terms of acceptance. > > Because volatile is overkill and too heavy-handed for what is needed here. > E.g. if SMP=n, then there are no barriers needed whatsover, as a CPU can't > race with itself. > > Even for SMP=y, volatile is too much as it prevents optimization that would > otherwise be allowed. For the smp_wmb() it doesn't matter much, if at all, > since the list and version are updated inside of a critical section, i.e. > barriered on both sides so the compiler can't do much optimization anyways. > > But for the smp_rmb() case, the compiler is free to cache the version in a > register early on in the function, it just needs to make sure that the > access is done before starting the list walk. If encl->mm_list_version > were volatile, the compiler would not be allowed to do such an optimization > as it'd be required to access memory exactly when it's referenced in code. > > This is easily visible in the compiled code, as encl->mm_list_version is > only read from memory once per iteration (in the unlikely case that there > are multiple iterations). It takes the do-while loop, which appears to > read encl->mm_list_version twice per iteration: > > do { > mm_list_version = encl->mm_list_version; > > <walk list> > > } while (unlikely(encl->mm_list_version != mm_list_version)); > > > And turns it into an optimized loop that loads encl->mm_list_version the > minimum number of times. If encl->mm_list_version list were volatile, the > compiler would not be allowed to cache it in %rax. > > mov 0x58(%r12),%rax // Load from encl->mm_list_version > lea 0x40(%r12),%rbx // Interleaved to optimize ALU vs LD > and $0xfffffffffffff000,%rbp // Interleaved to optimize ALU vs LD > mov %rax,0x8(%rsp) // mm_list_version = encl->mm_list_version > > > walk_mm_list: > <blah blah blah> > > mov 0x58(%r12),%rax // Load from encl->mm_list_version > cmp 0x8(%rsp),%rax // Compare against mm_list_version > jne update_mm_list_version > > <happy path> > ret > > update_mm_list_version: > mov %rax,0x8(%rsp) // mm_list_version = encl->mm_list_version > jmpq 0xffffffff8102e460 <walk_mm_list> Thanks and check v4 so that I can squash it :-) /Jarkko
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index e0124a2f22d5..5b15352b3d4f 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -196,6 +196,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) struct sgx_encl_mm *encl_mm; int ret; + /* mm_list can be accessed only by a single thread at a time. */ + lockdep_assert_held_write(&mm->mmap_sem); + if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) return -EINVAL; @@ -221,12 +224,16 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) return ret; } + /* + * The page reclaimer uses list version for synchronization instead of + * synchronize_scru() because otherwise we could conflict with + * dup_mmap(). + */ spin_lock(&encl->mm_lock); list_add_rcu(&encl_mm->list, &encl->mm_list); + encl->mm_list_version++; spin_unlock(&encl->mm_lock); - synchronize_srcu(&encl->srcu); - return 0; } diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index 44b353aa8866..f0f72e591244 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -74,6 +74,7 @@ struct sgx_encl { struct mutex lock; struct list_head mm_list; spinlock_t mm_lock; + unsigned long mm_list_version; struct file *backing; struct kref refcount; struct srcu_struct srcu; diff --git a/arch/x86/kernel/cpu/sgx/reclaim.c b/arch/x86/kernel/cpu/sgx/reclaim.c index 39f0ddefbb79..5fb8bdfa6a1f 100644 --- a/arch/x86/kernel/cpu/sgx/reclaim.c +++ b/arch/x86/kernel/cpu/sgx/reclaim.c @@ -184,28 +184,38 @@ static void sgx_reclaimer_block(struct sgx_epc_page *epc_page) struct sgx_encl_page *page = epc_page->owner; unsigned long addr = SGX_ENCL_PAGE_ADDR(page); struct sgx_encl *encl = page->encl; + unsigned long mm_list_version; struct sgx_encl_mm *encl_mm; struct vm_area_struct *vma; int idx, ret; - idx = srcu_read_lock(&encl->srcu); + do { + mm_list_version = encl->mm_list_version; - list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { - if (!mmget_not_zero(encl_mm->mm)) - continue; + /* Fence reads as the CPU can reorder them. This guarantees + * that we don't access old list with a new version. + */ + smp_rmb(); - down_read(&encl_mm->mm->mmap_sem); + idx = srcu_read_lock(&encl->srcu); - ret = sgx_encl_find(encl_mm->mm, addr, &vma); - if (!ret && encl == vma->vm_private_data) - zap_vma_ptes(vma, addr, PAGE_SIZE); + list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { + if (!mmget_not_zero(encl_mm->mm)) + continue; - up_read(&encl_mm->mm->mmap_sem); + down_read(&encl_mm->mm->mmap_sem); - mmput_async(encl_mm->mm); - } + ret = sgx_encl_find(encl_mm->mm, addr, &vma); + if (!ret && encl == vma->vm_private_data) + zap_vma_ptes(vma, addr, PAGE_SIZE); - srcu_read_unlock(&encl->srcu, idx); + up_read(&encl_mm->mm->mmap_sem); + + mmput_async(encl_mm->mm); + } + + srcu_read_unlock(&encl->srcu, idx); + } while (unlikely(encl->mm_list_version != mm_list_version)); mutex_lock(&encl->lock); @@ -250,6 +260,11 @@ static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl) struct sgx_encl_mm *encl_mm; int idx; + /* + * Can race with sgx_encl_mm_add(), but ETRACK has already been + * executed, which means that the CPUs running in the new mm will enter + * into the enclave with a fresh epoch. + */ cpumask_clear(cpumask); idx = srcu_read_lock(&encl->srcu);