From patchwork Mon Apr 23 22:08:25 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alex Williamson X-Patchwork-Id: 10358185 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 92775601BE for ; Mon, 23 Apr 2018 22:09:00 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7D1A728C6E for ; Mon, 23 Apr 2018 22:09:00 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7194C28C7F; Mon, 23 Apr 2018 22:09:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AEFC228C6E for ; Mon, 23 Apr 2018 22:08:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932569AbeDWWIc (ORCPT ); Mon, 23 Apr 2018 18:08:32 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44074 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932315AbeDWWIb (ORCPT ); Mon, 23 Apr 2018 18:08:31 -0400 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B4D0181DE2; Mon, 23 Apr 2018 22:08:31 +0000 (UTC) Received: from gimli.home (ovpn-116-103.phx2.redhat.com [10.3.116.103]) by smtp.corp.redhat.com (Postfix) with ESMTP id 709AF16909; Mon, 23 Apr 2018 22:08:25 +0000 (UTC) Subject: [PATCH] vfio/type1: Fix task tracking for QEMU vCPU hotplug From: Alex Williamson To: alex.williamson@redhat.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xuyandong2@huawei.com, zhang.zhanghailiang@huawei.com, wangxinxin.wang@huawei.com, kwankhede@nvidia.com Date: Mon, 23 Apr 2018 16:08:25 -0600 Message-ID: <20180423220808.27282.8721.stgit@gimli.home> User-Agent: StGit/0.18-102-gdf9f MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Mon, 23 Apr 2018 22:08:31 +0000 (UTC) Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP MAP_DMA ioctls might be called from various threads within a process, for example when using QEMU, the vCPU threads are often generating these calls and we therefore take a reference to that vCPU task. However, QEMU also supports vCPU hotplug on some machines and the task that called MAP_DMA may have exited by the time UNMAP_DMA is called, resulting in the mm_struct pointer being NULL and thus a failure to match against the existing mapping. To resolve this, we instead take a reference to the thread group_leader, which has the same mm_struct and resource limits, but is less likely exit, at least in the QEMU case. A difficulty here is guaranteeing that the capabilities of the group_leader match that of the calling thread, which we resolve by tracking CAP_IPC_LOCK at the time of calling rather than at an indeterminate time in the future. Potentially this also results in better efficiency as this is now recorded once per MAP_DMA ioctl. Reported-by: Xu Yandong Signed-off-by: Alex Williamson --- drivers/vfio/vfio_iommu_type1.c | 73 +++++++++++++++++++++++++-------------- 1 file changed, 47 insertions(+), 26 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 5c212bf29640..1dca77f37fd0 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -83,6 +83,7 @@ struct vfio_dma { size_t size; /* Map size (bytes) */ int prot; /* IOMMU_READ/WRITE */ bool iommu_mapped; + bool lock_cap; /* capable(CAP_IPC_LOCK) */ struct task_struct *task; struct rb_root pfn_list; /* Ex-user pinned pfn list */ }; @@ -253,29 +254,25 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn) return ret; } -static int vfio_lock_acct(struct task_struct *task, long npage, bool *lock_cap) +static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) { struct mm_struct *mm; - bool is_current; int ret; if (!npage) return 0; - is_current = (task->mm == current->mm); - - mm = is_current ? task->mm : get_task_mm(task); + mm = async ? get_task_mm(dma->task) : dma->task->mm; if (!mm) return -ESRCH; /* process exited */ ret = down_write_killable(&mm->mmap_sem); if (!ret) { if (npage > 0) { - if (lock_cap ? !*lock_cap : - !has_capability(task, CAP_IPC_LOCK)) { + if (!dma->lock_cap) { unsigned long limit; - limit = task_rlimit(task, + limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (mm->locked_vm + npage > limit) @@ -289,7 +286,7 @@ static int vfio_lock_acct(struct task_struct *task, long npage, bool *lock_cap) up_write(&mm->mmap_sem); } - if (!is_current) + if (async) mmput(mm); return ret; @@ -400,7 +397,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, */ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, long npage, unsigned long *pfn_base, - bool lock_cap, unsigned long limit) + unsigned long limit) { unsigned long pfn = 0; long ret, pinned = 0, lock_acct = 0; @@ -431,7 +428,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, * pages are already counted against the user. */ if (!vfio_find_vpfn(dma, iova)) { - if (!lock_cap && current->mm->locked_vm + 1 > limit) { + if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) { put_pfn(*pfn_base, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); @@ -456,7 +453,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, } if (!vfio_find_vpfn(dma, iova)) { - if (!lock_cap && + if (!dma->lock_cap && current->mm->locked_vm + lock_acct + 1 > limit) { put_pfn(pfn, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", @@ -469,7 +466,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, } out: - ret = vfio_lock_acct(current, lock_acct, &lock_cap); + ret = vfio_lock_acct(dma, lock_acct, false); unpin_out: if (ret) { @@ -498,7 +495,7 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, } if (do_accounting) - vfio_lock_acct(dma->task, locked - unlocked, NULL); + vfio_lock_acct(dma, locked - unlocked, true); return unlocked; } @@ -515,7 +512,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr, ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base); if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) { - ret = vfio_lock_acct(dma->task, 1, NULL); + ret = vfio_lock_acct(dma, 1, true); if (ret) { put_pfn(*pfn_base, dma->prot); if (ret == -ENOMEM) @@ -542,7 +539,7 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova, unlocked = vfio_iova_put_vfio_pfn(dma, vpfn); if (do_accounting) - vfio_lock_acct(dma->task, -unlocked, NULL); + vfio_lock_acct(dma, -unlocked, true); return unlocked; } @@ -834,7 +831,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, unlocked += vfio_sync_unpin(dma, domain, &unmapped_region_list); if (do_accounting) { - vfio_lock_acct(dma->task, -unlocked, NULL); + vfio_lock_acct(dma, -unlocked, true); return 0; } return unlocked; @@ -1049,14 +1046,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, size_t size = map_size; long npage; unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - bool lock_cap = capable(CAP_IPC_LOCK); int ret = 0; while (size) { /* Pin a contiguous chunk of memory */ npage = vfio_pin_pages_remote(dma, vaddr + dma->size, - size >> PAGE_SHIFT, &pfn, - lock_cap, limit); + size >> PAGE_SHIFT, &pfn, limit); if (npage <= 0) { WARN_ON(!npage); ret = (int)npage; @@ -1131,8 +1126,36 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, dma->iova = iova; dma->vaddr = vaddr; dma->prot = prot; - get_task_struct(current); - dma->task = current; + + /* + * We need to be able to both add to a task's locked memory and test + * against the locked memory limit and we need to be able to do both + * outside of this call path as pinning can be asynchronous via the + * external interfaces for mdev devices. RLIMIT_MEMLOCK requires a + * task_struct and VM locked pages requires an mm_struct, however + * holding an indefinite mm reference is not recommended, therefore we + * only hold a reference to a task. We could hold a reference to + * current, however QEMU uses this call path through vCPU threads, + * which can be killed resulting in a NULL mm and failure in the unmap + * path when called via a different thread. Avoid this problem by + * using the group_leader as threads within the same group require + * both CLONE_THREAD and CLONE_VM and will therefore use the same + * mm_struct. + * + * Previously we also used the task for testing CAP_IPC_LOCK at the + * time of pinning and accounting, however has_capability() makes use + * of real_cred, a copy-on-write field, so we can't guarantee that it + * matches group_leader, or in fact that it might not change by the + * time it's evaluated. If a process were to call MAP_DMA with + * CAP_IPC_LOCK but later drop it, it doesn't make sense that they + * possibly see different results for an iommu_mapped vfio_dma vs + * externally mapped. Therefore track CAP_IPC_LOCK in vfio_dma at the + * time of calling MAP_DMA. + */ + get_task_struct(current->group_leader); + dma->task = current->group_leader; + dma->lock_cap = capable(CAP_IPC_LOCK); + dma->pfn_list = RB_ROOT; /* Insert zero-sized and grow as we map chunks of it */ @@ -1167,7 +1190,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, struct vfio_domain *d; struct rb_node *n; unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - bool lock_cap = capable(CAP_IPC_LOCK); int ret; /* Arbitrarily pick the first domain in the list for lookups */ @@ -1214,8 +1236,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, npage = vfio_pin_pages_remote(dma, vaddr, n >> PAGE_SHIFT, - &pfn, lock_cap, - limit); + &pfn, limit); if (npage <= 0) { WARN_ON(!npage); ret = (int)npage; @@ -1492,7 +1513,7 @@ static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu) if (!is_invalid_reserved_pfn(vpfn->pfn)) locked++; } - vfio_lock_acct(dma->task, locked - unlocked, NULL); + vfio_lock_acct(dma, locked - unlocked, true); } }