Message ID | 1585078359-20124-6-git-send-email-kwankhede@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KABIs to support migration for VFIO devices | expand |
On Wed, Mar 25, 2020 at 03:32:37AM +0800, Kirti Wankhede wrote: > DMA mapped pages, including those pinned by mdev vendor drivers, might > get unpinned and unmapped while migration is active and device is still > running. For example, in pre-copy phase while guest driver could access > those pages, host device or vendor driver can dirty these mapped pages. > Such pages should be marked dirty so as to maintain memory consistency > for a user making use of dirty page tracking. > > To get bitmap during unmap, user should allocate memory for bitmap, set > size of allocated memory, set page size to be considered for bitmap and > set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP. > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> > Reviewed-by: Neo Jia <cjia@nvidia.com> > --- > drivers/vfio/vfio_iommu_type1.c | 54 ++++++++++++++++++++++++++++++++++++++--- > include/uapi/linux/vfio.h | 10 ++++++++ > 2 files changed, 60 insertions(+), 4 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index 27ed069c5053..b98a8d79e13a 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -982,7 +982,8 @@ static int verify_bitmap_size(uint64_t npages, uint64_t bitmap_size) > } > > static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > - struct vfio_iommu_type1_dma_unmap *unmap) > + struct vfio_iommu_type1_dma_unmap *unmap, > + struct vfio_bitmap *bitmap) > { > uint64_t mask; > struct vfio_dma *dma, *dma_last = NULL; > @@ -1033,6 +1034,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > * will be returned if these conditions are not met. The v2 interface > * will only return success and a size of zero if there were no > * mappings within the range. > + * > + * When VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP flag is set, unmap request > + * must be for single mapping. Multiple mappings with this flag set is > + * not supported. > */ > if (iommu->v2) { > dma = vfio_find_dma(iommu, unmap->iova, 1); > @@ -1040,6 +1045,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > ret = -EINVAL; > goto unlock; > } > + > + if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) && > + (dma->iova != unmap->iova || dma->size != unmap->size)) { potential NULL pointer! And could you address the comments in v14? How to handle DSI unmaps in vIOMMU (https://lore.kernel.org/kvm/20200323011041.GB5456@joy-OptiPlex-7040/) > + ret = -EINVAL; > + goto unlock; > + } > + > dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0); > if (dma && dma->iova + dma->size != unmap->iova + unmap->size) { > ret = -EINVAL; > @@ -1057,6 +1069,11 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > if (dma->task->mm != current->mm) > break; > > + if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) && > + iommu->dirty_page_tracking) > + vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size, > + bitmap->pgsize, bitmap->data); > + > if (!RB_EMPTY_ROOT(&dma->pfn_list)) { > struct vfio_iommu_type1_dma_unmap nb_unmap; > > @@ -2418,17 +2435,46 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, > > } else if (cmd == VFIO_IOMMU_UNMAP_DMA) { > struct vfio_iommu_type1_dma_unmap unmap; > - long ret; > + struct vfio_bitmap bitmap = { 0 }; > + int ret; > > minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size); > > if (copy_from_user(&unmap, (void __user *)arg, minsz)) > return -EFAULT; > > - if (unmap.argsz < minsz || unmap.flags) > + if (unmap.argsz < minsz || > + unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) > return -EINVAL; > > - ret = vfio_dma_do_unmap(iommu, &unmap); > + if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) { > + unsigned long pgshift; > + uint64_t iommu_pgsize = > + 1 << __ffs(vfio_pgsize_bitmap(iommu)); > + > + if (unmap.argsz < (minsz + sizeof(bitmap))) > + return -EINVAL; > + > + if (copy_from_user(&bitmap, > + (void __user *)(arg + minsz), > + sizeof(bitmap))) > + return -EFAULT; > + > + /* allow only min supported pgsize */ > + if (bitmap.pgsize != iommu_pgsize) > + return -EINVAL; > + if (!access_ok((void __user *)bitmap.data, bitmap.size)) > + return -EINVAL; > + > + pgshift = __ffs(bitmap.pgsize); > + ret = verify_bitmap_size(unmap.size >> pgshift, > + bitmap.size); > + if (ret) > + return ret; > + > + } > + > + ret = vfio_dma_do_unmap(iommu, &unmap, &bitmap); > if (ret) > return ret; > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 0018721fb744..7c888041136f 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -1011,12 +1011,22 @@ struct vfio_bitmap { > * field. No guarantee is made to the user that arbitrary unmaps of iova > * or size different from those used in the original mapping call will > * succeed. > + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap > + * before unmapping IO virtual addresses. When this flag is set, user must > + * provide data[] as structure vfio_bitmap. User must allocate memory to get > + * bitmap and must set size of allocated memory in vfio_bitmap.size field. > + * A bit in bitmap represents one page of user provided page size in 'pgsize', > + * consecutively starting from iova offset. Bit set indicates page at that > + * offset from iova is dirty. Bitmap of pages in the range of unmapped size is > + * returned in vfio_bitmap.data > */ > struct vfio_iommu_type1_dma_unmap { > __u32 argsz; > __u32 flags; > +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0) > __u64 iova; /* IO virtual address */ > __u64 size; /* Size of mapping (bytes) */ > + __u8 data[]; > }; > > #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14) > -- > 2.7.0 >
On 3/25/2020 7:48 AM, Yan Zhao wrote: > On Wed, Mar 25, 2020 at 03:32:37AM +0800, Kirti Wankhede wrote: >> DMA mapped pages, including those pinned by mdev vendor drivers, might >> get unpinned and unmapped while migration is active and device is still >> running. For example, in pre-copy phase while guest driver could access >> those pages, host device or vendor driver can dirty these mapped pages. >> Such pages should be marked dirty so as to maintain memory consistency >> for a user making use of dirty page tracking. >> >> To get bitmap during unmap, user should allocate memory for bitmap, set >> size of allocated memory, set page size to be considered for bitmap and >> set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP. >> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> >> Reviewed-by: Neo Jia <cjia@nvidia.com> >> --- >> drivers/vfio/vfio_iommu_type1.c | 54 ++++++++++++++++++++++++++++++++++++++--- >> include/uapi/linux/vfio.h | 10 ++++++++ >> 2 files changed, 60 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >> index 27ed069c5053..b98a8d79e13a 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -982,7 +982,8 @@ static int verify_bitmap_size(uint64_t npages, uint64_t bitmap_size) >> } >> >> static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >> - struct vfio_iommu_type1_dma_unmap *unmap) >> + struct vfio_iommu_type1_dma_unmap *unmap, >> + struct vfio_bitmap *bitmap) >> { >> uint64_t mask; >> struct vfio_dma *dma, *dma_last = NULL; >> @@ -1033,6 +1034,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >> * will be returned if these conditions are not met. The v2 interface >> * will only return success and a size of zero if there were no >> * mappings within the range. >> + * >> + * When VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP flag is set, unmap request >> + * must be for single mapping. Multiple mappings with this flag set is >> + * not supported. >> */ >> if (iommu->v2) { >> dma = vfio_find_dma(iommu, unmap->iova, 1); >> @@ -1040,6 +1045,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >> ret = -EINVAL; >> goto unlock; >> } >> + >> + if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) && >> + (dma->iova != unmap->iova || dma->size != unmap->size)) { > potential NULL pointer! > > And could you address the comments in v14? > How to handle DSI unmaps in vIOMMU > (https://lore.kernel.org/kvm/20200323011041.GB5456@joy-OptiPlex-7040/) > Sorry, I drafted reply to it, but I missed to send, it remained in my drafts > > it happens in vIOMMU Domain level invalidation of IOTLB > (domain-selective invalidation, see vtd_iotlb_domain_invalidate() in qemu). > common in VTD lazy mode, and NOT just happening once at boot time. > rather than invalidate page by page, it batches the page invalidation. > so, when this invalidation takes place, even higher level page tables > have been invalid and therefore it has to invalidate a bigger combined range. > That's why we see IOVAs are mapped in 4k pages, but are unmapped in 2M > pages. > > I think those UNMAPs should also have GET_DIRTY_BIMTAP flag on, right? vtd_iotlb_domain_invalidate() vtd_sync_shadow_page_table() vtd_sync_shadow_page_table_range(vtd_as, &ce, 0, UINT64_MAX) vtd_page_walk() vtd_page_walk_level() - walk over specific level for IOVA range vtd_page_walk_one() memory_region_notify_iommu() ... vfio_iommu_map_notify() In the above trace, isn't page walk will take care of creating proper IOTLB entry which should be same as created during mapping for that IOTLB entry? >>> >>> Such unmap would callback vfio_iommu_map_notify() in QEMU. In >>> vfio_iommu_map_notify(), unmap is called on same range <iova, >>> iotlb->addr_mask + 1> which was used for map. Secondly unmap with bitmap >>> will be called only when device state has _SAVING flag set. >> > in this case, iotlb->addr_mask in unmap is 0x200000 -1. > different than 0x1000 -1 used for map. >> It might be helpful for Yan, and everyone else, to see the latest QEMU >> patch series. Thanks, >> > yes, please. also curious of log_sync part for vIOMMU. given most IOVAs in > address space are unmapped and therefore no IOTLBs are able to be found. > Qemu patches compatible with v16 version are at: https://www.mail-archive.com/qemu-devel@nongnu.org/msg691806.html Hope that helps. Thanks, Kirti
On Fri, Mar 27, 2020 at 05:39:44AM +0800, Kirti Wankhede wrote: > > > On 3/25/2020 7:48 AM, Yan Zhao wrote: > > On Wed, Mar 25, 2020 at 03:32:37AM +0800, Kirti Wankhede wrote: > >> DMA mapped pages, including those pinned by mdev vendor drivers, might > >> get unpinned and unmapped while migration is active and device is still > >> running. For example, in pre-copy phase while guest driver could access > >> those pages, host device or vendor driver can dirty these mapped pages. > >> Such pages should be marked dirty so as to maintain memory consistency > >> for a user making use of dirty page tracking. > >> > >> To get bitmap during unmap, user should allocate memory for bitmap, set > >> size of allocated memory, set page size to be considered for bitmap and > >> set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP. > >> > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> > >> Reviewed-by: Neo Jia <cjia@nvidia.com> > >> --- > >> drivers/vfio/vfio_iommu_type1.c | 54 ++++++++++++++++++++++++++++++++++++++--- > >> include/uapi/linux/vfio.h | 10 ++++++++ > >> 2 files changed, 60 insertions(+), 4 deletions(-) > >> > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > >> index 27ed069c5053..b98a8d79e13a 100644 > >> --- a/drivers/vfio/vfio_iommu_type1.c > >> +++ b/drivers/vfio/vfio_iommu_type1.c > >> @@ -982,7 +982,8 @@ static int verify_bitmap_size(uint64_t npages, uint64_t bitmap_size) > >> } > >> > >> static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > >> - struct vfio_iommu_type1_dma_unmap *unmap) > >> + struct vfio_iommu_type1_dma_unmap *unmap, > >> + struct vfio_bitmap *bitmap) > >> { > >> uint64_t mask; > >> struct vfio_dma *dma, *dma_last = NULL; > >> @@ -1033,6 +1034,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > >> * will be returned if these conditions are not met. The v2 interface > >> * will only return success and a size of zero if there were no > >> * mappings within the range. > >> + * > >> + * When VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP flag is set, unmap request > >> + * must be for single mapping. Multiple mappings with this flag set is > >> + * not supported. > >> */ > >> if (iommu->v2) { > >> dma = vfio_find_dma(iommu, unmap->iova, 1); > >> @@ -1040,6 +1045,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > >> ret = -EINVAL; > >> goto unlock; > >> } > >> + > >> + if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) && > >> + (dma->iova != unmap->iova || dma->size != unmap->size)) { > > potential NULL pointer! > > > > And could you address the comments in v14? > > How to handle DSI unmaps in vIOMMU > > (https://lore.kernel.org/kvm/20200323011041.GB5456@joy-OptiPlex-7040/) > > > > Sorry, I drafted reply to it, but I missed to send, it remained in my drafts > > > > > it happens in vIOMMU Domain level invalidation of IOTLB > > (domain-selective invalidation, see vtd_iotlb_domain_invalidate() in > qemu). > > common in VTD lazy mode, and NOT just happening once at boot time. > > rather than invalidate page by page, it batches the page invalidation. > > so, when this invalidation takes place, even higher level page tables > > have been invalid and therefore it has to invalidate a bigger > combined range. > > That's why we see IOVAs are mapped in 4k pages, but are unmapped in 2M > > pages. > > > > I think those UNMAPs should also have GET_DIRTY_BIMTAP flag on, right? > > > vtd_iotlb_domain_invalidate() > vtd_sync_shadow_page_table() > vtd_sync_shadow_page_table_range(vtd_as, &ce, 0, UINT64_MAX) > vtd_page_walk() > vtd_page_walk_level() - walk over specific level for IOVA range > vtd_page_walk_one() > memory_region_notify_iommu() > ... > vfio_iommu_map_notify() > > In the above trace, isn't page walk will take care of creating proper > IOTLB entry which should be same as created during mapping for that > IOTLB entry? > No. It does walk the page table, but as it's dsi (delay & batched unmap), pages table entry for a whole 2M (the higher level, not last level for 4K) range is invalid, so the iotlb->addr_mask what vfio_iommu_map_notify() receives is (2M - 1), not the same as the size for map. > > >>> > >>> Such unmap would callback vfio_iommu_map_notify() in QEMU. In > >>> vfio_iommu_map_notify(), unmap is called on same range <iova, > >>> iotlb->addr_mask + 1> which was used for map. Secondly unmap with > bitmap > >>> will be called only when device state has _SAVING flag set. > >> > > in this case, iotlb->addr_mask in unmap is 0x200000 -1. > > different than 0x1000 -1 used for map. > >> It might be helpful for Yan, and everyone else, to see the latest QEMU > >> patch series. Thanks, > >> > > yes, please. also curious of log_sync part for vIOMMU. given most > IOVAs in > > address space are unmapped and therefore no IOTLBs are able to be found. > > > > Qemu patches compatible with v16 version are at: > https://www.mail-archive.com/qemu-devel@nongnu.org/msg691806.html > >
On 3/27/2020 5:34 AM, Yan Zhao wrote: > On Fri, Mar 27, 2020 at 05:39:44AM +0800, Kirti Wankhede wrote: >> >> >> On 3/25/2020 7:48 AM, Yan Zhao wrote: >>> On Wed, Mar 25, 2020 at 03:32:37AM +0800, Kirti Wankhede wrote: >>>> DMA mapped pages, including those pinned by mdev vendor drivers, might >>>> get unpinned and unmapped while migration is active and device is still >>>> running. For example, in pre-copy phase while guest driver could access >>>> those pages, host device or vendor driver can dirty these mapped pages. >>>> Such pages should be marked dirty so as to maintain memory consistency >>>> for a user making use of dirty page tracking. >>>> >>>> To get bitmap during unmap, user should allocate memory for bitmap, set >>>> size of allocated memory, set page size to be considered for bitmap and >>>> set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP. >>>> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> >>>> Reviewed-by: Neo Jia <cjia@nvidia.com> >>>> --- >>>> drivers/vfio/vfio_iommu_type1.c | 54 ++++++++++++++++++++++++++++++++++++++--- >>>> include/uapi/linux/vfio.h | 10 ++++++++ >>>> 2 files changed, 60 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>>> index 27ed069c5053..b98a8d79e13a 100644 >>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>> @@ -982,7 +982,8 @@ static int verify_bitmap_size(uint64_t npages, uint64_t bitmap_size) >>>> } >>>> >>>> static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >>>> - struct vfio_iommu_type1_dma_unmap *unmap) >>>> + struct vfio_iommu_type1_dma_unmap *unmap, >>>> + struct vfio_bitmap *bitmap) >>>> { >>>> uint64_t mask; >>>> struct vfio_dma *dma, *dma_last = NULL; >>>> @@ -1033,6 +1034,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >>>> * will be returned if these conditions are not met. The v2 interface >>>> * will only return success and a size of zero if there were no >>>> * mappings within the range. >>>> + * >>>> + * When VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP flag is set, unmap request >>>> + * must be for single mapping. Multiple mappings with this flag set is >>>> + * not supported. >>>> */ >>>> if (iommu->v2) { >>>> dma = vfio_find_dma(iommu, unmap->iova, 1); >>>> @@ -1040,6 +1045,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >>>> ret = -EINVAL; >>>> goto unlock; >>>> } >>>> + >>>> + if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) && >>>> + (dma->iova != unmap->iova || dma->size != unmap->size)) { >>> potential NULL pointer! >>> >>> And could you address the comments in v14? >>> How to handle DSI unmaps in vIOMMU >>> (https://lore.kernel.org/kvm/20200323011041.GB5456@joy-OptiPlex-7040/) >>> >> >> Sorry, I drafted reply to it, but I missed to send, it remained in my drafts >> >> > >> > it happens in vIOMMU Domain level invalidation of IOTLB >> > (domain-selective invalidation, see vtd_iotlb_domain_invalidate() in >> qemu). >> > common in VTD lazy mode, and NOT just happening once at boot time. >> > rather than invalidate page by page, it batches the page invalidation. >> > so, when this invalidation takes place, even higher level page tables >> > have been invalid and therefore it has to invalidate a bigger >> combined range. >> > That's why we see IOVAs are mapped in 4k pages, but are unmapped in 2M >> > pages. >> > >> > I think those UNMAPs should also have GET_DIRTY_BIMTAP flag on, right? >> >> >> vtd_iotlb_domain_invalidate() >> vtd_sync_shadow_page_table() >> vtd_sync_shadow_page_table_range(vtd_as, &ce, 0, UINT64_MAX) >> vtd_page_walk() >> vtd_page_walk_level() - walk over specific level for IOVA range >> vtd_page_walk_one() >> memory_region_notify_iommu() >> ... >> vfio_iommu_map_notify() >> >> In the above trace, isn't page walk will take care of creating proper >> IOTLB entry which should be same as created during mapping for that >> IOTLB entry? >> > No. It does walk the page table, but as it's dsi (delay & batched unmap), > pages table entry for a whole 2M (the higher level, not last level for 4K) > range is invalid, so the iotlb->addr_mask what vfio_iommu_map_notify() > receives is (2M - 1), not the same as the size for map. > When do this happen? during my testing I never hit this case. How can I hit this case? In this case, will adjacent whole vfio_dmas will be clubbed together or will there be any intersection of vfio_dmas? Thanks, Kirti
On Fri, Mar 27, 2020 at 12:42:43PM +0800, Kirti Wankhede wrote: > > > On 3/27/2020 5:34 AM, Yan Zhao wrote: > > On Fri, Mar 27, 2020 at 05:39:44AM +0800, Kirti Wankhede wrote: > >> > >> > >> On 3/25/2020 7:48 AM, Yan Zhao wrote: > >>> On Wed, Mar 25, 2020 at 03:32:37AM +0800, Kirti Wankhede wrote: > >>>> DMA mapped pages, including those pinned by mdev vendor drivers, might > >>>> get unpinned and unmapped while migration is active and device is still > >>>> running. For example, in pre-copy phase while guest driver could access > >>>> those pages, host device or vendor driver can dirty these mapped pages. > >>>> Such pages should be marked dirty so as to maintain memory consistency > >>>> for a user making use of dirty page tracking. > >>>> > >>>> To get bitmap during unmap, user should allocate memory for bitmap, set > >>>> size of allocated memory, set page size to be considered for bitmap and > >>>> set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP. > >>>> > >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> > >>>> Reviewed-by: Neo Jia <cjia@nvidia.com> > >>>> --- > >>>> drivers/vfio/vfio_iommu_type1.c | 54 ++++++++++++++++++++++++++++++++++++++--- > >>>> include/uapi/linux/vfio.h | 10 ++++++++ > >>>> 2 files changed, 60 insertions(+), 4 deletions(-) > >>>> > >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > >>>> index 27ed069c5053..b98a8d79e13a 100644 > >>>> --- a/drivers/vfio/vfio_iommu_type1.c > >>>> +++ b/drivers/vfio/vfio_iommu_type1.c > >>>> @@ -982,7 +982,8 @@ static int verify_bitmap_size(uint64_t npages, uint64_t bitmap_size) > >>>> } > >>>> > >>>> static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > >>>> - struct vfio_iommu_type1_dma_unmap *unmap) > >>>> + struct vfio_iommu_type1_dma_unmap *unmap, > >>>> + struct vfio_bitmap *bitmap) > >>>> { > >>>> uint64_t mask; > >>>> struct vfio_dma *dma, *dma_last = NULL; > >>>> @@ -1033,6 +1034,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > >>>> * will be returned if these conditions are not met. The v2 interface > >>>> * will only return success and a size of zero if there were no > >>>> * mappings within the range. > >>>> + * > >>>> + * When VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP flag is set, unmap request > >>>> + * must be for single mapping. Multiple mappings with this flag set is > >>>> + * not supported. > >>>> */ > >>>> if (iommu->v2) { > >>>> dma = vfio_find_dma(iommu, unmap->iova, 1); > >>>> @@ -1040,6 +1045,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > >>>> ret = -EINVAL; > >>>> goto unlock; > >>>> } > >>>> + > >>>> + if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) && > >>>> + (dma->iova != unmap->iova || dma->size != unmap->size)) { > >>> potential NULL pointer! > >>> > >>> And could you address the comments in v14? > >>> How to handle DSI unmaps in vIOMMU > >>> (https://lore.kernel.org/kvm/20200323011041.GB5456@joy-OptiPlex-7040/) > >>> > >> > >> Sorry, I drafted reply to it, but I missed to send, it remained in my drafts > >> > >> > > >> > it happens in vIOMMU Domain level invalidation of IOTLB > >> > (domain-selective invalidation, see vtd_iotlb_domain_invalidate() in > >> qemu). > >> > common in VTD lazy mode, and NOT just happening once at boot time. > >> > rather than invalidate page by page, it batches the page invalidation. > >> > so, when this invalidation takes place, even higher level page tables > >> > have been invalid and therefore it has to invalidate a bigger > >> combined range. > >> > That's why we see IOVAs are mapped in 4k pages, but are unmapped in 2M > >> > pages. > >> > > >> > I think those UNMAPs should also have GET_DIRTY_BIMTAP flag on, right? > >> > >> > >> vtd_iotlb_domain_invalidate() > >> vtd_sync_shadow_page_table() > >> vtd_sync_shadow_page_table_range(vtd_as, &ce, 0, UINT64_MAX) > >> vtd_page_walk() > >> vtd_page_walk_level() - walk over specific level for IOVA range > >> vtd_page_walk_one() > >> memory_region_notify_iommu() > >> ... > >> vfio_iommu_map_notify() > >> > >> In the above trace, isn't page walk will take care of creating proper > >> IOTLB entry which should be same as created during mapping for that > >> IOTLB entry? > >> > > No. It does walk the page table, but as it's dsi (delay & batched unmap), > > pages table entry for a whole 2M (the higher level, not last level for 4K) > > range is invalid, so the iotlb->addr_mask what vfio_iommu_map_notify() > > receives is (2M - 1), not the same as the size for map. > > > > When do this happen? during my testing I never hit this case. How can I > hit this case? Just common settings to enable vIOMMU: Qemu: -device intel-iommu,caching-mode=true guest kernel parameter: intel_iommu=on (intel_iommu=on turns on lazy mode by default) In lazy mode, guest notifies DMA MAP on page level, but notifies DMA UNMAPs in batch. with a pass-through NVMe, there are 89 DSI unmaps in 1 second for a typical fio. With a pass-through GPU, there 22 DSI unmaps in total for benchmark openArena (lasting around 55 secs) > > In this case, will adjacent whole vfio_dmas will be clubbed together or > will there be any intersection of vfio_dmas? > clubbed together.
On 3/30/2020 7:45 AM, Yan Zhao wrote: > On Fri, Mar 27, 2020 at 12:42:43PM +0800, Kirti Wankhede wrote: >> >> >> On 3/27/2020 5:34 AM, Yan Zhao wrote: >>> On Fri, Mar 27, 2020 at 05:39:44AM +0800, Kirti Wankhede wrote: >>>> >>>> >>>> On 3/25/2020 7:48 AM, Yan Zhao wrote: >>>>> On Wed, Mar 25, 2020 at 03:32:37AM +0800, Kirti Wankhede wrote: >>>>>> DMA mapped pages, including those pinned by mdev vendor drivers, might >>>>>> get unpinned and unmapped while migration is active and device is still >>>>>> running. For example, in pre-copy phase while guest driver could access >>>>>> those pages, host device or vendor driver can dirty these mapped pages. >>>>>> Such pages should be marked dirty so as to maintain memory consistency >>>>>> for a user making use of dirty page tracking. >>>>>> >>>>>> To get bitmap during unmap, user should allocate memory for bitmap, set >>>>>> size of allocated memory, set page size to be considered for bitmap and >>>>>> set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP. >>>>>> >>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> >>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com> >>>>>> --- >>>>>> drivers/vfio/vfio_iommu_type1.c | 54 ++++++++++++++++++++++++++++++++++++++--- >>>>>> include/uapi/linux/vfio.h | 10 ++++++++ >>>>>> 2 files changed, 60 insertions(+), 4 deletions(-) >>>>>> >>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>>>>> index 27ed069c5053..b98a8d79e13a 100644 >>>>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>>>> @@ -982,7 +982,8 @@ static int verify_bitmap_size(uint64_t npages, uint64_t bitmap_size) >>>>>> } >>>>>> >>>>>> static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >>>>>> - struct vfio_iommu_type1_dma_unmap *unmap) >>>>>> + struct vfio_iommu_type1_dma_unmap *unmap, >>>>>> + struct vfio_bitmap *bitmap) >>>>>> { >>>>>> uint64_t mask; >>>>>> struct vfio_dma *dma, *dma_last = NULL; >>>>>> @@ -1033,6 +1034,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >>>>>> * will be returned if these conditions are not met. The v2 interface >>>>>> * will only return success and a size of zero if there were no >>>>>> * mappings within the range. >>>>>> + * >>>>>> + * When VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP flag is set, unmap request >>>>>> + * must be for single mapping. Multiple mappings with this flag set is >>>>>> + * not supported. >>>>>> */ >>>>>> if (iommu->v2) { >>>>>> dma = vfio_find_dma(iommu, unmap->iova, 1); >>>>>> @@ -1040,6 +1045,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >>>>>> ret = -EINVAL; >>>>>> goto unlock; >>>>>> } >>>>>> + >>>>>> + if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) && >>>>>> + (dma->iova != unmap->iova || dma->size != unmap->size)) { >>>>> potential NULL pointer! >>>>> >>>>> And could you address the comments in v14? >>>>> How to handle DSI unmaps in vIOMMU >>>>> (https://lore.kernel.org/kvm/20200323011041.GB5456@joy-OptiPlex-7040/) >>>>> >>>> >>>> Sorry, I drafted reply to it, but I missed to send, it remained in my drafts >>>> >>>> > >>>> > it happens in vIOMMU Domain level invalidation of IOTLB >>>> > (domain-selective invalidation, see vtd_iotlb_domain_invalidate() in >>>> qemu). >>>> > common in VTD lazy mode, and NOT just happening once at boot time. >>>> > rather than invalidate page by page, it batches the page invalidation. >>>> > so, when this invalidation takes place, even higher level page tables >>>> > have been invalid and therefore it has to invalidate a bigger >>>> combined range. >>>> > That's why we see IOVAs are mapped in 4k pages, but are unmapped in 2M >>>> > pages. >>>> > >>>> > I think those UNMAPs should also have GET_DIRTY_BIMTAP flag on, right? >>>> >>>> >>>> vtd_iotlb_domain_invalidate() >>>> vtd_sync_shadow_page_table() >>>> vtd_sync_shadow_page_table_range(vtd_as, &ce, 0, UINT64_MAX) >>>> vtd_page_walk() >>>> vtd_page_walk_level() - walk over specific level for IOVA range >>>> vtd_page_walk_one() >>>> memory_region_notify_iommu() >>>> ... >>>> vfio_iommu_map_notify() >>>> >>>> In the above trace, isn't page walk will take care of creating proper >>>> IOTLB entry which should be same as created during mapping for that >>>> IOTLB entry? >>>> >>> No. It does walk the page table, but as it's dsi (delay & batched unmap), >>> pages table entry for a whole 2M (the higher level, not last level for 4K) >>> range is invalid, so the iotlb->addr_mask what vfio_iommu_map_notify() >>> receives is (2M - 1), not the same as the size for map. >>> >> >> When do this happen? during my testing I never hit this case. How can I >> hit this case? > > Just common settings to enable vIOMMU: > Qemu: -device intel-iommu,caching-mode=true > guest kernel parameter: intel_iommu=on > > (intel_iommu=on turns on lazy mode by default) > > In lazy mode, guest notifies DMA MAP on page level, but notifies DMA UNMAPs > in batch. > with a pass-through NVMe, there are 89 DSI unmaps in 1 second for a typical fio. > With a pass-through GPU, there 22 DSI unmaps in total for benchmark openArena > (lasting around 55 secs) >> >> In this case, will adjacent whole vfio_dmas will be clubbed together or >> will there be any intersection of vfio_dmas? >> > clubbed together. > Even if support for clubbing bitmap together is added, still there will be limitation that clubbed vfio_dmas size shouldn't exceed INT_MAX pages. Thanks, Kirti
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 27ed069c5053..b98a8d79e13a 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -982,7 +982,8 @@ static int verify_bitmap_size(uint64_t npages, uint64_t bitmap_size) } static int vfio_dma_do_unmap(struct vfio_iommu *iommu, - struct vfio_iommu_type1_dma_unmap *unmap) + struct vfio_iommu_type1_dma_unmap *unmap, + struct vfio_bitmap *bitmap) { uint64_t mask; struct vfio_dma *dma, *dma_last = NULL; @@ -1033,6 +1034,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, * will be returned if these conditions are not met. The v2 interface * will only return success and a size of zero if there were no * mappings within the range. + * + * When VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP flag is set, unmap request + * must be for single mapping. Multiple mappings with this flag set is + * not supported. */ if (iommu->v2) { dma = vfio_find_dma(iommu, unmap->iova, 1); @@ -1040,6 +1045,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, ret = -EINVAL; goto unlock; } + + if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) && + (dma->iova != unmap->iova || dma->size != unmap->size)) { + ret = -EINVAL; + goto unlock; + } + dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0); if (dma && dma->iova + dma->size != unmap->iova + unmap->size) { ret = -EINVAL; @@ -1057,6 +1069,11 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, if (dma->task->mm != current->mm) break; + if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) && + iommu->dirty_page_tracking) + vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size, + bitmap->pgsize, bitmap->data); + if (!RB_EMPTY_ROOT(&dma->pfn_list)) { struct vfio_iommu_type1_dma_unmap nb_unmap; @@ -2418,17 +2435,46 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, } else if (cmd == VFIO_IOMMU_UNMAP_DMA) { struct vfio_iommu_type1_dma_unmap unmap; - long ret; + struct vfio_bitmap bitmap = { 0 }; + int ret; minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size); if (copy_from_user(&unmap, (void __user *)arg, minsz)) return -EFAULT; - if (unmap.argsz < minsz || unmap.flags) + if (unmap.argsz < minsz || + unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) return -EINVAL; - ret = vfio_dma_do_unmap(iommu, &unmap); + if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) { + unsigned long pgshift; + uint64_t iommu_pgsize = + 1 << __ffs(vfio_pgsize_bitmap(iommu)); + + if (unmap.argsz < (minsz + sizeof(bitmap))) + return -EINVAL; + + if (copy_from_user(&bitmap, + (void __user *)(arg + minsz), + sizeof(bitmap))) + return -EFAULT; + + /* allow only min supported pgsize */ + if (bitmap.pgsize != iommu_pgsize) + return -EINVAL; + if (!access_ok((void __user *)bitmap.data, bitmap.size)) + return -EINVAL; + + pgshift = __ffs(bitmap.pgsize); + ret = verify_bitmap_size(unmap.size >> pgshift, + bitmap.size); + if (ret) + return ret; + + } + + ret = vfio_dma_do_unmap(iommu, &unmap, &bitmap); if (ret) return ret; diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 0018721fb744..7c888041136f 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1011,12 +1011,22 @@ struct vfio_bitmap { * field. No guarantee is made to the user that arbitrary unmaps of iova * or size different from those used in the original mapping call will * succeed. + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap + * before unmapping IO virtual addresses. When this flag is set, user must + * provide data[] as structure vfio_bitmap. User must allocate memory to get + * bitmap and must set size of allocated memory in vfio_bitmap.size field. + * A bit in bitmap represents one page of user provided page size in 'pgsize', + * consecutively starting from iova offset. Bit set indicates page at that + * offset from iova is dirty. Bitmap of pages in the range of unmapped size is + * returned in vfio_bitmap.data */ struct vfio_iommu_type1_dma_unmap { __u32 argsz; __u32 flags; +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0) __u64 iova; /* IO virtual address */ __u64 size; /* Size of mapping (bytes) */ + __u8 data[]; }; #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)