mbox series

[v4,0/1] Use HMM for ODP v4

Message ID 20190411181314.19465-1-jglisse@redhat.com (mailing list archive)
Headers show
Series Use HMM for ODP v4 | expand

Message

Jerome Glisse April 11, 2019, 6:13 p.m. UTC
From: Jérôme Glisse <jglisse@redhat.com>

Just fixed Kconfig and build when ODP was not enabled, other than that
this is the same as v3. Here is previous cover letter:

Git tree with all prerequisite:
https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4

This patchset convert RDMA ODP to use HMM underneath this is motivated
by stronger code sharing for same feature (share virtual memory SVM or
Share Virtual Address SVA) and also stronger integration with mm code to
achieve that. It depends on HMM patchset posted for inclusion in 5.2 [2]
and [3].

It has been tested with pingpong test with -o and others flags to test
different size/features associated with ODP.

Moreover they are some features of HMM in the works like peer to peer
support, fast CPU page table snapshot, fast IOMMU mapping update ...
It will be easier for RDMA devices with ODP to leverage those if they
use HMM underneath.

Quick summary of what HMM is:
    HMM is a toolbox for device driver to implement software support for
    Share Virtual Memory (SVM). Not only it provides helpers to mirror a
    process address space on a device (hmm_mirror). It also provides
    helper to allow to use device memory to back regular valid virtual
    address of a process (any valid mmap that is not an mmap of a device
    or a DAX mapping). They are two kinds of device memory. Private memory
    that is not accessible to CPU because it does not have all the expected
    properties (this is for all PCIE devices) or public memory which can
    also be access by CPU without restriction (with OpenCAPI or CCIX or
    similar cache-coherent and atomic inter-connect).

    Device driver can use each of HMM tools separatly. You do not have to
    use all the tools it provides.

For RDMA device i do not expect a need to use the device memory support
of HMM. This device memory support is geared toward accelerator like GPU.


You can find a branch [1] with all the prerequisite in. This patch is on
top of rdma-next with the HMM patchset [2] and mmu notifier patchset [3]
applied on top of it.

[1] https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
[2] https://lkml.org/lkml/2019/4/3/1032
[3] https://lkml.org/lkml/2019/3/26/900

Cc: linux-rdma@vger.kernel.org
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Moni Shoua <monis@mellanox.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Kaike Wan <kaike.wan@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>

Jérôme Glisse (1):
  RDMA/odp: convert to use HMM for ODP v4

 drivers/infiniband/Kconfig         |   3 +-
 drivers/infiniband/core/umem_odp.c | 499 ++++++++---------------------
 drivers/infiniband/hw/mlx5/mem.c   |  20 +-
 drivers/infiniband/hw/mlx5/mr.c    |   2 +-
 drivers/infiniband/hw/mlx5/odp.c   | 106 +++---
 include/rdma/ib_umem_odp.h         |  49 ++-
 6 files changed, 231 insertions(+), 448 deletions(-)

Comments

Jason Gunthorpe May 6, 2019, 7:56 p.m. UTC | #1
On Thu, Apr 11, 2019 at 02:13:13PM -0400, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Just fixed Kconfig and build when ODP was not enabled, other than that
> this is the same as v3. Here is previous cover letter:
> 
> Git tree with all prerequisite:
> https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> 
> This patchset convert RDMA ODP to use HMM underneath this is motivated
> by stronger code sharing for same feature (share virtual memory SVM or
> Share Virtual Address SVA) and also stronger integration with mm code to
> achieve that. It depends on HMM patchset posted for inclusion in 5.2 [2]
> and [3].
> 
> It has been tested with pingpong test with -o and others flags to test
> different size/features associated with ODP.
> 
> Moreover they are some features of HMM in the works like peer to peer
> support, fast CPU page table snapshot, fast IOMMU mapping update ...
> It will be easier for RDMA devices with ODP to leverage those if they
> use HMM underneath.
> 
> Quick summary of what HMM is:
>     HMM is a toolbox for device driver to implement software support for
>     Share Virtual Memory (SVM). Not only it provides helpers to mirror a
>     process address space on a device (hmm_mirror). It also provides
>     helper to allow to use device memory to back regular valid virtual
>     address of a process (any valid mmap that is not an mmap of a device
>     or a DAX mapping). They are two kinds of device memory. Private memory
>     that is not accessible to CPU because it does not have all the expected
>     properties (this is for all PCIE devices) or public memory which can
>     also be access by CPU without restriction (with OpenCAPI or CCIX or
>     similar cache-coherent and atomic inter-connect).
> 
>     Device driver can use each of HMM tools separatly. You do not have to
>     use all the tools it provides.
> 
> For RDMA device i do not expect a need to use the device memory support
> of HMM. This device memory support is geared toward accelerator like GPU.
> 
> 
> You can find a branch [1] with all the prerequisite in. This patch is on
> top of rdma-next with the HMM patchset [2] and mmu notifier patchset [3]
> applied on top of it.
> 
> [1] https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> [2] https://lkml.org/lkml/2019/4/3/1032
> [3] https://lkml.org/lkml/2019/3/26/900

Jerome, please let me know if these dependent series are merged during
the first week of the merge window.

This patch has been tested and could go along next week if the
dependencies are met.

Thanks,
Jason
Jerome Glisse May 21, 2019, 8:53 p.m. UTC | #2
On Mon, May 06, 2019 at 04:56:57PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 11, 2019 at 02:13:13PM -0400, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > Just fixed Kconfig and build when ODP was not enabled, other than that
> > this is the same as v3. Here is previous cover letter:
> > 
> > Git tree with all prerequisite:
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > 
> > This patchset convert RDMA ODP to use HMM underneath this is motivated
> > by stronger code sharing for same feature (share virtual memory SVM or
> > Share Virtual Address SVA) and also stronger integration with mm code to
> > achieve that. It depends on HMM patchset posted for inclusion in 5.2 [2]
> > and [3].
> > 
> > It has been tested with pingpong test with -o and others flags to test
> > different size/features associated with ODP.
> > 
> > Moreover they are some features of HMM in the works like peer to peer
> > support, fast CPU page table snapshot, fast IOMMU mapping update ...
> > It will be easier for RDMA devices with ODP to leverage those if they
> > use HMM underneath.
> > 
> > Quick summary of what HMM is:
> >     HMM is a toolbox for device driver to implement software support for
> >     Share Virtual Memory (SVM). Not only it provides helpers to mirror a
> >     process address space on a device (hmm_mirror). It also provides
> >     helper to allow to use device memory to back regular valid virtual
> >     address of a process (any valid mmap that is not an mmap of a device
> >     or a DAX mapping). They are two kinds of device memory. Private memory
> >     that is not accessible to CPU because it does not have all the expected
> >     properties (this is for all PCIE devices) or public memory which can
> >     also be access by CPU without restriction (with OpenCAPI or CCIX or
> >     similar cache-coherent and atomic inter-connect).
> > 
> >     Device driver can use each of HMM tools separatly. You do not have to
> >     use all the tools it provides.
> > 
> > For RDMA device i do not expect a need to use the device memory support
> > of HMM. This device memory support is geared toward accelerator like GPU.
> > 
> > 
> > You can find a branch [1] with all the prerequisite in. This patch is on
> > top of rdma-next with the HMM patchset [2] and mmu notifier patchset [3]
> > applied on top of it.
> > 
> > [1] https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > [2] https://lkml.org/lkml/2019/4/3/1032
> > [3] https://lkml.org/lkml/2019/3/26/900
> 
> Jerome, please let me know if these dependent series are merged during
> the first week of the merge window.
> 
> This patch has been tested and could go along next week if the
> dependencies are met.
> 

So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
(prefetch and not and different sizes). Seems to work ok.

Cheers,
Jérôme
From 80d98b62b0106d94825eccacd5035fb67ad7b825 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Sat, 8 Dec 2018 15:47:55 -0500
Subject: [PATCH] RDMA/odp: convert to use HMM for ODP v4
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Convert ODP to use HMM so that we can build on common infrastructure
for different class of devices that want to mirror a process address
space into a device. There is no functional changes.

Changes since v3:
    - Rebase on top of 5.2-rc1
Changes since v2:
    - Update to match changes to HMM API
Changes since v1:
    - improved comments
    - simplified page alignment computation

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Moni Shoua <monis@mellanox.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Kaike Wan <kaike.wan@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/core/umem_odp.c | 491 ++++++++---------------------
 drivers/infiniband/hw/mlx5/mem.c   |  20 +-
 drivers/infiniband/hw/mlx5/mr.c    |   2 +-
 drivers/infiniband/hw/mlx5/odp.c   | 107 ++++---
 include/rdma/ib_umem_odp.h         |  47 ++-
 5 files changed, 224 insertions(+), 443 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index f962b5bbfa40..b94ab0d34f1b 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -46,6 +46,20 @@
 #include <rdma/ib_umem.h>
 #include <rdma/ib_umem_odp.h>
 
+
+static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
+	ODP_READ_BIT,	/* HMM_PFN_VALID */
+	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
+	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
+};
+
+static uint64_t odp_hmm_values[HMM_PFN_VALUE_MAX] = {
+	-1UL,	/* HMM_PFN_ERROR */
+	0UL,	/* HMM_PFN_NONE */
+	-2UL,	/* HMM_PFN_SPECIAL */
+};
+
+
 /*
  * The ib_umem list keeps track of memory regions for which the HW
  * device request to receive notification when the related memory
@@ -78,57 +92,25 @@ static u64 node_last(struct umem_odp_node *n)
 INTERVAL_TREE_DEFINE(struct umem_odp_node, rb, u64, __subtree_last,
 		     node_start, node_last, static, rbt_ib_umem)
 
-static void ib_umem_notifier_start_account(struct ib_umem_odp *umem_odp)
-{
-	mutex_lock(&umem_odp->umem_mutex);
-	if (umem_odp->notifiers_count++ == 0)
-		/*
-		 * Initialize the completion object for waiting on
-		 * notifiers. Since notifier_count is zero, no one should be
-		 * waiting right now.
-		 */
-		reinit_completion(&umem_odp->notifier_completion);
-	mutex_unlock(&umem_odp->umem_mutex);
-}
-
-static void ib_umem_notifier_end_account(struct ib_umem_odp *umem_odp)
-{
-	mutex_lock(&umem_odp->umem_mutex);
-	/*
-	 * This sequence increase will notify the QP page fault that the page
-	 * that is going to be mapped in the spte could have been freed.
-	 */
-	++umem_odp->notifiers_seq;
-	if (--umem_odp->notifiers_count == 0)
-		complete_all(&umem_odp->notifier_completion);
-	mutex_unlock(&umem_odp->umem_mutex);
-}
-
 static int ib_umem_notifier_release_trampoline(struct ib_umem_odp *umem_odp,
 					       u64 start, u64 end, void *cookie)
 {
 	struct ib_umem *umem = &umem_odp->umem;
 
-	/*
-	 * Increase the number of notifiers running, to
-	 * prevent any further fault handling on this MR.
-	 */
-	ib_umem_notifier_start_account(umem_odp);
 	umem_odp->dying = 1;
 	/* Make sure that the fact the umem is dying is out before we release
 	 * all pending page faults. */
 	smp_wmb();
-	complete_all(&umem_odp->notifier_completion);
 	umem->context->invalidate_range(umem_odp, ib_umem_start(umem),
 					ib_umem_end(umem));
 	return 0;
 }
 
-static void ib_umem_notifier_release(struct mmu_notifier *mn,
-				     struct mm_struct *mm)
+static void ib_umem_notifier_release(struct hmm_mirror *mirror)
 {
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
+	struct ib_ucontext_per_mm *per_mm;
+
+	per_mm = container_of(mirror, struct ib_ucontext_per_mm, mirror);
 
 	down_read(&per_mm->umem_rwsem);
 	if (per_mm->active)
@@ -136,23 +118,26 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
 			&per_mm->umem_tree, 0, ULLONG_MAX,
 			ib_umem_notifier_release_trampoline, true, NULL);
 	up_read(&per_mm->umem_rwsem);
+
+	per_mm->mm = NULL;
 }
 
-static int invalidate_range_start_trampoline(struct ib_umem_odp *item,
-					     u64 start, u64 end, void *cookie)
+static int invalidate_range_trampoline(struct ib_umem_odp *item,
+				       u64 start, u64 end, void *cookie)
 {
-	ib_umem_notifier_start_account(item);
 	item->umem.context->invalidate_range(item, start, end);
 	return 0;
 }
 
-static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
-				const struct mmu_notifier_range *range)
+static int ib_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
+			    const struct hmm_update *range)
 {
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
+	struct ib_ucontext_per_mm *per_mm;
+	int ret;
+
+	per_mm = container_of(mirror, struct ib_ucontext_per_mm, mirror);
 
-	if (mmu_notifier_range_blockable(range))
+	if (range->blockable)
 		down_read(&per_mm->umem_rwsem);
 	else if (!down_read_trylock(&per_mm->umem_rwsem))
 		return -EAGAIN;
@@ -167,39 +152,17 @@ static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		return 0;
 	}
 
-	return rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
+	ret = rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
 					     range->end,
-					     invalidate_range_start_trampoline,
-					     mmu_notifier_range_blockable(range),
-					     NULL);
-}
-
-static int invalidate_range_end_trampoline(struct ib_umem_odp *item, u64 start,
-					   u64 end, void *cookie)
-{
-	ib_umem_notifier_end_account(item);
-	return 0;
-}
-
-static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
-				const struct mmu_notifier_range *range)
-{
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
-
-	if (unlikely(!per_mm->active))
-		return;
-
-	rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
-				      range->end,
-				      invalidate_range_end_trampoline, true, NULL);
+					     invalidate_range_trampoline,
+					     range->blockable, NULL);
 	up_read(&per_mm->umem_rwsem);
+	return ret;
 }
 
-static const struct mmu_notifier_ops ib_umem_notifiers = {
+static const struct hmm_mirror_ops ib_umem_notifiers = {
 	.release                    = ib_umem_notifier_release,
-	.invalidate_range_start     = ib_umem_notifier_invalidate_range_start,
-	.invalidate_range_end       = ib_umem_notifier_invalidate_range_end,
+	.sync_cpu_device_pagetables = ib_sync_cpu_device_pagetables,
 };
 
 static void add_umem_to_per_mm(struct ib_umem_odp *umem_odp)
@@ -223,7 +186,6 @@ static void remove_umem_from_per_mm(struct ib_umem_odp *umem_odp)
 	if (likely(ib_umem_start(umem) != ib_umem_end(umem)))
 		rbt_ib_umem_remove(&umem_odp->interval_tree,
 				   &per_mm->umem_tree);
-	complete_all(&umem_odp->notifier_completion);
 
 	up_write(&per_mm->umem_rwsem);
 }
@@ -250,11 +212,13 @@ static struct ib_ucontext_per_mm *alloc_per_mm(struct ib_ucontext *ctx,
 
 	WARN_ON(mm != current->mm);
 
-	per_mm->mn.ops = &ib_umem_notifiers;
-	ret = mmu_notifier_register(&per_mm->mn, per_mm->mm);
+	per_mm->mirror.ops = &ib_umem_notifiers;
+	down_write(&mm->mmap_sem);
+	ret = hmm_mirror_register(&per_mm->mirror, per_mm->mm);
+	up_write(&mm->mmap_sem);
 	if (ret) {
 		dev_err(&ctx->device->dev,
-			"Failed to register mmu_notifier %d\n", ret);
+			"Failed to register HMM mirror %d\n", ret);
 		goto out_pid;
 	}
 
@@ -296,11 +260,6 @@ static int get_per_mm(struct ib_umem_odp *umem_odp)
 	return 0;
 }
 
-static void free_per_mm(struct rcu_head *rcu)
-{
-	kfree(container_of(rcu, struct ib_ucontext_per_mm, rcu));
-}
-
 static void put_per_mm(struct ib_umem_odp *umem_odp)
 {
 	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
@@ -329,9 +288,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
 	up_write(&per_mm->umem_rwsem);
 
 	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
-	mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
+	hmm_mirror_unregister(&per_mm->mirror);
 	put_pid(per_mm->tgid);
-	mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
+
+	kfree(per_mm);
 }
 
 struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
@@ -359,11 +319,9 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
 	mmgrab(umem->owning_mm);
 
 	mutex_init(&odp_data->umem_mutex);
-	init_completion(&odp_data->notifier_completion);
 
-	odp_data->page_list =
-		vzalloc(array_size(pages, sizeof(*odp_data->page_list)));
-	if (!odp_data->page_list) {
+	odp_data->pfns = vzalloc(array_size(pages, sizeof(*odp_data->pfns)));
+	if (!odp_data->pfns) {
 		ret = -ENOMEM;
 		goto out_odp_data;
 	}
@@ -372,7 +330,7 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
 		vzalloc(array_size(pages, sizeof(*odp_data->dma_list)));
 	if (!odp_data->dma_list) {
 		ret = -ENOMEM;
-		goto out_page_list;
+		goto out_pfns;
 	}
 
 	/*
@@ -386,8 +344,8 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
 
 	return odp_data;
 
-out_page_list:
-	vfree(odp_data->page_list);
+out_pfns:
+	vfree(odp_data->pfns);
 out_odp_data:
 	mmdrop(umem->owning_mm);
 	kfree(odp_data);
@@ -422,13 +380,11 @@ int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access)
 
 	mutex_init(&umem_odp->umem_mutex);
 
-	init_completion(&umem_odp->notifier_completion);
-
 	if (ib_umem_num_pages(umem)) {
-		umem_odp->page_list =
-			vzalloc(array_size(sizeof(*umem_odp->page_list),
+		umem_odp->pfns =
+			vzalloc(array_size(sizeof(*umem_odp->pfns),
 					   ib_umem_num_pages(umem)));
-		if (!umem_odp->page_list)
+		if (!umem_odp->pfns)
 			return -ENOMEM;
 
 		umem_odp->dma_list =
@@ -436,7 +392,7 @@ int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access)
 					   ib_umem_num_pages(umem)));
 		if (!umem_odp->dma_list) {
 			ret_val = -ENOMEM;
-			goto out_page_list;
+			goto out_pfns;
 		}
 	}
 
@@ -449,8 +405,8 @@ int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access)
 
 out_dma_list:
 	vfree(umem_odp->dma_list);
-out_page_list:
-	vfree(umem_odp->page_list);
+out_pfns:
+	vfree(umem_odp->pfns);
 	return ret_val;
 }
 
@@ -470,289 +426,118 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 	remove_umem_from_per_mm(umem_odp);
 	put_per_mm(umem_odp);
 	vfree(umem_odp->dma_list);
-	vfree(umem_odp->page_list);
-}
-
-/*
- * Map for DMA and insert a single page into the on-demand paging page tables.
- *
- * @umem: the umem to insert the page to.
- * @page_index: index in the umem to add the page to.
- * @page: the page struct to map and add.
- * @access_mask: access permissions needed for this page.
- * @current_seq: sequence number for synchronization with invalidations.
- *               the sequence number is taken from
- *               umem_odp->notifiers_seq.
- *
- * The function returns -EFAULT if the DMA mapping operation fails. It returns
- * -EAGAIN if a concurrent invalidation prevents us from updating the page.
- *
- * The page is released via put_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
- */
-static int ib_umem_odp_map_dma_single_page(
-		struct ib_umem_odp *umem_odp,
-		int page_index,
-		struct page *page,
-		u64 access_mask,
-		unsigned long current_seq)
-{
-	struct ib_umem *umem = &umem_odp->umem;
-	struct ib_device *dev = umem->context->device;
-	dma_addr_t dma_addr;
-	int remove_existing_mapping = 0;
-	int ret = 0;
-
-	/*
-	 * Note: we avoid writing if seq is different from the initial seq, to
-	 * handle case of a racing notifier. This check also allows us to bail
-	 * early if we have a notifier running in parallel with us.
-	 */
-	if (ib_umem_mmu_notifier_retry(umem_odp, current_seq)) {
-		ret = -EAGAIN;
-		goto out;
-	}
-	if (!(umem_odp->dma_list[page_index])) {
-		dma_addr = ib_dma_map_page(dev,
-					   page,
-					   0, BIT(umem->page_shift),
-					   DMA_BIDIRECTIONAL);
-		if (ib_dma_mapping_error(dev, dma_addr)) {
-			ret = -EFAULT;
-			goto out;
-		}
-		umem_odp->dma_list[page_index] = dma_addr | access_mask;
-		umem_odp->page_list[page_index] = page;
-		umem_odp->npages++;
-	} else if (umem_odp->page_list[page_index] == page) {
-		umem_odp->dma_list[page_index] |= access_mask;
-	} else {
-		pr_err("error: got different pages in IB device and from get_user_pages. IB device page: %p, gup page: %p\n",
-		       umem_odp->page_list[page_index], page);
-		/* Better remove the mapping now, to prevent any further
-		 * damage. */
-		remove_existing_mapping = 1;
-	}
-
-out:
-	put_page(page);
-
-	if (remove_existing_mapping) {
-		ib_umem_notifier_start_account(umem_odp);
-		umem->context->invalidate_range(
-			umem_odp,
-			ib_umem_start(umem) + (page_index << umem->page_shift),
-			ib_umem_start(umem) +
-				((page_index + 1) << umem->page_shift));
-		ib_umem_notifier_end_account(umem_odp);
-		ret = -EAGAIN;
-	}
-
-	return ret;
+	vfree(umem_odp->pfns);
 }
 
 /**
  * ib_umem_odp_map_dma_pages - Pin and DMA map userspace memory in an ODP MR.
- *
- * Pins the range of pages passed in the argument, and maps them to
- * DMA addresses. The DMA addresses of the mapped pages is updated in
- * umem_odp->dma_list.
- *
- * Returns the number of pages mapped in success, negative error code
- * for failure.
- * An -EAGAIN error code is returned when a concurrent mmu notifier prevents
- * the function from completing its task.
- * An -ENOENT error code indicates that userspace process is being terminated
- * and mm was already destroyed.
  * @umem_odp: the umem to map and pin
- * @user_virt: the address from which we need to map.
- * @bcnt: the minimal number of bytes to pin and map. The mapping might be
- *        bigger due to alignment, and may also be smaller in case of an error
- *        pinning or mapping a page. The actual pages mapped is returned in
- *        the return value.
- * @access_mask: bit mask of the requested access permissions for the given
- *               range.
- * @current_seq: the MMU notifiers sequance value for synchronization with
- *               invalidations. the sequance number is read from
- *               umem_odp->notifiers_seq before calling this function
+ * @range: range of virtual address to be mapped to the device
+ * Returns: -EINVAL some invalid arguments, -EAGAIN need to try again, -ENOENT
+ *          if process is being terminated, number of pages mapped otherwise.
+ *
+ * Map to device a range of virtual address passed in the argument. The DMA
+ * addresses are in umem_odp->dma_list and the corresponding page informations
+ * in umem_odp->pfns.
  */
-int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
-			      u64 bcnt, u64 access_mask,
-			      unsigned long current_seq)
+long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
+			       struct hmm_range *range)
 {
+	struct device *device = umem_odp->umem.context->device->dma_device;
+	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
 	struct ib_umem *umem = &umem_odp->umem;
-	struct task_struct *owning_process  = NULL;
-	struct mm_struct *owning_mm = umem_odp->umem.owning_mm;
-	struct page       **local_page_list = NULL;
-	u64 page_mask, off;
-	int j, k, ret = 0, start_idx, npages = 0, page_shift;
-	unsigned int flags = 0;
-	phys_addr_t p = 0;
-
-	if (access_mask == 0)
+	struct mm_struct *mm = per_mm->mm;
+	unsigned long idx, npages;
+	long ret;
+
+	if (mm == NULL)
+		return -ENOENT;
+
+	/* Only drivers with invalidate support can use this function. */
+	if (!umem->context->invalidate_range)
 		return -EINVAL;
 
-	if (user_virt < ib_umem_start(umem) ||
-	    user_virt + bcnt > ib_umem_end(umem))
-		return -EFAULT;
+	/* Sanity checks. */
+	if (range->default_flags == 0)
+		return -EINVAL;
 
-	local_page_list = (struct page **)__get_free_page(GFP_KERNEL);
-	if (!local_page_list)
-		return -ENOMEM;
+	if (range->start < ib_umem_start(umem) ||
+	    range->end > ib_umem_end(umem))
+		return -EINVAL;
 
-	page_shift = umem->page_shift;
-	page_mask = ~(BIT(page_shift) - 1);
-	off = user_virt & (~page_mask);
-	user_virt = user_virt & page_mask;
-	bcnt += off; /* Charge for the first page offset as well. */
+	idx = (range->start - ib_umem_start(umem)) >> umem->page_shift;
+	range->pfns = &umem_odp->pfns[idx];
+	range->pfn_shift = ODP_FLAGS_BITS;
+	range->values = odp_hmm_values;
+	range->flags = odp_hmm_flags;
 
 	/*
-	 * owning_process is allowed to be NULL, this means somehow the mm is
-	 * existing beyond the lifetime of the originating process.. Presumably
-	 * mmget_not_zero will fail in this case.
+	 * If mm is dying just bail out early without trying to take mmap_sem.
+	 * Note that this might race with mm destruction but that is fine the
+	 * is properly refcounted so are all HMM structure.
 	 */
-	owning_process = get_pid_task(umem_odp->per_mm->tgid, PIDTYPE_PID);
-	if (!owning_process || !mmget_not_zero(owning_mm)) {
-		ret = -EINVAL;
-		goto out_put_task;
-	}
-
-	if (access_mask & ODP_WRITE_ALLOWED_BIT)
-		flags |= FOLL_WRITE;
-
-	start_idx = (user_virt - ib_umem_start(umem)) >> page_shift;
-	k = start_idx;
-
-	while (bcnt > 0) {
-		const size_t gup_num_pages = min_t(size_t,
-				(bcnt + BIT(page_shift) - 1) >> page_shift,
-				PAGE_SIZE / sizeof(struct page *));
-
-		down_read(&owning_mm->mmap_sem);
-		/*
-		 * Note: this might result in redundent page getting. We can
-		 * avoid this by checking dma_list to be 0 before calling
-		 * get_user_pages. However, this make the code much more
-		 * complex (and doesn't gain us much performance in most use
-		 * cases).
-		 */
-		npages = get_user_pages_remote(owning_process, owning_mm,
-				user_virt, gup_num_pages,
-				flags, local_page_list, NULL, NULL);
-		up_read(&owning_mm->mmap_sem);
-
-		if (npages < 0) {
-			if (npages != -EAGAIN)
-				pr_warn("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
-			else
-				pr_debug("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
-			break;
-		}
+	if (!hmm_mirror_mm_is_alive(&per_mm->mirror))
+		return -EINVAL;
+	down_read(&mm->mmap_sem);
+	mutex_lock(&umem_odp->umem_mutex);
+	ret = hmm_range_dma_map(range, device,
+		&umem_odp->dma_list[idx], true);
+	mutex_unlock(&umem_odp->umem_mutex);
+	npages = ret;
 
-		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
-		mutex_lock(&umem_odp->umem_mutex);
-		for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
-			if (user_virt & ~page_mask) {
-				p += PAGE_SIZE;
-				if (page_to_phys(local_page_list[j]) != p) {
-					ret = -EFAULT;
-					break;
-				}
-				put_page(local_page_list[j]);
-				continue;
-			}
-
-			ret = ib_umem_odp_map_dma_single_page(
-					umem_odp, k, local_page_list[j],
-					access_mask, current_seq);
-			if (ret < 0) {
-				if (ret != -EAGAIN)
-					pr_warn("ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
-				else
-					pr_debug("ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
-				break;
-			}
-
-			p = page_to_phys(local_page_list[j]);
-			k++;
-		}
-		mutex_unlock(&umem_odp->umem_mutex);
-
-		if (ret < 0) {
-			/*
-			 * Release pages, remembering that the first page
-			 * to hit an error was already released by
-			 * ib_umem_odp_map_dma_single_page().
-			 */
-			if (npages - (j + 1) > 0)
-				release_pages(&local_page_list[j+1],
-					      npages - (j + 1));
-			break;
-		}
-	}
+	/*
+	 * The mmap_sem have been drop if hmm_vma_fault_and_dma_map() returned
+	 * with -EAGAIN. In which case we need to retry as -EBUSY but we also
+	 * need to take the mmap_sem again.
+	 */
+	if (ret != -EAGAIN)
+		 up_read(&mm->mmap_sem);
 
-	if (ret >= 0) {
-		if (npages < 0 && k == start_idx)
-			ret = npages;
-		else
-			ret = k - start_idx;
+	if (ret <= 0) {
+		/* Convert -EBUSY to -EAGAIN and 0 to -EAGAIN */
+		ret = ret == -EBUSY ? -EAGAIN : ret;
+		return ret ? ret : -EAGAIN;
 	}
 
-	mmput(owning_mm);
-out_put_task:
-	if (owning_process)
-		put_task_struct(owning_process);
-	free_page((unsigned long)local_page_list);
-	return ret;
+	umem_odp->npages += npages;
+	return npages;
 }
 EXPORT_SYMBOL(ib_umem_odp_map_dma_pages);
 
-void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
-				 u64 bound)
+void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp,
+				 u64 virt, u64 bound)
 {
+	struct device *device = umem_odp->umem.context->device->dma_device;
 	struct ib_umem *umem = &umem_odp->umem;
-	int idx;
-	u64 addr;
-	struct ib_device *dev = umem->context->device;
+	unsigned long idx, page_mask;
+	struct hmm_range range;
+	long ret;
+
+	if (!umem_odp->npages)
+		return;
+
+	bound = ALIGN(bound, 1UL << umem->page_shift);
+	page_mask = ~(BIT(umem->page_shift) - 1);
+	virt &= page_mask;
 
 	virt  = max_t(u64, virt,  ib_umem_start(umem));
 	bound = min_t(u64, bound, ib_umem_end(umem));
-	/* Note that during the run of this function, the
-	 * notifiers_count of the MR is > 0, preventing any racing
-	 * faults from completion. We might be racing with other
-	 * invalidations, so we must make sure we free each page only
-	 * once. */
+
+	idx = ((unsigned long)virt - ib_umem_start(umem)) >> PAGE_SHIFT;
+
+	range.page_shift = umem->page_shift;
+	range.pfns = &umem_odp->pfns[idx];
+	range.pfn_shift = ODP_FLAGS_BITS;
+	range.values = odp_hmm_values;
+	range.flags = odp_hmm_flags;
+	range.start = virt;
+	range.end = bound;
+
 	mutex_lock(&umem_odp->umem_mutex);
-	for (addr = virt; addr < bound; addr += BIT(umem->page_shift)) {
-		idx = (addr - ib_umem_start(umem)) >> umem->page_shift;
-		if (umem_odp->page_list[idx]) {
-			struct page *page = umem_odp->page_list[idx];
-			dma_addr_t dma = umem_odp->dma_list[idx];
-			dma_addr_t dma_addr = dma & ODP_DMA_ADDR_MASK;
-
-			WARN_ON(!dma_addr);
-
-			ib_dma_unmap_page(dev, dma_addr, PAGE_SIZE,
-					  DMA_BIDIRECTIONAL);
-			if (dma & ODP_WRITE_ALLOWED_BIT) {
-				struct page *head_page = compound_head(page);
-				/*
-				 * set_page_dirty prefers being called with
-				 * the page lock. However, MMU notifiers are
-				 * called sometimes with and sometimes without
-				 * the lock. We rely on the umem_mutex instead
-				 * to prevent other mmu notifiers from
-				 * continuing and allowing the page mapping to
-				 * be removed.
-				 */
-				set_page_dirty(head_page);
-			}
-			umem_odp->page_list[idx] = NULL;
-			umem_odp->dma_list[idx] = 0;
-			umem_odp->npages--;
-		}
-	}
+	ret = hmm_range_dma_unmap(&range, NULL, device,
+		&umem_odp->dma_list[idx], true);
+	if (ret > 0)
+		umem_odp->npages -= ret;
 	mutex_unlock(&umem_odp->umem_mutex);
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 9f90be296ee0..e2481509b913 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -111,16 +111,16 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr,
 	*count = i;
 }
 
-static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
+static u64 umem_dma_to_mtt(struct ib_umem_odp *odp, size_t idx)
 {
-	u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
+	u64 mtt_entry = odp->dma_list[idx];
 
-	if (umem_dma & ODP_READ_ALLOWED_BIT)
+	if (odp->pfns[idx] & ODP_READ_BIT)
 		mtt_entry |= MLX5_IB_MTT_READ;
-	if (umem_dma & ODP_WRITE_ALLOWED_BIT)
+	if (odp->pfns[idx] & ODP_WRITE_BIT)
 		mtt_entry |= MLX5_IB_MTT_WRITE;
 
-	return mtt_entry;
+	return cpu_to_be64(mtt_entry);
 }
 
 /*
@@ -151,15 +151,13 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 	int entry;
 
 	if (umem->is_odp) {
+		struct ib_umem_odp *odp = to_ib_umem_odp(umem);
+
 		WARN_ON(shift != 0);
 		WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE));
 
-		for (i = 0; i < num_pages; ++i) {
-			dma_addr_t pa =
-				to_ib_umem_odp(umem)->dma_list[offset + i];
-
-			pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
-		}
+		for (i = 0; i < num_pages; ++i)
+			pas[i] = umem_dma_to_mtt(odp, offset + i);
 		return;
 	}
 
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 5f09699fab98..978e0cfad643 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1605,7 +1605,7 @@ static void dereg_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 		/* Wait for all running page-fault handlers to finish. */
 		synchronize_srcu(&dev->mr_srcu);
 		/* Destroy all page mappings */
-		if (umem_odp->page_list)
+		if (umem_odp->pfns)
 			mlx5_ib_invalidate_range(umem_odp, ib_umem_start(umem),
 						 ib_umem_end(umem));
 		else
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 91507a2e9290..46a951d643e2 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -257,8 +257,7 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
 		 * estimate the cost of another UMR vs. the cost of bigger
 		 * UMR.
 		 */
-		if (umem_odp->dma_list[idx] &
-		    (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
+		if (umem_odp->pfns[idx] & ODP_READ_BIT) {
 			if (!in_block) {
 				blk_start_idx = idx;
 				in_block = 1;
@@ -580,17 +579,18 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
 			u64 io_virt, size_t bcnt, u32 *bytes_mapped,
 			u32 flags)
 {
-	int npages = 0, current_seq, page_shift, ret, np;
-	bool implicit = false;
 	struct ib_umem_odp *odp_mr = to_ib_umem_odp(mr->umem);
 	bool downgrade = flags & MLX5_PF_FLAGS_DOWNGRADE;
 	bool prefetch = flags & MLX5_PF_FLAGS_PREFETCH;
-	u64 access_mask;
+	unsigned long npages = 0, page_shift, np, off;
 	u64 start_idx, page_mask;
 	struct ib_umem_odp *odp;
-	size_t size;
+	struct hmm_range range;
+	bool implicit = false;
+	size_t size, fault_size;
+	long ret;
 
-	if (!odp_mr->page_list) {
+	if (!odp_mr->pfns) {
 		odp = implicit_mr_get_data(mr, io_virt, bcnt);
 
 		if (IS_ERR(odp))
@@ -603,11 +603,29 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
 
 next_mr:
 	size = min_t(size_t, bcnt, ib_umem_end(&odp->umem) - io_virt);
-
 	page_shift = mr->umem->page_shift;
 	page_mask = ~(BIT(page_shift) - 1);
+	/*
+	 * We need to align io_virt on page size so off is the extra bytes we
+	 * will be faulting and fault_size is the page aligned size we are
+	 * faulting.
+	 */
+	io_virt = io_virt & page_mask;
+	off = (io_virt & (~page_mask));
+	fault_size = ALIGN(size + off, 1UL << page_shift);
+
+	if (io_virt < ib_umem_start(&odp->umem))
+		return -EINVAL;
+
 	start_idx = (io_virt - (mr->mmkey.iova & page_mask)) >> page_shift;
-	access_mask = ODP_READ_ALLOWED_BIT;
+
+	if (odp_mr->per_mm == NULL || odp_mr->per_mm->mm == NULL)
+		return -ENOENT;
+
+	ret = hmm_range_register(&range, odp_mr->per_mm->mm,
+				 io_virt, io_virt + fault_size, page_shift);
+	if (ret)
+		return ret;
 
 	if (prefetch && !downgrade && !mr->umem->writable) {
 		/* prefetch with write-access must
@@ -617,58 +635,55 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
 		goto out;
 	}
 
+	range.default_flags = ODP_READ_BIT;
 	if (mr->umem->writable && !downgrade)
-		access_mask |= ODP_WRITE_ALLOWED_BIT;
-
-	current_seq = READ_ONCE(odp->notifiers_seq);
-	/*
-	 * Ensure the sequence number is valid for some time before we call
-	 * gup.
-	 */
-	smp_rmb();
-
-	ret = ib_umem_odp_map_dma_pages(to_ib_umem_odp(mr->umem), io_virt, size,
-					access_mask, current_seq);
+		range.default_flags |= ODP_WRITE_BIT;
 
+	ret = ib_umem_odp_map_dma_pages(to_ib_umem_odp(mr->umem), &range);
 	if (ret < 0)
-		goto out;
+		goto again;
 
 	np = ret;
 
 	mutex_lock(&odp->umem_mutex);
-	if (!ib_umem_mmu_notifier_retry(to_ib_umem_odp(mr->umem),
-					current_seq)) {
+	if (hmm_range_valid(&range)) {
 		/*
 		 * No need to check whether the MTTs really belong to
-		 * this MR, since ib_umem_odp_map_dma_pages already
+		 * this MR, since ib_umem_odp_map_dma_pages() already
 		 * checks this.
 		 */
 		ret = mlx5_ib_update_xlt(mr, start_idx, np,
 					 page_shift, MLX5_IB_UPD_XLT_ATOMIC);
-	} else {
+	} else
 		ret = -EAGAIN;
-	}
 	mutex_unlock(&odp->umem_mutex);
 
 	if (ret < 0) {
-		if (ret != -EAGAIN)
+		if (ret != -EAGAIN) {
 			mlx5_ib_err(dev, "Failed to update mkey page tables\n");
-		goto out;
+			goto out;
+		}
+		goto again;
 	}
 
 	if (bytes_mapped) {
-		u32 new_mappings = (np << page_shift) -
-			(io_virt - round_down(io_virt, 1 << page_shift));
+		long new_mappings = (np << page_shift) - off;
+		new_mappings = new_mappings < 0 ? 0 : new_mappings;
 		*bytes_mapped += min_t(u32, new_mappings, size);
 	}
 
 	npages += np << (page_shift - PAGE_SHIFT);
+	hmm_range_unregister(&range);
 	bcnt -= size;
 
-	if (unlikely(bcnt)) {
+	if (unlikely(bcnt > 0)) {
 		struct ib_umem_odp *next;
 
-		io_virt += size;
+		/*
+		 * Next virtual address is after the number of bytes we faulted
+		 * in this step.
+		 */
+		io_virt += fault_size;
 		next = odp_next(odp);
 		if (unlikely(!next || next->umem.address != io_virt)) {
 			mlx5_ib_dbg(dev, "next implicit leaf removed at 0x%llx. got %p\n",
@@ -682,24 +697,18 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
 
 	return npages;
 
-out:
-	if (ret == -EAGAIN) {
-		if (implicit || !odp->dying) {
-			unsigned long timeout =
-				msecs_to_jiffies(MMU_NOTIFIER_TIMEOUT);
-
-			if (!wait_for_completion_timeout(
-					&odp->notifier_completion,
-					timeout)) {
-				mlx5_ib_warn(dev, "timeout waiting for mmu notifier. seq %d against %d. notifiers_count=%d\n",
-					     current_seq, odp->notifiers_seq, odp->notifiers_count);
-			}
-		} else {
-			/* The MR is being killed, kill the QP as well. */
-			ret = -EFAULT;
-		}
-	}
+again:
+	if (ret != -EAGAIN)
+		goto out;
+
+	/* Check if the MR is being killed, kill the QP as well. */
+	if (!implicit || odp->dying)
+		ret = -EFAULT;
+	else if (!hmm_range_wait_until_valid(&range, MMU_NOTIFIER_TIMEOUT))
+		mlx5_ib_warn(dev, "timeout waiting for mmu notifier.\n");
 
+out:
+	hmm_range_unregister(&range);
 	return ret;
 }
 
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index eeec4e53c448..70b2df8e5a6c 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -36,6 +36,7 @@
 #include <rdma/ib_umem.h>
 #include <rdma/ib_verbs.h>
 #include <linux/interval_tree.h>
+#include <linux/hmm.h>
 
 struct umem_odp_node {
 	u64 __subtree_last;
@@ -47,11 +48,11 @@ struct ib_umem_odp {
 	struct ib_ucontext_per_mm *per_mm;
 
 	/*
-	 * An array of the pages included in the on-demand paging umem.
-	 * Indices of pages that are currently not mapped into the device will
-	 * contain NULL.
+	 * An array of the pages included in the on-demand paging umem. Indices
+	 * of pages that are currently not mapped into the device will contain
+	 * 0.
 	 */
-	struct page		**page_list;
+	uint64_t *pfns;
 	/*
 	 * An array of the same size as page_list, with DMA addresses mapped
 	 * for pages the pages in page_list. The lower two bits designate
@@ -67,14 +68,11 @@ struct ib_umem_odp {
 	struct mutex		umem_mutex;
 	void			*private; /* for the HW driver to use. */
 
-	int notifiers_seq;
-	int notifiers_count;
 	int npages;
 
 	/* Tree tracking */
 	struct umem_odp_node	interval_tree;
 
-	struct completion	notifier_completion;
 	int			dying;
 	struct work_struct	work;
 };
@@ -109,11 +107,10 @@ struct ib_ucontext_per_mm {
 	/* Protects umem_tree */
 	struct rw_semaphore umem_rwsem;
 
-	struct mmu_notifier mn;
+	struct hmm_mirror mirror;
 	unsigned int odp_mrs_count;
 
 	struct list_head ucontext_list;
-	struct rcu_head rcu;
 };
 
 int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access);
@@ -121,9 +118,18 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root_umem,
 				      unsigned long addr, size_t size);
 void ib_umem_odp_release(struct ib_umem_odp *umem_odp);
 
-int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 start_offset,
-			      u64 bcnt, u64 access_mask,
-			      unsigned long current_seq);
+#define ODP_READ_BIT	(1<<0ULL)
+#define ODP_WRITE_BIT	(1<<1ULL)
+/*
+ * The device bit is not use by ODP but is there to full-fill HMM API which
+ * also support device with device memory (like GPU). So from ODP/RDMA POV
+ * this can be ignored.
+ */
+#define ODP_DEVICE_BIT	(1<<2ULL)
+#define ODP_FLAGS_BITS	3
+
+long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
+			       struct hmm_range *range);
 
 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 start_offset,
 				 u64 bound);
@@ -146,23 +152,6 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 struct ib_umem_odp *rbt_ib_umem_lookup(struct rb_root_cached *root,
 				       u64 addr, u64 length);
 
-static inline int ib_umem_mmu_notifier_retry(struct ib_umem_odp *umem_odp,
-					     unsigned long mmu_seq)
-{
-	/*
-	 * This code is strongly based on the KVM code from
-	 * mmu_notifier_retry. Should be called with
-	 * the relevant locks taken (umem_odp->umem_mutex
-	 * and the ucontext umem_mutex semaphore locked for read).
-	 */
-
-	if (unlikely(umem_odp->notifiers_count))
-		return 1;
-	if (umem_odp->notifiers_seq != mmu_seq)
-		return 1;
-	return 0;
-}
-
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 static inline int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access)
Jason Gunthorpe May 22, 2019, 12:52 a.m. UTC | #3
On Tue, May 21, 2019 at 04:53:22PM -0400, Jerome Glisse wrote:
> On Mon, May 06, 2019 at 04:56:57PM -0300, Jason Gunthorpe wrote:
> > On Thu, Apr 11, 2019 at 02:13:13PM -0400, jglisse@redhat.com wrote:
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > > 
> > > Just fixed Kconfig and build when ODP was not enabled, other than that
> > > this is the same as v3. Here is previous cover letter:
> > > 
> > > Git tree with all prerequisite:
> > > https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > > 
> > > This patchset convert RDMA ODP to use HMM underneath this is motivated
> > > by stronger code sharing for same feature (share virtual memory SVM or
> > > Share Virtual Address SVA) and also stronger integration with mm code to
> > > achieve that. It depends on HMM patchset posted for inclusion in 5.2 [2]
> > > and [3].
> > > 
> > > It has been tested with pingpong test with -o and others flags to test
> > > different size/features associated with ODP.
> > > 
> > > Moreover they are some features of HMM in the works like peer to peer
> > > support, fast CPU page table snapshot, fast IOMMU mapping update ...
> > > It will be easier for RDMA devices with ODP to leverage those if they
> > > use HMM underneath.
> > > 
> > > Quick summary of what HMM is:
> > >     HMM is a toolbox for device driver to implement software support for
> > >     Share Virtual Memory (SVM). Not only it provides helpers to mirror a
> > >     process address space on a device (hmm_mirror). It also provides
> > >     helper to allow to use device memory to back regular valid virtual
> > >     address of a process (any valid mmap that is not an mmap of a device
> > >     or a DAX mapping). They are two kinds of device memory. Private memory
> > >     that is not accessible to CPU because it does not have all the expected
> > >     properties (this is for all PCIE devices) or public memory which can
> > >     also be access by CPU without restriction (with OpenCAPI or CCIX or
> > >     similar cache-coherent and atomic inter-connect).
> > > 
> > >     Device driver can use each of HMM tools separatly. You do not have to
> > >     use all the tools it provides.
> > > 
> > > For RDMA device i do not expect a need to use the device memory support
> > > of HMM. This device memory support is geared toward accelerator like GPU.
> > > 
> > > 
> > > You can find a branch [1] with all the prerequisite in. This patch is on
> > > top of rdma-next with the HMM patchset [2] and mmu notifier patchset [3]
> > > applied on top of it.
> > > 
> > > [1] https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > > [2] https://lkml.org/lkml/2019/4/3/1032
> > > [3] https://lkml.org/lkml/2019/3/26/900
> > 
> > Jerome, please let me know if these dependent series are merged during
> > the first week of the merge window.
> > 
> > This patch has been tested and could go along next week if the
> > dependencies are met.
> > 
> 
> So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> (prefetch and not and different sizes). Seems to work ok.

Urk, it already doesn't apply to the rdma tree :(

The conflicts are a little more extensive than I'd prefer to handle..
Can I ask you to rebase it on top of this branch please:

https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next

Specifically it conflicts with this patch:

https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34

> +long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
> +			       struct hmm_range *range)
>  {
> +	struct device *device = umem_odp->umem.context->device->dma_device;
> +	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
>  	struct ib_umem *umem = &umem_odp->umem;
> -	struct task_struct *owning_process  = NULL;
> -	struct mm_struct *owning_mm = umem_odp->umem.owning_mm;
> -	struct page       **local_page_list = NULL;
> -	u64 page_mask, off;
> -	int j, k, ret = 0, start_idx, npages = 0, page_shift;
> -	unsigned int flags = 0;
> -	phys_addr_t p = 0;
> -
> -	if (access_mask == 0)
> +	struct mm_struct *mm = per_mm->mm;
> +	unsigned long idx, npages;
> +	long ret;
> +
> +	if (mm == NULL)
> +		return -ENOENT;
> +
> +	/* Only drivers with invalidate support can use this function. */
> +	if (!umem->context->invalidate_range)
>  		return -EINVAL;
>  
> -	if (user_virt < ib_umem_start(umem) ||
> -	    user_virt + bcnt > ib_umem_end(umem))
> -		return -EFAULT;
> +	/* Sanity checks. */
> +	if (range->default_flags == 0)
> +		return -EINVAL;
>  
> -	local_page_list = (struct page **)__get_free_page(GFP_KERNEL);
> -	if (!local_page_list)
> -		return -ENOMEM;
> +	if (range->start < ib_umem_start(umem) ||
> +	    range->end > ib_umem_end(umem))
> +		return -EINVAL;
>  
> -	page_shift = umem->page_shift;
> -	page_mask = ~(BIT(page_shift) - 1);
> -	off = user_virt & (~page_mask);
> -	user_virt = user_virt & page_mask;
> -	bcnt += off; /* Charge for the first page offset as well. */
> +	idx = (range->start - ib_umem_start(umem)) >> umem->page_shift;

Is this math OK? What is supposed to happen if the range->start is not
page aligned to the internal page size?

> +	range->pfns = &umem_odp->pfns[idx];
> +	range->pfn_shift = ODP_FLAGS_BITS;
> +	range->values = odp_hmm_values;
> +	range->flags = odp_hmm_flags;
>  
>  	/*
> -	 * owning_process is allowed to be NULL, this means somehow the mm is
> -	 * existing beyond the lifetime of the originating process.. Presumably
> -	 * mmget_not_zero will fail in this case.
> +	 * If mm is dying just bail out early without trying to take mmap_sem.
> +	 * Note that this might race with mm destruction but that is fine the
> +	 * is properly refcounted so are all HMM structure.
>  	 */
> -	owning_process = get_pid_task(umem_odp->per_mm->tgid, PIDTYPE_PID);
> -	if (!owning_process || !mmget_not_zero(owning_mm)) {

But we are not in a HMM context here, and per_mm is not a HMM
structure. 

So why is mm suddenly guarenteed valid? It was a bug report that
triggered the race the mmget_not_zero is fixing, so I need a better
explanation why it is now safe. From what I see the hmm_range_fault
is doing stuff like find_vma without an active mmget??

> @@ -603,11 +603,29 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
>  
>  next_mr:
>  	size = min_t(size_t, bcnt, ib_umem_end(&odp->umem) - io_virt);
> -
>  	page_shift = mr->umem->page_shift;
>  	page_mask = ~(BIT(page_shift) - 1);
> +	/*
> +	 * We need to align io_virt on page size so off is the extra bytes we
> +	 * will be faulting and fault_size is the page aligned size we are
> +	 * faulting.
> +	 */
> +	io_virt = io_virt & page_mask;
> +	off = (io_virt & (~page_mask));
> +	fault_size = ALIGN(size + off, 1UL << page_shift);
> +
> +	if (io_virt < ib_umem_start(&odp->umem))
> +		return -EINVAL;
> +
>  	start_idx = (io_virt - (mr->mmkey.iova & page_mask)) >> page_shift;
> -	access_mask = ODP_READ_ALLOWED_BIT;
> +
> +	if (odp_mr->per_mm == NULL || odp_mr->per_mm->mm == NULL)
> +		return -ENOENT;

How can this happen? Where is the locking?

per_mm is supposed to outlive any odp_mr's the refer to it, and the mm
is supposed to remain grab'd as long as the per_mm exists..

> diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> index eeec4e53c448..70b2df8e5a6c 100644
> +++ b/include/rdma/ib_umem_odp.h
> @@ -36,6 +36,7 @@
>  #include <rdma/ib_umem.h>
>  #include <rdma/ib_verbs.h>
>  #include <linux/interval_tree.h>
> +#include <linux/hmm.h>
>  
>  struct umem_odp_node {
>  	u64 __subtree_last;
> @@ -47,11 +48,11 @@ struct ib_umem_odp {
>  	struct ib_ucontext_per_mm *per_mm;
>  
>  	/*
> -	 * An array of the pages included in the on-demand paging umem.
> -	 * Indices of pages that are currently not mapped into the device will
> -	 * contain NULL.
> +	 * An array of the pages included in the on-demand paging umem. Indices
> +	 * of pages that are currently not mapped into the device will contain
> +	 * 0.
>  	 */
> -	struct page		**page_list;
> +	uint64_t *pfns;

Are these actually pfns, or are they mangled with some shift? (what is range->pfn_shift?)

Jason
Jerome Glisse May 22, 2019, 5:48 p.m. UTC | #4
On Tue, May 21, 2019 at 09:52:25PM -0300, Jason Gunthorpe wrote:
> On Tue, May 21, 2019 at 04:53:22PM -0400, Jerome Glisse wrote:
> > On Mon, May 06, 2019 at 04:56:57PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 11, 2019 at 02:13:13PM -0400, jglisse@redhat.com wrote:
> > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > 
> > > > Just fixed Kconfig and build when ODP was not enabled, other than that
> > > > this is the same as v3. Here is previous cover letter:
> > > > 
> > > > Git tree with all prerequisite:
> > > > https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > > > 
> > > > This patchset convert RDMA ODP to use HMM underneath this is motivated
> > > > by stronger code sharing for same feature (share virtual memory SVM or
> > > > Share Virtual Address SVA) and also stronger integration with mm code to
> > > > achieve that. It depends on HMM patchset posted for inclusion in 5.2 [2]
> > > > and [3].
> > > > 
> > > > It has been tested with pingpong test with -o and others flags to test
> > > > different size/features associated with ODP.
> > > > 
> > > > Moreover they are some features of HMM in the works like peer to peer
> > > > support, fast CPU page table snapshot, fast IOMMU mapping update ...
> > > > It will be easier for RDMA devices with ODP to leverage those if they
> > > > use HMM underneath.
> > > > 
> > > > Quick summary of what HMM is:
> > > >     HMM is a toolbox for device driver to implement software support for
> > > >     Share Virtual Memory (SVM). Not only it provides helpers to mirror a
> > > >     process address space on a device (hmm_mirror). It also provides
> > > >     helper to allow to use device memory to back regular valid virtual
> > > >     address of a process (any valid mmap that is not an mmap of a device
> > > >     or a DAX mapping). They are two kinds of device memory. Private memory
> > > >     that is not accessible to CPU because it does not have all the expected
> > > >     properties (this is for all PCIE devices) or public memory which can
> > > >     also be access by CPU without restriction (with OpenCAPI or CCIX or
> > > >     similar cache-coherent and atomic inter-connect).
> > > > 
> > > >     Device driver can use each of HMM tools separatly. You do not have to
> > > >     use all the tools it provides.
> > > > 
> > > > For RDMA device i do not expect a need to use the device memory support
> > > > of HMM. This device memory support is geared toward accelerator like GPU.
> > > > 
> > > > 
> > > > You can find a branch [1] with all the prerequisite in. This patch is on
> > > > top of rdma-next with the HMM patchset [2] and mmu notifier patchset [3]
> > > > applied on top of it.
> > > > 
> > > > [1] https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > > > [2] https://lkml.org/lkml/2019/4/3/1032
> > > > [3] https://lkml.org/lkml/2019/3/26/900
> > > 
> > > Jerome, please let me know if these dependent series are merged during
> > > the first week of the merge window.
> > > 
> > > This patch has been tested and could go along next week if the
> > > dependencies are met.
> > > 
> > 
> > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > (prefetch and not and different sizes). Seems to work ok.
> 
> Urk, it already doesn't apply to the rdma tree :(
> 
> The conflicts are a little more extensive than I'd prefer to handle..
> Can I ask you to rebase it on top of this branch please:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> 
> Specifically it conflicts with this patch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> 
> > +long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
> > +			       struct hmm_range *range)
> >  {
> > +	struct device *device = umem_odp->umem.context->device->dma_device;
> > +	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> >  	struct ib_umem *umem = &umem_odp->umem;
> > -	struct task_struct *owning_process  = NULL;
> > -	struct mm_struct *owning_mm = umem_odp->umem.owning_mm;
> > -	struct page       **local_page_list = NULL;
> > -	u64 page_mask, off;
> > -	int j, k, ret = 0, start_idx, npages = 0, page_shift;
> > -	unsigned int flags = 0;
> > -	phys_addr_t p = 0;
> > -
> > -	if (access_mask == 0)
> > +	struct mm_struct *mm = per_mm->mm;
> > +	unsigned long idx, npages;
> > +	long ret;
> > +
> > +	if (mm == NULL)
> > +		return -ENOENT;
> > +
> > +	/* Only drivers with invalidate support can use this function. */
> > +	if (!umem->context->invalidate_range)
> >  		return -EINVAL;
> >  
> > -	if (user_virt < ib_umem_start(umem) ||
> > -	    user_virt + bcnt > ib_umem_end(umem))
> > -		return -EFAULT;
> > +	/* Sanity checks. */
> > +	if (range->default_flags == 0)
> > +		return -EINVAL;
> >  
> > -	local_page_list = (struct page **)__get_free_page(GFP_KERNEL);
> > -	if (!local_page_list)
> > -		return -ENOMEM;
> > +	if (range->start < ib_umem_start(umem) ||
> > +	    range->end > ib_umem_end(umem))
> > +		return -EINVAL;
> >  
> > -	page_shift = umem->page_shift;
> > -	page_mask = ~(BIT(page_shift) - 1);
> > -	off = user_virt & (~page_mask);
> > -	user_virt = user_virt & page_mask;
> > -	bcnt += off; /* Charge for the first page offset as well. */
> > +	idx = (range->start - ib_umem_start(umem)) >> umem->page_shift;
> 
> Is this math OK? What is supposed to happen if the range->start is not
> page aligned to the internal page size?

range->start is align on 1 << page_shift boundary within pagefault_mr
thus the above math is ok. We can add a BUG_ON() and comments if you
want.

> 
> > +	range->pfns = &umem_odp->pfns[idx];
> > +	range->pfn_shift = ODP_FLAGS_BITS;
> > +	range->values = odp_hmm_values;
> > +	range->flags = odp_hmm_flags;
> >  
> >  	/*
> > -	 * owning_process is allowed to be NULL, this means somehow the mm is
> > -	 * existing beyond the lifetime of the originating process.. Presumably
> > -	 * mmget_not_zero will fail in this case.
> > +	 * If mm is dying just bail out early without trying to take mmap_sem.
> > +	 * Note that this might race with mm destruction but that is fine the
> > +	 * is properly refcounted so are all HMM structure.
> >  	 */
> > -	owning_process = get_pid_task(umem_odp->per_mm->tgid, PIDTYPE_PID);
> > -	if (!owning_process || !mmget_not_zero(owning_mm)) {
> 
> But we are not in a HMM context here, and per_mm is not a HMM
> structure. 
> 
> So why is mm suddenly guarenteed valid? It was a bug report that
> triggered the race the mmget_not_zero is fixing, so I need a better
> explanation why it is now safe. From what I see the hmm_range_fault
> is doing stuff like find_vma without an active mmget??

So the mm struct can not go away as long as we hold a reference on
the hmm struct and we hold a reference on it through both hmm_mirror
and hmm_range struct. So struct mm can not go away and thus it is
safe to try to take its mmap_sem.

Now if we race with a destruction (ie all process that referenced the
mm struct are dead/dying) hmm struct will be marked as dead this is
tested by hmm_mirror_mm_is_alive() and also by hmm_range_snapshot()
(which is call by hmm_range_dma_map()). In other word if mm->mm_users
is zero then hmm will be mark as dead.

Hence it is safe to take mmap_sem and it is safe to call in hmm, if
mm have been kill it will return EFAULT and this will propagate to
RDMA.

As per_mm i removed the per_mm->mm = NULL from release so that it is
always safe to use that field even in face of racing mm "killing".



> 
> > @@ -603,11 +603,29 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
> >  
> >  next_mr:
> >  	size = min_t(size_t, bcnt, ib_umem_end(&odp->umem) - io_virt);
> > -
> >  	page_shift = mr->umem->page_shift;
> >  	page_mask = ~(BIT(page_shift) - 1);
> > +	/*
> > +	 * We need to align io_virt on page size so off is the extra bytes we
> > +	 * will be faulting and fault_size is the page aligned size we are
> > +	 * faulting.
> > +	 */
> > +	io_virt = io_virt & page_mask;
> > +	off = (io_virt & (~page_mask));
> > +	fault_size = ALIGN(size + off, 1UL << page_shift);
> > +
> > +	if (io_virt < ib_umem_start(&odp->umem))
> > +		return -EINVAL;
> > +
> >  	start_idx = (io_virt - (mr->mmkey.iova & page_mask)) >> page_shift;
> > -	access_mask = ODP_READ_ALLOWED_BIT;
> > +
> > +	if (odp_mr->per_mm == NULL || odp_mr->per_mm->mm == NULL)
> > +		return -ENOENT;
> 
> How can this happen? Where is the locking?
> 
> per_mm is supposed to outlive any odp_mr's the refer to it, and the mm
> is supposed to remain grab'd as long as the per_mm exists..

This can not happen removed the test.

> 
> > diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> > index eeec4e53c448..70b2df8e5a6c 100644
> > +++ b/include/rdma/ib_umem_odp.h
> > @@ -36,6 +36,7 @@
> >  #include <rdma/ib_umem.h>
> >  #include <rdma/ib_verbs.h>
> >  #include <linux/interval_tree.h>
> > +#include <linux/hmm.h>
> >  
> >  struct umem_odp_node {
> >  	u64 __subtree_last;
> > @@ -47,11 +48,11 @@ struct ib_umem_odp {
> >  	struct ib_ucontext_per_mm *per_mm;
> >  
> >  	/*
> > -	 * An array of the pages included in the on-demand paging umem.
> > -	 * Indices of pages that are currently not mapped into the device will
> > -	 * contain NULL.
> > +	 * An array of the pages included in the on-demand paging umem. Indices
> > +	 * of pages that are currently not mapped into the device will contain
> > +	 * 0.
> >  	 */
> > -	struct page		**page_list;
> > +	uint64_t *pfns;
> 
> Are these actually pfns, or are they mangled with some shift? (what is range->pfn_shift?)

They are not pfns they have flags (hence range->pfn_shift) at the
bottoms i just do not have a better name for this.

Cheers,
Jérôme
From 0b429b2ffbec348e283693cb97d7ffce760d89da Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Sat, 8 Dec 2018 15:47:55 -0500
Subject: [PATCH] RDMA/odp: convert to use HMM for ODP v5
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Convert ODP to use HMM so that we can build on common infrastructure
for different class of devices that want to mirror a process address
space into a device. There is no functional changes.

Changes since v4:
    - Rebase on top of rdma-next
Changes since v3:
    - Rebase on top of 5.2-rc1
Changes since v2:
    - Update to match changes to HMM API
Changes since v1:
    - improved comments
    - simplified page alignment computation

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Moni Shoua <monis@mellanox.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Kaike Wan <kaike.wan@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/core/umem_odp.c | 506 +++++++++--------------------
 drivers/infiniband/hw/mlx5/mem.c   |  20 +-
 drivers/infiniband/hw/mlx5/mr.c    |   2 +-
 drivers/infiniband/hw/mlx5/odp.c   | 104 +++---
 include/rdma/ib_umem_odp.h         |  47 +--
 5 files changed, 233 insertions(+), 446 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index c3b3c523401f..7aca59ac8551 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -46,6 +46,20 @@
 #include <rdma/ib_umem.h>
 #include <rdma/ib_umem_odp.h>
 
+
+static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
+	ODP_READ_BIT,	/* HMM_PFN_VALID */
+	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
+	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
+};
+
+static uint64_t odp_hmm_values[HMM_PFN_VALUE_MAX] = {
+	-1UL,	/* HMM_PFN_ERROR */
+	0UL,	/* HMM_PFN_NONE */
+	-2UL,	/* HMM_PFN_SPECIAL */
+};
+
+
 /*
  * The ib_umem list keeps track of memory regions for which the HW
  * device request to receive notification when the related memory
@@ -78,55 +92,24 @@ static u64 node_last(struct umem_odp_node *n)
 INTERVAL_TREE_DEFINE(struct umem_odp_node, rb, u64, __subtree_last,
 		     node_start, node_last, static, rbt_ib_umem)
 
-static void ib_umem_notifier_start_account(struct ib_umem_odp *umem_odp)
-{
-	mutex_lock(&umem_odp->umem_mutex);
-	if (umem_odp->notifiers_count++ == 0)
-		/*
-		 * Initialize the completion object for waiting on
-		 * notifiers. Since notifier_count is zero, no one should be
-		 * waiting right now.
-		 */
-		reinit_completion(&umem_odp->notifier_completion);
-	mutex_unlock(&umem_odp->umem_mutex);
-}
-
-static void ib_umem_notifier_end_account(struct ib_umem_odp *umem_odp)
-{
-	mutex_lock(&umem_odp->umem_mutex);
-	/*
-	 * This sequence increase will notify the QP page fault that the page
-	 * that is going to be mapped in the spte could have been freed.
-	 */
-	++umem_odp->notifiers_seq;
-	if (--umem_odp->notifiers_count == 0)
-		complete_all(&umem_odp->notifier_completion);
-	mutex_unlock(&umem_odp->umem_mutex);
-}
-
 static int ib_umem_notifier_release_trampoline(struct ib_umem_odp *umem_odp,
 					       u64 start, u64 end, void *cookie)
 {
-	/*
-	 * Increase the number of notifiers running, to
-	 * prevent any further fault handling on this MR.
-	 */
-	ib_umem_notifier_start_account(umem_odp);
 	umem_odp->dying = 1;
 	/* Make sure that the fact the umem is dying is out before we release
 	 * all pending page faults. */
 	smp_wmb();
-	complete_all(&umem_odp->notifier_completion);
-	umem_odp->umem.context->invalidate_range(
-		umem_odp, ib_umem_start(umem_odp), ib_umem_end(umem_odp));
+	umem_odp->umem.context->invalidate_range(umem_odp,
+						 ib_umem_start(umem_odp),
+						 ib_umem_end(umem_odp));
 	return 0;
 }
 
-static void ib_umem_notifier_release(struct mmu_notifier *mn,
-				     struct mm_struct *mm)
+static void ib_umem_notifier_release(struct hmm_mirror *mirror)
 {
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
+	struct ib_ucontext_per_mm *per_mm;
+
+	per_mm = container_of(mirror, struct ib_ucontext_per_mm, mirror);
 
 	down_read(&per_mm->umem_rwsem);
 	if (per_mm->active)
@@ -136,21 +119,22 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
 	up_read(&per_mm->umem_rwsem);
 }
 
-static int invalidate_range_start_trampoline(struct ib_umem_odp *item,
-					     u64 start, u64 end, void *cookie)
+static int invalidate_range_trampoline(struct ib_umem_odp *item,
+				       u64 start, u64 end, void *cookie)
 {
-	ib_umem_notifier_start_account(item);
 	item->umem.context->invalidate_range(item, start, end);
 	return 0;
 }
 
-static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
-				const struct mmu_notifier_range *range)
+static int ib_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
+			    const struct hmm_update *range)
 {
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
+	struct ib_ucontext_per_mm *per_mm;
+	int ret;
+
+	per_mm = container_of(mirror, struct ib_ucontext_per_mm, mirror);
 
-	if (mmu_notifier_range_blockable(range))
+	if (range->blockable)
 		down_read(&per_mm->umem_rwsem);
 	else if (!down_read_trylock(&per_mm->umem_rwsem))
 		return -EAGAIN;
@@ -165,39 +149,17 @@ static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		return 0;
 	}
 
-	return rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
+	ret = rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
 					     range->end,
-					     invalidate_range_start_trampoline,
-					     mmu_notifier_range_blockable(range),
-					     NULL);
-}
-
-static int invalidate_range_end_trampoline(struct ib_umem_odp *item, u64 start,
-					   u64 end, void *cookie)
-{
-	ib_umem_notifier_end_account(item);
-	return 0;
-}
-
-static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
-				const struct mmu_notifier_range *range)
-{
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
-
-	if (unlikely(!per_mm->active))
-		return;
-
-	rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
-				      range->end,
-				      invalidate_range_end_trampoline, true, NULL);
+					     invalidate_range_trampoline,
+					     range->blockable, NULL);
 	up_read(&per_mm->umem_rwsem);
+	return ret;
 }
 
-static const struct mmu_notifier_ops ib_umem_notifiers = {
+static const struct hmm_mirror_ops ib_umem_notifiers = {
 	.release                    = ib_umem_notifier_release,
-	.invalidate_range_start     = ib_umem_notifier_invalidate_range_start,
-	.invalidate_range_end       = ib_umem_notifier_invalidate_range_end,
+	.sync_cpu_device_pagetables = ib_sync_cpu_device_pagetables,
 };
 
 static void add_umem_to_per_mm(struct ib_umem_odp *umem_odp)
@@ -219,7 +181,6 @@ static void remove_umem_from_per_mm(struct ib_umem_odp *umem_odp)
 	if (likely(ib_umem_start(umem_odp) != ib_umem_end(umem_odp)))
 		rbt_ib_umem_remove(&umem_odp->interval_tree,
 				   &per_mm->umem_tree);
-	complete_all(&umem_odp->notifier_completion);
 
 	up_write(&per_mm->umem_rwsem);
 }
@@ -246,11 +207,13 @@ static struct ib_ucontext_per_mm *alloc_per_mm(struct ib_ucontext *ctx,
 
 	WARN_ON(mm != current->mm);
 
-	per_mm->mn.ops = &ib_umem_notifiers;
-	ret = mmu_notifier_register(&per_mm->mn, per_mm->mm);
+	per_mm->mirror.ops = &ib_umem_notifiers;
+	down_write(&mm->mmap_sem);
+	ret = hmm_mirror_register(&per_mm->mirror, per_mm->mm);
+	up_write(&mm->mmap_sem);
 	if (ret) {
 		dev_err(&ctx->device->dev,
-			"Failed to register mmu_notifier %d\n", ret);
+			"Failed to register HMM mirror %d\n", ret);
 		goto out_pid;
 	}
 
@@ -292,11 +255,6 @@ static int get_per_mm(struct ib_umem_odp *umem_odp)
 	return 0;
 }
 
-static void free_per_mm(struct rcu_head *rcu)
-{
-	kfree(container_of(rcu, struct ib_ucontext_per_mm, rcu));
-}
-
 static void put_per_mm(struct ib_umem_odp *umem_odp)
 {
 	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
@@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
 	up_write(&per_mm->umem_rwsem);
 
 	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
-	mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
+	hmm_mirror_unregister(&per_mm->mirror);
 	put_pid(per_mm->tgid);
-	mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
+
+	kfree(per_mm);
 }
 
 struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
@@ -355,11 +314,9 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
 	mmgrab(umem->owning_mm);
 
 	mutex_init(&odp_data->umem_mutex);
-	init_completion(&odp_data->notifier_completion);
 
-	odp_data->page_list =
-		vzalloc(array_size(pages, sizeof(*odp_data->page_list)));
-	if (!odp_data->page_list) {
+	odp_data->pfns = vzalloc(array_size(pages, sizeof(*odp_data->pfns)));
+	if (!odp_data->pfns) {
 		ret = -ENOMEM;
 		goto out_odp_data;
 	}
@@ -368,7 +325,7 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
 		vzalloc(array_size(pages, sizeof(*odp_data->dma_list)));
 	if (!odp_data->dma_list) {
 		ret = -ENOMEM;
-		goto out_page_list;
+		goto out_pfns;
 	}
 
 	/*
@@ -382,8 +339,8 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
 
 	return odp_data;
 
-out_page_list:
-	vfree(odp_data->page_list);
+out_pfns:
+	vfree(odp_data->pfns);
 out_odp_data:
 	mmdrop(umem->owning_mm);
 	kfree(odp_data);
@@ -419,13 +376,11 @@ int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access)
 
 	mutex_init(&umem_odp->umem_mutex);
 
-	init_completion(&umem_odp->notifier_completion);
-
 	if (ib_umem_odp_num_pages(umem_odp)) {
-		umem_odp->page_list =
-			vzalloc(array_size(sizeof(*umem_odp->page_list),
+		umem_odp->pfns =
+			vzalloc(array_size(sizeof(*umem_odp->pfns),
 					   ib_umem_odp_num_pages(umem_odp)));
-		if (!umem_odp->page_list)
+		if (!umem_odp->pfns)
 			return -ENOMEM;
 
 		umem_odp->dma_list =
@@ -433,7 +388,7 @@ int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access)
 					   ib_umem_odp_num_pages(umem_odp)));
 		if (!umem_odp->dma_list) {
 			ret_val = -ENOMEM;
-			goto out_page_list;
+			goto out_pfns;
 		}
 	}
 
@@ -446,8 +401,8 @@ int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access)
 
 out_dma_list:
 	vfree(umem_odp->dma_list);
-out_page_list:
-	vfree(umem_odp->page_list);
+out_pfns:
+	vfree(umem_odp->pfns);
 	return ret_val;
 }
 
@@ -465,287 +420,126 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 	remove_umem_from_per_mm(umem_odp);
 	put_per_mm(umem_odp);
 	vfree(umem_odp->dma_list);
-	vfree(umem_odp->page_list);
-}
-
-/*
- * Map for DMA and insert a single page into the on-demand paging page tables.
- *
- * @umem: the umem to insert the page to.
- * @page_index: index in the umem to add the page to.
- * @page: the page struct to map and add.
- * @access_mask: access permissions needed for this page.
- * @current_seq: sequence number for synchronization with invalidations.
- *               the sequence number is taken from
- *               umem_odp->notifiers_seq.
- *
- * The function returns -EFAULT if the DMA mapping operation fails. It returns
- * -EAGAIN if a concurrent invalidation prevents us from updating the page.
- *
- * The page is released via put_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
- */
-static int ib_umem_odp_map_dma_single_page(
-		struct ib_umem_odp *umem_odp,
-		int page_index,
-		struct page *page,
-		u64 access_mask,
-		unsigned long current_seq)
-{
-	struct ib_ucontext *context = umem_odp->umem.context;
-	struct ib_device *dev = context->device;
-	dma_addr_t dma_addr;
-	int remove_existing_mapping = 0;
-	int ret = 0;
-
-	/*
-	 * Note: we avoid writing if seq is different from the initial seq, to
-	 * handle case of a racing notifier. This check also allows us to bail
-	 * early if we have a notifier running in parallel with us.
-	 */
-	if (ib_umem_mmu_notifier_retry(umem_odp, current_seq)) {
-		ret = -EAGAIN;
-		goto out;
-	}
-	if (!(umem_odp->dma_list[page_index])) {
-		dma_addr =
-			ib_dma_map_page(dev, page, 0, BIT(umem_odp->page_shift),
-					DMA_BIDIRECTIONAL);
-		if (ib_dma_mapping_error(dev, dma_addr)) {
-			ret = -EFAULT;
-			goto out;
-		}
-		umem_odp->dma_list[page_index] = dma_addr | access_mask;
-		umem_odp->page_list[page_index] = page;
-		umem_odp->npages++;
-	} else if (umem_odp->page_list[page_index] == page) {
-		umem_odp->dma_list[page_index] |= access_mask;
-	} else {
-		pr_err("error: got different pages in IB device and from get_user_pages. IB device page: %p, gup page: %p\n",
-		       umem_odp->page_list[page_index], page);
-		/* Better remove the mapping now, to prevent any further
-		 * damage. */
-		remove_existing_mapping = 1;
-	}
-
-out:
-	put_page(page);
-
-	if (remove_existing_mapping) {
-		ib_umem_notifier_start_account(umem_odp);
-		context->invalidate_range(
-			umem_odp,
-			ib_umem_start(umem_odp) +
-				(page_index << umem_odp->page_shift),
-			ib_umem_start(umem_odp) +
-				((page_index + 1) << umem_odp->page_shift));
-		ib_umem_notifier_end_account(umem_odp);
-		ret = -EAGAIN;
-	}
-
-	return ret;
+	vfree(umem_odp->pfns);
 }
 
 /**
  * ib_umem_odp_map_dma_pages - Pin and DMA map userspace memory in an ODP MR.
- *
- * Pins the range of pages passed in the argument, and maps them to
- * DMA addresses. The DMA addresses of the mapped pages is updated in
- * umem_odp->dma_list.
- *
- * Returns the number of pages mapped in success, negative error code
- * for failure.
- * An -EAGAIN error code is returned when a concurrent mmu notifier prevents
- * the function from completing its task.
- * An -ENOENT error code indicates that userspace process is being terminated
- * and mm was already destroyed.
  * @umem_odp: the umem to map and pin
- * @user_virt: the address from which we need to map.
- * @bcnt: the minimal number of bytes to pin and map. The mapping might be
- *        bigger due to alignment, and may also be smaller in case of an error
- *        pinning or mapping a page. The actual pages mapped is returned in
- *        the return value.
- * @access_mask: bit mask of the requested access permissions for the given
- *               range.
- * @current_seq: the MMU notifiers sequance value for synchronization with
- *               invalidations. the sequance number is read from
- *               umem_odp->notifiers_seq before calling this function
+ * @range: range of virtual address to be mapped to the device
+ * Returns: -EINVAL some invalid arguments, -EAGAIN need to try again, -ENOENT
+ *          if process is being terminated, number of pages mapped otherwise.
+ *
+ * Map to device a range of virtual address passed in the argument. The DMA
+ * addresses are in umem_odp->dma_list and the corresponding page informations
+ * in umem_odp->pfns.
  */
-int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
-			      u64 bcnt, u64 access_mask,
-			      unsigned long current_seq)
+long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
+			       struct hmm_range *range)
 {
-	struct task_struct *owning_process  = NULL;
-	struct mm_struct *owning_mm = umem_odp->umem.owning_mm;
-	struct page       **local_page_list = NULL;
-	u64 page_mask, off;
-	int j, k, ret = 0, start_idx, npages = 0;
-	unsigned int flags = 0, page_shift;
-	phys_addr_t p = 0;
-
-	if (access_mask == 0)
+	struct device *device = umem_odp->umem.context->device->dma_device;
+	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
+	struct mm_struct *mm = per_mm->mm;
+	unsigned long idx, npages;
+	long ret;
+
+	if (mm == NULL)
+		return -ENOENT;
+
+	/* Only drivers with invalidate support can use this function. */
+	if (!umem_odp->umem.context->invalidate_range)
 		return -EINVAL;
 
-	if (user_virt < ib_umem_start(umem_odp) ||
-	    user_virt + bcnt > ib_umem_end(umem_odp))
-		return -EFAULT;
+	/* Sanity checks. */
+	if (range->default_flags == 0)
+		return -EINVAL;
 
-	local_page_list = (struct page **)__get_free_page(GFP_KERNEL);
-	if (!local_page_list)
-		return -ENOMEM;
+	if (range->start < ib_umem_start(umem_odp) ||
+	    range->end > ib_umem_end(umem_odp))
+		return -EINVAL;
 
-	page_shift = umem_odp->page_shift;
-	page_mask = ~(BIT(page_shift) - 1);
-	off = user_virt & (~page_mask);
-	user_virt = user_virt & page_mask;
-	bcnt += off; /* Charge for the first page offset as well. */
+	idx = (range->start - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
+	range->pfns = &umem_odp->pfns[idx];
+	range->pfn_shift = ODP_FLAGS_BITS;
+	range->values = odp_hmm_values;
+	range->flags = odp_hmm_flags;
 
 	/*
-	 * owning_process is allowed to be NULL, this means somehow the mm is
-	 * existing beyond the lifetime of the originating process.. Presumably
-	 * mmget_not_zero will fail in this case.
+	 * If mm is dying just bail out early without trying to take mmap_sem.
+	 * Note that this might race with mm destruction but that is fine the
+	 * is properly refcounted so are all HMM structure.
+	 *
+	 * This means that per_mm->mm is always a valid pointer as hmm struct
+	 * hold a reference on mm struct and the hmm_mirror struct hold a
+	 * reference on the hmm struct as do the hmm_range struct. Thus at this
+	 * point the mm struct can not be freed from under us and it is safe
+	 * to dereference it and try to take the mmap_sem.
+	 *
+	 * If all process owning the mm struct are dead or dying then HMM will
+	 * catch that and it is safe to call within HMM which will return with
+	 * -EFAULT.
 	 */
-	owning_process = get_pid_task(umem_odp->per_mm->tgid, PIDTYPE_PID);
-	if (!owning_process || !mmget_not_zero(owning_mm)) {
-		ret = -EINVAL;
-		goto out_put_task;
-	}
-
-	if (access_mask & ODP_WRITE_ALLOWED_BIT)
-		flags |= FOLL_WRITE;
-
-	start_idx = (user_virt - ib_umem_start(umem_odp)) >> page_shift;
-	k = start_idx;
-
-	while (bcnt > 0) {
-		const size_t gup_num_pages = min_t(size_t,
-				(bcnt + BIT(page_shift) - 1) >> page_shift,
-				PAGE_SIZE / sizeof(struct page *));
-
-		down_read(&owning_mm->mmap_sem);
-		/*
-		 * Note: this might result in redundent page getting. We can
-		 * avoid this by checking dma_list to be 0 before calling
-		 * get_user_pages. However, this make the code much more
-		 * complex (and doesn't gain us much performance in most use
-		 * cases).
-		 */
-		npages = get_user_pages_remote(owning_process, owning_mm,
-				user_virt, gup_num_pages,
-				flags, local_page_list, NULL, NULL);
-		up_read(&owning_mm->mmap_sem);
-
-		if (npages < 0) {
-			if (npages != -EAGAIN)
-				pr_warn("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
-			else
-				pr_debug("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
-			break;
-		}
+	if (!hmm_mirror_mm_is_alive(&per_mm->mirror))
+		return -EINVAL;
+	down_read(&mm->mmap_sem);
+	mutex_lock(&umem_odp->umem_mutex);
+	ret = hmm_range_dma_map(range, device,
+		&umem_odp->dma_list[idx], true);
+	mutex_unlock(&umem_odp->umem_mutex);
+	npages = ret;
 
-		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
-		mutex_lock(&umem_odp->umem_mutex);
-		for (j = 0; j < npages; j++, user_virt += PAGE_SIZE) {
-			if (user_virt & ~page_mask) {
-				p += PAGE_SIZE;
-				if (page_to_phys(local_page_list[j]) != p) {
-					ret = -EFAULT;
-					break;
-				}
-				put_page(local_page_list[j]);
-				continue;
-			}
-
-			ret = ib_umem_odp_map_dma_single_page(
-					umem_odp, k, local_page_list[j],
-					access_mask, current_seq);
-			if (ret < 0) {
-				if (ret != -EAGAIN)
-					pr_warn("ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
-				else
-					pr_debug("ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
-				break;
-			}
-
-			p = page_to_phys(local_page_list[j]);
-			k++;
-		}
-		mutex_unlock(&umem_odp->umem_mutex);
-
-		if (ret < 0) {
-			/*
-			 * Release pages, remembering that the first page
-			 * to hit an error was already released by
-			 * ib_umem_odp_map_dma_single_page().
-			 */
-			if (npages - (j + 1) > 0)
-				release_pages(&local_page_list[j+1],
-					      npages - (j + 1));
-			break;
-		}
-	}
+	/*
+	 * The mmap_sem have been drop if hmm_vma_fault_and_dma_map() returned
+	 * with -EAGAIN. In which case we need to retry as -EBUSY but we also
+	 * need to take the mmap_sem again.
+	 */
+	if (ret != -EAGAIN)
+		 up_read(&mm->mmap_sem);
 
-	if (ret >= 0) {
-		if (npages < 0 && k == start_idx)
-			ret = npages;
-		else
-			ret = k - start_idx;
+	if (ret <= 0) {
+		/* Convert -EBUSY to -EAGAIN and 0 to -EAGAIN */
+		ret = ret == -EBUSY ? -EAGAIN : ret;
+		return ret ? ret : -EAGAIN;
 	}
 
-	mmput(owning_mm);
-out_put_task:
-	if (owning_process)
-		put_task_struct(owning_process);
-	free_page((unsigned long)local_page_list);
-	return ret;
+	umem_odp->npages += npages;
+	return npages;
 }
 EXPORT_SYMBOL(ib_umem_odp_map_dma_pages);
 
-void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
-				 u64 bound)
+void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp,
+				 u64 virt, u64 bound)
 {
-	int idx;
-	u64 addr;
-	struct ib_device *dev = umem_odp->umem.context->device;
+	struct device *device = umem_odp->umem.context->device->dma_device;
+	unsigned long idx, page_mask;
+	struct hmm_range range;
+	long ret;
+
+	if (!umem_odp->npages)
+		return;
 
-	virt = max_t(u64, virt, ib_umem_start(umem_odp));
+	bound = ALIGN(bound, 1UL << umem_odp->page_shift);
+	page_mask = ~(BIT(umem_odp->page_shift) - 1);
+	virt &= page_mask;
+
+	virt  = max_t(u64, virt,  ib_umem_start(umem_odp));
 	bound = min_t(u64, bound, ib_umem_end(umem_odp));
-	/* Note that during the run of this function, the
-	 * notifiers_count of the MR is > 0, preventing any racing
-	 * faults from completion. We might be racing with other
-	 * invalidations, so we must make sure we free each page only
-	 * once. */
+
+	idx = ((unsigned long)virt - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
+
+	range.page_shift = umem_odp->page_shift;
+	range.pfns = &umem_odp->pfns[idx];
+	range.pfn_shift = ODP_FLAGS_BITS;
+	range.values = odp_hmm_values;
+	range.flags = odp_hmm_flags;
+	range.start = virt;
+	range.end = bound;
+
 	mutex_lock(&umem_odp->umem_mutex);
-	for (addr = virt; addr < bound; addr += BIT(umem_odp->page_shift)) {
-		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
-		if (umem_odp->page_list[idx]) {
-			struct page *page = umem_odp->page_list[idx];
-			dma_addr_t dma = umem_odp->dma_list[idx];
-			dma_addr_t dma_addr = dma & ODP_DMA_ADDR_MASK;
-
-			WARN_ON(!dma_addr);
-
-			ib_dma_unmap_page(dev, dma_addr, PAGE_SIZE,
-					  DMA_BIDIRECTIONAL);
-			if (dma & ODP_WRITE_ALLOWED_BIT) {
-				struct page *head_page = compound_head(page);
-				/*
-				 * set_page_dirty prefers being called with
-				 * the page lock. However, MMU notifiers are
-				 * called sometimes with and sometimes without
-				 * the lock. We rely on the umem_mutex instead
-				 * to prevent other mmu notifiers from
-				 * continuing and allowing the page mapping to
-				 * be removed.
-				 */
-				set_page_dirty(head_page);
-			}
-			umem_odp->page_list[idx] = NULL;
-			umem_odp->dma_list[idx] = 0;
-			umem_odp->npages--;
-		}
-	}
+	ret = hmm_range_dma_unmap(&range, NULL, device,
+		&umem_odp->dma_list[idx], true);
+	if (ret > 0)
+		umem_odp->npages -= ret;
 	mutex_unlock(&umem_odp->umem_mutex);
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index fe1a76d8531c..f2d92f0eb3f4 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -112,16 +112,16 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr,
 	*count = i;
 }
 
-static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
+static u64 umem_dma_to_mtt(struct ib_umem_odp *odp, size_t idx)
 {
-	u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
+	u64 mtt_entry = odp->dma_list[idx];
 
-	if (umem_dma & ODP_READ_ALLOWED_BIT)
+	if (odp->pfns[idx] & ODP_READ_BIT)
 		mtt_entry |= MLX5_IB_MTT_READ;
-	if (umem_dma & ODP_WRITE_ALLOWED_BIT)
+	if (odp->pfns[idx] & ODP_WRITE_BIT)
 		mtt_entry |= MLX5_IB_MTT_WRITE;
 
-	return mtt_entry;
+	return cpu_to_be64(mtt_entry);
 }
 
 /*
@@ -151,15 +151,13 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 	int entry;
 
 	if (umem->is_odp) {
+		struct ib_umem_odp *odp = to_ib_umem_odp(umem);
+
 		WARN_ON(shift != 0);
 		WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE));
 
-		for (i = 0; i < num_pages; ++i) {
-			dma_addr_t pa =
-				to_ib_umem_odp(umem)->dma_list[offset + i];
-
-			pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
-		}
+		for (i = 0; i < num_pages; ++i)
+			pas[i] = umem_dma_to_mtt(odp, offset + i);
 		return;
 	}
 
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 4d033796dcfc..a8e7ac685ce1 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1605,7 +1605,7 @@ static void dereg_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 		/* Wait for all running page-fault handlers to finish. */
 		synchronize_srcu(&dev->mr_srcu);
 		/* Destroy all page mappings */
-		if (umem_odp->page_list)
+		if (umem_odp->pfns)
 			mlx5_ib_invalidate_range(umem_odp,
 						 ib_umem_start(umem_odp),
 						 ib_umem_end(umem_odp));
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index d0c6f9cc97ef..45438620c0fa 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -255,8 +255,7 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
 		 * estimate the cost of another UMR vs. the cost of bigger
 		 * UMR.
 		 */
-		if (umem_odp->dma_list[idx] &
-		    (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
+		if (umem_odp->pfns[idx] & ODP_READ_BIT) {
 			if (!in_block) {
 				blk_start_idx = idx;
 				in_block = 1;
@@ -577,17 +576,18 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
 			u64 io_virt, size_t bcnt, u32 *bytes_mapped,
 			u32 flags)
 {
-	int npages = 0, current_seq, page_shift, ret, np;
-	bool implicit = false;
 	struct ib_umem_odp *odp_mr = to_ib_umem_odp(mr->umem);
 	bool downgrade = flags & MLX5_PF_FLAGS_DOWNGRADE;
 	bool prefetch = flags & MLX5_PF_FLAGS_PREFETCH;
-	u64 access_mask;
+	unsigned long npages = 0, page_shift, np, off;
 	u64 start_idx, page_mask;
 	struct ib_umem_odp *odp;
-	size_t size;
+	struct hmm_range range;
+	bool implicit = false;
+	size_t size, fault_size;
+	long ret;
 
-	if (!odp_mr->page_list) {
+	if (!odp_mr->pfns) {
 		odp = implicit_mr_get_data(mr, io_virt, bcnt);
 
 		if (IS_ERR(odp))
@@ -600,11 +600,26 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
 
 next_mr:
 	size = min_t(size_t, bcnt, ib_umem_end(odp) - io_virt);
-
 	page_shift = odp->page_shift;
 	page_mask = ~(BIT(page_shift) - 1);
+	/*
+	 * We need to align io_virt on page size so off is the extra bytes we
+	 * will be faulting and fault_size is the page aligned size we are
+	 * faulting.
+	 */
+	io_virt = io_virt & page_mask;
+	off = (io_virt & (~page_mask));
+	fault_size = ALIGN(size + off, 1UL << page_shift);
+
+	if (io_virt < ib_umem_start(odp))
+		return -EINVAL;
+
 	start_idx = (io_virt - (mr->mmkey.iova & page_mask)) >> page_shift;
-	access_mask = ODP_READ_ALLOWED_BIT;
+
+	ret = hmm_range_register(&range, odp_mr->per_mm->mm,
+				 io_virt, io_virt + fault_size, page_shift);
+	if (ret)
+		return ret;
 
 	if (prefetch && !downgrade && !mr->umem->writable) {
 		/* prefetch with write-access must
@@ -614,58 +629,55 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
 		goto out;
 	}
 
+	range.default_flags = ODP_READ_BIT;
 	if (mr->umem->writable && !downgrade)
-		access_mask |= ODP_WRITE_ALLOWED_BIT;
-
-	current_seq = READ_ONCE(odp->notifiers_seq);
-	/*
-	 * Ensure the sequence number is valid for some time before we call
-	 * gup.
-	 */
-	smp_rmb();
-
-	ret = ib_umem_odp_map_dma_pages(to_ib_umem_odp(mr->umem), io_virt, size,
-					access_mask, current_seq);
+		range.default_flags |= ODP_WRITE_BIT;
 
+	ret = ib_umem_odp_map_dma_pages(to_ib_umem_odp(mr->umem), &range);
 	if (ret < 0)
-		goto out;
+		goto again;
 
 	np = ret;
 
 	mutex_lock(&odp->umem_mutex);
-	if (!ib_umem_mmu_notifier_retry(to_ib_umem_odp(mr->umem),
-					current_seq)) {
+	if (hmm_range_valid(&range)) {
 		/*
 		 * No need to check whether the MTTs really belong to
-		 * this MR, since ib_umem_odp_map_dma_pages already
+		 * this MR, since ib_umem_odp_map_dma_pages() already
 		 * checks this.
 		 */
 		ret = mlx5_ib_update_xlt(mr, start_idx, np,
 					 page_shift, MLX5_IB_UPD_XLT_ATOMIC);
-	} else {
+	} else
 		ret = -EAGAIN;
-	}
 	mutex_unlock(&odp->umem_mutex);
 
 	if (ret < 0) {
-		if (ret != -EAGAIN)
+		if (ret != -EAGAIN) {
 			mlx5_ib_err(dev, "Failed to update mkey page tables\n");
-		goto out;
+			goto out;
+		}
+		goto again;
 	}
 
 	if (bytes_mapped) {
-		u32 new_mappings = (np << page_shift) -
-			(io_virt - round_down(io_virt, 1 << page_shift));
+		long new_mappings = (np << page_shift) - off;
+		new_mappings = new_mappings < 0 ? 0 : new_mappings;
 		*bytes_mapped += min_t(u32, new_mappings, size);
 	}
 
 	npages += np << (page_shift - PAGE_SHIFT);
+	hmm_range_unregister(&range);
 	bcnt -= size;
 
-	if (unlikely(bcnt)) {
+	if (unlikely(bcnt > 0)) {
 		struct ib_umem_odp *next;
 
-		io_virt += size;
+		/*
+		 * Next virtual address is after the number of bytes we faulted
+		 * in this step.
+		 */
+		io_virt += fault_size;
 		next = odp_next(odp);
 		if (unlikely(!next || next->umem.address != io_virt)) {
 			mlx5_ib_dbg(dev, "next implicit leaf removed at 0x%llx. got %p\n",
@@ -679,24 +691,18 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
 
 	return npages;
 
-out:
-	if (ret == -EAGAIN) {
-		if (implicit || !odp->dying) {
-			unsigned long timeout =
-				msecs_to_jiffies(MMU_NOTIFIER_TIMEOUT);
-
-			if (!wait_for_completion_timeout(
-					&odp->notifier_completion,
-					timeout)) {
-				mlx5_ib_warn(dev, "timeout waiting for mmu notifier. seq %d against %d. notifiers_count=%d\n",
-					     current_seq, odp->notifiers_seq, odp->notifiers_count);
-			}
-		} else {
-			/* The MR is being killed, kill the QP as well. */
-			ret = -EFAULT;
-		}
-	}
+again:
+	if (ret != -EAGAIN)
+		goto out;
+
+	/* Check if the MR is being killed, kill the QP as well. */
+	if (!implicit || odp->dying)
+		ret = -EFAULT;
+	else if (!hmm_range_wait_until_valid(&range, MMU_NOTIFIER_TIMEOUT))
+		mlx5_ib_warn(dev, "timeout waiting for mmu notifier.\n");
 
+out:
+	hmm_range_unregister(&range);
 	return ret;
 }
 
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 479db5c98ff6..e1476e9ebb79 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -36,6 +36,7 @@
 #include <rdma/ib_umem.h>
 #include <rdma/ib_verbs.h>
 #include <linux/interval_tree.h>
+#include <linux/hmm.h>
 
 struct umem_odp_node {
 	u64 __subtree_last;
@@ -47,11 +48,11 @@ struct ib_umem_odp {
 	struct ib_ucontext_per_mm *per_mm;
 
 	/*
-	 * An array of the pages included in the on-demand paging umem.
-	 * Indices of pages that are currently not mapped into the device will
-	 * contain NULL.
+	 * An array of the pages included in the on-demand paging umem. Indices
+	 * of pages that are currently not mapped into the device will contain
+	 * 0.
 	 */
-	struct page		**page_list;
+	uint64_t *pfns;
 	/*
 	 * An array of the same size as page_list, with DMA addresses mapped
 	 * for pages the pages in page_list. The lower two bits designate
@@ -67,14 +68,11 @@ struct ib_umem_odp {
 	struct mutex		umem_mutex;
 	void			*private; /* for the HW driver to use. */
 
-	int notifiers_seq;
-	int notifiers_count;
 	int npages;
 
 	/* Tree tracking */
 	struct umem_odp_node	interval_tree;
 
-	struct completion	notifier_completion;
 	int			dying;
 	unsigned int		page_shift;
 	struct work_struct	work;
@@ -129,11 +127,10 @@ struct ib_ucontext_per_mm {
 	/* Protects umem_tree */
 	struct rw_semaphore umem_rwsem;
 
-	struct mmu_notifier mn;
+	struct hmm_mirror mirror;
 	unsigned int odp_mrs_count;
 
 	struct list_head ucontext_list;
-	struct rcu_head rcu;
 };
 
 int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access);
@@ -141,9 +138,18 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root_umem,
 				      unsigned long addr, size_t size);
 void ib_umem_odp_release(struct ib_umem_odp *umem_odp);
 
-int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 start_offset,
-			      u64 bcnt, u64 access_mask,
-			      unsigned long current_seq);
+#define ODP_READ_BIT	(1<<0ULL)
+#define ODP_WRITE_BIT	(1<<1ULL)
+/*
+ * The device bit is not use by ODP but is there to full-fill HMM API which
+ * also support device with device memory (like GPU). So from ODP/RDMA POV
+ * this can be ignored.
+ */
+#define ODP_DEVICE_BIT	(1<<2ULL)
+#define ODP_FLAGS_BITS	3
+
+long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
+			       struct hmm_range *range);
 
 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 start_offset,
 				 u64 bound);
@@ -166,23 +172,6 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 struct ib_umem_odp *rbt_ib_umem_lookup(struct rb_root_cached *root,
 				       u64 addr, u64 length);
 
-static inline int ib_umem_mmu_notifier_retry(struct ib_umem_odp *umem_odp,
-					     unsigned long mmu_seq)
-{
-	/*
-	 * This code is strongly based on the KVM code from
-	 * mmu_notifier_retry. Should be called with
-	 * the relevant locks taken (umem_odp->umem_mutex
-	 * and the ucontext umem_mutex semaphore locked for read).
-	 */
-
-	if (unlikely(umem_odp->notifiers_count))
-		return 1;
-	if (umem_odp->notifiers_seq != mmu_seq)
-		return 1;
-	return 0;
-}
-
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 static inline int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access)
Jason Gunthorpe May 22, 2019, 6:32 p.m. UTC | #5
On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:

> From 0b429b2ffbec348e283693cb97d7ffce760d89da Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
> Date: Sat, 8 Dec 2018 15:47:55 -0500
> Subject: [PATCH] RDMA/odp: convert to use HMM for ODP v5
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> Convert ODP to use HMM so that we can build on common infrastructure
> for different class of devices that want to mirror a process address
> space into a device. There is no functional changes.
> 
> Changes since v4:
>     - Rebase on top of rdma-next
> Changes since v3:
>     - Rebase on top of 5.2-rc1
> Changes since v2:
>     - Update to match changes to HMM API
> Changes since v1:
>     - improved comments
>     - simplified page alignment computation
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: Leon Romanovsky <leonro@mellanox.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Artemy Kovalyov <artemyko@mellanox.com>
> Cc: Moni Shoua <monis@mellanox.com>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Kaike Wan <kaike.wan@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
>  drivers/infiniband/core/umem_odp.c | 506 +++++++++--------------------
>  drivers/infiniband/hw/mlx5/mem.c   |  20 +-
>  drivers/infiniband/hw/mlx5/mr.c    |   2 +-
>  drivers/infiniband/hw/mlx5/odp.c   | 104 +++---
>  include/rdma/ib_umem_odp.h         |  47 +--
>  5 files changed, 233 insertions(+), 446 deletions(-)

The kconfig stuff is missing, and it doesn't compile in various cases
I tried.

The kconfig stuff for hmm is also really obnoxious, you can't just
enabe HMM_MIRROR, you have to track down all the little required
elements to get it to turn on..

Once I did get it to compile, I also get warnings:

mm/hmm.c: In function ‘hmm_vma_walk_pud’:
mm/hmm.c:782:28: warning: unused variable ‘pfn’ [-Wunused-variable]
   unsigned long i, npages, pfn;
                            ^~~
mm/hmm.c: In function ‘hmm_range_snapshot’:
mm/hmm.c:1027:19: warning: unused variable ‘h’ [-Wunused-variable]
    struct hstate *h = hstate_vma(vma);

Because this kernel doesn't have CONFIG_HUGETLB_PAGE

Please fold this into your patch if it has to be resent.. I think it
fixes the compilation problems.

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index cbfbea49f126cd..e3eefd0917985a 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -63,7 +63,7 @@ config INFINIBAND_USER_MEM
 config INFINIBAND_ON_DEMAND_PAGING
 	bool "InfiniBand on-demand paging support"
 	depends on INFINIBAND_USER_MEM
-	select MMU_NOTIFIER
+	depends on HMM_MIRROR
 	default y
 	---help---
 	  On demand paging support for the InfiniBand subsystem.
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index e1476e9ebb7906..f760103c07349a 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -115,6 +115,16 @@ static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp)
 
 #define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
 
+#define ODP_READ_BIT	(1<<0ULL)
+#define ODP_WRITE_BIT	(1<<1ULL)
+/*
+ * The device bit is not use by ODP but is there to full-fill HMM API which
+ * also support device with device memory (like GPU). So from ODP/RDMA POV
+ * this can be ignored.
+ */
+#define ODP_DEVICE_BIT	(1<<2ULL)
+#define ODP_FLAGS_BITS	3
+
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 
 struct ib_ucontext_per_mm {
@@ -138,16 +148,6 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root_umem,
 				      unsigned long addr, size_t size);
 void ib_umem_odp_release(struct ib_umem_odp *umem_odp);
 
-#define ODP_READ_BIT	(1<<0ULL)
-#define ODP_WRITE_BIT	(1<<1ULL)
-/*
- * The device bit is not use by ODP but is there to full-fill HMM API which
- * also support device with device memory (like GPU). So from ODP/RDMA POV
- * this can be ignored.
- */
-#define ODP_DEVICE_BIT	(1<<2ULL)
-#define ODP_FLAGS_BITS	3
-
 long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
 			       struct hmm_range *range);
Jason Gunthorpe May 22, 2019, 7:22 p.m. UTC | #6
On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:

> > > +long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
> > > +			       struct hmm_range *range)
> > >  {
> > > +	struct device *device = umem_odp->umem.context->device->dma_device;
> > > +	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > >  	struct ib_umem *umem = &umem_odp->umem;
> > > -	struct task_struct *owning_process  = NULL;
> > > -	struct mm_struct *owning_mm = umem_odp->umem.owning_mm;
> > > -	struct page       **local_page_list = NULL;
> > > -	u64 page_mask, off;
> > > -	int j, k, ret = 0, start_idx, npages = 0, page_shift;
> > > -	unsigned int flags = 0;
> > > -	phys_addr_t p = 0;
> > > -
> > > -	if (access_mask == 0)
> > > +	struct mm_struct *mm = per_mm->mm;
> > > +	unsigned long idx, npages;
> > > +	long ret;
> > > +
> > > +	if (mm == NULL)
> > > +		return -ENOENT;
> > > +
> > > +	/* Only drivers with invalidate support can use this function. */
> > > +	if (!umem->context->invalidate_range)
> > >  		return -EINVAL;
> > >  
> > > -	if (user_virt < ib_umem_start(umem) ||
> > > -	    user_virt + bcnt > ib_umem_end(umem))
> > > -		return -EFAULT;
> > > +	/* Sanity checks. */
> > > +	if (range->default_flags == 0)
> > > +		return -EINVAL;
> > >  
> > > -	local_page_list = (struct page **)__get_free_page(GFP_KERNEL);
> > > -	if (!local_page_list)
> > > -		return -ENOMEM;
> > > +	if (range->start < ib_umem_start(umem) ||
> > > +	    range->end > ib_umem_end(umem))
> > > +		return -EINVAL;
> > >  
> > > -	page_shift = umem->page_shift;
> > > -	page_mask = ~(BIT(page_shift) - 1);
> > > -	off = user_virt & (~page_mask);
> > > -	user_virt = user_virt & page_mask;
> > > -	bcnt += off; /* Charge for the first page offset as well. */
> > > +	idx = (range->start - ib_umem_start(umem)) >> umem->page_shift;
> > 
> > Is this math OK? What is supposed to happen if the range->start is not
> > page aligned to the internal page size?
> 
> range->start is align on 1 << page_shift boundary within pagefault_mr
> thus the above math is ok. We can add a BUG_ON() and comments if you
> want.

OK

> > > +	range->pfns = &umem_odp->pfns[idx];
> > > +	range->pfn_shift = ODP_FLAGS_BITS;
> > > +	range->values = odp_hmm_values;
> > > +	range->flags = odp_hmm_flags;
> > >  
> > >  	/*
> > > -	 * owning_process is allowed to be NULL, this means somehow the mm is
> > > -	 * existing beyond the lifetime of the originating process.. Presumably
> > > -	 * mmget_not_zero will fail in this case.
> > > +	 * If mm is dying just bail out early without trying to take mmap_sem.
> > > +	 * Note that this might race with mm destruction but that is fine the
> > > +	 * is properly refcounted so are all HMM structure.
> > >  	 */
> > > -	owning_process = get_pid_task(umem_odp->per_mm->tgid, PIDTYPE_PID);
> > > -	if (!owning_process || !mmget_not_zero(owning_mm)) {
> > 
> > But we are not in a HMM context here, and per_mm is not a HMM
> > structure. 
> > 
> > So why is mm suddenly guarenteed valid? It was a bug report that
> > triggered the race the mmget_not_zero is fixing, so I need a better
> > explanation why it is now safe. From what I see the hmm_range_fault
> > is doing stuff like find_vma without an active mmget??
> 
> So the mm struct can not go away as long as we hold a reference on
> the hmm struct and we hold a reference on it through both hmm_mirror
> and hmm_range struct. So struct mm can not go away and thus it is
> safe to try to take its mmap_sem.

This was always true here, though, so long as the umem_odp exists the
the mm has a grab on it. But a grab is not a get..

The point here was the old code needed an mmget() in order to do
get_user_pages_remote()

If hmm does not need an external mmget() then fine, we delete this
stuff and rely on hmm.

But I don't think that is true as we have:

          CPU 0                                           CPU1
                                                       mmput()
                       				        __mmput()
							 exit_mmap()
down_read(&mm->mmap_sem);
hmm_range_dma_map(range, device,..
  ret = hmm_range_fault(range, block);
     if (hmm->mm == NULL || hmm->dead)
							   mmu_notifier_release()
							     hmm->dead = true
     vma = find_vma(hmm->mm, start);
        .. rb traversal ..                                 while (vma) remove_vma()

*goes boom*

I think this is violating the basic constraint of the mm by acting on
a mm's VMA's without holding a mmget() to prevent concurrent
destruction.

In other words, mmput() destruction does not respect the mmap_sem - so
holding the mmap sem alone is not enough locking.

The unlucked hmm->dead simply can't save this. Frankly every time I
look a struct with 'dead' in it, I find races like this.

Thus we should put the mmget_notzero back in.

I saw some other funky looking stuff in hmm as well..

> Hence it is safe to take mmap_sem and it is safe to call in hmm, if
> mm have been kill it will return EFAULT and this will propagate to
> RDMA.
 
> As per_mm i removed the per_mm->mm = NULL from release so that it is
> always safe to use that field even in face of racing mm "killing".

Yes, that certainly wasn't good.

> > > -	 * An array of the pages included in the on-demand paging umem.
> > > -	 * Indices of pages that are currently not mapped into the device will
> > > -	 * contain NULL.
> > > +	 * An array of the pages included in the on-demand paging umem. Indices
> > > +	 * of pages that are currently not mapped into the device will contain
> > > +	 * 0.
> > >  	 */
> > > -	struct page		**page_list;
> > > +	uint64_t *pfns;
> > 
> > Are these actually pfns, or are they mangled with some shift? (what is range->pfn_shift?)
> 
> They are not pfns they have flags (hence range->pfn_shift) at the
> bottoms i just do not have a better name for this.

I think you need to have a better name then

Jason
Jason Gunthorpe May 22, 2019, 8:12 p.m. UTC | #7
On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:

>  static void put_per_mm(struct ib_umem_odp *umem_odp)
>  {
>  	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> @@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
>  	up_write(&per_mm->umem_rwsem);
>  
>  	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> -	mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> +	hmm_mirror_unregister(&per_mm->mirror);
>  	put_pid(per_mm->tgid);
> -	mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> +
> +	kfree(per_mm);

Notice that mmu_notifier only uses SRCU to fence in-progress ops
callbacks, so I think hmm internally has the bug that this ODP
approach prevents.

hmm should follow the same pattern ODP has and 'kfree_srcu' the hmm
struct, use container_of in the mmu_notifier callbacks, and use the
otherwise vestigal kref_get_unless_zero() to bail:

From 0cb536dc0150ba964a1d655151d7b7a84d0f915a Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@mellanox.com>
Date: Wed, 22 May 2019 16:52:52 -0300
Subject: [PATCH] hmm: Fix use after free with struct hmm in the mmu notifiers

mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
system will continue to reference hmm->mn until the srcu grace period
expires.

         CPU0                                     CPU1
                                               __mmu_notifier_invalidate_range_start()
                                                 srcu_read_lock
                                                 hlist_for_each ()
                                                   // mn == hmm->mn
hmm_mirror_unregister()
  hmm_put()
    hmm_free()
      mmu_notifier_unregister_no_release()
         hlist_del_init_rcu(hmm-mn->list)
			                           mn->ops->invalidate_range_start(mn, range);
					             mm_get_hmm()
      mm->hmm = NULL;
      kfree(hmm)
                                                     mutex_lock(&hmm->lock);

Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
existing. Get the now-safe hmm struct through container_of and directly
check kref_get_unless_zero to lock it against free.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/hmm.h |  1 +
 mm/hmm.c            | 25 +++++++++++++++++++------
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 51ec27a8466816..8b91c90d3b88cb 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -102,6 +102,7 @@ struct hmm {
 	struct mmu_notifier	mmu_notifier;
 	struct rw_semaphore	mirrors_sem;
 	wait_queue_head_t	wq;
+	struct rcu_head		rcu;
 	long			notifiers;
 	bool			dead;
 };
diff --git a/mm/hmm.c b/mm/hmm.c
index 816c2356f2449f..824e7e160d8167 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -113,6 +113,11 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	return NULL;
 }
 
+static void hmm_fee_rcu(struct rcu_head *rcu)
+{
+	kfree(container_of(rcu, struct hmm, rcu));
+}
+
 static void hmm_free(struct kref *kref)
 {
 	struct hmm *hmm = container_of(kref, struct hmm, kref);
@@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
 		mm->hmm = NULL;
 	spin_unlock(&mm->page_table_lock);
 
-	kfree(hmm);
+	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
 }
 
 static inline void hmm_put(struct hmm *hmm)
@@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
 
 static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-	struct hmm *hmm = mm_get_hmm(mm);
+	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
 	struct hmm_range *range;
 
+	/* hmm is in progress to free */
+	if (!kref_get_unless_zero(&hmm->kref))
+		return;
+
 	/* Report this HMM as dying. */
 	hmm->dead = true;
 
@@ -194,13 +203,15 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(nrange->mm);
+	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
 	struct hmm_update update;
 	struct hmm_range *range;
 	int ret = 0;
 
-	VM_BUG_ON(!hmm);
+	/* hmm is in progress to free */
+	if (!kref_get_unless_zero(&hmm->kref))
+		return 0;
 
 	update.start = nrange->start;
 	update.end = nrange->end;
@@ -248,9 +259,11 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(nrange->mm);
+	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 
-	VM_BUG_ON(!hmm);
+	/* hmm is in progress to free */
+	if (!kref_get_unless_zero(&hmm->kref))
+		return;
 
 	mutex_lock(&hmm->lock);
 	hmm->notifiers--;
Ralph Campbell May 22, 2019, 9:12 p.m. UTC | #8
On 5/22/19 1:12 PM, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> 
>>   static void put_per_mm(struct ib_umem_odp *umem_odp)
>>   {
>>   	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
>> @@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
>>   	up_write(&per_mm->umem_rwsem);
>>   
>>   	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
>> -	mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
>> +	hmm_mirror_unregister(&per_mm->mirror);
>>   	put_pid(per_mm->tgid);
>> -	mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
>> +
>> +	kfree(per_mm);
> 
> Notice that mmu_notifier only uses SRCU to fence in-progress ops
> callbacks, so I think hmm internally has the bug that this ODP
> approach prevents.
> 
> hmm should follow the same pattern ODP has and 'kfree_srcu' the hmm
> struct, use container_of in the mmu_notifier callbacks, and use the
> otherwise vestigal kref_get_unless_zero() to bail:

You might also want to look at my patch where
I try to fix some of these same issues (5/5).

https://marc.info/?l=linux-mm&m=155718572908765&w=2


>  From 0cb536dc0150ba964a1d655151d7b7a84d0f915a Mon Sep 17 00:00:00 2001
> From: Jason Gunthorpe <jgg@mellanox.com>
> Date: Wed, 22 May 2019 16:52:52 -0300
> Subject: [PATCH] hmm: Fix use after free with struct hmm in the mmu notifiers
> 
> mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
> system will continue to reference hmm->mn until the srcu grace period
> expires.
> 
>           CPU0                                     CPU1
>                                                 __mmu_notifier_invalidate_range_start()
>                                                   srcu_read_lock
>                                                   hlist_for_each ()
>                                                     // mn == hmm->mn
> hmm_mirror_unregister()
>    hmm_put()
>      hmm_free()
>        mmu_notifier_unregister_no_release()
>           hlist_del_init_rcu(hmm-mn->list)
> 			                           mn->ops->invalidate_range_start(mn, range);
> 					             mm_get_hmm()
>        mm->hmm = NULL;
>        kfree(hmm)
>                                                       mutex_lock(&hmm->lock);
> 
> Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
> existing. Get the now-safe hmm struct through container_of and directly
> check kref_get_unless_zero to lock it against free.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>   include/linux/hmm.h |  1 +
>   mm/hmm.c            | 25 +++++++++++++++++++------
>   2 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 51ec27a8466816..8b91c90d3b88cb 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -102,6 +102,7 @@ struct hmm {
>   	struct mmu_notifier	mmu_notifier;
>   	struct rw_semaphore	mirrors_sem;
>   	wait_queue_head_t	wq;
> +	struct rcu_head		rcu;
>   	long			notifiers;
>   	bool			dead;
>   };
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 816c2356f2449f..824e7e160d8167 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -113,6 +113,11 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   	return NULL;
>   }
>   
> +static void hmm_fee_rcu(struct rcu_head *rcu)
> +{
> +	kfree(container_of(rcu, struct hmm, rcu));
> +}
> +
>   static void hmm_free(struct kref *kref)
>   {
>   	struct hmm *hmm = container_of(kref, struct hmm, kref);
> @@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
>   		mm->hmm = NULL;
>   	spin_unlock(&mm->page_table_lock);
>   
> -	kfree(hmm);
> +	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
>   }
>   
>   static inline void hmm_put(struct hmm *hmm)
> @@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
>   
>   static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   {
> -	struct hmm *hmm = mm_get_hmm(mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>   	struct hmm_mirror *mirror;
>   	struct hmm_range *range;
>   
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
> +
>   	/* Report this HMM as dying. */
>   	hmm->dead = true;
>   
> @@ -194,13 +203,15 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   			const struct mmu_notifier_range *nrange)
>   {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>   	struct hmm_mirror *mirror;
>   	struct hmm_update update;
>   	struct hmm_range *range;
>   	int ret = 0;
>   
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return 0;
>   
>   	update.start = nrange->start;
>   	update.end = nrange->end;
> @@ -248,9 +259,11 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>   			const struct mmu_notifier_range *nrange)
>   {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>   
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
>   
>   	mutex_lock(&hmm->lock);
>   	hmm->notifiers--;
>
Jerome Glisse May 22, 2019, 9:49 p.m. UTC | #9
On Wed, May 22, 2019 at 04:22:19PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> 
> > > > +long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
> > > > +			       struct hmm_range *range)
> > > >  {
> > > > +	struct device *device = umem_odp->umem.context->device->dma_device;
> > > > +	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > > >  	struct ib_umem *umem = &umem_odp->umem;
> > > > -	struct task_struct *owning_process  = NULL;
> > > > -	struct mm_struct *owning_mm = umem_odp->umem.owning_mm;
> > > > -	struct page       **local_page_list = NULL;
> > > > -	u64 page_mask, off;
> > > > -	int j, k, ret = 0, start_idx, npages = 0, page_shift;
> > > > -	unsigned int flags = 0;
> > > > -	phys_addr_t p = 0;
> > > > -
> > > > -	if (access_mask == 0)
> > > > +	struct mm_struct *mm = per_mm->mm;
> > > > +	unsigned long idx, npages;
> > > > +	long ret;
> > > > +
> > > > +	if (mm == NULL)
> > > > +		return -ENOENT;
> > > > +
> > > > +	/* Only drivers with invalidate support can use this function. */
> > > > +	if (!umem->context->invalidate_range)
> > > >  		return -EINVAL;
> > > >  
> > > > -	if (user_virt < ib_umem_start(umem) ||
> > > > -	    user_virt + bcnt > ib_umem_end(umem))
> > > > -		return -EFAULT;
> > > > +	/* Sanity checks. */
> > > > +	if (range->default_flags == 0)
> > > > +		return -EINVAL;
> > > >  
> > > > -	local_page_list = (struct page **)__get_free_page(GFP_KERNEL);
> > > > -	if (!local_page_list)
> > > > -		return -ENOMEM;
> > > > +	if (range->start < ib_umem_start(umem) ||
> > > > +	    range->end > ib_umem_end(umem))
> > > > +		return -EINVAL;
> > > >  
> > > > -	page_shift = umem->page_shift;
> > > > -	page_mask = ~(BIT(page_shift) - 1);
> > > > -	off = user_virt & (~page_mask);
> > > > -	user_virt = user_virt & page_mask;
> > > > -	bcnt += off; /* Charge for the first page offset as well. */
> > > > +	idx = (range->start - ib_umem_start(umem)) >> umem->page_shift;
> > > 
> > > Is this math OK? What is supposed to happen if the range->start is not
> > > page aligned to the internal page size?
> > 
> > range->start is align on 1 << page_shift boundary within pagefault_mr
> > thus the above math is ok. We can add a BUG_ON() and comments if you
> > want.
> 
> OK
> 
> > > > +	range->pfns = &umem_odp->pfns[idx];
> > > > +	range->pfn_shift = ODP_FLAGS_BITS;
> > > > +	range->values = odp_hmm_values;
> > > > +	range->flags = odp_hmm_flags;
> > > >  
> > > >  	/*
> > > > -	 * owning_process is allowed to be NULL, this means somehow the mm is
> > > > -	 * existing beyond the lifetime of the originating process.. Presumably
> > > > -	 * mmget_not_zero will fail in this case.
> > > > +	 * If mm is dying just bail out early without trying to take mmap_sem.
> > > > +	 * Note that this might race with mm destruction but that is fine the
> > > > +	 * is properly refcounted so are all HMM structure.
> > > >  	 */
> > > > -	owning_process = get_pid_task(umem_odp->per_mm->tgid, PIDTYPE_PID);
> > > > -	if (!owning_process || !mmget_not_zero(owning_mm)) {
> > > 
> > > But we are not in a HMM context here, and per_mm is not a HMM
> > > structure. 
> > > 
> > > So why is mm suddenly guarenteed valid? It was a bug report that
> > > triggered the race the mmget_not_zero is fixing, so I need a better
> > > explanation why it is now safe. From what I see the hmm_range_fault
> > > is doing stuff like find_vma without an active mmget??
> > 
> > So the mm struct can not go away as long as we hold a reference on
> > the hmm struct and we hold a reference on it through both hmm_mirror
> > and hmm_range struct. So struct mm can not go away and thus it is
> > safe to try to take its mmap_sem.
> 
> This was always true here, though, so long as the umem_odp exists the
> the mm has a grab on it. But a grab is not a get..
> 
> The point here was the old code needed an mmget() in order to do
> get_user_pages_remote()
> 
> If hmm does not need an external mmget() then fine, we delete this
> stuff and rely on hmm.
> 
> But I don't think that is true as we have:
> 
>           CPU 0                                           CPU1
>                                                        mmput()
>                        				        __mmput()
> 							 exit_mmap()
> down_read(&mm->mmap_sem);
> hmm_range_dma_map(range, device,..
>   ret = hmm_range_fault(range, block);
>      if (hmm->mm == NULL || hmm->dead)
> 							   mmu_notifier_release()
> 							     hmm->dead = true
>      vma = find_vma(hmm->mm, start);
>         .. rb traversal ..                                 while (vma) remove_vma()
> 
> *goes boom*
> 
> I think this is violating the basic constraint of the mm by acting on
> a mm's VMA's without holding a mmget() to prevent concurrent
> destruction.
> 
> In other words, mmput() destruction does not respect the mmap_sem - so
> holding the mmap sem alone is not enough locking.
> 
> The unlucked hmm->dead simply can't save this. Frankly every time I
> look a struct with 'dead' in it, I find races like this.
> 
> Thus we should put the mmget_notzero back in.

So for some reason i thought exit_mmap() was setting the mm_rb
to empty node and flushing vmacache so that find_vma() would
fail. Might have been in some patch that never went upstream.

Note that right before find_vma() there is also range->valid
check which will also intercept mm release.

Anyway the easy fix is to get ref on mm user in range_register.

> 
> I saw some other funky looking stuff in hmm as well..
> 
> > Hence it is safe to take mmap_sem and it is safe to call in hmm, if
> > mm have been kill it will return EFAULT and this will propagate to
> > RDMA.
>  
> > As per_mm i removed the per_mm->mm = NULL from release so that it is
> > always safe to use that field even in face of racing mm "killing".
> 
> Yes, that certainly wasn't good.
> 
> > > > -	 * An array of the pages included in the on-demand paging umem.
> > > > -	 * Indices of pages that are currently not mapped into the device will
> > > > -	 * contain NULL.
> > > > +	 * An array of the pages included in the on-demand paging umem. Indices
> > > > +	 * of pages that are currently not mapped into the device will contain
> > > > +	 * 0.
> > > >  	 */
> > > > -	struct page		**page_list;
> > > > +	uint64_t *pfns;
> > > 
> > > Are these actually pfns, or are they mangled with some shift? (what is range->pfn_shift?)
> > 
> > They are not pfns they have flags (hence range->pfn_shift) at the
> > bottoms i just do not have a better name for this.
> 
> I think you need to have a better name then

Suggestion ? i have no idea for a better name, it has pfn value
in it.

Cheers,
Jérôme
Jerome Glisse May 22, 2019, 10:04 p.m. UTC | #10
On Wed, May 22, 2019 at 05:12:47PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> 
> >  static void put_per_mm(struct ib_umem_odp *umem_odp)
> >  {
> >  	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > @@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
> >  	up_write(&per_mm->umem_rwsem);
> >  
> >  	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> > -	mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> > +	hmm_mirror_unregister(&per_mm->mirror);
> >  	put_pid(per_mm->tgid);
> > -	mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> > +
> > +	kfree(per_mm);
> 
> Notice that mmu_notifier only uses SRCU to fence in-progress ops
> callbacks, so I think hmm internally has the bug that this ODP
> approach prevents.
> 
> hmm should follow the same pattern ODP has and 'kfree_srcu' the hmm
> struct, use container_of in the mmu_notifier callbacks, and use the
> otherwise vestigal kref_get_unless_zero() to bail:
> 
> From 0cb536dc0150ba964a1d655151d7b7a84d0f915a Mon Sep 17 00:00:00 2001
> From: Jason Gunthorpe <jgg@mellanox.com>
> Date: Wed, 22 May 2019 16:52:52 -0300
> Subject: [PATCH] hmm: Fix use after free with struct hmm in the mmu notifiers
> 
> mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
> system will continue to reference hmm->mn until the srcu grace period
> expires.
> 
>          CPU0                                     CPU1
>                                                __mmu_notifier_invalidate_range_start()
>                                                  srcu_read_lock
>                                                  hlist_for_each ()
>                                                    // mn == hmm->mn
> hmm_mirror_unregister()
>   hmm_put()
>     hmm_free()
>       mmu_notifier_unregister_no_release()
>          hlist_del_init_rcu(hmm-mn->list)
> 			                           mn->ops->invalidate_range_start(mn, range);
> 					             mm_get_hmm()
>       mm->hmm = NULL;
>       kfree(hmm)
>                                                      mutex_lock(&hmm->lock);
> 
> Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
> existing. Get the now-safe hmm struct through container_of and directly
> check kref_get_unless_zero to lock it against free.

It is already badly handled with BUG_ON(), i just need to convert
those to return and to use mmu_notifier_call_srcu() to free hmm
struct.

The way race is avoided is because mm->hmm will either be NULL or
point to another hmm struct before an existing hmm is free. Also
if range_start/range_end use kref_get_unless_zero() but right now
this is BUG_ON if it turn out to be NULL, it should just return
on NULL.

> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  include/linux/hmm.h |  1 +
>  mm/hmm.c            | 25 +++++++++++++++++++------
>  2 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 51ec27a8466816..8b91c90d3b88cb 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -102,6 +102,7 @@ struct hmm {
>  	struct mmu_notifier	mmu_notifier;
>  	struct rw_semaphore	mirrors_sem;
>  	wait_queue_head_t	wq;
> +	struct rcu_head		rcu;
>  	long			notifiers;
>  	bool			dead;
>  };
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 816c2356f2449f..824e7e160d8167 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -113,6 +113,11 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  	return NULL;
>  }
>  
> +static void hmm_fee_rcu(struct rcu_head *rcu)
> +{
> +	kfree(container_of(rcu, struct hmm, rcu));
> +}
> +
>  static void hmm_free(struct kref *kref)
>  {
>  	struct hmm *hmm = container_of(kref, struct hmm, kref);
> @@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
>  		mm->hmm = NULL;
>  	spin_unlock(&mm->page_table_lock);
>  
> -	kfree(hmm);
> +	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
>  }
>  
>  static inline void hmm_put(struct hmm *hmm)
> @@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
>  
>  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
> -	struct hmm *hmm = mm_get_hmm(mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  	struct hmm_mirror *mirror;
>  	struct hmm_range *range;
>  
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
> +
>  	/* Report this HMM as dying. */
>  	hmm->dead = true;
>  
> @@ -194,13 +203,15 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  			const struct mmu_notifier_range *nrange)
>  {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  	struct hmm_mirror *mirror;
>  	struct hmm_update update;
>  	struct hmm_range *range;
>  	int ret = 0;
>  
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return 0;
>  
>  	update.start = nrange->start;
>  	update.end = nrange->end;
> @@ -248,9 +259,11 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>  			const struct mmu_notifier_range *nrange)
>  {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
>  
>  	mutex_lock(&hmm->lock);
>  	hmm->notifiers--;
> -- 
> 2.21.0
>
Jerome Glisse May 22, 2019, 10:06 p.m. UTC | #11
On Wed, May 22, 2019 at 02:12:31PM -0700, Ralph Campbell wrote:
> 
> On 5/22/19 1:12 PM, Jason Gunthorpe wrote:
> > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > 
> > >   static void put_per_mm(struct ib_umem_odp *umem_odp)
> > >   {
> > >   	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > > @@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
> > >   	up_write(&per_mm->umem_rwsem);
> > >   	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> > > -	mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> > > +	hmm_mirror_unregister(&per_mm->mirror);
> > >   	put_pid(per_mm->tgid);
> > > -	mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> > > +
> > > +	kfree(per_mm);
> > 
> > Notice that mmu_notifier only uses SRCU to fence in-progress ops
> > callbacks, so I think hmm internally has the bug that this ODP
> > approach prevents.
> > 
> > hmm should follow the same pattern ODP has and 'kfree_srcu' the hmm
> > struct, use container_of in the mmu_notifier callbacks, and use the
> > otherwise vestigal kref_get_unless_zero() to bail:
> 
> You might also want to look at my patch where
> I try to fix some of these same issues (5/5).
> 
> https://marc.info/?l=linux-mm&m=155718572908765&w=2

I need to review the patchset but i do not want to invert referencing
ie having mm hold reference on hmm. Will review tommorrow. I wanted to
do that today but did not had time.
Jason Gunthorpe May 22, 2019, 10:39 p.m. UTC | #12
On Wed, May 22, 2019 at 06:04:20PM -0400, Jerome Glisse wrote:
> On Wed, May 22, 2019 at 05:12:47PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > 
> > >  static void put_per_mm(struct ib_umem_odp *umem_odp)
> > >  {
> > >  	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > > @@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
> > >  	up_write(&per_mm->umem_rwsem);
> > >  
> > >  	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> > > -	mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> > > +	hmm_mirror_unregister(&per_mm->mirror);
> > >  	put_pid(per_mm->tgid);
> > > -	mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> > > +
> > > +	kfree(per_mm);
> > 
> > Notice that mmu_notifier only uses SRCU to fence in-progress ops
> > callbacks, so I think hmm internally has the bug that this ODP
> > approach prevents.
> > 
> > hmm should follow the same pattern ODP has and 'kfree_srcu' the hmm
> > struct, use container_of in the mmu_notifier callbacks, and use the
> > otherwise vestigal kref_get_unless_zero() to bail:
> > 
> > From 0cb536dc0150ba964a1d655151d7b7a84d0f915a Mon Sep 17 00:00:00 2001
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > Date: Wed, 22 May 2019 16:52:52 -0300
> > Subject: [PATCH] hmm: Fix use after free with struct hmm in the mmu notifiers
> > 
> > mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
> > system will continue to reference hmm->mn until the srcu grace period
> > expires.
> > 
> >          CPU0                                     CPU1
> >                                                __mmu_notifier_invalidate_range_start()
> >                                                  srcu_read_lock
> >                                                  hlist_for_each ()
> >                                                    // mn == hmm->mn
> > hmm_mirror_unregister()
> >   hmm_put()
> >     hmm_free()
> >       mmu_notifier_unregister_no_release()
> >          hlist_del_init_rcu(hmm-mn->list)
> > 			                           mn->ops->invalidate_range_start(mn, range);
> > 					             mm_get_hmm()
> >       mm->hmm = NULL;
> >       kfree(hmm)
> >                                                      mutex_lock(&hmm->lock);
> > 
> > Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
> > existing. Get the now-safe hmm struct through container_of and directly
> > check kref_get_unless_zero to lock it against free.
> 
> It is already badly handled with BUG_ON()

You can't crash the kernel because userspace forced a race, and no it
isn't handled today because there is no RCU locking in mm_get_hmm nor
is there a kfree_rcu for the struct hmm to make the
kref_get_unless_zero work without use-after-free.

> i just need to convert those to return and to use
> mmu_notifier_call_srcu() to free hmm struct.

Isn't that what this patch does?

> The way race is avoided is because mm->hmm will either be NULL or
> point to another hmm struct before an existing hmm is free. 

There is no locking on mm->hmm so it is useless to prevent races.

> Also if range_start/range_end use kref_get_unless_zero() but right
> now this is BUG_ON if it turn out to be NULL, it should just return
> on NULL.

Still needs rcu.

Also the container_of is necessary to avoid some race where you could
be doing:

                  CPU0                                     CPU1                         CPU2
                                                       hlist_for_each ()
       mmu_notifier_unregister_no_release(hmm1)             
       spin_lock(&mm->page_table_lock);                                
       mm->hmm = NULL
       spin_unlock(&mm->page_table_lock);                                                                                      
                                                      				 hmm2 = hmm_get_or_create()
                                                        mn == hmm1->mn
                                                        mn->ops->invalidate_range_start(mn, range)
							  mm_get_mm() == hmm2
                                                      hist_for_each con't
                                                        mn == hmm2->mn
                                                        mn->ops->invalidate_range_start(mn, range)
							  mm_get_mm() == hmm2

Now we called the same notifier twice on hmm2. Ooops.

There is no reason to risk this confusion just to avoid container_of.

So we agree this patch is necessary? Can you test it an ack it please?

Jason
Jerome Glisse May 22, 2019, 10:42 p.m. UTC | #13
On Wed, May 22, 2019 at 07:39:06PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 06:04:20PM -0400, Jerome Glisse wrote:
> > On Wed, May 22, 2019 at 05:12:47PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > 
> > > >  static void put_per_mm(struct ib_umem_odp *umem_odp)
> > > >  {
> > > >  	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > > > @@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
> > > >  	up_write(&per_mm->umem_rwsem);
> > > >  
> > > >  	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> > > > -	mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> > > > +	hmm_mirror_unregister(&per_mm->mirror);
> > > >  	put_pid(per_mm->tgid);
> > > > -	mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> > > > +
> > > > +	kfree(per_mm);
> > > 
> > > Notice that mmu_notifier only uses SRCU to fence in-progress ops
> > > callbacks, so I think hmm internally has the bug that this ODP
> > > approach prevents.
> > > 
> > > hmm should follow the same pattern ODP has and 'kfree_srcu' the hmm
> > > struct, use container_of in the mmu_notifier callbacks, and use the
> > > otherwise vestigal kref_get_unless_zero() to bail:
> > > 
> > > From 0cb536dc0150ba964a1d655151d7b7a84d0f915a Mon Sep 17 00:00:00 2001
> > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > Date: Wed, 22 May 2019 16:52:52 -0300
> > > Subject: [PATCH] hmm: Fix use after free with struct hmm in the mmu notifiers
> > > 
> > > mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
> > > system will continue to reference hmm->mn until the srcu grace period
> > > expires.
> > > 
> > >          CPU0                                     CPU1
> > >                                                __mmu_notifier_invalidate_range_start()
> > >                                                  srcu_read_lock
> > >                                                  hlist_for_each ()
> > >                                                    // mn == hmm->mn
> > > hmm_mirror_unregister()
> > >   hmm_put()
> > >     hmm_free()
> > >       mmu_notifier_unregister_no_release()
> > >          hlist_del_init_rcu(hmm-mn->list)
> > > 			                           mn->ops->invalidate_range_start(mn, range);
> > > 					             mm_get_hmm()
> > >       mm->hmm = NULL;
> > >       kfree(hmm)
> > >                                                      mutex_lock(&hmm->lock);
> > > 
> > > Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
> > > existing. Get the now-safe hmm struct through container_of and directly
> > > check kref_get_unless_zero to lock it against free.
> > 
> > It is already badly handled with BUG_ON()
> 
> You can't crash the kernel because userspace forced a race, and no it
> isn't handled today because there is no RCU locking in mm_get_hmm nor
> is there a kfree_rcu for the struct hmm to make the
> kref_get_unless_zero work without use-after-free.
> 
> > i just need to convert those to return and to use
> > mmu_notifier_call_srcu() to free hmm struct.
> 
> Isn't that what this patch does?

Yes but other chunk just need to replace BUG_ON with return

> 
> > The way race is avoided is because mm->hmm will either be NULL or
> > point to another hmm struct before an existing hmm is free. 
> 
> There is no locking on mm->hmm so it is useless to prevent races.

There is locking on mm->hmm

> 
> > Also if range_start/range_end use kref_get_unless_zero() but right
> > now this is BUG_ON if it turn out to be NULL, it should just return
> > on NULL.
> 
> Still needs rcu.
> 
> Also the container_of is necessary to avoid some race where you could
> be doing:
> 
>                   CPU0                                     CPU1                         CPU2
>                                                        hlist_for_each ()
>        mmu_notifier_unregister_no_release(hmm1)             
>        spin_lock(&mm->page_table_lock);                                
>        mm->hmm = NULL
>        spin_unlock(&mm->page_table_lock);                                                                                      
>                                                       				 hmm2 = hmm_get_or_create()
>                                                         mn == hmm1->mn
>                                                         mn->ops->invalidate_range_start(mn, range)
> 							  mm_get_mm() == hmm2
>                                                       hist_for_each con't
>                                                         mn == hmm2->mn
>                                                         mn->ops->invalidate_range_start(mn, range)
> 							  mm_get_mm() == hmm2
> 
> Now we called the same notifier twice on hmm2. Ooops.
> 
> There is no reason to risk this confusion just to avoid container_of.
> 
> So we agree this patch is necessary? Can you test it an ack it please?

A slightly different patch than this one is necessary i will work on
it tomorrow.

Cheers,
Jérôme
Jason Gunthorpe May 22, 2019, 10:43 p.m. UTC | #14
On Wed, May 22, 2019 at 05:49:18PM -0400, Jerome Glisse wrote:
> > > > So why is mm suddenly guarenteed valid? It was a bug report that
> > > > triggered the race the mmget_not_zero is fixing, so I need a better
> > > > explanation why it is now safe. From what I see the hmm_range_fault
> > > > is doing stuff like find_vma without an active mmget??
> > > 
> > > So the mm struct can not go away as long as we hold a reference on
> > > the hmm struct and we hold a reference on it through both hmm_mirror
> > > and hmm_range struct. So struct mm can not go away and thus it is
> > > safe to try to take its mmap_sem.
> > 
> > This was always true here, though, so long as the umem_odp exists the
> > the mm has a grab on it. But a grab is not a get..
> > 
> > The point here was the old code needed an mmget() in order to do
> > get_user_pages_remote()
> > 
> > If hmm does not need an external mmget() then fine, we delete this
> > stuff and rely on hmm.
> > 
> > But I don't think that is true as we have:
> > 
> >           CPU 0                                           CPU1
> >                                                        mmput()
> >                        				        __mmput()
> > 							 exit_mmap()
> > down_read(&mm->mmap_sem);
> > hmm_range_dma_map(range, device,..
> >   ret = hmm_range_fault(range, block);
> >      if (hmm->mm == NULL || hmm->dead)
> > 							   mmu_notifier_release()
> > 							     hmm->dead = true
> >      vma = find_vma(hmm->mm, start);
> >         .. rb traversal ..                                 while (vma) remove_vma()
> > 
> > *goes boom*
> > 
> > I think this is violating the basic constraint of the mm by acting on
> > a mm's VMA's without holding a mmget() to prevent concurrent
> > destruction.
> > 
> > In other words, mmput() destruction does not respect the mmap_sem - so
> > holding the mmap sem alone is not enough locking.
> > 
> > The unlucked hmm->dead simply can't save this. Frankly every time I
> > look a struct with 'dead' in it, I find races like this.
> > 
> > Thus we should put the mmget_notzero back in.
> 
> So for some reason i thought exit_mmap() was setting the mm_rb
> to empty node and flushing vmacache so that find_vma() would
> fail.

It would still be racy without locks.

> Note that right before find_vma() there is also range->valid
> check which will also intercept mm release.

There is no locking on range->valid so it is just moves the race
around. You can't solve races with unlocked/non-atomic variables.

> Anyway the easy fix is to get ref on mm user in range_register.

Yes a mmget_not_zero inside range_register would be fine.

How do you want to handle that patch?

> > I saw some other funky looking stuff in hmm as well..
> > 
> > > Hence it is safe to take mmap_sem and it is safe to call in hmm, if
> > > mm have been kill it will return EFAULT and this will propagate to
> > > RDMA.
> >  
> > > As per_mm i removed the per_mm->mm = NULL from release so that it is
> > > always safe to use that field even in face of racing mm "killing".
> > 
> > Yes, that certainly wasn't good.
> > 
> > > > > -	 * An array of the pages included in the on-demand paging umem.
> > > > > -	 * Indices of pages that are currently not mapped into the device will
> > > > > -	 * contain NULL.
> > > > > +	 * An array of the pages included in the on-demand paging umem. Indices
> > > > > +	 * of pages that are currently not mapped into the device will contain
> > > > > +	 * 0.
> > > > >  	 */
> > > > > -	struct page		**page_list;
> > > > > +	uint64_t *pfns;
> > > > 
> > > > Are these actually pfns, or are they mangled with some shift? (what is range->pfn_shift?)
> > > 
> > > They are not pfns they have flags (hence range->pfn_shift) at the
> > > bottoms i just do not have a better name for this.
> > 
> > I think you need to have a better name then
> 
> Suggestion ? i have no idea for a better name, it has pfn value
> in it.

pfn_flags?

Jason
Jason Gunthorpe May 22, 2019, 10:52 p.m. UTC | #15
On Wed, May 22, 2019 at 06:42:11PM -0400, Jerome Glisse wrote:

> > > The way race is avoided is because mm->hmm will either be NULL or
> > > point to another hmm struct before an existing hmm is free. 
> > 
> > There is no locking on mm->hmm so it is useless to prevent races.
> 
> There is locking on mm->hmm

Not in mm_get_hmm()

> > So we agree this patch is necessary? Can you test it an ack it please?
> 
> A slightly different patch than this one is necessary i will work on
> it tomorrow.

I think if you want something different you should give feedback on
this patch. You haven't rasied any defects with it.

Jason
Jason Gunthorpe May 22, 2019, 11:57 p.m. UTC | #16
On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:

> > > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > > (prefetch and not and different sizes). Seems to work ok.
> > 
> > Urk, it already doesn't apply to the rdma tree :(
> > 
> > The conflicts are a little more extensive than I'd prefer to handle..
> > Can I ask you to rebase it on top of this branch please:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > 
> > Specifically it conflicts with this patch:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34

There is at least one more serious blocker here:

config ARCH_HAS_HMM_MIRROR
        bool
        default y
        depends on (X86_64 || PPC64)
        depends on MMU && 64BIT

I can't loose ARM64 support for ODP by merging this, that is too
serious of a regression.

Can you fix it?

Thanks,
Jason
Jerome Glisse May 23, 2019, 3:04 p.m. UTC | #17
On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> 
> > > > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > > > (prefetch and not and different sizes). Seems to work ok.
> > > 
> > > Urk, it already doesn't apply to the rdma tree :(
> > > 
> > > The conflicts are a little more extensive than I'd prefer to handle..
> > > Can I ask you to rebase it on top of this branch please:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > 
> > > Specifically it conflicts with this patch:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> 
> There is at least one more serious blocker here:
> 
> config ARCH_HAS_HMM_MIRROR
>         bool
>         default y
>         depends on (X86_64 || PPC64)
>         depends on MMU && 64BIT
> 
> I can't loose ARM64 support for ODP by merging this, that is too
> serious of a regression.
> 
> Can you fix it?

5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
in 5.3 so that i can drop the old ARCH_HAS_HMM and then convert
core mm in 5.4 to use ARCH_HAS_HMM_MIRROR and ARCH_HAS_HMM_DEVICE
instead of ARCH_HAS_HMM

Adding ARM64 to ARCH_HAS_HMM_MIRROR should not be an issue i would
need access to an ARM64 to test as i did not wanted to enable it
without testing.

So it seems it will have to wait 5.4 for ODP. I will re-spin the
patch for ODP once i am done reviewing Ralph changes and yours
for 5.3.

Cheers,
Jérôme
Jason Gunthorpe May 23, 2019, 3:41 p.m. UTC | #18
On Thu, May 23, 2019 at 11:04:32AM -0400, Jerome Glisse wrote:
> On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > 
> > > > > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > > > > (prefetch and not and different sizes). Seems to work ok.
> > > > 
> > > > Urk, it already doesn't apply to the rdma tree :(
> > > > 
> > > > The conflicts are a little more extensive than I'd prefer to handle..
> > > > Can I ask you to rebase it on top of this branch please:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > > 
> > > > Specifically it conflicts with this patch:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> > 
> > There is at least one more serious blocker here:
> > 
> > config ARCH_HAS_HMM_MIRROR
> >         bool
> >         default y
> >         depends on (X86_64 || PPC64)
> >         depends on MMU && 64BIT
> > 
> > I can't loose ARM64 support for ODP by merging this, that is too
> > serious of a regression.
> > 
> > Can you fix it?
> 
> 5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
> ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau

Newer than 5.2-rc1? Is this why ARCH_HAS_HMM_MIRROR is not used anywhere?

> in 5.3 so that i can drop the old ARCH_HAS_HMM and then convert
> core mm in 5.4 to use ARCH_HAS_HMM_MIRROR and ARCH_HAS_HMM_DEVICE
> instead of ARCH_HAS_HMM

My problem is that ODP needs HMM_MIRROR which needs HMM & ARCH_HAS_HMM
- and then even if fixed we still have the ARCH_HAS_HMM_MIRROR
restricted to ARM64..

Can we broaden HMM_MIRROR to all arches? I would very much prefer
that.

> So it seems it will have to wait 5.4 for ODP. I will re-spin the
> patch for ODP once i am done reviewing Ralph changes and yours
> for 5.3.

I think we are still OK for 5.3.

If mm takes the fixup patches so hmm mirror is as reliable as ODP's
existing stuff, and patch from you to enable ARM64, then we can
continue to merge into 5.3

So, let us try to get acks on those other threads..

Jason
Jerome Glisse May 23, 2019, 3:52 p.m. UTC | #19
On Thu, May 23, 2019 at 12:41:49PM -0300, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 11:04:32AM -0400, Jerome Glisse wrote:
> > On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > 
> > > > > > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > > > > > (prefetch and not and different sizes). Seems to work ok.
> > > > > 
> > > > > Urk, it already doesn't apply to the rdma tree :(
> > > > > 
> > > > > The conflicts are a little more extensive than I'd prefer to handle..
> > > > > Can I ask you to rebase it on top of this branch please:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > > > 
> > > > > Specifically it conflicts with this patch:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> > > 
> > > There is at least one more serious blocker here:
> > > 
> > > config ARCH_HAS_HMM_MIRROR
> > >         bool
> > >         default y
> > >         depends on (X86_64 || PPC64)
> > >         depends on MMU && 64BIT
> > > 
> > > I can't loose ARM64 support for ODP by merging this, that is too
> > > serious of a regression.
> > > 
> > > Can you fix it?
> > 
> > 5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
> > ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
> 
> Newer than 5.2-rc1? Is this why ARCH_HAS_HMM_MIRROR is not used anywhere?

Yes this is multi-step update, first add the new Kconfig release n,
update driver in release n+1, update core Kconfig in release n+2

So we are in release n (5.2), in 5.3 i will update nouveau and amdgpu
so that in 5.4 in ca remove the old ARCH_HAS_HMM

> > in 5.3 so that i can drop the old ARCH_HAS_HMM and then convert
> > core mm in 5.4 to use ARCH_HAS_HMM_MIRROR and ARCH_HAS_HMM_DEVICE
> > instead of ARCH_HAS_HMM
> 
> My problem is that ODP needs HMM_MIRROR which needs HMM & ARCH_HAS_HMM
> - and then even if fixed we still have the ARCH_HAS_HMM_MIRROR
> restricted to ARM64..
> 
> Can we broaden HMM_MIRROR to all arches? I would very much prefer
> that.

Ignore ARCH_HAS_HMM it will be remove in 5.4, all that will matter
for ODP is ARCH_HAS_HMM_MIRROR which should be enabled for ARM64 as
ARM64 has everything needed for that. I just did not add ARM64 to
ARCH_HAS_HMM_MIRROR because i did not had hardware to test it on.

So in 5.3 i will update nouveau and amdgpu to use ARCH_HAS_HMM_DEVICE
and ARCH_HAS_HMM_MIRROR. In 5.4 i will update mm/Kconig to remove
ARCH_HAS_HMM

> 
> > So it seems it will have to wait 5.4 for ODP. I will re-spin the
> > patch for ODP once i am done reviewing Ralph changes and yours
> > for 5.3.
> 
> I think we are still OK for 5.3.

I can not update mm/Kconfig in 5.3 so any Kconfig update will be
5.4

> 
> If mm takes the fixup patches so hmm mirror is as reliable as ODP's
> existing stuff, and patch from you to enable ARM64, then we can
> continue to merge into 5.3
> 
> So, let us try to get acks on those other threads..

I will be merging your patchset and Ralph and repost, they are only
minor change mostly that you can not update the driver API in just
one release. First add the new API in release n, then replace old
API usage in release n+1, then remove old API in n+2.

Cheers,
Jérôme
Jason Gunthorpe May 23, 2019, 4:34 p.m. UTC | #20
On Thu, May 23, 2019 at 11:52:08AM -0400, Jerome Glisse wrote:
> On Thu, May 23, 2019 at 12:41:49PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 23, 2019 at 11:04:32AM -0400, Jerome Glisse wrote:
> > > On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > > 
> > > > > > > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > > > > > > (prefetch and not and different sizes). Seems to work ok.
> > > > > > 
> > > > > > Urk, it already doesn't apply to the rdma tree :(
> > > > > > 
> > > > > > The conflicts are a little more extensive than I'd prefer to handle..
> > > > > > Can I ask you to rebase it on top of this branch please:
> > > > > > 
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > > > > 
> > > > > > Specifically it conflicts with this patch:
> > > > > > 
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> > > > 
> > > > There is at least one more serious blocker here:
> > > > 
> > > > config ARCH_HAS_HMM_MIRROR
> > > >         bool
> > > >         default y
> > > >         depends on (X86_64 || PPC64)
> > > >         depends on MMU && 64BIT
> > > > 
> > > > I can't loose ARM64 support for ODP by merging this, that is too
> > > > serious of a regression.
> > > > 
> > > > Can you fix it?
> > > 
> > > 5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
> > > ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
> > 
> > Newer than 5.2-rc1? Is this why ARCH_HAS_HMM_MIRROR is not used anywhere?
> 
> Yes this is multi-step update, first add the new Kconfig release n,
> update driver in release n+1, update core Kconfig in release n+2
> 
> So we are in release n (5.2), in 5.3 i will update nouveau and amdgpu
> so that in 5.4 in ca remove the old ARCH_HAS_HMM

Why don't you just send the patch for both parts to mm or to DRM?

This is very normal - as long as the resulting conflicts would be
small during there is no reason not to do this. Can you share the
combined patch?

> > If mm takes the fixup patches so hmm mirror is as reliable as ODP's
> > existing stuff, and patch from you to enable ARM64, then we can
> > continue to merge into 5.3
> > 
> > So, let us try to get acks on those other threads..
> 
> I will be merging your patchset and Ralph and repost, they are only
> minor change mostly that you can not update the driver API in just
> one release.

Of course you can, we do it all the time. It requires some
co-ordination, but as long as the merge conflicts are not big it is
fine.

Merge the driver API change and the call site updates to -mm and
refain from merging horrendously conflicting patches through DRM.

In the case of the changes in my HMM RFC it is something like 2
lines in DRM that need touching, no problem at all.

If you want help I can volunteer make a hmm PR for Linus just for this
during the merge window - but Andrew would need to agree and ack the
patches.

Jason
Jerome Glisse May 23, 2019, 5:33 p.m. UTC | #21
On Thu, May 23, 2019 at 01:34:29PM -0300, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 11:52:08AM -0400, Jerome Glisse wrote:
> > On Thu, May 23, 2019 at 12:41:49PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 23, 2019 at 11:04:32AM -0400, Jerome Glisse wrote:
> > > > On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > > > 
> > > > > > > > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > > > > > > > (prefetch and not and different sizes). Seems to work ok.
> > > > > > > 
> > > > > > > Urk, it already doesn't apply to the rdma tree :(
> > > > > > > 
> > > > > > > The conflicts are a little more extensive than I'd prefer to handle..
> > > > > > > Can I ask you to rebase it on top of this branch please:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > > > > > 
> > > > > > > Specifically it conflicts with this patch:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> > > > > 
> > > > > There is at least one more serious blocker here:
> > > > > 
> > > > > config ARCH_HAS_HMM_MIRROR
> > > > >         bool
> > > > >         default y
> > > > >         depends on (X86_64 || PPC64)
> > > > >         depends on MMU && 64BIT
> > > > > 
> > > > > I can't loose ARM64 support for ODP by merging this, that is too
> > > > > serious of a regression.
> > > > > 
> > > > > Can you fix it?
> > > > 
> > > > 5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
> > > > ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
> > > 
> > > Newer than 5.2-rc1? Is this why ARCH_HAS_HMM_MIRROR is not used anywhere?
> > 
> > Yes this is multi-step update, first add the new Kconfig release n,
> > update driver in release n+1, update core Kconfig in release n+2
> > 
> > So we are in release n (5.2), in 5.3 i will update nouveau and amdgpu
> > so that in 5.4 in ca remove the old ARCH_HAS_HMM
> 
> Why don't you just send the patch for both parts to mm or to DRM?
> 
> This is very normal - as long as the resulting conflicts would be
> small during there is no reason not to do this. Can you share the
> combined patch?

This was tested in the past an resulted in failure. So for now i am
taking the simplest and easiest path with the least burden for every
maintainer. It only complexify my life.

Note that mm is not a git tree and thus i can not play any git trick
to help in this endeavor.

> > > If mm takes the fixup patches so hmm mirror is as reliable as ODP's
> > > existing stuff, and patch from you to enable ARM64, then we can
> > > continue to merge into 5.3
> > > 
> > > So, let us try to get acks on those other threads..
> > 
> > I will be merging your patchset and Ralph and repost, they are only
> > minor change mostly that you can not update the driver API in just
> > one release.
> 
> Of course you can, we do it all the time. It requires some
> co-ordination, but as long as the merge conflicts are not big it is
> fine.
> 
> Merge the driver API change and the call site updates to -mm and
> refain from merging horrendously conflicting patches through DRM.
> 
> In the case of the changes in my HMM RFC it is something like 2
> lines in DRM that need touching, no problem at all.
> 
> If you want help I can volunteer make a hmm PR for Linus just for this
> during the merge window - but Andrew would need to agree and ack the
> patches.

This was tested in the past and i do not want to go over this issue
again (or re-iterate the long emails discussion associated with that).
It failed and it put the burden on every maintainers. So it is easier
to do the multi-step thing.

You can take a peak at Ralph patchset and yours into one with minor
changes here:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.3

I am about to start testing it with nouveau, amdgpu and RDMA.

Cheers,
Jérôme
Jason Gunthorpe May 23, 2019, 5:55 p.m. UTC | #22
On Thu, May 23, 2019 at 01:33:03PM -0400, Jerome Glisse wrote:
> On Thu, May 23, 2019 at 01:34:29PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 23, 2019 at 11:52:08AM -0400, Jerome Glisse wrote:
> > > On Thu, May 23, 2019 at 12:41:49PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, May 23, 2019 at 11:04:32AM -0400, Jerome Glisse wrote:
> > > > > On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> > > > > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > > > > 
> > > > > > > > > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > > > > > > > > (prefetch and not and different sizes). Seems to work ok.
> > > > > > > > 
> > > > > > > > Urk, it already doesn't apply to the rdma tree :(
> > > > > > > > 
> > > > > > > > The conflicts are a little more extensive than I'd prefer to handle..
> > > > > > > > Can I ask you to rebase it on top of this branch please:
> > > > > > > > 
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > > > > > > 
> > > > > > > > Specifically it conflicts with this patch:
> > > > > > > > 
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> > > > > > 
> > > > > > There is at least one more serious blocker here:
> > > > > > 
> > > > > > config ARCH_HAS_HMM_MIRROR
> > > > > >         bool
> > > > > >         default y
> > > > > >         depends on (X86_64 || PPC64)
> > > > > >         depends on MMU && 64BIT
> > > > > > 
> > > > > > I can't loose ARM64 support for ODP by merging this, that is too
> > > > > > serious of a regression.
> > > > > > 
> > > > > > Can you fix it?
> > > > > 
> > > > > 5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
> > > > > ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
> > > > 
> > > > Newer than 5.2-rc1? Is this why ARCH_HAS_HMM_MIRROR is not used anywhere?
> > > 
> > > Yes this is multi-step update, first add the new Kconfig release n,
> > > update driver in release n+1, update core Kconfig in release n+2
> > > 
> > > So we are in release n (5.2), in 5.3 i will update nouveau and amdgpu
> > > so that in 5.4 in ca remove the old ARCH_HAS_HMM
> > 
> > Why don't you just send the patch for both parts to mm or to DRM?
> > 
> > This is very normal - as long as the resulting conflicts would be
> > small during there is no reason not to do this. Can you share the
> > combined patch?
> 
> This was tested in the past an resulted in failure. So for now i am
> taking the simplest and easiest path with the least burden for every
> maintainer. It only complexify my life.

I don't know what you tried to do in the past, but it happens all the
time, every merge cycle with success. Not everything can be done, but
changing the signature of one function with one call site should
really not be a problem.

> Note that mm is not a git tree and thus i can not play any git trick
> to help in this endeavor.

I am aware..

> > > > If mm takes the fixup patches so hmm mirror is as reliable as ODP's
> > > > existing stuff, and patch from you to enable ARM64, then we can
> > > > continue to merge into 5.3
> > > > 
> > > > So, let us try to get acks on those other threads..
> > > 
> > > I will be merging your patchset and Ralph and repost, they are only
> > > minor change mostly that you can not update the driver API in just
> > > one release.
> > 
> > Of course you can, we do it all the time. It requires some
> > co-ordination, but as long as the merge conflicts are not big it is
> > fine.
> > 
> > Merge the driver API change and the call site updates to -mm and
> > refain from merging horrendously conflicting patches through DRM.
> > 
> > In the case of the changes in my HMM RFC it is something like 2
> > lines in DRM that need touching, no problem at all.
> > 
> > If you want help I can volunteer make a hmm PR for Linus just for this
> > during the merge window - but Andrew would need to agree and ack the
> > patches.
> 
> This was tested in the past and i do not want to go over this issue
> again (or re-iterate the long emails discussion associated with that).
> It failed and it put the burden on every maintainers. So it is easier
> to do the multi-step thing.
> 
> You can take a peak at Ralph patchset and yours into one with minor
> changes here:
> 
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.3

Okay..

This patch needs to use down_read(&mm->mmap_sem) not READ_ONCE:

 mm/hmm: do not try to create hmm struct from within hmm_range_register()
 Driver should never call hmm_range_register() without a valid and active
 registered hmm_mirror and thus without a valid and active hmm struct. So
 if that happens just return -EFAULT.

Otherwise it is inconsisent with the locking scheme and has a use
after free race. 

I was not sure if the lock could be obtained safely here so I
preferred to use the no-lock alternative of passing in mirror. I still
think you should just change the single call site and sent to -mm.

Thank you for all the other fixes, they look great.

Regards,
Jason
Jerome Glisse May 23, 2019, 6:24 p.m. UTC | #23
On Thu, May 23, 2019 at 02:55:46PM -0300, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 01:33:03PM -0400, Jerome Glisse wrote:
> > On Thu, May 23, 2019 at 01:34:29PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 23, 2019 at 11:52:08AM -0400, Jerome Glisse wrote:
> > > > On Thu, May 23, 2019 at 12:41:49PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, May 23, 2019 at 11:04:32AM -0400, Jerome Glisse wrote:
> > > > > > On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> > > > > > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > > > > > 
> > > > > > > > > > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > > > > > > > > > (prefetch and not and different sizes). Seems to work ok.
> > > > > > > > > 
> > > > > > > > > Urk, it already doesn't apply to the rdma tree :(
> > > > > > > > > 
> > > > > > > > > The conflicts are a little more extensive than I'd prefer to handle..
> > > > > > > > > Can I ask you to rebase it on top of this branch please:
> > > > > > > > > 
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > > > > > > > 
> > > > > > > > > Specifically it conflicts with this patch:
> > > > > > > > > 
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> > > > > > > 
> > > > > > > There is at least one more serious blocker here:
> > > > > > > 
> > > > > > > config ARCH_HAS_HMM_MIRROR
> > > > > > >         bool
> > > > > > >         default y
> > > > > > >         depends on (X86_64 || PPC64)
> > > > > > >         depends on MMU && 64BIT
> > > > > > > 
> > > > > > > I can't loose ARM64 support for ODP by merging this, that is too
> > > > > > > serious of a regression.
> > > > > > > 
> > > > > > > Can you fix it?
> > > > > > 
> > > > > > 5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
> > > > > > ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
> > > > > 
> > > > > Newer than 5.2-rc1? Is this why ARCH_HAS_HMM_MIRROR is not used anywhere?
> > > > 
> > > > Yes this is multi-step update, first add the new Kconfig release n,
> > > > update driver in release n+1, update core Kconfig in release n+2
> > > > 
> > > > So we are in release n (5.2), in 5.3 i will update nouveau and amdgpu
> > > > so that in 5.4 in ca remove the old ARCH_HAS_HMM
> > > 
> > > Why don't you just send the patch for both parts to mm or to DRM?
> > > 
> > > This is very normal - as long as the resulting conflicts would be
> > > small during there is no reason not to do this. Can you share the
> > > combined patch?
> > 
> > This was tested in the past an resulted in failure. So for now i am
> > taking the simplest and easiest path with the least burden for every
> > maintainer. It only complexify my life.
> 
> I don't know what you tried to do in the past, but it happens all the
> time, every merge cycle with success. Not everything can be done, but
> changing the signature of one function with one call site should
> really not be a problem.
> 
> > Note that mm is not a git tree and thus i can not play any git trick
> > to help in this endeavor.
> 
> I am aware..
> 
> > > > > If mm takes the fixup patches so hmm mirror is as reliable as ODP's
> > > > > existing stuff, and patch from you to enable ARM64, then we can
> > > > > continue to merge into 5.3
> > > > > 
> > > > > So, let us try to get acks on those other threads..
> > > > 
> > > > I will be merging your patchset and Ralph and repost, they are only
> > > > minor change mostly that you can not update the driver API in just
> > > > one release.
> > > 
> > > Of course you can, we do it all the time. It requires some
> > > co-ordination, but as long as the merge conflicts are not big it is
> > > fine.
> > > 
> > > Merge the driver API change and the call site updates to -mm and
> > > refain from merging horrendously conflicting patches through DRM.
> > > 
> > > In the case of the changes in my HMM RFC it is something like 2
> > > lines in DRM that need touching, no problem at all.
> > > 
> > > If you want help I can volunteer make a hmm PR for Linus just for this
> > > during the merge window - but Andrew would need to agree and ack the
> > > patches.
> > 
> > This was tested in the past and i do not want to go over this issue
> > again (or re-iterate the long emails discussion associated with that).
> > It failed and it put the burden on every maintainers. So it is easier
> > to do the multi-step thing.
> > 
> > You can take a peak at Ralph patchset and yours into one with minor
> > changes here:
> > 
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.3
> 
> Okay..
> 
> This patch needs to use down_read(&mm->mmap_sem) not READ_ONCE:
> 
>  mm/hmm: do not try to create hmm struct from within hmm_range_register()
>  Driver should never call hmm_range_register() without a valid and active
>  registered hmm_mirror and thus without a valid and active hmm struct. So
>  if that happens just return -EFAULT.
> 
> Otherwise it is inconsisent with the locking scheme and has a use
> after free race. 

I can not take mmap_sem in range_register, the READ_ONCE is fine and
they are no race as we do take a reference on the hmm struct thus
if we get are successful at taking that reference then hmm struct
will remain valid. Also this API has always been use with active
mirror and thus active hmm and thus mm->hmm will remain stable for
all call to hmm_range_register(), it would be a bug otherwise.

The READ_ONCE() is really just a safety net, it is not required and
once the convertion to hmm_mirror is done it will go away.

> 
> I was not sure if the lock could be obtained safely here so I
> preferred to use the no-lock alternative of passing in mirror. I still
> think you should just change the single call site and sent to -mm.

It is too painful with unmerge driver like amdgpu that are targeting
5.3, those do target API that is in 5.2

Cheers,
Jérôme
Jason Gunthorpe May 23, 2019, 7:10 p.m. UTC | #24
On Thu, May 23, 2019 at 02:24:58PM -0400, Jerome Glisse wrote:
> I can not take mmap_sem in range_register, the READ_ONCE is fine and
> they are no race as we do take a reference on the hmm struct thus

Of course there are use after free races with a READ_ONCE scheme, I
shouldn't have to explain this.

If you cannot take the read mmap sem (why not?), then please use my
version and push the update to the driver through -mm..

Jason
Jerome Glisse May 23, 2019, 7:39 p.m. UTC | #25
On Thu, May 23, 2019 at 04:10:38PM -0300, Jason Gunthorpe wrote:
> 
> On Thu, May 23, 2019 at 02:24:58PM -0400, Jerome Glisse wrote:
> > I can not take mmap_sem in range_register, the READ_ONCE is fine and
> > they are no race as we do take a reference on the hmm struct thus
> 
> Of course there are use after free races with a READ_ONCE scheme, I
> shouldn't have to explain this.

Well i can not think of anything again here the mm->hmm can not
change while driver is calling hmm_range_register() so if you
want i can remove the READ_ONCE() this does not change anything.


> If you cannot take the read mmap sem (why not?), then please use my
> version and push the update to the driver through -mm..

Please see previous threads on why it was a failure.

Cheers,
Jérôme
Jason Gunthorpe May 23, 2019, 7:47 p.m. UTC | #26
On Thu, May 23, 2019 at 03:39:59PM -0400, Jerome Glisse wrote:
> On Thu, May 23, 2019 at 04:10:38PM -0300, Jason Gunthorpe wrote:
> > 
> > On Thu, May 23, 2019 at 02:24:58PM -0400, Jerome Glisse wrote:
> > > I can not take mmap_sem in range_register, the READ_ONCE is fine and
> > > they are no race as we do take a reference on the hmm struct thus
> > 
> > Of course there are use after free races with a READ_ONCE scheme, I
> > shouldn't have to explain this.
> 
> Well i can not think of anything again here the mm->hmm can not
> change while driver is calling hmm_range_register() 

Oh of course! It is so confusing because the new code makes it look like
it could be changing...

> so if you want i can remove the READ_ONCE() this does not change
> anything.

Please just write it as:
  
  /* The caller must hold a valid struct hmm_mirror to call this api,
   * and a valid hmm_mirror guarantees mm->hmm is valid.
   */
  range->hmm = mm->hmm;
  kref_get(&range->hmm->kref);

All the READ_ONCE/kref_get_not_zero/!hmm just obfuscates the reality
that hmm is non-NULL and constant here or the caller is using the API
wrong.

kref_get will reliably oops or WARN_ON if it is misused which is a
fine amount of debugging.

Jason
Christoph Hellwig May 24, 2019, 6:40 a.m. UTC | #27
On Thu, May 23, 2019 at 04:10:38PM -0300, Jason Gunthorpe wrote:
> 
> On Thu, May 23, 2019 at 02:24:58PM -0400, Jerome Glisse wrote:
> > I can not take mmap_sem in range_register, the READ_ONCE is fine and
> > they are no race as we do take a reference on the hmm struct thus
> 
> Of course there are use after free races with a READ_ONCE scheme, I
> shouldn't have to explain this.
> 
> If you cannot take the read mmap sem (why not?), then please use my
> version and push the update to the driver through -mm..

I think it would really help if we queue up these changes in a git tree
that can be pulled into the driver trees.  Given that you've been
doing so much work to actually make it usable I'd nominate rdma for the
"lead" tree.
Jason Gunthorpe May 24, 2019, 12:44 p.m. UTC | #28
On Thu, May 23, 2019 at 11:40:51PM -0700, Christoph Hellwig wrote:
> On Thu, May 23, 2019 at 04:10:38PM -0300, Jason Gunthorpe wrote:
> > 
> > On Thu, May 23, 2019 at 02:24:58PM -0400, Jerome Glisse wrote:
> > > I can not take mmap_sem in range_register, the READ_ONCE is fine and
> > > they are no race as we do take a reference on the hmm struct thus
> > 
> > Of course there are use after free races with a READ_ONCE scheme, I
> > shouldn't have to explain this.
> > 
> > If you cannot take the read mmap sem (why not?), then please use my
> > version and push the update to the driver through -mm..
> 
> I think it would really help if we queue up these changes in a git tree
> that can be pulled into the driver trees.  Given that you've been
> doing so much work to actually make it usable I'd nominate rdma for the
> "lead" tree.

Sure, I'm willing to do that. RDMA has experience successfully running
shared git trees with netdev. It can work very well, but requires
discipline and understanding of the limitations.

I really want to see the complete HMM solution from Jerome (ie the
kconfig fixes, arm64, api fixes, etc) in one cohesive view, not
forced to be sprinkled across multiple kernel releases to work around
a submission process/coordination problem.

Now that -mm merged the basic hmm API skeleton I think running like
this would get us quickly to the place we all want: comprehensive in tree
users of hmm.

Andrew, would this be acceptable to you?

Dave, would you be willing to merge a clean HMM tree into DRM if it is
required for DRM driver work in 5.3?

I'm fine to merge a tree like this for RDMA, we already do this
pattern with netdev.

Background: The issue that is motivating this is we want to make
changes to some of the API's for hmm, which mean changes in existing
DRM, changes in to-be-accepted RDMA code, and to-be-accepted DRM
driver code. Coordintating the mm/hmm.c, RDMA and DRM changes is best
done with the proven shared git tree pattern. As CH explains I would
run a clean/minimal hmm tree that can be merged into driver trees as
required, and I will commit to sending a PR to Linus for this tree
very early in the merge window so that driver PR's are 'clean'.

The tree will only contain uncontroversial hmm related commits, bug
fixes, etc.

Obviouisly I will also commit to providing review for patches flowing
through this tree.

Regards,
Jason
(rdma subsystem co-maintainer, FWIW)
Daniel Vetter May 24, 2019, 4:27 p.m. UTC | #29
On Fri, May 24, 2019 at 09:44:55AM -0300, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 11:40:51PM -0700, Christoph Hellwig wrote:
> > On Thu, May 23, 2019 at 04:10:38PM -0300, Jason Gunthorpe wrote:
> > > 
> > > On Thu, May 23, 2019 at 02:24:58PM -0400, Jerome Glisse wrote:
> > > > I can not take mmap_sem in range_register, the READ_ONCE is fine and
> > > > they are no race as we do take a reference on the hmm struct thus
> > > 
> > > Of course there are use after free races with a READ_ONCE scheme, I
> > > shouldn't have to explain this.
> > > 
> > > If you cannot take the read mmap sem (why not?), then please use my
> > > version and push the update to the driver through -mm..
> > 
> > I think it would really help if we queue up these changes in a git tree
> > that can be pulled into the driver trees.  Given that you've been
> > doing so much work to actually make it usable I'd nominate rdma for the
> > "lead" tree.
> 
> Sure, I'm willing to do that. RDMA has experience successfully running
> shared git trees with netdev. It can work very well, but requires
> discipline and understanding of the limitations.
> 
> I really want to see the complete HMM solution from Jerome (ie the
> kconfig fixes, arm64, api fixes, etc) in one cohesive view, not
> forced to be sprinkled across multiple kernel releases to work around
> a submission process/coordination problem.
> 
> Now that -mm merged the basic hmm API skeleton I think running like
> this would get us quickly to the place we all want: comprehensive in tree
> users of hmm.
> 
> Andrew, would this be acceptable to you?
> 
> Dave, would you be willing to merge a clean HMM tree into DRM if it is
> required for DRM driver work in 5.3?
> 
> I'm fine to merge a tree like this for RDMA, we already do this
> pattern with netdev.
> 
> Background: The issue that is motivating this is we want to make
> changes to some of the API's for hmm, which mean changes in existing
> DRM, changes in to-be-accepted RDMA code, and to-be-accepted DRM
> driver code. Coordintating the mm/hmm.c, RDMA and DRM changes is best
> done with the proven shared git tree pattern. As CH explains I would
> run a clean/minimal hmm tree that can be merged into driver trees as
> required, and I will commit to sending a PR to Linus for this tree
> very early in the merge window so that driver PR's are 'clean'.
> 
> The tree will only contain uncontroversial hmm related commits, bug
> fixes, etc.
> 
> Obviouisly I will also commit to providing review for patches flowing
> through this tree.

Sure topic branch sounds fine, we do that all the time with various
subsystems all over. We have ready made scripts for topic branches and
applying pulls from all over, so we can even soak test everything in our
integration tree. In case there's conflicts or just to make sure
everything works, before we bake the topic branch into permanent history
(the main drm.git repo just can't be rebased, too much going on and too
many people involvd).

If Jerome is ok with wrestling with our scripting we could even pull these
updates in while the hmm.git tree is evolving.

Cheers, Daniel
(drm co-maintainer fwiw)
Jason Gunthorpe May 24, 2019, 4:53 p.m. UTC | #30
On Fri, May 24, 2019 at 06:27:09PM +0200, Daniel Vetter wrote:
> Sure topic branch sounds fine, we do that all the time with various
> subsystems all over. We have ready made scripts for topic branches and
> applying pulls from all over, so we can even soak test everything in our
> integration tree. In case there's conflicts or just to make sure
> everything works, before we bake the topic branch into permanent history
> (the main drm.git repo just can't be rebased, too much going on and too
> many people involvd).

We don't rebase rdma.git either for the same reasons and nor does
netdev

So the usual flow for a shared topic branch is also no-rebase -
testing/etc needs to be done before things get applied to it.

Cheers,
Jason
Daniel Vetter May 24, 2019, 4:59 p.m. UTC | #31
On Fri, May 24, 2019 at 6:53 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, May 24, 2019 at 06:27:09PM +0200, Daniel Vetter wrote:
> > Sure topic branch sounds fine, we do that all the time with various
> > subsystems all over. We have ready made scripts for topic branches and
> > applying pulls from all over, so we can even soak test everything in our
> > integration tree. In case there's conflicts or just to make sure
> > everything works, before we bake the topic branch into permanent history
> > (the main drm.git repo just can't be rebased, too much going on and too
> > many people involvd).
>
> We don't rebase rdma.git either for the same reasons and nor does
> netdev
>
> So the usual flow for a shared topic branch is also no-rebase -
> testing/etc needs to be done before things get applied to it.

Rebasing before it gets baked into any tree is still ok. And for
something like this we do need a test branch first, which might need a
fixup patch squashed in. On the drm side we have a drm-local
integration tree for this stuff (like linux-next, but without all the
other stuff that's not relevant for graphics). But yeah that's just
details, easy to figure out.
-Daniel
Andrew Morton May 25, 2019, 10:52 p.m. UTC | #32
On Fri, 24 May 2019 09:44:55 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote:

> Now that -mm merged the basic hmm API skeleton I think running like
> this would get us quickly to the place we all want: comprehensive in tree
> users of hmm.
> 
> Andrew, would this be acceptable to you?

Sure.  Please take care not to permit this to reduce the amount of
exposure and review which the core HMM pieces get.
Jason Gunthorpe May 27, 2019, 7:12 p.m. UTC | #33
On Sat, May 25, 2019 at 03:52:10PM -0700, Andrew Morton wrote:
> On Fri, 24 May 2019 09:44:55 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > Now that -mm merged the basic hmm API skeleton I think running like
> > this would get us quickly to the place we all want: comprehensive in tree
> > users of hmm.
> > 
> > Andrew, would this be acceptable to you?
> 
> Sure.  Please take care not to permit this to reduce the amount of
> exposure and review which the core HMM pieces get.

Certainly, thanks all

Jerome: I started a HMM branch on v5.2-rc2 in the rdma.git here:

git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm

Please send a series with the initial cross tree stuff:
 - kconfig fixing patches
 - The full removal of all the 'temporary for merging' APIs
 - Fixing the API of hmm_range_register to accept a mirror

When these are ready I'll send a hmm PR to DRM so everyone is on the
same API page.

I'll also move the hugetlb patch that Andrew picked up into this git
so we don't have a merge conflict risk

In parallel let us also finish revising the mirror API and going
through the ODP stuff.

Regards,
Jason
Jason Gunthorpe June 6, 2019, 3:25 p.m. UTC | #34
On Mon, May 27, 2019 at 04:12:47PM -0300, Jason Gunthorpe wrote:
> On Sat, May 25, 2019 at 03:52:10PM -0700, Andrew Morton wrote:
> > On Fri, 24 May 2019 09:44:55 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > > Now that -mm merged the basic hmm API skeleton I think running like
> > > this would get us quickly to the place we all want: comprehensive in tree
> > > users of hmm.
> > > 
> > > Andrew, would this be acceptable to you?
> > 
> > Sure.  Please take care not to permit this to reduce the amount of
> > exposure and review which the core HMM pieces get.
> 
> Certainly, thanks all
> 
> Jerome: I started a HMM branch on v5.2-rc2 in the rdma.git here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm

I did a first round of collecting patches for hmm.git

Andrew, I'm checking linux-next and to stay co-ordinated, I see the
patches below are in your tree and now also in hmm.git. Can you please
drop them from your tree? 

5b693741de2ace mm/hmm.c: suppress compilation warnings when CONFIG_HUGETLB_PAGE is not set
b2870fb882599a mm/hmm.c: only set FAULT_FLAG_ALLOW_RETRY for non-blocking
dff7babf8ae9f1 mm/hmm.c: support automatic NUMA balancing

I checked that the other two patches in -next also touching hmm.c are
best suited to go through your tree:

a76b9b318a7180 mm/devm_memremap_pages: fix final page put race
fc64c058d01b98 mm/memremap: rename and consolidate SECTION_SIZE

StephenR: Can you pick up the hmm branch from rdma.git for linux-next for
this cycle? As above we are moving the patches from -mm to hmm.git, so
there will be a conflict in -next until Andrew adjusts his tree,
thanks!

Regards,
Jason
(hashes are from today's linux-next)
Stephen Rothwell June 6, 2019, 7:53 p.m. UTC | #35
Hi Jason,

On Thu, 6 Jun 2019 15:25:49 +0000 Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Mon, May 27, 2019 at 04:12:47PM -0300, Jason Gunthorpe wrote:
> > On Sat, May 25, 2019 at 03:52:10PM -0700, Andrew Morton wrote:  
> > > On Fri, 24 May 2019 09:44:55 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >   
> > > > Now that -mm merged the basic hmm API skeleton I think running like
> > > > this would get us quickly to the place we all want: comprehensive in tree
> > > > users of hmm.
> > > > 
> > > > Andrew, would this be acceptable to you?  
> > > 
> > > Sure.  Please take care not to permit this to reduce the amount of
> > > exposure and review which the core HMM pieces get.  
> > 
> > Certainly, thanks all
> > 
> > Jerome: I started a HMM branch on v5.2-rc2 in the rdma.git here:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm  
> 
> I did a first round of collecting patches for hmm.git
> 
> Andrew, I'm checking linux-next and to stay co-ordinated, I see the
> patches below are in your tree and now also in hmm.git. Can you please
> drop them from your tree? 
> 
> 5b693741de2ace mm/hmm.c: suppress compilation warnings when CONFIG_HUGETLB_PAGE is not set
> b2870fb882599a mm/hmm.c: only set FAULT_FLAG_ALLOW_RETRY for non-blocking
> dff7babf8ae9f1 mm/hmm.c: support automatic NUMA balancing
> 
> I checked that the other two patches in -next also touching hmm.c are
> best suited to go through your tree:
> 
> a76b9b318a7180 mm/devm_memremap_pages: fix final page put race
> fc64c058d01b98 mm/memremap: rename and consolidate SECTION_SIZE
> 
> StephenR: Can you pick up the hmm branch from rdma.git for linux-next for
> this cycle? As above we are moving the patches from -mm to hmm.git, so
> there will be a conflict in -next until Andrew adjusts his tree,
> thanks!

I have added the hmm branch from today with currently just you as the
contact.  I also removed the three commits above from Andrew's tree.

Thanks for adding your subsystem tree as a participant of linux-next.  As
you may know, this is not a judgement of your code.  The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window. 

You will need to ensure that the patches/commits in your tree/series have
been:
     * submitted under GPL v2 (or later) and include the Contributor's
        Signed-off-by,
     * posted to the relevant mailing list,
     * reviewed by you (or another maintainer of your subsystem tree),
     * successfully unit tested, and 
     * destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch).  It is allowed to be rebased if you deem it necessary.