[v2,2/3] mm/hmm: allow snapshot of the special zero page

Message ID	20191015204814.30099-3-rcampbell@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=FL5B=YI=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 513BF20872 TLS: TLSv1.2, DES-CBC3-SHA) id <B5da630a60000>; Tue, 15 Oct 2019 13:48:38 -0700 From: Ralph Campbell <rcampbell@nvidia.com> To: Jerome Glisse <jglisse@redhat.com>, John Hubbard <jhubbard@nvidia.com>, Christoph Hellwig <hch@lst.de>, Jason Gunthorpe <jgg@mellanox.com> CC: <linux-rdma@vger.kernel.org>, <linux-mm@kvack.org>, Ralph Campbell <rcampbell@nvidia.com> Subject: [PATCH v2 2/3] mm/hmm: allow snapshot of the special zero page Date: Tue, 15 Oct 2019 13:48:13 -0700 Message-ID: <20191015204814.30099-3-rcampbell@nvidia.com> In-Reply-To: <20191015204814.30099-1-rcampbell@nvidia.com> References: <20191015204814.30099-1-rcampbell@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	HMM tests and minor fixes \| expand [v2,0/3] HMM tests and minor fixes [v2,1/3] mm/hmm: make full use of walk_page_range() [v2,2/3] mm/hmm: allow snapshot of the special zero page [v2,3/3] mm/hmm/test: add self tests for HMM

Ralph Campbell Oct. 15, 2019, 8:48 p.m. UTC

Allow hmm_range_fault() to return success (0) when the CPU pagetable
entry points to the special shared zero page.
The caller can then handle the zero page by possibly clearing device
private memory instead of DMAing a zero page.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Jason Gunthorpe Oct. 21, 2019, 6:08 p.m. UTC | #1

On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote:
> Allow hmm_range_fault() to return success (0) when the CPU pagetable
> entry points to the special shared zero page.
> The caller can then handle the zero page by possibly clearing device
> private memory instead of DMAing a zero page.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
>  mm/hmm.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 5df0dbf77e89..f62b119722a3 100644
> +++ b/mm/hmm.c
> @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  			return -EBUSY;
>  	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
>  		*pfn = range->values[HMM_PFN_SPECIAL];
> -		return -EFAULT;
> +		if (!is_zero_pfn(pte_pfn(pte)))
> +			return -EFAULT;
> +		return 0;

Does it make sense to return HMM_PFN_SPECIAL in this case? Does the
zero pfn have a struct page? Does it need mandatory special treatment?

ie the base behavior without any driver code should be to dma from the
zero memory. A fancy driver should be able to detect the zero and do
something else.

I'm not clear what the two existing users do with PFN_SPECIAL? Nouveau
looks like it is the same value as error, can't guess what amdgpu does
with its magic constant

Jason

Jerome Glisse Oct. 21, 2019, 6:49 p.m. UTC | #2

On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote:
> Allow hmm_range_fault() to return success (0) when the CPU pagetable
> entry points to the special shared zero page.
> The caller can then handle the zero page by possibly clearing device
> private memory instead of DMAing a zero page.

I do not understand why you are talking about DMA. GPU can work
on main memory and migrating to GPU memory is optional and should
not involve this function at all.

> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>

NAK please keep semantic or change it fully. See the alternative
below.

> ---
>  mm/hmm.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 5df0dbf77e89..f62b119722a3 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  			return -EBUSY;
>  	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
>  		*pfn = range->values[HMM_PFN_SPECIAL];
> -		return -EFAULT;
> +		if (!is_zero_pfn(pte_pfn(pte)))
> +			return -EFAULT;
> +		return 0;

An acceptable change would be to turn the branch into:
	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
		if (!is_zero_pfn(pte_pfn(pte))) {
			*pfn = range->values[HMM_PFN_SPECIAL];
			return -EFAULT;
		}
		/* Fall-through for zero pfn (if write was needed the above
		 * hmm_pte_need_faul() would had catched it).
		 */
	}

Ralph Campbell Oct. 21, 2019, 8:08 p.m. UTC | #3

On 10/21/19 11:08 AM, Jason Gunthorpe wrote:
> On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote:
>> Allow hmm_range_fault() to return success (0) when the CPU pagetable
>> entry points to the special shared zero page.
>> The caller can then handle the zero page by possibly clearing device
>> private memory instead of DMAing a zero page.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Jason Gunthorpe <jgg@mellanox.com>
>>   mm/hmm.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/hmm.c b/mm/hmm.c
>> index 5df0dbf77e89..f62b119722a3 100644
>> +++ b/mm/hmm.c
>> @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>>   			return -EBUSY;
>>   	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
>>   		*pfn = range->values[HMM_PFN_SPECIAL];
>> -		return -EFAULT;
>> +		if (!is_zero_pfn(pte_pfn(pte)))
>> +			return -EFAULT;
>> +		return 0;
> 
> Does it make sense to return HMM_PFN_SPECIAL in this case? Does the
> zero pfn have a struct page? Does it need mandatory special treatment?

The zero pfn does not have a struct page so it needs special treatment:
see nouveau_dmem_convert_pfn() where it calls hmm_device_entry_to_page().

If HMM ever ends up supporting VM_PFNMAP
there would need to be a way to distinguish pfns with and without a
backing struct page too.

> ie the base behavior without any driver code should be to dma from the
> zero memory. A fancy driver should be able to detect the zero and do
> something else.

Correct.

> I'm not clear what the two existing users do with PFN_SPECIAL? Nouveau
> looks like it is the same value as error, can't guess what amdgpu does
> with its magic constant
> 
> Jason

I doubt the zero pfn case is being handled correctly in amd/nouveau.
I made the change above when explicitly testing for it in the patch
adding HMM tests.

Ralph Campbell Oct. 21, 2019, 8:54 p.m. UTC | #4

On 10/21/19 11:49 AM, Jerome Glisse wrote:
> On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote:
>> Allow hmm_range_fault() to return success (0) when the CPU pagetable
>> entry points to the special shared zero page.
>> The caller can then handle the zero page by possibly clearing device
>> private memory instead of DMAing a zero page.
> 
> I do not understand why you are talking about DMA. GPU can work
> on main memory and migrating to GPU memory is optional and should
> not involve this function at all.

Good point. This is the device accessing the zero page over PCIe
or another bus, not migrating a zero page to device private memory.
I'll update the wording.

>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Jason Gunthorpe <jgg@mellanox.com>
> 
> NAK please keep semantic or change it fully. See the alternative
> below.
> 
>> ---
>>   mm/hmm.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/hmm.c b/mm/hmm.c
>> index 5df0dbf77e89..f62b119722a3 100644
>> --- a/mm/hmm.c
>> +++ b/mm/hmm.c
>> @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>>   			return -EBUSY;
>>   	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
>>   		*pfn = range->values[HMM_PFN_SPECIAL];
>> -		return -EFAULT;
>> +		if (!is_zero_pfn(pte_pfn(pte)))
>> +			return -EFAULT;
>> +		return 0;
> 
> An acceptable change would be to turn the branch into:
> 	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> 		if (!is_zero_pfn(pte_pfn(pte))) {
> 			*pfn = range->values[HMM_PFN_SPECIAL];
> 			return -EFAULT;
> 		}
> 		/* Fall-through for zero pfn (if write was needed the above
> 		 * hmm_pte_need_faul() would had catched it).
> 		 */
> 	}
> 

Except this will return the zero pfn with no indication that it is special
(i.e., doesn't have a struct page).

Jerome Glisse Oct. 22, 2019, 2:45 a.m. UTC | #5

On Mon, Oct 21, 2019 at 01:54:15PM -0700, Ralph Campbell wrote:
> 
> On 10/21/19 11:49 AM, Jerome Glisse wrote:
> > On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote:
> > > Allow hmm_range_fault() to return success (0) when the CPU pagetable
> > > entry points to the special shared zero page.
> > > The caller can then handle the zero page by possibly clearing device
> > > private memory instead of DMAing a zero page.
> > 
> > I do not understand why you are talking about DMA. GPU can work
> > on main memory and migrating to GPU memory is optional and should
> > not involve this function at all.
> 
> Good point. This is the device accessing the zero page over PCIe
> or another bus, not migrating a zero page to device private memory.
> I'll update the wording.
> 
> > > 
> > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > Cc: "Jérôme Glisse" <jglisse@redhat.com>
> > > Cc: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > NAK please keep semantic or change it fully. See the alternative
> > below.
> > 
> > > ---
> > >   mm/hmm.c | 4 +++-
> > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/hmm.c b/mm/hmm.c
> > > index 5df0dbf77e89..f62b119722a3 100644
> > > --- a/mm/hmm.c
> > > +++ b/mm/hmm.c
> > > @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> > >   			return -EBUSY;
> > >   	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> > >   		*pfn = range->values[HMM_PFN_SPECIAL];
> > > -		return -EFAULT;
> > > +		if (!is_zero_pfn(pte_pfn(pte)))
> > > +			return -EFAULT;
> > > +		return 0;
> > 
> > An acceptable change would be to turn the branch into:
> > 	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> > 		if (!is_zero_pfn(pte_pfn(pte))) {
> > 			*pfn = range->values[HMM_PFN_SPECIAL];
> > 			return -EFAULT;
> > 		}
> > 		/* Fall-through for zero pfn (if write was needed the above
> > 		 * hmm_pte_need_faul() would had catched it).
> > 		 */
> > 	}
> > 
> 
> Except this will return the zero pfn with no indication that it is special
> (i.e., doesn't have a struct page).

That is fine, the device driver should not do anything with it ie
if the device driver wanted to write then the write fault test
would return true and it would fault.

Note that driver should not dereference the struct page.

Cheers,
Jérôme

Jason Gunthorpe Oct. 22, 2019, 3:05 p.m. UTC | #6

On Mon, Oct 21, 2019 at 10:45:49PM -0400, Jerome Glisse wrote:
> On Mon, Oct 21, 2019 at 01:54:15PM -0700, Ralph Campbell wrote:
> > 
> > On 10/21/19 11:49 AM, Jerome Glisse wrote:
> > > On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote:
> > > > Allow hmm_range_fault() to return success (0) when the CPU pagetable
> > > > entry points to the special shared zero page.
> > > > The caller can then handle the zero page by possibly clearing device
> > > > private memory instead of DMAing a zero page.
> > > 
> > > I do not understand why you are talking about DMA. GPU can work
> > > on main memory and migrating to GPU memory is optional and should
> > > not involve this function at all.
> > 
> > Good point. This is the device accessing the zero page over PCIe
> > or another bus, not migrating a zero page to device private memory.
> > I'll update the wording.
> > 
> > > > 
> > > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > > Cc: "Jérôme Glisse" <jglisse@redhat.com>
> > > > Cc: Jason Gunthorpe <jgg@mellanox.com>
> > > 
> > > NAK please keep semantic or change it fully. See the alternative
> > > below.
> > > 
> > > >   mm/hmm.c | 4 +++-
> > > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/hmm.c b/mm/hmm.c
> > > > index 5df0dbf77e89..f62b119722a3 100644
> > > > +++ b/mm/hmm.c
> > > > @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> > > >   			return -EBUSY;
> > > >   	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> > > >   		*pfn = range->values[HMM_PFN_SPECIAL];
> > > > -		return -EFAULT;
> > > > +		if (!is_zero_pfn(pte_pfn(pte)))
> > > > +			return -EFAULT;
> > > > +		return 0;
> > > 
> > > An acceptable change would be to turn the branch into:
> > > 	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> > > 		if (!is_zero_pfn(pte_pfn(pte))) {
> > > 			*pfn = range->values[HMM_PFN_SPECIAL];
> > > 			return -EFAULT;
> > > 		}
> > > 		/* Fall-through for zero pfn (if write was needed the above
> > > 		 * hmm_pte_need_faul() would had catched it).
> > > 		 */
> > > 	}
> > > 
> > 
> > Except this will return the zero pfn with no indication that it is special
> > (i.e., doesn't have a struct page).
> 
> That is fine, the device driver should not do anything with it ie
> if the device driver wanted to write then the write fault test
> would return true and it would fault.
> 
> Note that driver should not dereference the struct page.

Can this thing be dma mapped for read?

Jason

Jerome Glisse Oct. 22, 2019, 5:06 p.m. UTC | #7

On Tue, Oct 22, 2019 at 03:05:18PM +0000, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 10:45:49PM -0400, Jerome Glisse wrote:
> > On Mon, Oct 21, 2019 at 01:54:15PM -0700, Ralph Campbell wrote:
> > > 
> > > On 10/21/19 11:49 AM, Jerome Glisse wrote:
> > > > On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote:
> > > > > Allow hmm_range_fault() to return success (0) when the CPU pagetable
> > > > > entry points to the special shared zero page.
> > > > > The caller can then handle the zero page by possibly clearing device
> > > > > private memory instead of DMAing a zero page.
> > > > 
> > > > I do not understand why you are talking about DMA. GPU can work
> > > > on main memory and migrating to GPU memory is optional and should
> > > > not involve this function at all.
> > > 
> > > Good point. This is the device accessing the zero page over PCIe
> > > or another bus, not migrating a zero page to device private memory.
> > > I'll update the wording.
> > > 
> > > > > 
> > > > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> > > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > > > Cc: "Jérôme Glisse" <jglisse@redhat.com>
> > > > > Cc: Jason Gunthorpe <jgg@mellanox.com>
> > > > 
> > > > NAK please keep semantic or change it fully. See the alternative
> > > > below.
> > > > 
> > > > >   mm/hmm.c | 4 +++-
> > > > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/hmm.c b/mm/hmm.c
> > > > > index 5df0dbf77e89..f62b119722a3 100644
> > > > > +++ b/mm/hmm.c
> > > > > @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> > > > >   			return -EBUSY;
> > > > >   	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> > > > >   		*pfn = range->values[HMM_PFN_SPECIAL];
> > > > > -		return -EFAULT;
> > > > > +		if (!is_zero_pfn(pte_pfn(pte)))
> > > > > +			return -EFAULT;
> > > > > +		return 0;
> > > > 
> > > > An acceptable change would be to turn the branch into:
> > > > 	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> > > > 		if (!is_zero_pfn(pte_pfn(pte))) {
> > > > 			*pfn = range->values[HMM_PFN_SPECIAL];
> > > > 			return -EFAULT;
> > > > 		}
> > > > 		/* Fall-through for zero pfn (if write was needed the above
> > > > 		 * hmm_pte_need_faul() would had catched it).
> > > > 		 */
> > > > 	}
> > > > 
> > > 
> > > Except this will return the zero pfn with no indication that it is special
> > > (i.e., doesn't have a struct page).
> > 
> > That is fine, the device driver should not do anything with it ie
> > if the device driver wanted to write then the write fault test
> > would return true and it would fault.
> > 
> > Note that driver should not dereference the struct page.
> 
> Can this thing be dma mapped for read?
> 

Yes it can, the zero page is just a regular page (AFAIK on all
architecture). So device can dma map it for read only, there is
no reason to treat it any differently.

The HMM_PTE_SPECIAL is only (as documented in the header) for
pte insert with insert_pfn or insert_page ie pte inserted in
vma with MIXED or PFNMAP flag. While HMM catch those vma early
on and backof it can still race with some driver setting the vma
flag and installing special pte afterward hence why special pte
goes through this special path.

The zero page being a special pte is just an exception ie it
is the only special pte allowed in vma that do not have MIXED or
PFNMAP flag set.

Cheers,
Jérôme

Jason Gunthorpe Oct. 22, 2019, 5:09 p.m. UTC | #8

On Tue, Oct 22, 2019 at 01:06:31PM -0400, Jerome Glisse wrote:

> > > That is fine, the device driver should not do anything with it ie
> > > if the device driver wanted to write then the write fault test
> > > would return true and it would fault.
> > > 
> > > Note that driver should not dereference the struct page.
> > 
> > Can this thing be dma mapped for read?
> > 
> 
> Yes it can, the zero page is just a regular page (AFAIK on all
> architecture). So device can dma map it for read only, there is
> no reason to treat it any differently.
> 
> The HMM_PTE_SPECIAL is only (as documented in the header) for
> pte insert with insert_pfn or insert_page ie pte inserted in
> vma with MIXED or PFNMAP flag. While HMM catch those vma early
> on and backof it can still race with some driver setting the vma
> flag and installing special pte afterward hence why special pte
> goes through this special path.
> 
> The zero page being a special pte is just an exception ie it
> is the only special pte allowed in vma that do not have MIXED or
> PFNMAP flag set.

Just to be clear then, the correct behavior is to return the zero page
pfn as a HMM_PFN_VALID and the driver should treat it the same as any
memory page and dma map it?

Smart drivers can test somehow for pfn == zero_page and optimize?

Jason

Jerome Glisse Oct. 22, 2019, 5:30 p.m. UTC | #9

On Tue, Oct 22, 2019 at 05:09:19PM +0000, Jason Gunthorpe wrote:
> On Tue, Oct 22, 2019 at 01:06:31PM -0400, Jerome Glisse wrote:
> 
> > > > That is fine, the device driver should not do anything with it ie
> > > > if the device driver wanted to write then the write fault test
> > > > would return true and it would fault.
> > > > 
> > > > Note that driver should not dereference the struct page.
> > > 
> > > Can this thing be dma mapped for read?
> > > 
> > 
> > Yes it can, the zero page is just a regular page (AFAIK on all
> > architecture). So device can dma map it for read only, there is
> > no reason to treat it any differently.
> > 
> > The HMM_PTE_SPECIAL is only (as documented in the header) for
> > pte insert with insert_pfn or insert_page ie pte inserted in
> > vma with MIXED or PFNMAP flag. While HMM catch those vma early
> > on and backof it can still race with some driver setting the vma
> > flag and installing special pte afterward hence why special pte
> > goes through this special path.
> > 
> > The zero page being a special pte is just an exception ie it
> > is the only special pte allowed in vma that do not have MIXED or
> > PFNMAP flag set.
> 
> Just to be clear then, the correct behavior is to return the zero page
> pfn as a HMM_PFN_VALID and the driver should treat it the same as any
> memory page and dma map it?

Yes correct.

> 
> Smart drivers can test somehow for pfn == zero_page and optimize?

There is nothing to optimize here, i do not know any hardware that
have a special page table entry that make all memory access return
zero.

What was confusing in Ralph commit message is that he was conflating
the memory migration, which is a totaly different code path, with
that code. When doing memory migration it is easy to program the DMA
engine to zero out destination memory (common feature found on various
devices) and that optimization is allowed by the migrate code.

So i can not think of any reason why distinguishing the zero page in
this code would help. Maybe i missed some new feature in the mmu of
some new hardware.

Cheers,
Jérôme

Jason Gunthorpe Oct. 22, 2019, 5:41 p.m. UTC | #10

On Tue, Oct 22, 2019 at 01:30:26PM -0400, Jerome Glisse wrote:

> > Smart drivers can test somehow for pfn == zero_page and optimize?
> 
> There is nothing to optimize here, i do not know any hardware that
> have a special page table entry that make all memory access return
> zero.

Presumably any GPU could globally dedicate one page of internal memory
as a zero page and remap CPU zero page to that internal memory page?
This is basically how the CPU zero page works.

I suspect mlx5 could do the same with its internal memory, but the
internal memory is too limited to make this worth while.

mlx5 also has a specially 'zero MR' that always reads as zero (and
discards writes), but it doesn't quite fit well into the ODP flow.

Jason

Jerome Glisse Oct. 22, 2019, 5:52 p.m. UTC | #11

On Tue, Oct 22, 2019 at 05:41:11PM +0000, Jason Gunthorpe wrote:
> On Tue, Oct 22, 2019 at 01:30:26PM -0400, Jerome Glisse wrote:
> 
> > > Smart drivers can test somehow for pfn == zero_page and optimize?
> > 
> > There is nothing to optimize here, i do not know any hardware that
> > have a special page table entry that make all memory access return
> > zero.
> 
> Presumably any GPU could globally dedicate one page of internal memory
> as a zero page and remap CPU zero page to that internal memory page?
> This is basically how the CPU zero page works.

Yes that would work too but i do not know of any upstream driver
that does that.

> I suspect mlx5 could do the same with its internal memory, but the
> internal memory is too limited to make this worth while.
> 
> mlx5 also has a specially 'zero MR' that always reads as zero (and
> discards writes), but it doesn't quite fit well into the ODP flow.

Well you can always ask for new stuff to your beloved hardware
engineers, they never say no right ? :)

Cheers,
Jérôme

[v2,2/3] mm/hmm: allow snapshot of the special zero page

Commit Message

Comments

Patch