diff mbox series

[v5,01/13] mm: add zone device coherent type memory support

Message ID 20220531200041.24904-2-alex.sierra@amd.com (mailing list archive)
State Superseded, archived
Headers show
Series Add MEMORY_DEVICE_COHERENT for coherent device memory mapping | expand

Commit Message

Sierra Guiza, Alejandro (Alex) May 31, 2022, 8 p.m. UTC
Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
[hch: rebased ontop of the refcount changes,
      removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/memremap.h | 19 +++++++++++++++++++
 mm/memcontrol.c          |  7 ++++---
 mm/memory-failure.c      |  8 ++++++--
 mm/memremap.c            | 10 ++++++++++
 mm/migrate_device.c      | 16 +++++++---------
 mm/rmap.c                |  5 +++--
 6 files changed, 49 insertions(+), 16 deletions(-)

Comments

David Hildenbrand June 17, 2022, 9:40 a.m. UTC | #1
On 31.05.22 22:00, Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
> 
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> [hch: rebased ontop of the refcount changes,
>       removed is_dev_private_or_coherent_page]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/memremap.h | 19 +++++++++++++++++++
>  mm/memcontrol.c          |  7 ++++---
>  mm/memory-failure.c      |  8 ++++++--
>  mm/memremap.c            | 10 ++++++++++
>  mm/migrate_device.c      | 16 +++++++---------
>  mm/rmap.c                |  5 +++--
>  6 files changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 8af304f6b504..9f752ebed613 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -41,6 +41,13 @@ struct vmem_altmap {
>   * A more complete discussion of unaddressable memory may be found in
>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>   *
> + * MEMORY_DEVICE_COHERENT:
> + * Device memory that is cache coherent from device and CPU point of view. This
> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> + * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special pages
like vdso, shared zeropage, ... pinned pages ...

> + * should be allowed to pin such memory so that it can always be evicted.
> + *
>   * MEMORY_DEVICE_FS_DAX:
>   * Host memory that has similar access semantics as System RAM i.e. DMA
>   * coherent and supports page pinning. In support of coordinating page
> @@ -61,6 +68,7 @@ struct vmem_altmap {
>  enum memory_type {
>  	/* 0 is reserved to catch uninitialized type fields */
>  	MEMORY_DEVICE_PRIVATE = 1,
> +	MEMORY_DEVICE_COHERENT,
>  	MEMORY_DEVICE_FS_DAX,
>  	MEMORY_DEVICE_GENERIC,
>  	MEMORY_DEVICE_PCI_P2PDMA,
> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?
Sierra Guiza, Alejandro (Alex) June 17, 2022, 5:20 p.m. UTC | #2
On 6/17/2022 4:40 AM, David Hildenbrand wrote:
> On 31.05.22 22:00, Alex Sierra wrote:
>> Device memory that is cache coherent from device and CPU point of view.
>> This is used on platforms that have an advanced system bus (like CAPI
>> or CXL). Any page of a process can be migrated to such memory. However,
>> no one should be allowed to pin such memory so that it can always be
>> evicted.
>>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>> [hch: rebased ontop of the refcount changes,
>>        removed is_dev_private_or_coherent_page]
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>   include/linux/memremap.h | 19 +++++++++++++++++++
>>   mm/memcontrol.c          |  7 ++++---
>>   mm/memory-failure.c      |  8 ++++++--
>>   mm/memremap.c            | 10 ++++++++++
>>   mm/migrate_device.c      | 16 +++++++---------
>>   mm/rmap.c                |  5 +++--
>>   6 files changed, 49 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 8af304f6b504..9f752ebed613 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>    * A more complete discussion of unaddressable memory may be found in
>>    * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>    *
>> + * MEMORY_DEVICE_COHERENT:
>> + * Device memory that is cache coherent from device and CPU point of view. This
>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>> + * type. Any page of a process can be migrated to such memory. However no one
> Any page might not be right, I'm pretty sure. ... just thinking about special pages
> like vdso, shared zeropage, ... pinned pages ...

Hi David,

Yes, I think you're right. This type does not cover all special pages.  
I need to correct that on the cover letter.
Pinned pages are allowed as long as they're not long term pinned.

Regards,
Alex Sierra

>
>> + * should be allowed to pin such memory so that it can always be evicted.
>> + *
>>    * MEMORY_DEVICE_FS_DAX:
>>    * Host memory that has similar access semantics as System RAM i.e. DMA
>>    * coherent and supports page pinning. In support of coordinating page
>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>   enum memory_type {
>>   	/* 0 is reserved to catch uninitialized type fields */
>>   	MEMORY_DEVICE_PRIVATE = 1,
>> +	MEMORY_DEVICE_COHERENT,
>>   	MEMORY_DEVICE_FS_DAX,
>>   	MEMORY_DEVICE_GENERIC,
>>   	MEMORY_DEVICE_PCI_P2PDMA,
>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>
>
> However, where exactly is pinning forbidden?

Long-term pinning is forbidden since it would interfere with the device 
memory manager owning the
device-coherent pages (e.g. evictions in TTM). However, normal pinning 
is allowed on this device type.

Regards,
Alex Sierra

>
David Hildenbrand June 17, 2022, 5:33 p.m. UTC | #3
On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>> On 31.05.22 22:00, Alex Sierra wrote:
>>> Device memory that is cache coherent from device and CPU point of view.
>>> This is used on platforms that have an advanced system bus (like CAPI
>>> or CXL). Any page of a process can be migrated to such memory. However,
>>> no one should be allowed to pin such memory so that it can always be
>>> evicted.
>>>
>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>>> [hch: rebased ontop of the refcount changes,
>>>        removed is_dev_private_or_coherent_page]
>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>> ---
>>>   include/linux/memremap.h | 19 +++++++++++++++++++
>>>   mm/memcontrol.c          |  7 ++++---
>>>   mm/memory-failure.c      |  8 ++++++--
>>>   mm/memremap.c            | 10 ++++++++++
>>>   mm/migrate_device.c      | 16 +++++++---------
>>>   mm/rmap.c                |  5 +++--
>>>   6 files changed, 49 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index 8af304f6b504..9f752ebed613 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>    * A more complete discussion of unaddressable memory may be found in
>>>    * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>    *
>>> + * MEMORY_DEVICE_COHERENT:
>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>> + * type. Any page of a process can be migrated to such memory. However no one
>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>> like vdso, shared zeropage, ... pinned pages ...
> 

Well, you cannot migrate long term pages, that's what I meant :)

>>
>>> + * should be allowed to pin such memory so that it can always be evicted.
>>> + *
>>>    * MEMORY_DEVICE_FS_DAX:
>>>    * Host memory that has similar access semantics as System RAM i.e. DMA
>>>    * coherent and supports page pinning. In support of coordinating page
>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>   enum memory_type {
>>>   	/* 0 is reserved to catch uninitialized type fields */
>>>   	MEMORY_DEVICE_PRIVATE = 1,
>>> +	MEMORY_DEVICE_COHERENT,
>>>   	MEMORY_DEVICE_FS_DAX,
>>>   	MEMORY_DEVICE_GENERIC,
>>>   	MEMORY_DEVICE_PCI_P2PDMA,
>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>
>>
>> However, where exactly is pinning forbidden?
> 
> Long-term pinning is forbidden since it would interfere with the device 
> memory manager owning the
> device-coherent pages (e.g. evictions in TTM). However, normal pinning 
> is allowed on this device type.

I don't see updates to folio_is_pinnable() in this patch.

So wouldn't try_grab_folio() simply pin these pages? What am I missing?
Sierra Guiza, Alejandro (Alex) June 17, 2022, 7:27 p.m. UTC | #4
On 6/17/2022 12:33 PM, David Hildenbrand wrote:
> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>> Device memory that is cache coherent from device and CPU point of view.
>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>> no one should be allowed to pin such memory so that it can always be
>>>> evicted.
>>>>
>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>>>> [hch: rebased ontop of the refcount changes,
>>>>         removed is_dev_private_or_coherent_page]
>>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>> ---
>>>>    include/linux/memremap.h | 19 +++++++++++++++++++
>>>>    mm/memcontrol.c          |  7 ++++---
>>>>    mm/memory-failure.c      |  8 ++++++--
>>>>    mm/memremap.c            | 10 ++++++++++
>>>>    mm/migrate_device.c      | 16 +++++++---------
>>>>    mm/rmap.c                |  5 +++--
>>>>    6 files changed, 49 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>> index 8af304f6b504..9f752ebed613 100644
>>>> --- a/include/linux/memremap.h
>>>> +++ b/include/linux/memremap.h
>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>     * A more complete discussion of unaddressable memory may be found in
>>>>     * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>     *
>>>> + * MEMORY_DEVICE_COHERENT:
>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>> like vdso, shared zeropage, ... pinned pages ...
> Well, you cannot migrate long term pages, that's what I meant :)
>
>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>> + *
>>>>     * MEMORY_DEVICE_FS_DAX:
>>>>     * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>     * coherent and supports page pinning. In support of coordinating page
>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>    enum memory_type {
>>>>    	/* 0 is reserved to catch uninitialized type fields */
>>>>    	MEMORY_DEVICE_PRIVATE = 1,
>>>> +	MEMORY_DEVICE_COHERENT,
>>>>    	MEMORY_DEVICE_FS_DAX,
>>>>    	MEMORY_DEVICE_GENERIC,
>>>>    	MEMORY_DEVICE_PCI_P2PDMA,
>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>
>>>
>>> However, where exactly is pinning forbidden?
>> Long-term pinning is forbidden since it would interfere with the device
>> memory manager owning the
>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>> is allowed on this device type.
> I don't see updates to folio_is_pinnable() in this patch.
Device coherent type pages should return true here, as they are pinnable 
pages.
>
> So wouldn't try_grab_folio() simply pin these pages? What am I missing?

As far as I understand this return NULL for long term pin pages. 
Otherwise they get refcount incremented.

Regards,
Alex Sierra

>
David Hildenbrand June 17, 2022, 9:19 p.m. UTC | #5
On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>> no one should be allowed to pin such memory so that it can always be
>>>>> evicted.
>>>>>
>>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>>>>> [hch: rebased ontop of the refcount changes,
>>>>>         removed is_dev_private_or_coherent_page]
>>>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>>> ---
>>>>>    include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>    mm/memcontrol.c          |  7 ++++---
>>>>>    mm/memory-failure.c      |  8 ++++++--
>>>>>    mm/memremap.c            | 10 ++++++++++
>>>>>    mm/migrate_device.c      | 16 +++++++---------
>>>>>    mm/rmap.c                |  5 +++--
>>>>>    6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>> --- a/include/linux/memremap.h
>>>>> +++ b/include/linux/memremap.h
>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>     * A more complete discussion of unaddressable memory may be found in
>>>>>     * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>     *
>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>> like vdso, shared zeropage, ... pinned pages ...
>> Well, you cannot migrate long term pages, that's what I meant :)
>>
>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>> + *
>>>>>     * MEMORY_DEVICE_FS_DAX:
>>>>>     * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>     * coherent and supports page pinning. In support of coordinating page
>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>    enum memory_type {
>>>>>    	/* 0 is reserved to catch uninitialized type fields */
>>>>>    	MEMORY_DEVICE_PRIVATE = 1,
>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>    	MEMORY_DEVICE_FS_DAX,
>>>>>    	MEMORY_DEVICE_GENERIC,
>>>>>    	MEMORY_DEVICE_PCI_P2PDMA,
>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>
>>>>
>>>> However, where exactly is pinning forbidden?
>>> Long-term pinning is forbidden since it would interfere with the device
>>> memory manager owning the
>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>> is allowed on this device type.
>> I don't see updates to folio_is_pinnable() in this patch.
> Device coherent type pages should return true here, as they are pinnable 
> pages.

That function is only called for long-term pinnings in try_grab_folio().

>>
>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
> 
> As far as I understand this return NULL for long term pin pages. 
> Otherwise they get refcount incremented.

I don't follow.

You're saying

a) folio_is_pinnable() returns true for device coherent pages

and that

b) device coherent pages don't get long-term pinned


Yet, the code says

struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
	if (flags & FOLL_GET)
		return try_get_folio(page, refs);
	else if (flags & FOLL_PIN) {
		struct folio *folio;

		/*
		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
		 * right zone, so fail and let the caller fall back to the slow
		 * path.
		 */
		if (unlikely((flags & FOLL_LONGTERM) &&
			     !is_pinnable_page(page)))
			return NULL;
		...
		return folio;
	}
}


What prevents these pages from getting long-term pinned as stated in this patch?

I am probably missing something important.
Oded Gabbay June 18, 2022, 9:32 a.m. UTC | #6
On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
<alex.sierra@amd.com> wrote:
>
>
> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
> > On 31.05.22 22:00, Alex Sierra wrote:
> >> Device memory that is cache coherent from device and CPU point of view.
> >> This is used on platforms that have an advanced system bus (like CAPI
> >> or CXL). Any page of a process can be migrated to such memory. However,
> >> no one should be allowed to pin such memory so that it can always be
> >> evicted.
> >>
> >> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> >> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> >> [hch: rebased ontop of the refcount changes,
> >>        removed is_dev_private_or_coherent_page]
> >> Signed-off-by: Christoph Hellwig <hch@lst.de>
> >> ---
> >>   include/linux/memremap.h | 19 +++++++++++++++++++
> >>   mm/memcontrol.c          |  7 ++++---
> >>   mm/memory-failure.c      |  8 ++++++--
> >>   mm/memremap.c            | 10 ++++++++++
> >>   mm/migrate_device.c      | 16 +++++++---------
> >>   mm/rmap.c                |  5 +++--
> >>   6 files changed, 49 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> index 8af304f6b504..9f752ebed613 100644
> >> --- a/include/linux/memremap.h
> >> +++ b/include/linux/memremap.h
> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
> >>    * A more complete discussion of unaddressable memory may be found in
> >>    * include/linux/hmm.h and Documentation/vm/hmm.rst.
> >>    *
> >> + * MEMORY_DEVICE_COHERENT:
> >> + * Device memory that is cache coherent from device and CPU point of view. This
> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> >> + * type. Any page of a process can be migrated to such memory. However no one
> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
> > like vdso, shared zeropage, ... pinned pages ...
>
> Hi David,
>
> Yes, I think you're right. This type does not cover all special pages.
> I need to correct that on the cover letter.
> Pinned pages are allowed as long as they're not long term pinned.
>
> Regards,
> Alex Sierra

What if I want to hotplug this device's coherent memory, but I do
*not* want the OS
to migrate any page to it ?
I want to fully-control what resides on this memory, as I can consider
this memory
"expensive". i.e. I don't have a lot of it, I want to use it for
specific purposes and
I don't want the OS to start using it when there is some memory pressure in
the system.

Oded

>
> >
> >> + * should be allowed to pin such memory so that it can always be evicted.
> >> + *
> >>    * MEMORY_DEVICE_FS_DAX:
> >>    * Host memory that has similar access semantics as System RAM i.e. DMA
> >>    * coherent and supports page pinning. In support of coordinating page
> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
> >>   enum memory_type {
> >>      /* 0 is reserved to catch uninitialized type fields */
> >>      MEMORY_DEVICE_PRIVATE = 1,
> >> +    MEMORY_DEVICE_COHERENT,
> >>      MEMORY_DEVICE_FS_DAX,
> >>      MEMORY_DEVICE_GENERIC,
> >>      MEMORY_DEVICE_PCI_P2PDMA,
> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
> >
> >
> > However, where exactly is pinning forbidden?
>
> Long-term pinning is forbidden since it would interfere with the device
> memory manager owning the
> device-coherent pages (e.g. evictions in TTM). However, normal pinning
> is allowed on this device type.
>
> Regards,
> Alex Sierra
>
> >
Alistair Popple June 20, 2022, 12:17 a.m. UTC | #7
Oded Gabbay <oded.gabbay@gmail.com> writes:

> On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
> <alex.sierra@amd.com> wrote:
>>
>>
>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>> > On 31.05.22 22:00, Alex Sierra wrote:
>> >> Device memory that is cache coherent from device and CPU point of view.
>> >> This is used on platforms that have an advanced system bus (like CAPI
>> >> or CXL). Any page of a process can be migrated to such memory. However,
>> >> no one should be allowed to pin such memory so that it can always be
>> >> evicted.
>> >>
>> >> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> >> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> >> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>> >> [hch: rebased ontop of the refcount changes,
>> >>        removed is_dev_private_or_coherent_page]
>> >> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> >> ---
>> >>   include/linux/memremap.h | 19 +++++++++++++++++++
>> >>   mm/memcontrol.c          |  7 ++++---
>> >>   mm/memory-failure.c      |  8 ++++++--
>> >>   mm/memremap.c            | 10 ++++++++++
>> >>   mm/migrate_device.c      | 16 +++++++---------
>> >>   mm/rmap.c                |  5 +++--
>> >>   6 files changed, 49 insertions(+), 16 deletions(-)
>> >>
>> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> >> index 8af304f6b504..9f752ebed613 100644
>> >> --- a/include/linux/memremap.h
>> >> +++ b/include/linux/memremap.h
>> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
>> >>    * A more complete discussion of unaddressable memory may be found in
>> >>    * include/linux/hmm.h and Documentation/vm/hmm.rst.
>> >>    *
>> >> + * MEMORY_DEVICE_COHERENT:
>> >> + * Device memory that is cache coherent from device and CPU point of view. This
>> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>> >> + * type. Any page of a process can be migrated to such memory. However no one
>> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
>> > like vdso, shared zeropage, ... pinned pages ...
>>
>> Hi David,
>>
>> Yes, I think you're right. This type does not cover all special pages.
>> I need to correct that on the cover letter.
>> Pinned pages are allowed as long as they're not long term pinned.
>>
>> Regards,
>> Alex Sierra
>
> What if I want to hotplug this device's coherent memory, but I do
> *not* want the OS
> to migrate any page to it ?
> I want to fully-control what resides on this memory, as I can consider
> this memory
> "expensive". i.e. I don't have a lot of it, I want to use it for
> specific purposes and
> I don't want the OS to start using it when there is some memory pressure in
> the system.

This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
pages are only allocated by a device driver and exposed to user-space by
a driver migrating pages to them with migrate_vma. The OS can't just
start using them due to memory pressure for example.

 - Alistair

> Oded
>
>>
>> >
>> >> + * should be allowed to pin such memory so that it can always be evicted.
>> >> + *
>> >>    * MEMORY_DEVICE_FS_DAX:
>> >>    * Host memory that has similar access semantics as System RAM i.e. DMA
>> >>    * coherent and supports page pinning. In support of coordinating page
>> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
>> >>   enum memory_type {
>> >>      /* 0 is reserved to catch uninitialized type fields */
>> >>      MEMORY_DEVICE_PRIVATE = 1,
>> >> +    MEMORY_DEVICE_COHERENT,
>> >>      MEMORY_DEVICE_FS_DAX,
>> >>      MEMORY_DEVICE_GENERIC,
>> >>      MEMORY_DEVICE_PCI_P2PDMA,
>> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>> >
>> >
>> > However, where exactly is pinning forbidden?
>>
>> Long-term pinning is forbidden since it would interfere with the device
>> memory manager owning the
>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>> is allowed on this device type.
>>
>> Regards,
>> Alex Sierra
>>
>> >
Oded Gabbay June 20, 2022, 6:01 a.m. UTC | #8
On Mon, Jun 20, 2022 at 3:33 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Oded Gabbay <oded.gabbay@gmail.com> writes:
>
> > On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
> > <alex.sierra@amd.com> wrote:
> >>
> >>
> >> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
> >> > On 31.05.22 22:00, Alex Sierra wrote:
> >> >> Device memory that is cache coherent from device and CPU point of view.
> >> >> This is used on platforms that have an advanced system bus (like CAPI
> >> >> or CXL). Any page of a process can be migrated to such memory. However,
> >> >> no one should be allowed to pin such memory so that it can always be
> >> >> evicted.
> >> >>
> >> >> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> >> >> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >> >> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> >> >> [hch: rebased ontop of the refcount changes,
> >> >>        removed is_dev_private_or_coherent_page]
> >> >> Signed-off-by: Christoph Hellwig <hch@lst.de>
> >> >> ---
> >> >>   include/linux/memremap.h | 19 +++++++++++++++++++
> >> >>   mm/memcontrol.c          |  7 ++++---
> >> >>   mm/memory-failure.c      |  8 ++++++--
> >> >>   mm/memremap.c            | 10 ++++++++++
> >> >>   mm/migrate_device.c      | 16 +++++++---------
> >> >>   mm/rmap.c                |  5 +++--
> >> >>   6 files changed, 49 insertions(+), 16 deletions(-)
> >> >>
> >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> >> index 8af304f6b504..9f752ebed613 100644
> >> >> --- a/include/linux/memremap.h
> >> >> +++ b/include/linux/memremap.h
> >> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
> >> >>    * A more complete discussion of unaddressable memory may be found in
> >> >>    * include/linux/hmm.h and Documentation/vm/hmm.rst.
> >> >>    *
> >> >> + * MEMORY_DEVICE_COHERENT:
> >> >> + * Device memory that is cache coherent from device and CPU point of view. This
> >> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> >> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> >> >> + * type. Any page of a process can be migrated to such memory. However no one
> >> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
> >> > like vdso, shared zeropage, ... pinned pages ...
> >>
> >> Hi David,
> >>
> >> Yes, I think you're right. This type does not cover all special pages.
> >> I need to correct that on the cover letter.
> >> Pinned pages are allowed as long as they're not long term pinned.
> >>
> >> Regards,
> >> Alex Sierra
> >
> > What if I want to hotplug this device's coherent memory, but I do
> > *not* want the OS
> > to migrate any page to it ?
> > I want to fully-control what resides on this memory, as I can consider
> > this memory
> > "expensive". i.e. I don't have a lot of it, I want to use it for
> > specific purposes and
> > I don't want the OS to start using it when there is some memory pressure in
> > the system.
>
> This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
> pages are only allocated by a device driver and exposed to user-space by
> a driver migrating pages to them with migrate_vma. The OS can't just
> start using them due to memory pressure for example.
>
>  - Alistair
Thanks for the explanation.

I guess the commit message confused me a bit, especially these two sentences:

"Any page of a process can be migrated to such memory. However no one should be
allowed to pin such memory so that it can always be evicted."

I read them as if the OS is free to choose which pages are migrated to
this memory,
and anything is eligible for migration to that memory (and that's why
we also don't
allow it to pin memory there).

If we are not allowed to pin anything there, can the device driver
decide to disable
any option for oversubscription of this memory area ?

Let's assume the user uses this memory area for doing p2p with other
CXL devices.
In that case, I wouldn't want the driver/OS to migrate pages in and
out of that memory...

So either I should let the user pin those pages, or prevent him from
doing (accidently or not)
oversubscription in this memory area.

wdyt ?

>
> > Oded
> >
> >>
> >> >
> >> >> + * should be allowed to pin such memory so that it can always be evicted.
> >> >> + *
> >> >>    * MEMORY_DEVICE_FS_DAX:
> >> >>    * Host memory that has similar access semantics as System RAM i.e. DMA
> >> >>    * coherent and supports page pinning. In support of coordinating page
> >> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
> >> >>   enum memory_type {
> >> >>      /* 0 is reserved to catch uninitialized type fields */
> >> >>      MEMORY_DEVICE_PRIVATE = 1,
> >> >> +    MEMORY_DEVICE_COHERENT,
> >> >>      MEMORY_DEVICE_FS_DAX,
> >> >>      MEMORY_DEVICE_GENERIC,
> >> >>      MEMORY_DEVICE_PCI_P2PDMA,
> >> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
> >> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
> >> >
> >> >
> >> > However, where exactly is pinning forbidden?
> >>
> >> Long-term pinning is forbidden since it would interfere with the device
> >> memory manager owning the
> >> device-coherent pages (e.g. evictions in TTM). However, normal pinning
> >> is allowed on this device type.
> >>
> >> Regards,
> >> Alex Sierra
> >>
> >> >
Alistair Popple June 20, 2022, 8:13 a.m. UTC | #9
Oded Gabbay <oded.gabbay@gmail.com> writes:

> On Mon, Jun 20, 2022 at 3:33 AM Alistair Popple <apopple@nvidia.com> wrote:
>>
>>
>> Oded Gabbay <oded.gabbay@gmail.com> writes:
>>
>> > On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
>> > <alex.sierra@amd.com> wrote:
>> >>
>> >>
>> >> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>> >> > On 31.05.22 22:00, Alex Sierra wrote:
>> >> >> Device memory that is cache coherent from device and CPU point of view.
>> >> >> This is used on platforms that have an advanced system bus (like CAPI
>> >> >> or CXL). Any page of a process can be migrated to such memory. However,
>> >> >> no one should be allowed to pin such memory so that it can always be
>> >> >> evicted.
>> >> >>
>> >> >> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> >> >> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> >> >> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>> >> >> [hch: rebased ontop of the refcount changes,
>> >> >>        removed is_dev_private_or_coherent_page]
>> >> >> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> >> >> ---
>> >> >>   include/linux/memremap.h | 19 +++++++++++++++++++
>> >> >>   mm/memcontrol.c          |  7 ++++---
>> >> >>   mm/memory-failure.c      |  8 ++++++--
>> >> >>   mm/memremap.c            | 10 ++++++++++
>> >> >>   mm/migrate_device.c      | 16 +++++++---------
>> >> >>   mm/rmap.c                |  5 +++--
>> >> >>   6 files changed, 49 insertions(+), 16 deletions(-)
>> >> >>
>> >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> >> >> index 8af304f6b504..9f752ebed613 100644
>> >> >> --- a/include/linux/memremap.h
>> >> >> +++ b/include/linux/memremap.h
>> >> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
>> >> >>    * A more complete discussion of unaddressable memory may be found in
>> >> >>    * include/linux/hmm.h and Documentation/vm/hmm.rst.
>> >> >>    *
>> >> >> + * MEMORY_DEVICE_COHERENT:
>> >> >> + * Device memory that is cache coherent from device and CPU point of view. This
>> >> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>> >> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>> >> >> + * type. Any page of a process can be migrated to such memory. However no one
>> >> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
>> >> > like vdso, shared zeropage, ... pinned pages ...
>> >>
>> >> Hi David,
>> >>
>> >> Yes, I think you're right. This type does not cover all special pages.
>> >> I need to correct that on the cover letter.
>> >> Pinned pages are allowed as long as they're not long term pinned.
>> >>
>> >> Regards,
>> >> Alex Sierra
>> >
>> > What if I want to hotplug this device's coherent memory, but I do
>> > *not* want the OS
>> > to migrate any page to it ?
>> > I want to fully-control what resides on this memory, as I can consider
>> > this memory
>> > "expensive". i.e. I don't have a lot of it, I want to use it for
>> > specific purposes and
>> > I don't want the OS to start using it when there is some memory pressure in
>> > the system.
>>
>> This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
>> pages are only allocated by a device driver and exposed to user-space by
>> a driver migrating pages to them with migrate_vma. The OS can't just
>> start using them due to memory pressure for example.
>>
>>  - Alistair
> Thanks for the explanation.
>
> I guess the commit message confused me a bit, especially these two sentences:
>
> "Any page of a process can be migrated to such memory. However no one should be
> allowed to pin such memory so that it can always be evicted."
>
> I read them as if the OS is free to choose which pages are migrated to
> this memory,
> and anything is eligible for migration to that memory (and that's why
> we also don't
> allow it to pin memory there).
>
> If we are not allowed to pin anything there, can the device driver
> decide to disable
> any option for oversubscription of this memory area ?

I'm not sure I follow your thinking on how oversubscription would work
here, however all allocations are controlled by the driver. So if a
device's coherent memory is full a driver would be unable to migrate
pages to that device until pages are freed by the OS due to being
unmapped or the driver evicts pages by migrating them back to normal CPU
memory.

Pinning of pages is allowed, and could prevent such migrations. However
this patch series prevents device coherent pages from being pinned
longterm (ie. with FOLL_LONGTERM), so it should always be able to evict
pages eventually.

> Let's assume the user uses this memory area for doing p2p with other
> CXL devices.
> In that case, I wouldn't want the driver/OS to migrate pages in and
> out of that memory...

The OS will not migrate pages in or out (although it may free them if no
longer required), but a driver might choose to. So at the moment it's
really up to the driver to implement what you want in this regards.

> So either I should let the user pin those pages, or prevent him from
> doing (accidently or not)
> oversubscription in this memory area.

As noted above pages can be pinned, but not long-term.

 - Alistair

> wdyt ?
>
>>
>> > Oded
>> >
>> >>
>> >> >
>> >> >> + * should be allowed to pin such memory so that it can always be evicted.
>> >> >> + *
>> >> >>    * MEMORY_DEVICE_FS_DAX:
>> >> >>    * Host memory that has similar access semantics as System RAM i.e. DMA
>> >> >>    * coherent and supports page pinning. In support of coordinating page
>> >> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
>> >> >>   enum memory_type {
>> >> >>      /* 0 is reserved to catch uninitialized type fields */
>> >> >>      MEMORY_DEVICE_PRIVATE = 1,
>> >> >> +    MEMORY_DEVICE_COHERENT,
>> >> >>      MEMORY_DEVICE_FS_DAX,
>> >> >>      MEMORY_DEVICE_GENERIC,
>> >> >>      MEMORY_DEVICE_PCI_P2PDMA,
>> >> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>> >> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>> >> >
>> >> >
>> >> > However, where exactly is pinning forbidden?
>> >>
>> >> Long-term pinning is forbidden since it would interfere with the device
>> >> memory manager owning the
>> >> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>> >> is allowed on this device type.
>> >>
>> >> Regards,
>> >> Alex Sierra
>> >>
>> >> >
Oded Gabbay June 20, 2022, 12:23 p.m. UTC | #10
On Mon, Jun 20, 2022 at 11:50 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Oded Gabbay <oded.gabbay@gmail.com> writes:
>
> > On Mon, Jun 20, 2022 at 3:33 AM Alistair Popple <apopple@nvidia.com> wrote:
> >>
> >>
> >> Oded Gabbay <oded.gabbay@gmail.com> writes:
> >>
> >> > On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
> >> > <alex.sierra@amd.com> wrote:
> >> >>
> >> >>
> >> >> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
> >> >> > On 31.05.22 22:00, Alex Sierra wrote:
> >> >> >> Device memory that is cache coherent from device and CPU point of view.
> >> >> >> This is used on platforms that have an advanced system bus (like CAPI
> >> >> >> or CXL). Any page of a process can be migrated to such memory. However,
> >> >> >> no one should be allowed to pin such memory so that it can always be
> >> >> >> evicted.
> >> >> >>
> >> >> >> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> >> >> >> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >> >> >> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> >> >> >> [hch: rebased ontop of the refcount changes,
> >> >> >>        removed is_dev_private_or_coherent_page]
> >> >> >> Signed-off-by: Christoph Hellwig <hch@lst.de>
> >> >> >> ---
> >> >> >>   include/linux/memremap.h | 19 +++++++++++++++++++
> >> >> >>   mm/memcontrol.c          |  7 ++++---
> >> >> >>   mm/memory-failure.c      |  8 ++++++--
> >> >> >>   mm/memremap.c            | 10 ++++++++++
> >> >> >>   mm/migrate_device.c      | 16 +++++++---------
> >> >> >>   mm/rmap.c                |  5 +++--
> >> >> >>   6 files changed, 49 insertions(+), 16 deletions(-)
> >> >> >>
> >> >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> >> >> index 8af304f6b504..9f752ebed613 100644
> >> >> >> --- a/include/linux/memremap.h
> >> >> >> +++ b/include/linux/memremap.h
> >> >> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
> >> >> >>    * A more complete discussion of unaddressable memory may be found in
> >> >> >>    * include/linux/hmm.h and Documentation/vm/hmm.rst.
> >> >> >>    *
> >> >> >> + * MEMORY_DEVICE_COHERENT:
> >> >> >> + * Device memory that is cache coherent from device and CPU point of view. This
> >> >> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> >> >> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> >> >> >> + * type. Any page of a process can be migrated to such memory. However no one
> >> >> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
> >> >> > like vdso, shared zeropage, ... pinned pages ...
> >> >>
> >> >> Hi David,
> >> >>
> >> >> Yes, I think you're right. This type does not cover all special pages.
> >> >> I need to correct that on the cover letter.
> >> >> Pinned pages are allowed as long as they're not long term pinned.
> >> >>
> >> >> Regards,
> >> >> Alex Sierra
> >> >
> >> > What if I want to hotplug this device's coherent memory, but I do
> >> > *not* want the OS
> >> > to migrate any page to it ?
> >> > I want to fully-control what resides on this memory, as I can consider
> >> > this memory
> >> > "expensive". i.e. I don't have a lot of it, I want to use it for
> >> > specific purposes and
> >> > I don't want the OS to start using it when there is some memory pressure in
> >> > the system.
> >>
> >> This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
> >> pages are only allocated by a device driver and exposed to user-space by
> >> a driver migrating pages to them with migrate_vma. The OS can't just
> >> start using them due to memory pressure for example.
> >>
> >>  - Alistair
> > Thanks for the explanation.
> >
> > I guess the commit message confused me a bit, especially these two sentences:
> >
> > "Any page of a process can be migrated to such memory. However no one should be
> > allowed to pin such memory so that it can always be evicted."
> >
> > I read them as if the OS is free to choose which pages are migrated to
> > this memory,
> > and anything is eligible for migration to that memory (and that's why
> > we also don't
> > allow it to pin memory there).
> >
> > If we are not allowed to pin anything there, can the device driver
> > decide to disable
> > any option for oversubscription of this memory area ?
>
> I'm not sure I follow your thinking on how oversubscription would work
> here, however all allocations are controlled by the driver. So if a
> device's coherent memory is full a driver would be unable to migrate
> pages to that device until pages are freed by the OS due to being
> unmapped or the driver evicts pages by migrating them back to normal CPU
> memory.
>
> Pinning of pages is allowed, and could prevent such migrations. However
> this patch series prevents device coherent pages from being pinned
> longterm (ie. with FOLL_LONGTERM), so it should always be able to evict
> pages eventually.
>
> > Let's assume the user uses this memory area for doing p2p with other
> > CXL devices.
> > In that case, I wouldn't want the driver/OS to migrate pages in and
> > out of that memory...
>
> The OS will not migrate pages in or out (although it may free them if no
> longer required), but a driver might choose to. So at the moment it's
> really up to the driver to implement what you want in this regards.

I see.
In other words, we don't want to allow long-term pinning but
the driver can decide it doesn't want to evict pages out
of that memory, until they are freed.

Thanks,
Oded
>
> > So either I should let the user pin those pages, or prevent him from
> > doing (accidently or not)
> > oversubscription in this memory area.
>
> As noted above pages can be pinned, but not long-term.
>
>  - Alistair
>
> > wdyt ?
> >
> >>
> >> > Oded
> >> >
> >> >>
> >> >> >
> >> >> >> + * should be allowed to pin such memory so that it can always be evicted.
> >> >> >> + *
> >> >> >>    * MEMORY_DEVICE_FS_DAX:
> >> >> >>    * Host memory that has similar access semantics as System RAM i.e. DMA
> >> >> >>    * coherent and supports page pinning. In support of coordinating page
> >> >> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
> >> >> >>   enum memory_type {
> >> >> >>      /* 0 is reserved to catch uninitialized type fields */
> >> >> >>      MEMORY_DEVICE_PRIVATE = 1,
> >> >> >> +    MEMORY_DEVICE_COHERENT,
> >> >> >>      MEMORY_DEVICE_FS_DAX,
> >> >> >>      MEMORY_DEVICE_GENERIC,
> >> >> >>      MEMORY_DEVICE_PCI_P2PDMA,
> >> >> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
> >> >> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
> >> >> >
> >> >> >
> >> >> > However, where exactly is pinning forbidden?
> >> >>
> >> >> Long-term pinning is forbidden since it would interfere with the device
> >> >> memory manager owning the
> >> >> device-coherent pages (e.g. evictions in TTM). However, normal pinning
> >> >> is allowed on this device type.
> >> >>
> >> >> Regards,
> >> >> Alex Sierra
> >> >>
> >> >> >
Felix Kuehling June 21, 2022, 11:25 a.m. UTC | #11
Am 6/17/22 um 23:19 schrieb David Hildenbrand:
> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>> evicted.
>>>>>>
>>>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>          removed is_dev_private_or_coherent_page]
>>>>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>>>> ---
>>>>>>     include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>     mm/memcontrol.c          |  7 ++++---
>>>>>>     mm/memory-failure.c      |  8 ++++++--
>>>>>>     mm/memremap.c            | 10 ++++++++++
>>>>>>     mm/migrate_device.c      | 16 +++++++---------
>>>>>>     mm/rmap.c                |  5 +++--
>>>>>>     6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>> --- a/include/linux/memremap.h
>>>>>> +++ b/include/linux/memremap.h
>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>      * A more complete discussion of unaddressable memory may be found in
>>>>>>      * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>      *
>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>> like vdso, shared zeropage, ... pinned pages ...
>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>
>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>> + *
>>>>>>      * MEMORY_DEVICE_FS_DAX:
>>>>>>      * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>      * coherent and supports page pinning. In support of coordinating page
>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>     enum memory_type {
>>>>>>     	/* 0 is reserved to catch uninitialized type fields */
>>>>>>     	MEMORY_DEVICE_PRIVATE = 1,
>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>     	MEMORY_DEVICE_FS_DAX,
>>>>>>     	MEMORY_DEVICE_GENERIC,
>>>>>>     	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>
>>>>>
>>>>> However, where exactly is pinning forbidden?
>>>> Long-term pinning is forbidden since it would interfere with the device
>>>> memory manager owning the
>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>> is allowed on this device type.
>>> I don't see updates to folio_is_pinnable() in this patch.
>> Device coherent type pages should return true here, as they are pinnable
>> pages.
> That function is only called for long-term pinnings in try_grab_folio().
>
>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>> As far as I understand this return NULL for long term pin pages.
>> Otherwise they get refcount incremented.
> I don't follow.
>
> You're saying
>
> a) folio_is_pinnable() returns true for device coherent pages
>
> and that
>
> b) device coherent pages don't get long-term pinned
>
>
> Yet, the code says
>
> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> {
> 	if (flags & FOLL_GET)
> 		return try_get_folio(page, refs);
> 	else if (flags & FOLL_PIN) {
> 		struct folio *folio;
>
> 		/*
> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
> 		 * right zone, so fail and let the caller fall back to the slow
> 		 * path.
> 		 */
> 		if (unlikely((flags & FOLL_LONGTERM) &&
> 			     !is_pinnable_page(page)))
> 			return NULL;
> 		...
> 		return folio;
> 	}
> }
>
>
> What prevents these pages from getting long-term pinned as stated in this patch?

Long-term pinning is handled by __gup_longterm_locked, which migrates 
pages returned by __get_user_pages_locked that cannot be long-term 
pinned. try_grab_folio is OK to grab the pages. Anything that can't be 
long-term pinned will be migrated afterwards, and 
__get_user_pages_locked will be retried. The migration of 
DEVICE_COHERENT pages was implemented by Alistair in patch 5/13 
("mm/gup: migrate device coherent pages when pinning instead of failing").

Regards,
   Felix


>
> I am probably missing something important.
>
P.S.: I'm on vacation and looking at a tiny screen. Hope I didn't miss 
anything myself.
David Hildenbrand June 21, 2022, 11:32 a.m. UTC | #12
On 21.06.22 13:25, Felix Kuehling wrote:
> 
> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>> evicted.
>>>>>>>
>>>>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>>>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>          removed is_dev_private_or_coherent_page]
>>>>>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>>>>> ---
>>>>>>>     include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>     mm/memcontrol.c          |  7 ++++---
>>>>>>>     mm/memory-failure.c      |  8 ++++++--
>>>>>>>     mm/memremap.c            | 10 ++++++++++
>>>>>>>     mm/migrate_device.c      | 16 +++++++---------
>>>>>>>     mm/rmap.c                |  5 +++--
>>>>>>>     6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>> --- a/include/linux/memremap.h
>>>>>>> +++ b/include/linux/memremap.h
>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>      * A more complete discussion of unaddressable memory may be found in
>>>>>>>      * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>      *
>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>
>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>> + *
>>>>>>>      * MEMORY_DEVICE_FS_DAX:
>>>>>>>      * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>      * coherent and supports page pinning. In support of coordinating page
>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>     enum memory_type {
>>>>>>>     	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>     	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>     	MEMORY_DEVICE_FS_DAX,
>>>>>>>     	MEMORY_DEVICE_GENERIC,
>>>>>>>     	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>
>>>>>>
>>>>>> However, where exactly is pinning forbidden?
>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>> memory manager owning the
>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>> is allowed on this device type.
>>>> I don't see updates to folio_is_pinnable() in this patch.
>>> Device coherent type pages should return true here, as they are pinnable
>>> pages.
>> That function is only called for long-term pinnings in try_grab_folio().
>>
>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>> As far as I understand this return NULL for long term pin pages.
>>> Otherwise they get refcount incremented.
>> I don't follow.
>>
>> You're saying
>>
>> a) folio_is_pinnable() returns true for device coherent pages
>>
>> and that
>>
>> b) device coherent pages don't get long-term pinned
>>
>>
>> Yet, the code says
>>
>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>> {
>> 	if (flags & FOLL_GET)
>> 		return try_get_folio(page, refs);
>> 	else if (flags & FOLL_PIN) {
>> 		struct folio *folio;
>>
>> 		/*
>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>> 		 * right zone, so fail and let the caller fall back to the slow
>> 		 * path.
>> 		 */
>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>> 			     !is_pinnable_page(page)))
>> 			return NULL;
>> 		...
>> 		return folio;
>> 	}
>> }
>>
>>
>> What prevents these pages from getting long-term pinned as stated in this patch?
> 
> Long-term pinning is handled by __gup_longterm_locked, which migrates 
> pages returned by __get_user_pages_locked that cannot be long-term 
> pinned. try_grab_folio is OK to grab the pages. Anything that can't be 
> long-term pinned will be migrated afterwards, and 
> __get_user_pages_locked will be retried. The migration of 
> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13 
> ("mm/gup: migrate device coherent pages when pinning instead of failing").

Thanks.

__gup_longterm_locked()->check_and_migrate_movable_pages()

Which checks folio_is_pinnable() and doesn't do anything if set.

Sorry to be dense here, but I don't see how what's stated in this patch
works without adjusting folio_is_pinnable().
Alistair Popple June 21, 2022, 11:55 a.m. UTC | #13
David Hildenbrand <david@redhat.com> writes:

> On 21.06.22 13:25, Felix Kuehling wrote:
>>
>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>> evicted.
>>>>>>>>
>>>>>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>>>>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>          removed is_dev_private_or_coherent_page]
>>>>>>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>>>>>> ---
>>>>>>>>     include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>     mm/memcontrol.c          |  7 ++++---
>>>>>>>>     mm/memory-failure.c      |  8 ++++++--
>>>>>>>>     mm/memremap.c            | 10 ++++++++++
>>>>>>>>     mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>     mm/rmap.c                |  5 +++--
>>>>>>>>     6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>      * A more complete discussion of unaddressable memory may be found in
>>>>>>>>      * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>      *
>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>
>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>> + *
>>>>>>>>      * MEMORY_DEVICE_FS_DAX:
>>>>>>>>      * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>      * coherent and supports page pinning. In support of coordinating page
>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>     enum memory_type {
>>>>>>>>     	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>     	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>     	MEMORY_DEVICE_FS_DAX,
>>>>>>>>     	MEMORY_DEVICE_GENERIC,
>>>>>>>>     	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>
>>>>>>>
>>>>>>> However, where exactly is pinning forbidden?
>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>> memory manager owning the
>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>> is allowed on this device type.
>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>> Device coherent type pages should return true here, as they are pinnable
>>>> pages.
>>> That function is only called for long-term pinnings in try_grab_folio().
>>>
>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>> As far as I understand this return NULL for long term pin pages.
>>>> Otherwise they get refcount incremented.
>>> I don't follow.
>>>
>>> You're saying
>>>
>>> a) folio_is_pinnable() returns true for device coherent pages
>>>
>>> and that
>>>
>>> b) device coherent pages don't get long-term pinned
>>>
>>>
>>> Yet, the code says
>>>
>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>> {
>>> 	if (flags & FOLL_GET)
>>> 		return try_get_folio(page, refs);
>>> 	else if (flags & FOLL_PIN) {
>>> 		struct folio *folio;
>>>
>>> 		/*
>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>> 		 * right zone, so fail and let the caller fall back to the slow
>>> 		 * path.
>>> 		 */
>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>> 			     !is_pinnable_page(page)))
>>> 			return NULL;
>>> 		...
>>> 		return folio;
>>> 	}
>>> }
>>>
>>>
>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>
>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>> pages returned by __get_user_pages_locked that cannot be long-term
>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>> long-term pinned will be migrated afterwards, and
>> __get_user_pages_locked will be retried. The migration of
>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>
> Thanks.
>
> __gup_longterm_locked()->check_and_migrate_movable_pages()
>
> Which checks folio_is_pinnable() and doesn't do anything if set.
>
> Sorry to be dense here, but I don't see how what's stated in this patch
> works without adjusting folio_is_pinnable().

Ugh, I think you might be right about try_grab_folio().

We didn't update folio_is_pinnable() to include device coherent pages
because device coherent pages are pinnable. It is really just
FOLL_LONGTERM that we want to prevent here.

For normal PUP that is done by my change in
check_and_migrate_movable_pages() which migrates pages being pinned with
FOLL_LONGTERM. But I think I incorrectly assumed we would take the
pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
So I think the check in try_grab_folio() needs to be:

 		if (unlikely((flags & FOLL_LONGTERM) &&
 			     (!is_pinnable_page(page) || is_device_coherent_page(page))))

 - Alistair
David Hildenbrand June 21, 2022, 12:25 p.m. UTC | #14
On 21.06.22 13:55, Alistair Popple wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>
>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>> evicted.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>>>>>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>>>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>          removed is_dev_private_or_coherent_page]
>>>>>>>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>> ---
>>>>>>>>>     include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>     mm/memcontrol.c          |  7 ++++---
>>>>>>>>>     mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>     mm/memremap.c            | 10 ++++++++++
>>>>>>>>>     mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>     mm/rmap.c                |  5 +++--
>>>>>>>>>     6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>      * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>      * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>      *
>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>
>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>> + *
>>>>>>>>>      * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>      * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>      * coherent and supports page pinning. In support of coordinating page
>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>     enum memory_type {
>>>>>>>>>     	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>     	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>     	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>     	MEMORY_DEVICE_GENERIC,
>>>>>>>>>     	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>
>>>>>>>>
>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>> memory manager owning the
>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>> is allowed on this device type.
>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>> pages.
>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>
>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>> As far as I understand this return NULL for long term pin pages.
>>>>> Otherwise they get refcount incremented.
>>>> I don't follow.
>>>>
>>>> You're saying
>>>>
>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>
>>>> and that
>>>>
>>>> b) device coherent pages don't get long-term pinned
>>>>
>>>>
>>>> Yet, the code says
>>>>
>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>> {
>>>> 	if (flags & FOLL_GET)
>>>> 		return try_get_folio(page, refs);
>>>> 	else if (flags & FOLL_PIN) {
>>>> 		struct folio *folio;
>>>>
>>>> 		/*
>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>> 		 * path.
>>>> 		 */
>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>> 			     !is_pinnable_page(page)))
>>>> 			return NULL;
>>>> 		...
>>>> 		return folio;
>>>> 	}
>>>> }
>>>>
>>>>
>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>
>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>> pages returned by __get_user_pages_locked that cannot be long-term
>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>> long-term pinned will be migrated afterwards, and
>>> __get_user_pages_locked will be retried. The migration of
>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>
>> Thanks.
>>
>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>
>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>
>> Sorry to be dense here, but I don't see how what's stated in this patch
>> works without adjusting folio_is_pinnable().
> 
> Ugh, I think you might be right about try_grab_folio().
> 
> We didn't update folio_is_pinnable() to include device coherent pages
> because device coherent pages are pinnable. It is really just
> FOLL_LONGTERM that we want to prevent here.
> 
> For normal PUP that is done by my change in
> check_and_migrate_movable_pages() which migrates pages being pinned with
> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
> So I think the check in try_grab_folio() needs to be:

I think I said it already (and I might be wrong without reading the
code), but folio_is_pinnable() is *only* called for long-term pinnings.

It should actually be called folio_is_longterm_pinnable().

That's where that check should go, no?
Sierra Guiza, Alejandro (Alex) June 21, 2022, 4:08 p.m. UTC | #15
On 6/21/2022 7:25 AM, David Hildenbrand wrote:
> On 21.06.22 13:55, Alistair Popple wrote:
>> David Hildenbrand<david@redhat.com>  writes:
>>
>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>> evicted.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>           removed is_dev_private_or_coherent_page]
>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>> ---
>>>>>>>>>>      include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>      mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>      mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>      mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>      mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>      mm/rmap.c                |  5 +++--
>>>>>>>>>>      6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>       * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>       * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>       *
>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>
>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>> + *
>>>>>>>>>>       * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>       * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>       * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>      enum memory_type {
>>>>>>>>>>      	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>      	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>      	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>      	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>      	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>> memory manager owning the
>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>> is allowed on this device type.
>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>> pages.
>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>
>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>> Otherwise they get refcount incremented.
>>>>> I don't follow.
>>>>>
>>>>> You're saying
>>>>>
>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>
>>>>> and that
>>>>>
>>>>> b) device coherent pages don't get long-term pinned
>>>>>
>>>>>
>>>>> Yet, the code says
>>>>>
>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>> {
>>>>> 	if (flags & FOLL_GET)
>>>>> 		return try_get_folio(page, refs);
>>>>> 	else if (flags & FOLL_PIN) {
>>>>> 		struct folio *folio;
>>>>>
>>>>> 		/*
>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>> 		 * path.
>>>>> 		 */
>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>> 			     !is_pinnable_page(page)))
>>>>> 			return NULL;
>>>>> 		...
>>>>> 		return folio;
>>>>> 	}
>>>>> }
>>>>>
>>>>>
>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>> long-term pinned will be migrated afterwards, and
>>>> __get_user_pages_locked will be retried. The migration of
>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>> Thanks.
>>>
>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>
>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>
>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>> works without adjusting folio_is_pinnable().
>> Ugh, I think you might be right about try_grab_folio().
>>
>> We didn't update folio_is_pinnable() to include device coherent pages
>> because device coherent pages are pinnable. It is really just
>> FOLL_LONGTERM that we want to prevent here.
>>
>> For normal PUP that is done by my change in
>> check_and_migrate_movable_pages() which migrates pages being pinned with
>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>> So I think the check in try_grab_folio() needs to be:
> I think I said it already (and I might be wrong without reading the
> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>
> It should actually be called folio_is_longterm_pinnable().
>
> That's where that check should go, no?

David, I think you're right. We didn't catch this since the LONGTERM gup 
test we added to hmm-test only calls to pin_user_pages. Apparently 
try_grab_folio is called only from fast callers (ex. 
pin_user_pages_fast/get_user_pages_fast). I have added a conditional 
similar to what Alistair has proposed to return null on LONGTERM && 
(coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup 
test was added with LONGTERM set that calls pin_user_pages_fast. 
Returning null under this condition it does causes the migration from 
dev to system memory.

Actually, Im having different problems with a call to PageAnonExclusive 
from try_to_migrate_one during page fault from a HMM test that first 
migrate pages to device private and forks to mark as COW these pages. 
Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page), 
page)

Regards,
Alex Sierra
David Hildenbrand June 21, 2022, 4:16 p.m. UTC | #16
On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>> On 21.06.22 13:55, Alistair Popple wrote:
>>> David Hildenbrand<david@redhat.com>  writes:
>>>
>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>> evicted.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>           removed is_dev_private_or_coherent_page]
>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>> ---
>>>>>>>>>>>      include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>      mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>      mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>      mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>      mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>      mm/rmap.c                |  5 +++--
>>>>>>>>>>>      6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>       * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>       * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>       *
>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>
>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>> + *
>>>>>>>>>>>       * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>       * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>       * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>      enum memory_type {
>>>>>>>>>>>      	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>      	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>      	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>      	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>      	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>> memory manager owning the
>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>> is allowed on this device type.
>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>> pages.
>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>
>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>> Otherwise they get refcount incremented.
>>>>>> I don't follow.
>>>>>>
>>>>>> You're saying
>>>>>>
>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>
>>>>>> and that
>>>>>>
>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>
>>>>>>
>>>>>> Yet, the code says
>>>>>>
>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>> {
>>>>>> 	if (flags & FOLL_GET)
>>>>>> 		return try_get_folio(page, refs);
>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>> 		struct folio *folio;
>>>>>>
>>>>>> 		/*
>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>> 		 * path.
>>>>>> 		 */
>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>> 			     !is_pinnable_page(page)))
>>>>>> 			return NULL;
>>>>>> 		...
>>>>>> 		return folio;
>>>>>> 	}
>>>>>> }
>>>>>>
>>>>>>
>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>> long-term pinned will be migrated afterwards, and
>>>>> __get_user_pages_locked will be retried. The migration of
>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>> Thanks.
>>>>
>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>
>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>
>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>> works without adjusting folio_is_pinnable().
>>> Ugh, I think you might be right about try_grab_folio().
>>>
>>> We didn't update folio_is_pinnable() to include device coherent pages
>>> because device coherent pages are pinnable. It is really just
>>> FOLL_LONGTERM that we want to prevent here.
>>>
>>> For normal PUP that is done by my change in
>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>> So I think the check in try_grab_folio() needs to be:
>> I think I said it already (and I might be wrong without reading the
>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>
>> It should actually be called folio_is_longterm_pinnable().
>>
>> That's where that check should go, no?
> 
> David, I think you're right. We didn't catch this since the LONGTERM gup 
> test we added to hmm-test only calls to pin_user_pages. Apparently 
> try_grab_folio is called only from fast callers (ex. 
> pin_user_pages_fast/get_user_pages_fast). I have added a conditional 
> similar to what Alistair has proposed to return null on LONGTERM && 
> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup 
> test was added with LONGTERM set that calls pin_user_pages_fast. 
> Returning null under this condition it does causes the migration from 
> dev to system memory.
> 

Why can't coherent memory simply put its checks into
folio_is_pinnable()? I don't get it why we have to do things differently
here.

> Actually, Im having different problems with a call to PageAnonExclusive 
> from try_to_migrate_one during page fault from a HMM test that first 
> migrate pages to device private and forks to mark as COW these pages. 
> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page), 
> page)

With or without this series? A backtrace would be great.
Alistair Popple June 22, 2022, 12:16 a.m. UTC | #17
David Hildenbrand <david@redhat.com> writes:

> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>
>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>> David Hildenbrand<david@redhat.com>  writes:
>>>>
>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>> evicted.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>           removed is_dev_private_or_coherent_page]
>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>>> ---
>>>>>>>>>>>>      include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>      mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>>      mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>>      mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>>      mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>>      mm/rmap.c                |  5 +++--
>>>>>>>>>>>>      6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>       * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>       * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>       *
>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>
>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>> + *
>>>>>>>>>>>>       * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>       * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>       * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>      enum memory_type {
>>>>>>>>>>>>      	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>      	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>      	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>      	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>      	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>> memory manager owning the
>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>> is allowed on this device type.
>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>> pages.
>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>
>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>> Otherwise they get refcount incremented.
>>>>>>> I don't follow.
>>>>>>>
>>>>>>> You're saying
>>>>>>>
>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>
>>>>>>> and that
>>>>>>>
>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>
>>>>>>>
>>>>>>> Yet, the code says
>>>>>>>
>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>> {
>>>>>>> 	if (flags & FOLL_GET)
>>>>>>> 		return try_get_folio(page, refs);
>>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>>> 		struct folio *folio;
>>>>>>>
>>>>>>> 		/*
>>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>>> 		 * path.
>>>>>>> 		 */
>>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>> 			     !is_pinnable_page(page)))
>>>>>>> 			return NULL;
>>>>>>> 		...
>>>>>>> 		return folio;
>>>>>>> 	}
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>> long-term pinned will be migrated afterwards, and
>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>> Thanks.
>>>>>
>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>
>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>
>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>> works without adjusting folio_is_pinnable().
>>>> Ugh, I think you might be right about try_grab_folio().
>>>>
>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>> because device coherent pages are pinnable. It is really just
>>>> FOLL_LONGTERM that we want to prevent here.
>>>>
>>>> For normal PUP that is done by my change in
>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>> So I think the check in try_grab_folio() needs to be:
>>> I think I said it already (and I might be wrong without reading the
>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>
>>> It should actually be called folio_is_longterm_pinnable().
>>>
>>> That's where that check should go, no?
>>
>> David, I think you're right. We didn't catch this since the LONGTERM gup
>> test we added to hmm-test only calls to pin_user_pages. Apparently
>> try_grab_folio is called only from fast callers (ex.
>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>> similar to what Alistair has proposed to return null on LONGTERM &&
>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>> test was added with LONGTERM set that calls pin_user_pages_fast.
>> Returning null under this condition it does causes the migration from
>> dev to system memory.
>>
>
> Why can't coherent memory simply put its checks into
> folio_is_pinnable()? I don't get it why we have to do things differently
> here.

I'd made the reasonable assumption that
folio_is_pinnable()/is_pinnable_page() were used to check if the
folio/page is pinnable or not regardless of FOLL_LONGTERM. Looking at
the code more closely though I see both are actually only used on paths
checking for FOLL_LONGTERM pinning.

So I agree - we should rename these
folio_is_longterm_pinnable()/is_longterm_pinnable_page() and add the
check for coherent pages there. Thanks for pointing that out.

 - Alistair

>> Actually, Im having different problems with a call to PageAnonExclusive
>> from try_to_migrate_one during page fault from a HMM test that first
>> migrate pages to device private and forks to mark as COW these pages.
>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>> page)
>
> With or without this series? A backtrace would be great.
Sierra Guiza, Alejandro (Alex) June 22, 2022, 11:06 p.m. UTC | #18
On 6/21/2022 7:16 PM, Alistair Popple wrote:
> David Hildenbrand <david@redhat.com> writes:
>
>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>> David Hildenbrand<david@redhat.com>  writes:
>>>>>
>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>            removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>       include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>       mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>>>       mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>>>       mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>>>       mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>>>       mm/rmap.c                |  5 +++--
>>>>>>>>>>>>>       6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>        * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>        * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>        *
>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>
>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>>        * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>        * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>        * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>       enum memory_type {
>>>>>>>>>>>>>       	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>       	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>       	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>       	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>       	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>> memory manager owning the
>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>> pages.
>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>
>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>> I don't follow.
>>>>>>>>
>>>>>>>> You're saying
>>>>>>>>
>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>
>>>>>>>> and that
>>>>>>>>
>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>
>>>>>>>>
>>>>>>>> Yet, the code says
>>>>>>>>
>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>> {
>>>>>>>> 	if (flags & FOLL_GET)
>>>>>>>> 		return try_get_folio(page, refs);
>>>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>>>> 		struct folio *folio;
>>>>>>>>
>>>>>>>> 		/*
>>>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>>>> 		 * path.
>>>>>>>> 		 */
>>>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>> 			     !is_pinnable_page(page)))
>>>>>>>> 			return NULL;
>>>>>>>> 		...
>>>>>>>> 		return folio;
>>>>>>>> 	}
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>> Thanks.
>>>>>>
>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>
>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>
>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>> works without adjusting folio_is_pinnable().
>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>
>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>> because device coherent pages are pinnable. It is really just
>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>
>>>>> For normal PUP that is done by my change in
>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>> So I think the check in try_grab_folio() needs to be:
>>>> I think I said it already (and I might be wrong without reading the
>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>
>>>> It should actually be called folio_is_longterm_pinnable().
>>>>
>>>> That's where that check should go, no?
>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>> try_grab_folio is called only from fast callers (ex.
>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>> Returning null under this condition it does causes the migration from
>>> dev to system memory.
>>>
>> Why can't coherent memory simply put its checks into
>> folio_is_pinnable()? I don't get it why we have to do things differently
>> here.
> I'd made the reasonable assumption that
> folio_is_pinnable()/is_pinnable_page() were used to check if the
> folio/page is pinnable or not regardless of FOLL_LONGTERM. Looking at
> the code more closely though I see both are actually only used on paths
> checking for FOLL_LONGTERM pinning.
>
> So I agree - we should rename these
> folio_is_longterm_pinnable()/is_longterm_pinnable_page() and add the
> check for coherent pages there. Thanks for pointing that out.
>
>   - Alistair

Will do in the next patch series.

Regards,
Alex Sierra

>>> Actually, Im having different problems with a call to PageAnonExclusive
>>> from try_to_migrate_one during page fault from a HMM test that first
>>> migrate pages to device private and forks to mark as COW these pages.
>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>> page)
>> With or without this series? A backtrace would be great.
Sierra Guiza, Alejandro (Alex) June 22, 2022, 11:16 p.m. UTC | #19
On 6/21/2022 11:16 AM, David Hildenbrand wrote:
> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>> David Hildenbrand<david@redhat.com>  writes:
>>>>
>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>> evicted.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>            removed is_dev_private_or_coherent_page]
>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>>> ---
>>>>>>>>>>>>       include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>       mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>>       mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>>       mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>>       mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>>       mm/rmap.c                |  5 +++--
>>>>>>>>>>>>       6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>        * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>        * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>        *
>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>
>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>> + *
>>>>>>>>>>>>        * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>        * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>        * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>       enum memory_type {
>>>>>>>>>>>>       	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>       	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>       	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>       	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>       	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>> memory manager owning the
>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>> is allowed on this device type.
>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>> pages.
>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>
>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>> Otherwise they get refcount incremented.
>>>>>>> I don't follow.
>>>>>>>
>>>>>>> You're saying
>>>>>>>
>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>
>>>>>>> and that
>>>>>>>
>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>
>>>>>>>
>>>>>>> Yet, the code says
>>>>>>>
>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>> {
>>>>>>> 	if (flags & FOLL_GET)
>>>>>>> 		return try_get_folio(page, refs);
>>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>>> 		struct folio *folio;
>>>>>>>
>>>>>>> 		/*
>>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>>> 		 * path.
>>>>>>> 		 */
>>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>> 			     !is_pinnable_page(page)))
>>>>>>> 			return NULL;
>>>>>>> 		...
>>>>>>> 		return folio;
>>>>>>> 	}
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>> long-term pinned will be migrated afterwards, and
>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>> Thanks.
>>>>>
>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>
>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>
>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>> works without adjusting folio_is_pinnable().
>>>> Ugh, I think you might be right about try_grab_folio().
>>>>
>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>> because device coherent pages are pinnable. It is really just
>>>> FOLL_LONGTERM that we want to prevent here.
>>>>
>>>> For normal PUP that is done by my change in
>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>> So I think the check in try_grab_folio() needs to be:
>>> I think I said it already (and I might be wrong without reading the
>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>
>>> It should actually be called folio_is_longterm_pinnable().
>>>
>>> That's where that check should go, no?
>> David, I think you're right. We didn't catch this since the LONGTERM gup
>> test we added to hmm-test only calls to pin_user_pages. Apparently
>> try_grab_folio is called only from fast callers (ex.
>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>> similar to what Alistair has proposed to return null on LONGTERM &&
>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>> test was added with LONGTERM set that calls pin_user_pages_fast.
>> Returning null under this condition it does causes the migration from
>> dev to system memory.
>>
> Why can't coherent memory simply put its checks into
> folio_is_pinnable()? I don't get it why we have to do things differently
> here.
>
>> Actually, Im having different problems with a call to PageAnonExclusive
>> from try_to_migrate_one during page fault from a HMM test that first
>> migrate pages to device private and forks to mark as COW these pages.
>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>> page)
> With or without this series? A backtrace would be great.

Here's the back trace. This happens in a hmm-test added in this patch 
series. However, I have tried to isolate this BUG by just adding the COW 
test with private device memory only. This is only present as follows. 
Allocate anonymous mem->Migrate to private device memory->fork->try to 
access to parent's anonymous memory (which will suppose to trigger a 
page fault and migration to system mem). Just for the record, if the 
child is terminated before the parent's memory is accessed, this problem 
is not present.

patch name for this test: tools: add selftests to hmm for COW in device 
memory

[  528.727237] BUG: unable to handle page fault for address: 
ffffea1fffffffc0
[  528.739585] #PF: supervisor read access in kernel mode
[  528.745324] #PF: error_code(0x0000) - not-present page
[  528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
[  528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 
5.19.0-rc3-kfd-alex #257
[  528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS 
RTY1002BDS 09/17/2021
[  528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
[  528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29 
c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74 
0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
[  528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
[  528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: 
ffffeaffffffffc0
[  528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffffc90003cdfaf8
[  528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 
0000000000000000
[  528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: 
ffff888194450540
[  528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 
03ffffffffffffff
[  528.850865] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) 
knlGS:0000000000000000
[  528.859891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 
0000000000770ee0
[  528.874275] PKRU: 55555554
[  528.877286] Call Trace:
[  528.880016]  <TASK>
[  528.882356]  ? lock_is_held_type+0xdf/0x130
[  528.887033]  rmap_walk_anon+0x167/0x410
[  528.891316]  try_to_migrate+0x90/0xd0
[  528.895405]  ? try_to_unmap_one+0xe10/0xe10
[  528.900074]  ? anon_vma_ctor+0x50/0x50
[  528.904260]  ? put_anon_vma+0x10/0x10
[  528.908347]  ? invalid_mkclean_vma+0x20/0x20
[  528.913114]  migrate_vma_setup+0x5f4/0x750
[  528.917691]  dmirror_devmem_fault+0x8c/0x250 [test_hmm]
[  528.923532]  do_swap_page+0xac0/0xe50
[  528.927623]  ? __lock_acquire+0x4b2/0x1ac0
[  528.932199]  __handle_mm_fault+0x949/0x1440
[  528.936876]  handle_mm_fault+0x13f/0x3e0
[  528.941256]  do_user_addr_fault+0x215/0x740
[  528.945928]  exc_page_fault+0x75/0x280
[  528.950115]  asm_exc_page_fault+0x27/0x30
[  528.954593] RIP: 0033:0x40366b
[  528.958001] Code: 00 48 89 85 d8 fe ff ff eb 2a 48 8b 85 d0 fe ff ff 
48 8d 14 85 00 00 00 00 48 8b 85 d8 fe ff ff 48 01 d0 48 8b 95 d0 fe ff 
ff <89> 10 48 83 85 d0 fe ff ff 01 48 8b 85 40 ff ff ff 48 c1 e8 02 48
[  528.978973] RSP: 002b:00007fffffffe280 EFLAGS: 00010206
[  528.984806] RAX: 00007ffff7ff4000 RBX: 0000000000000000 RCX: 
0000000000000000
[  528.992774] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
00007ffff77ee968
[  529.000742] RBP: 00007fffffffe430 R08: 00007ffff7fdb740 R09: 
0000000000000000
[  529.008709] R10: 00007ffff7fdba10 R11: 0000000000000246 R12: 
0000000000400e30
[  529.016675] R13: 00007fffffffe630 R14: 0000000000000000 R15: 
0000000000000000
[  529.024638]  </TASK>
[  529.027074] Modules linked in: test_hmm xt_conntrack xt_MASQUERADE 
nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 br_netfilter ip6table_filter ip6_tables iptable_filter 
k10temp ip_tables x_tables i2c_piix4 [last unloaded: test_hmm]
[  529.053595] CR2: ffffea1fffffffc0
[  529.057296] ---[ end trace 0000000000000000 ]---
[  529.197816] RIP: 0010:try_to_migrate_one+0x21a/0x1000
[  529.197823] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29 
c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74 
0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
[  529.197826] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
[  529.197828] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: 
ffffeaffffffffc0
[  529.197830] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffffc90003cdfaf8
[  529.197831] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 
0000000000000000
[  529.197832] R10: ffffc90003cdf910 R11: 0000000000000002 R12: 
ffff888194450540
[  529.197833] R13: ffff888160d057c0 R14: 0000000000000000 R15: 
03ffffffffffffff
[  529.197835] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) 
knlGS:0000000000000000
[  529.197837] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  529.197839] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 
0000000000770ee0
[  529.197840] PKRU: 55555554
[  529.197841] note: hmm-tests[18275] exited with preempt_count 1

Regards,
Alex Sierra

>
David Hildenbrand June 23, 2022, 7:57 a.m. UTC | #20
On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>> David Hildenbrand<david@redhat.com>  writes:
>>>>>
>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>            removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>       include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>       mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>>>       mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>>>       mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>>>       mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>>>       mm/rmap.c                |  5 +++--
>>>>>>>>>>>>>       6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>        * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>        * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>        *
>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>
>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>>        * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>        * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>        * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>       enum memory_type {
>>>>>>>>>>>>>       	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>       	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>       	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>       	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>       	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>> memory manager owning the
>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>> pages.
>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>
>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>> I don't follow.
>>>>>>>>
>>>>>>>> You're saying
>>>>>>>>
>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>
>>>>>>>> and that
>>>>>>>>
>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>
>>>>>>>>
>>>>>>>> Yet, the code says
>>>>>>>>
>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>> {
>>>>>>>> 	if (flags & FOLL_GET)
>>>>>>>> 		return try_get_folio(page, refs);
>>>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>>>> 		struct folio *folio;
>>>>>>>>
>>>>>>>> 		/*
>>>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>>>> 		 * path.
>>>>>>>> 		 */
>>>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>> 			     !is_pinnable_page(page)))
>>>>>>>> 			return NULL;
>>>>>>>> 		...
>>>>>>>> 		return folio;
>>>>>>>> 	}
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>> Thanks.
>>>>>>
>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>
>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>
>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>> works without adjusting folio_is_pinnable().
>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>
>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>> because device coherent pages are pinnable. It is really just
>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>
>>>>> For normal PUP that is done by my change in
>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>> So I think the check in try_grab_folio() needs to be:
>>>> I think I said it already (and I might be wrong without reading the
>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>
>>>> It should actually be called folio_is_longterm_pinnable().
>>>>
>>>> That's where that check should go, no?
>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>> try_grab_folio is called only from fast callers (ex.
>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>> Returning null under this condition it does causes the migration from
>>> dev to system memory.
>>>
>> Why can't coherent memory simply put its checks into
>> folio_is_pinnable()? I don't get it why we have to do things differently
>> here.
>>
>>> Actually, Im having different problems with a call to PageAnonExclusive
>>> from try_to_migrate_one during page fault from a HMM test that first
>>> migrate pages to device private and forks to mark as COW these pages.
>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>> page)
>> With or without this series? A backtrace would be great.
> 
> Here's the back trace. This happens in a hmm-test added in this patch 
> series. However, I have tried to isolate this BUG by just adding the COW 
> test with private device memory only. This is only present as follows. 
> Allocate anonymous mem->Migrate to private device memory->fork->try to 
> access to parent's anonymous memory (which will suppose to trigger a 
> page fault and migration to system mem). Just for the record, if the 
> child is terminated before the parent's memory is accessed, this problem 
> is not present.


The only usage of PageAnonExclusive() in try_to_migrate_one() is:

anon_exclusive = folio_test_anon(folio) &&
		 PageAnonExclusive(subpage);

Which can only possibly fail if subpage is not actually part of the folio.


I see some controversial code in the the if (folio_is_zone_device(folio)) case later:

			 * The assignment to subpage above was computed from a
			 * swap PTE which results in an invalid pointer.
			 * Since only PAGE_SIZE pages can currently be
			 * migrated, just set it to page. This will need to be
			 * changed when hugepage migrations to device private
			 * memory are supported.
			 */
			subpage = &folio->page;

There we have our invalid pointer hint.

I don't see how it could have worked if the child quit, though? Maybe
just pure luck?


Does the following fix your issue:



From 09750c714739ef3ca317b4aec82bf20283c8fd2d Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Thu, 23 Jun 2022 09:38:45 +0200
Subject: [PATCH] mm/rmap: fix dereferencing invalid subpage pointer in
 try_to_migrate_one()

The subpage we calculate is an invalid pointer for device private pages,
because device private pages are mapped via non-present device private
entries, not ordinary present PTEs.

Let's just not compute broken pointers and fixup later. Move the proper
assignment of the correct subpage to the beginning of the function and
assert that we really only have a single page in our folio.

This currently results in a BUG when tying to compute anon_exclusive,
because:

[  528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0
[  528.739585] #PF: supervisor read access in kernel mode
[  528.745324] #PF: error_code(0x0000) - not-present page
[  528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
[  528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257
[  528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021
[  528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
[  528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
[  528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
[  528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0
[  528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8
[  528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000
[  528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540
[  528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff
[  528.850865] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000
[  528.859891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0
[  528.874275] PKRU: 55555554
[  528.877286] Call Trace:
[  528.880016]  <TASK>
[  528.882356]  ? lock_is_held_type+0xdf/0x130
[  528.887033]  rmap_walk_anon+0x167/0x410
[  528.891316]  try_to_migrate+0x90/0xd0
[  528.895405]  ? try_to_unmap_one+0xe10/0xe10
[  528.900074]  ? anon_vma_ctor+0x50/0x50
[  528.904260]  ? put_anon_vma+0x10/0x10
[  528.908347]  ? invalid_mkclean_vma+0x20/0x20
[  528.913114]  migrate_vma_setup+0x5f4/0x750
[  528.917691]  dmirror_devmem_fault+0x8c/0x250 [test_hmm]
[  528.923532]  do_swap_page+0xac0/0xe50
[  528.927623]  ? __lock_acquire+0x4b2/0x1ac0
[  528.932199]  __handle_mm_fault+0x949/0x1440
[  528.936876]  handle_mm_fault+0x13f/0x3e0
[  528.941256]  do_user_addr_fault+0x215/0x740
[  528.945928]  exc_page_fault+0x75/0x280
[  528.950115]  asm_exc_page_fault+0x27/0x30
[  528.954593] RIP: 0033:0x40366b
...

Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Reported-by: Sierra Guiza, Alejandro (Alex) <alex.sierra@amd.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..746c05acad27 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1899,8 +1899,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		/* Unexpected PMD-mapped THP? */
 		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
 
-		subpage = folio_page(folio,
-				pte_pfn(*pvmw.pte) - folio_pfn(folio));
+		if (folio_is_zone_device(folio)) {
+			/*
+			 * Our PTE is a non-present device exclusive entry and
+			 * calculating the subpage as for the common case would
+			 * result in an invalid pointer.
+			 *
+			 * Since only PAGE_SIZE pages can currently be
+			 * migrated, just set it to page. This will need to be
+			 * changed when hugepage migrations to device private
+			 * memory are supported.
+			 */
+			VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
+			subpage = &folio->page;
+		} else {
+			subpage = folio_page(folio,
+					pte_pfn(*pvmw.pte) - folio_pfn(folio));
+		}
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
 				 PageAnonExclusive(subpage);
@@ -1993,15 +2008,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			/*
 			 * No need to invalidate here it will synchronize on
 			 * against the special swap migration pte.
-			 *
-			 * The assignment to subpage above was computed from a
-			 * swap PTE which results in an invalid pointer.
-			 * Since only PAGE_SIZE pages can currently be
-			 * migrated, just set it to page. This will need to be
-			 * changed when hugepage migrations to device private
-			 * memory are supported.
 			 */
-			subpage = &folio->page;
 		} else if (PageHWPoison(subpage)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
Sierra Guiza, Alejandro (Alex) June 23, 2022, 6:20 p.m. UTC | #21
On 6/23/2022 2:57 AM, David Hildenbrand wrote:
> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>>> David Hildenbrand<david@redhat.com>  writes:
>>>>>>
>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>>             removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>        include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>>        mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>>>>        mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>>>>        mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>>>>        mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>>>>        mm/rmap.c                |  5 +++--
>>>>>>>>>>>>>>        6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>>         * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>>         * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>>         *
>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>>
>>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>         * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>>         * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>>         * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>>        enum memory_type {
>>>>>>>>>>>>>>        	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>>        	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>>        	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>>        	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>>        	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>>> memory manager owning the
>>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>>> pages.
>>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>>
>>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>>> I don't follow.
>>>>>>>>>
>>>>>>>>> You're saying
>>>>>>>>>
>>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>>
>>>>>>>>> and that
>>>>>>>>>
>>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yet, the code says
>>>>>>>>>
>>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>>> {
>>>>>>>>> 	if (flags & FOLL_GET)
>>>>>>>>> 		return try_get_folio(page, refs);
>>>>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>>>>> 		struct folio *folio;
>>>>>>>>>
>>>>>>>>> 		/*
>>>>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>>>>> 		 * path.
>>>>>>>>> 		 */
>>>>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>>> 			     !is_pinnable_page(page)))
>>>>>>>>> 			return NULL;
>>>>>>>>> 		...
>>>>>>>>> 		return folio;
>>>>>>>>> 	}
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>>> Thanks.
>>>>>>>
>>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>>
>>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>>
>>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>>> works without adjusting folio_is_pinnable().
>>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>>
>>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>>> because device coherent pages are pinnable. It is really just
>>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>>
>>>>>> For normal PUP that is done by my change in
>>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>>> So I think the check in try_grab_folio() needs to be:
>>>>> I think I said it already (and I might be wrong without reading the
>>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>>
>>>>> It should actually be called folio_is_longterm_pinnable().
>>>>>
>>>>> That's where that check should go, no?
>>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>>> try_grab_folio is called only from fast callers (ex.
>>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>>> Returning null under this condition it does causes the migration from
>>>> dev to system memory.
>>>>
>>> Why can't coherent memory simply put its checks into
>>> folio_is_pinnable()? I don't get it why we have to do things differently
>>> here.
>>>
>>>> Actually, Im having different problems with a call to PageAnonExclusive
>>>> from try_to_migrate_one during page fault from a HMM test that first
>>>> migrate pages to device private and forks to mark as COW these pages.
>>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>>> page)
>>> With or without this series? A backtrace would be great.
>> Here's the back trace. This happens in a hmm-test added in this patch
>> series. However, I have tried to isolate this BUG by just adding the COW
>> test with private device memory only. This is only present as follows.
>> Allocate anonymous mem->Migrate to private device memory->fork->try to
>> access to parent's anonymous memory (which will suppose to trigger a
>> page fault and migration to system mem). Just for the record, if the
>> child is terminated before the parent's memory is accessed, this problem
>> is not present.
>
> The only usage of PageAnonExclusive() in try_to_migrate_one() is:
>
> anon_exclusive = folio_test_anon(folio) &&
> 		 PageAnonExclusive(subpage);
>
> Which can only possibly fail if subpage is not actually part of the folio.
>
>
> I see some controversial code in the the if (folio_is_zone_device(folio)) case later:
>
> 			 * The assignment to subpage above was computed from a
> 			 * swap PTE which results in an invalid pointer.
> 			 * Since only PAGE_SIZE pages can currently be
> 			 * migrated, just set it to page. This will need to be
> 			 * changed when hugepage migrations to device private
> 			 * memory are supported.
> 			 */
> 			subpage = &folio->page;
>
> There we have our invalid pointer hint.
>
> I don't see how it could have worked if the child quit, though? Maybe
> just pure luck?
>
>
> Does the following fix your issue:

Yes, it fixed the issue. Thanks. Should we include this patch in this 
patch series or separated?

Regards,
Alex Sierra
>
>
>
>  From 09750c714739ef3ca317b4aec82bf20283c8fd2d Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Thu, 23 Jun 2022 09:38:45 +0200
> Subject: [PATCH] mm/rmap: fix dereferencing invalid subpage pointer in
>   try_to_migrate_one()
>
> The subpage we calculate is an invalid pointer for device private pages,
> because device private pages are mapped via non-present device private
> entries, not ordinary present PTEs.
>
> Let's just not compute broken pointers and fixup later. Move the proper
> assignment of the correct subpage to the beginning of the function and
> assert that we really only have a single page in our folio.
>
> This currently results in a BUG when tying to compute anon_exclusive,
> because:
>
> [  528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0
> [  528.739585] #PF: supervisor read access in kernel mode
> [  528.745324] #PF: error_code(0x0000) - not-present page
> [  528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
> [  528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [  528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257
> [  528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021
> [  528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
> [  528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
> c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
> 0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
> [  528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
> [  528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0
> [  528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8
> [  528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000
> [  528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540
> [  528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff
> [  528.850865] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000
> [  528.859891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0
> [  528.874275] PKRU: 55555554
> [  528.877286] Call Trace:
> [  528.880016]  <TASK>
> [  528.882356]  ? lock_is_held_type+0xdf/0x130
> [  528.887033]  rmap_walk_anon+0x167/0x410
> [  528.891316]  try_to_migrate+0x90/0xd0
> [  528.895405]  ? try_to_unmap_one+0xe10/0xe10
> [  528.900074]  ? anon_vma_ctor+0x50/0x50
> [  528.904260]  ? put_anon_vma+0x10/0x10
> [  528.908347]  ? invalid_mkclean_vma+0x20/0x20
> [  528.913114]  migrate_vma_setup+0x5f4/0x750
> [  528.917691]  dmirror_devmem_fault+0x8c/0x250 [test_hmm]
> [  528.923532]  do_swap_page+0xac0/0xe50
> [  528.927623]  ? __lock_acquire+0x4b2/0x1ac0
> [  528.932199]  __handle_mm_fault+0x949/0x1440
> [  528.936876]  handle_mm_fault+0x13f/0x3e0
> [  528.941256]  do_user_addr_fault+0x215/0x740
> [  528.945928]  exc_page_fault+0x75/0x280
> [  528.950115]  asm_exc_page_fault+0x27/0x30
> [  528.954593] RIP: 0033:0x40366b
> ...
>
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Reported-by: Sierra Guiza, Alejandro (Alex) <alex.sierra@amd.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/rmap.c | 27 +++++++++++++++++----------
>   1 file changed, 17 insertions(+), 10 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..746c05acad27 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1899,8 +1899,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   		/* Unexpected PMD-mapped THP? */
>   		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>   
> -		subpage = folio_page(folio,
> -				pte_pfn(*pvmw.pte) - folio_pfn(folio));
> +		if (folio_is_zone_device(folio)) {
> +			/*
> +			 * Our PTE is a non-present device exclusive entry and
> +			 * calculating the subpage as for the common case would
> +			 * result in an invalid pointer.
> +			 *
> +			 * Since only PAGE_SIZE pages can currently be
> +			 * migrated, just set it to page. This will need to be
> +			 * changed when hugepage migrations to device private
> +			 * memory are supported.
> +			 */
> +			VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
> +			subpage = &folio->page;
> +		} else {
> +			subpage = folio_page(folio,
> +					pte_pfn(*pvmw.pte) - folio_pfn(folio));
> +		}
>   		address = pvmw.address;
>   		anon_exclusive = folio_test_anon(folio) &&
>   				 PageAnonExclusive(subpage);
> @@ -1993,15 +2008,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   			/*
>   			 * No need to invalidate here it will synchronize on
>   			 * against the special swap migration pte.
> -			 *
> -			 * The assignment to subpage above was computed from a
> -			 * swap PTE which results in an invalid pointer.
> -			 * Since only PAGE_SIZE pages can currently be
> -			 * migrated, just set it to page. This will need to be
> -			 * changed when hugepage migrations to device private
> -			 * memory are supported.
>   			 */
> -			subpage = &folio->page;
>   		} else if (PageHWPoison(subpage)) {
>   			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
>   			if (folio_test_hugetlb(folio)) {
David Hildenbrand June 23, 2022, 6:21 p.m. UTC | #22
On 23.06.22 20:20, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/23/2022 2:57 AM, David Hildenbrand wrote:
>> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>>>> David Hildenbrand<david@redhat.com>  writes:
>>>>>>>
>>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>>>             removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>        include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>>>        mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>>>>>        mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>>>>>        mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>>>>>        mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>>>>>        mm/rmap.c                |  5 +++--
>>>>>>>>>>>>>>>        6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>>>         * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>>>         * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>>>         *
>>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>>>
>>>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>         * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>>>         * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>>>         * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>>>        enum memory_type {
>>>>>>>>>>>>>>>        	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>>>        	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>>>        	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>>>        	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>>>        	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>>>> memory manager owning the
>>>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>>>> pages.
>>>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>>>
>>>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>>>> I don't follow.
>>>>>>>>>>
>>>>>>>>>> You're saying
>>>>>>>>>>
>>>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>>>
>>>>>>>>>> and that
>>>>>>>>>>
>>>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yet, the code says
>>>>>>>>>>
>>>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>>>> {
>>>>>>>>>> 	if (flags & FOLL_GET)
>>>>>>>>>> 		return try_get_folio(page, refs);
>>>>>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>>>>>> 		struct folio *folio;
>>>>>>>>>>
>>>>>>>>>> 		/*
>>>>>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>>>>>> 		 * path.
>>>>>>>>>> 		 */
>>>>>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>>>> 			     !is_pinnable_page(page)))
>>>>>>>>>> 			return NULL;
>>>>>>>>>> 		...
>>>>>>>>>> 		return folio;
>>>>>>>>>> 	}
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>>>
>>>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>>>
>>>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>>>> works without adjusting folio_is_pinnable().
>>>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>>>
>>>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>>>> because device coherent pages are pinnable. It is really just
>>>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>>>
>>>>>>> For normal PUP that is done by my change in
>>>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>>>> So I think the check in try_grab_folio() needs to be:
>>>>>> I think I said it already (and I might be wrong without reading the
>>>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>>>
>>>>>> It should actually be called folio_is_longterm_pinnable().
>>>>>>
>>>>>> That's where that check should go, no?
>>>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>>>> try_grab_folio is called only from fast callers (ex.
>>>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>>>> Returning null under this condition it does causes the migration from
>>>>> dev to system memory.
>>>>>
>>>> Why can't coherent memory simply put its checks into
>>>> folio_is_pinnable()? I don't get it why we have to do things differently
>>>> here.
>>>>
>>>>> Actually, Im having different problems with a call to PageAnonExclusive
>>>>> from try_to_migrate_one during page fault from a HMM test that first
>>>>> migrate pages to device private and forks to mark as COW these pages.
>>>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>>>> page)
>>>> With or without this series? A backtrace would be great.
>>> Here's the back trace. This happens in a hmm-test added in this patch
>>> series. However, I have tried to isolate this BUG by just adding the COW
>>> test with private device memory only. This is only present as follows.
>>> Allocate anonymous mem->Migrate to private device memory->fork->try to
>>> access to parent's anonymous memory (which will suppose to trigger a
>>> page fault and migration to system mem). Just for the record, if the
>>> child is terminated before the parent's memory is accessed, this problem
>>> is not present.
>>
>> The only usage of PageAnonExclusive() in try_to_migrate_one() is:
>>
>> anon_exclusive = folio_test_anon(folio) &&
>> 		 PageAnonExclusive(subpage);
>>
>> Which can only possibly fail if subpage is not actually part of the folio.
>>
>>
>> I see some controversial code in the the if (folio_is_zone_device(folio)) case later:
>>
>> 			 * The assignment to subpage above was computed from a
>> 			 * swap PTE which results in an invalid pointer.
>> 			 * Since only PAGE_SIZE pages can currently be
>> 			 * migrated, just set it to page. This will need to be
>> 			 * changed when hugepage migrations to device private
>> 			 * memory are supported.
>> 			 */
>> 			subpage = &folio->page;
>>
>> There we have our invalid pointer hint.
>>
>> I don't see how it could have worked if the child quit, though? Maybe
>> just pure luck?
>>
>>
>> Does the following fix your issue:
> 
> Yes, it fixed the issue. Thanks. Should we include this patch in this 
> patch series or separated?
> 
> Regards,
> Alex Sierra

I'll send it right away "officially" so we can get it into 5.19. Can I
add your tested-by?
Sierra Guiza, Alejandro (Alex) June 24, 2022, 4:13 p.m. UTC | #23
On 6/23/2022 1:21 PM, David Hildenbrand wrote:
> On 23.06.22 20:20, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/23/2022 2:57 AM, David Hildenbrand wrote:
>>> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>>>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>>>>> David Hildenbrand<david@redhat.com>  writes:
>>>>>>>>
>>>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>>>>              removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>         include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>>>>         mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>>>>>>         mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>>>>>>         mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>>>>>>         mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>>>>>>         mm/rmap.c                |  5 +++--
>>>>>>>>>>>>>>>>         6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>>>>          * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>>>>          * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>>>>          *
>>>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>          * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>>>>          * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>>>>          * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>>>>         enum memory_type {
>>>>>>>>>>>>>>>>         	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>>>>         	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>>>>         	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>>>>         	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>>>>         	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>>>>> memory manager owning the
>>>>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>>>>> pages.
>>>>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>>>>
>>>>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>>>>> I don't follow.
>>>>>>>>>>>
>>>>>>>>>>> You're saying
>>>>>>>>>>>
>>>>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>>>>
>>>>>>>>>>> and that
>>>>>>>>>>>
>>>>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yet, the code says
>>>>>>>>>>>
>>>>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>>>>> {
>>>>>>>>>>> 	if (flags & FOLL_GET)
>>>>>>>>>>> 		return try_get_folio(page, refs);
>>>>>>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>>>>>>> 		struct folio *folio;
>>>>>>>>>>>
>>>>>>>>>>> 		/*
>>>>>>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>>>>>>> 		 * path.
>>>>>>>>>>> 		 */
>>>>>>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>>>>> 			     !is_pinnable_page(page)))
>>>>>>>>>>> 			return NULL;
>>>>>>>>>>> 		...
>>>>>>>>>>> 		return folio;
>>>>>>>>>>> 	}
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>>>>
>>>>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>>>>
>>>>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>>>>> works without adjusting folio_is_pinnable().
>>>>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>>>>
>>>>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>>>>> because device coherent pages are pinnable. It is really just
>>>>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>>>>
>>>>>>>> For normal PUP that is done by my change in
>>>>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>>>>> So I think the check in try_grab_folio() needs to be:
>>>>>>> I think I said it already (and I might be wrong without reading the
>>>>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>>>>
>>>>>>> It should actually be called folio_is_longterm_pinnable().
>>>>>>>
>>>>>>> That's where that check should go, no?
>>>>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>>>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>>>>> try_grab_folio is called only from fast callers (ex.
>>>>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>>>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>>>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>>>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>>>>> Returning null under this condition it does causes the migration from
>>>>>> dev to system memory.
>>>>>>
>>>>> Why can't coherent memory simply put its checks into
>>>>> folio_is_pinnable()? I don't get it why we have to do things differently
>>>>> here.
>>>>>
>>>>>> Actually, Im having different problems with a call to PageAnonExclusive
>>>>>> from try_to_migrate_one during page fault from a HMM test that first
>>>>>> migrate pages to device private and forks to mark as COW these pages.
>>>>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>>>>> page)
>>>>> With or without this series? A backtrace would be great.
>>>> Here's the back trace. This happens in a hmm-test added in this patch
>>>> series. However, I have tried to isolate this BUG by just adding the COW
>>>> test with private device memory only. This is only present as follows.
>>>> Allocate anonymous mem->Migrate to private device memory->fork->try to
>>>> access to parent's anonymous memory (which will suppose to trigger a
>>>> page fault and migration to system mem). Just for the record, if the
>>>> child is terminated before the parent's memory is accessed, this problem
>>>> is not present.
>>> The only usage of PageAnonExclusive() in try_to_migrate_one() is:
>>>
>>> anon_exclusive = folio_test_anon(folio) &&
>>> 		 PageAnonExclusive(subpage);
>>>
>>> Which can only possibly fail if subpage is not actually part of the folio.
>>>
>>>
>>> I see some controversial code in the the if (folio_is_zone_device(folio)) case later:
>>>
>>> 			 * The assignment to subpage above was computed from a
>>> 			 * swap PTE which results in an invalid pointer.
>>> 			 * Since only PAGE_SIZE pages can currently be
>>> 			 * migrated, just set it to page. This will need to be
>>> 			 * changed when hugepage migrations to device private
>>> 			 * memory are supported.
>>> 			 */
>>> 			subpage = &folio->page;
>>>
>>> There we have our invalid pointer hint.
>>>
>>> I don't see how it could have worked if the child quit, though? Maybe
>>> just pure luck?
>>>
>>>
>>> Does the following fix your issue:
>> Yes, it fixed the issue. Thanks. Should we include this patch in this
>> patch series or separated?
>>
>> Regards,
>> Alex Sierra
> I'll send it right away "officially" so we can get it into 5.19. Can I
> add your tested-by?

Of course.

Alex Sierra

>
>
diff mbox series

Patch

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..9f752ebed613 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@  struct vmem_altmap {
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.rst.
  *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allowed to pin such memory so that it can always be evicted.
+ *
  * MEMORY_DEVICE_FS_DAX:
  * Host memory that has similar access semantics as System RAM i.e. DMA
  * coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@  struct vmem_altmap {
 enum memory_type {
 	/* 0 is reserved to catch uninitialized type fields */
 	MEMORY_DEVICE_PRIVATE = 1,
+	MEMORY_DEVICE_COHERENT,
 	MEMORY_DEVICE_FS_DAX,
 	MEMORY_DEVICE_GENERIC,
 	MEMORY_DEVICE_PCI_P2PDMA,
@@ -143,6 +151,17 @@  static inline bool folio_is_device_private(const struct folio *folio)
 	return is_device_private_page(&folio->page);
 }
 
+static inline bool is_device_coherent_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_COHERENT;
+}
+
+static inline bool folio_is_device_coherent(const struct folio *folio)
+{
+	return is_device_coherent_page(&folio->page);
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index abec50f31fe6..93f80d7ca148 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5665,8 +5665,8 @@  static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
- *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
- *     (so ZONE_DEVICE page and thus not on the lru).
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
+ *   thus not on the lru.
  *     For now we such page is charge like a regular page would be as for all
  *     intent and purposes it is just special memory taking the place of a
  *     regular page.
@@ -5704,7 +5704,8 @@  static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 */
 		if (page_memcg(page) == mc.from) {
 			ret = MC_TARGET_PAGE;
-			if (is_device_private_page(page))
+			if (is_device_private_page(page) ||
+			    is_device_coherent_page(page))
 				ret = MC_TARGET_DEVICE;
 			if (target)
 				target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b85661cbdc4a..0b6a0a01ee09 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1683,12 +1683,16 @@  static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto unlock;
 	}
 
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
 		/*
-		 * TODO: Handle HMM pages which may need coordination
+		 * TODO: Handle device pages which may need coordination
 		 * with device-side memory.
 		 */
 		goto unlock;
+	default:
+		break;
 	}
 
 	/*
diff --git a/mm/memremap.c b/mm/memremap.c
index 2b92e97cb25b..dbd2631b3520 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -315,6 +315,16 @@  void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 			return ERR_PTR(-EINVAL);
 		}
 		break;
+	case MEMORY_DEVICE_COHERENT:
+		if (!pgmap->ops->page_free) {
+			WARN(1, "Missing page_free method\n");
+			return ERR_PTR(-EINVAL);
+		}
+		if (!pgmap->owner) {
+			WARN(1, "Missing owner\n");
+			return ERR_PTR(-EINVAL);
+		}
+		break;
 	case MEMORY_DEVICE_FS_DAX:
 		if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
 			WARN(1, "File system DAX not supported\n");
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 5052093d0262..a4847ad65da3 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -518,7 +518,7 @@  EXPORT_SYMBOL(migrate_vma_setup);
  *     handle_pte_fault()
  *       do_anonymous_page()
  * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * private or coherent page.
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
@@ -594,11 +594,8 @@  static void migrate_vma_insert_page(struct migrate_vma *migrate,
 						page_to_pfn(page));
 		entry = swp_entry_to_pte(swp_entry);
 	} else {
-		/*
-		 * For now we only support migrating to un-addressable device
-		 * memory.
-		 */
-		if (is_zone_device_page(page)) {
+		if (is_zone_device_page(page) &&
+		    !is_device_coherent_page(page)) {
 			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
 			goto abort;
 		}
@@ -701,10 +698,11 @@  void migrate_vma_pages(struct migrate_vma *migrate)
 
 		mapping = page_mapping(page);
 
-		if (is_device_private_page(newpage)) {
+		if (is_device_private_page(newpage) ||
+		    is_device_coherent_page(newpage)) {
 			/*
-			 * For now only support private anonymous when migrating
-			 * to un-addressable device memory.
+			 * For now only support anonymous memory migrating to
+			 * device private or coherent memory.
 			 */
 			if (mapping) {
 				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..04fac1af870b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1957,7 +1957,7 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
 
-		if (folio_is_zone_device(folio)) {
+		if (folio_is_device_private(folio)) {
 			unsigned long pfn = folio_pfn(folio);
 			swp_entry_t entry;
 			pte_t swp_pte;
@@ -2131,7 +2131,8 @@  void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 					TTU_SYNC)))
 		return;
 
-	if (folio_is_zone_device(folio) && !folio_is_device_private(folio))
+	if (folio_is_zone_device(folio) &&
+	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
 		return;
 
 	/*