diff mbox series

[v2,16/18] mm/memremap_pages: Support initializing pages to a zero reference count

Message ID 166329940343.2786261.6047770378829215962.stgit@dwillia2-xfh.jf.intel.com (mailing list archive)
State New
Headers show
Series Fix the DAX-gup mistake | expand

Commit Message

Dan Williams Sept. 16, 2022, 3:36 a.m. UTC
The initial memremap_pages() implementation inherited the
__init_single_page() default of pages starting life with an elevated
reference count. This originally allowed for the page->pgmap pointer to
alias with the storage for page->lru since a page was only allowed to be
on an lru list when its reference count was zero.

Since then, 'struct page' definition cleanups have arranged for
dedicated space for the ZONE_DEVICE page metadata, and the
MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
page->_refcount transition to route the page to free_zone_device_page()
and not the core-mm page-free. With those cleanups in place and with
filesystem-dax and device-dax now converted to take and drop references
at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
and MEMORY_DEVICE_GENERIC reference counts at 0.

MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
pages start life at _refcount 1, so make that the default if
pgmap->init_mode is left at zero.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c     |    1 +
 drivers/nvdimm/pmem.c    |    2 ++
 include/linux/dax.h      |    2 +-
 include/linux/memremap.h |    5 +++++
 mm/memremap.c            |   15 ++++++++++-----
 mm/page_alloc.c          |    2 ++
 6 files changed, 21 insertions(+), 6 deletions(-)

Comments

Jason Gunthorpe Sept. 21, 2022, 3:24 p.m. UTC | #1
On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> The initial memremap_pages() implementation inherited the
> __init_single_page() default of pages starting life with an elevated
> reference count. This originally allowed for the page->pgmap pointer to
> alias with the storage for page->lru since a page was only allowed to be
> on an lru list when its reference count was zero.
> 
> Since then, 'struct page' definition cleanups have arranged for
> dedicated space for the ZONE_DEVICE page metadata, and the
> MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> page->_refcount transition to route the page to free_zone_device_page()
> and not the core-mm page-free. With those cleanups in place and with
> filesystem-dax and device-dax now converted to take and drop references
> at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> and MEMORY_DEVICE_GENERIC reference counts at 0.
> 
> MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> pages start life at _refcount 1, so make that the default if
> pgmap->init_mode is left at zero.

I'm shocked to read this - how does it make any sense?

dev_pagemap_ops->page_free() is only called on the 1->0 transition, so
any driver which implements it must be expecting pages to have a 0
refcount.

Looking around everything but only fsdax_pagemap_ops implements
page_free()

So, how does it work? Surely the instant the page map is created all
the pages must be considered 'free', and after page_free() is called I
would also expect the page to be considered free.

How on earth can a free'd page have both a 0 and 1 refcount??

eg look at the simple hmm_test, it threads pages on to the
mdevice->free_pages list immediately after memremap_pages and then
again inside page_free() - it is completely wrong that they would have
different refcounts while on the free_pages list.

I would expect that after the page is removed from the free_pages list
it will have its recount set to 1 to make it non-free then it will go
through the migration.

Alistair how should the refcounting be working here in hmm_test?

Jason
Dan Williams Sept. 21, 2022, 11:45 p.m. UTC | #2
Jason Gunthorpe wrote:
> On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> > The initial memremap_pages() implementation inherited the
> > __init_single_page() default of pages starting life with an elevated
> > reference count. This originally allowed for the page->pgmap pointer to
> > alias with the storage for page->lru since a page was only allowed to be
> > on an lru list when its reference count was zero.
> > 
> > Since then, 'struct page' definition cleanups have arranged for
> > dedicated space for the ZONE_DEVICE page metadata, and the
> > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> > page->_refcount transition to route the page to free_zone_device_page()
> > and not the core-mm page-free. With those cleanups in place and with
> > filesystem-dax and device-dax now converted to take and drop references
> > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> > and MEMORY_DEVICE_GENERIC reference counts at 0.
> > 
> > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> > pages start life at _refcount 1, so make that the default if
> > pgmap->init_mode is left at zero.
> 
> I'm shocked to read this - how does it make any sense?

I think what happened is that since memremap_pages() historically
produced pages with an elevated reference count that GPU drivers skipped
taking a reference on first allocation and just passed along an elevated
reference count page to the first user.

So either we keep that assumption or update all users to be prepared for
idle pages coming out of memremap_pages().

This is all in reaction to the "set_page_count(page, 1);" in
free_zone_device_page(). Which I am happy to get rid of but need from
help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
memremap_pages() starting all pages at reference count 0.

> dev_pagemap_ops->page_free() is only called on the 1->0 transition, so
> any driver which implements it must be expecting pages to have a 0
> refcount.
> 
> Looking around everything but only fsdax_pagemap_ops implements
> page_free()

Right.

> So, how does it work? Surely the instant the page map is created all
> the pages must be considered 'free', and after page_free() is called I
> would also expect the page to be considered free.

The GPU drivers need to increment reference counts when they hand out
the page rather than reuse the reference count that they get by default.

> How on earth can a free'd page have both a 0 and 1 refcount??

This is residual wonkiness from memremap_pages() handing out pages with
elevated reference counts at the outset.

> eg look at the simple hmm_test, it threads pages on to the
> mdevice->free_pages list immediately after memremap_pages and then
> again inside page_free() - it is completely wrong that they would have
> different refcounts while on the free_pages list.

I do not see any page_ref_inc() in that test, only put_page() so it is
assuming non-idle pages at the outset.

> I would expect that after the page is removed from the free_pages list
> it will have its recount set to 1 to make it non-free then it will go
> through the migration.
> 
> Alistair how should the refcounting be working here in hmm_test?
> 
> Jason
Alistair Popple Sept. 22, 2022, 12:03 a.m. UTC | #3
Dan Williams <dan.j.williams@intel.com> writes:

> Jason Gunthorpe wrote:
>> On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
>> > The initial memremap_pages() implementation inherited the
>> > __init_single_page() default of pages starting life with an elevated
>> > reference count. This originally allowed for the page->pgmap pointer to
>> > alias with the storage for page->lru since a page was only allowed to be
>> > on an lru list when its reference count was zero.
>> >
>> > Since then, 'struct page' definition cleanups have arranged for
>> > dedicated space for the ZONE_DEVICE page metadata, and the
>> > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
>> > page->_refcount transition to route the page to free_zone_device_page()
>> > and not the core-mm page-free. With those cleanups in place and with
>> > filesystem-dax and device-dax now converted to take and drop references
>> > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
>> > and MEMORY_DEVICE_GENERIC reference counts at 0.
>> >
>> > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
>> > pages start life at _refcount 1, so make that the default if
>> > pgmap->init_mode is left at zero.
>>
>> I'm shocked to read this - how does it make any sense?
>
> I think what happened is that since memremap_pages() historically
> produced pages with an elevated reference count that GPU drivers skipped
> taking a reference on first allocation and just passed along an elevated
> reference count page to the first user.
>
> So either we keep that assumption or update all users to be prepared for
> idle pages coming out of memremap_pages().
>
> This is all in reaction to the "set_page_count(page, 1);" in
> free_zone_device_page(). Which I am happy to get rid of but need from
> help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> memremap_pages() starting all pages at reference count 0.

This is all rather good timing - This week I've been in the middle of
getting a series together which fixes this among other things. So I'm
all for fixing it and can help with that - my motivation was I needed to
be able to tell if a page is free or not with get_page_unless_zero()
etc. which doesn't work at the moment because free device
private/coherent pages have an elevated refcount.

 - Alistair

>> dev_pagemap_ops->page_free() is only called on the 1->0 transition, so
>> any driver which implements it must be expecting pages to have a 0
>> refcount.
>>
>> Looking around everything but only fsdax_pagemap_ops implements
>> page_free()
>
> Right.
>
>> So, how does it work? Surely the instant the page map is created all
>> the pages must be considered 'free', and after page_free() is called I
>> would also expect the page to be considered free.
>
> The GPU drivers need to increment reference counts when they hand out
> the page rather than reuse the reference count that they get by default.
>
>> How on earth can a free'd page have both a 0 and 1 refcount??
>
> This is residual wonkiness from memremap_pages() handing out pages with
> elevated reference counts at the outset.
>
>> eg look at the simple hmm_test, it threads pages on to the
>> mdevice->free_pages list immediately after memremap_pages and then
>> again inside page_free() - it is completely wrong that they would have
>> different refcounts while on the free_pages list.
>
> I do not see any page_ref_inc() in that test, only put_page() so it is
> assuming non-idle pages at the outset.
>
>> I would expect that after the page is removed from the free_pages list
>> it will have its recount set to 1 to make it non-free then it will go
>> through the migration.
>>
>> Alistair how should the refcounting be working here in hmm_test?
>>
>> Jason
Jason Gunthorpe Sept. 22, 2022, 12:04 a.m. UTC | #4
On Wed, Sep 21, 2022 at 04:45:22PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> > > The initial memremap_pages() implementation inherited the
> > > __init_single_page() default of pages starting life with an elevated
> > > reference count. This originally allowed for the page->pgmap pointer to
> > > alias with the storage for page->lru since a page was only allowed to be
> > > on an lru list when its reference count was zero.
> > > 
> > > Since then, 'struct page' definition cleanups have arranged for
> > > dedicated space for the ZONE_DEVICE page metadata, and the
> > > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> > > page->_refcount transition to route the page to free_zone_device_page()
> > > and not the core-mm page-free. With those cleanups in place and with
> > > filesystem-dax and device-dax now converted to take and drop references
> > > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> > > and MEMORY_DEVICE_GENERIC reference counts at 0.
> > > 
> > > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> > > pages start life at _refcount 1, so make that the default if
> > > pgmap->init_mode is left at zero.
> > 
> > I'm shocked to read this - how does it make any sense?
> 
> I think what happened is that since memremap_pages() historically
> produced pages with an elevated reference count that GPU drivers skipped
> taking a reference on first allocation and just passed along an elevated
> reference count page to the first user.
> 
> So either we keep that assumption or update all users to be prepared for
> idle pages coming out of memremap_pages().
> 
> This is all in reaction to the "set_page_count(page, 1);" in
> free_zone_device_page(). Which I am happy to get rid of but need from
> help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> memremap_pages() starting all pages at reference count 0.

But, but this is all racy, it can't do this:

+	if (pgmap->ops && pgmap->ops->page_free)
+		pgmap->ops->page_free(page);
 
 	/*
+	 * Reset the page count to the @init_mode value to prepare for
+	 * handing out the page again.
 	 */
+	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
+		set_page_count(page, 1);

after the fact! Something like that hmm_test has already threaded the
"freed" page into the free list via ops->page_free(), it can't have a
0 ref count and be on the free list, even temporarily :(

Maybe it nees to be re-ordered?

> > How on earth can a free'd page have both a 0 and 1 refcount??
> 
> This is residual wonkiness from memremap_pages() handing out pages with
> elevated reference counts at the outset.

I think the answer to my question is the above troubled code where we
still set the page refcount back to 1 even in the page_free path, so
there is some consistency "a freed paged may have a refcount of 1" for
the driver.

So, I guess this patch makes sense but I would put more noise around
INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
explicit constant) and alert people that they need to fix their stuff
to get rid of it.

We should definately try to fix hmm_test as well so people have a good
reference code to follow in fixing the other drivers :(

Jason
John Hubbard Sept. 22, 2022, 12:13 a.m. UTC | #5
On 9/21/22 16:45, Dan Williams wrote:
>> I'm shocked to read this - how does it make any sense?
> 
> I think what happened is that since memremap_pages() historically
> produced pages with an elevated reference count that GPU drivers skipped
> taking a reference on first allocation and just passed along an elevated
> reference count page to the first user.
> 
> So either we keep that assumption or update all users to be prepared for
> idle pages coming out of memremap_pages().
> 
> This is all in reaction to the "set_page_count(page, 1);" in
> free_zone_device_page(). Which I am happy to get rid of but need from
> help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> memremap_pages() starting all pages at reference count 0.
> 

Just one tiny thing to contribute to this difficult story: I think that
we can make this slightly clearer by saying things like this:

"The device driver is the allocator for device pages. And allocators
keep pages at a refcount == 0, until they hand out the pages in response
to allocation requests."

To me at least, this makes it easier to see why pages are 0 or > 0 
refcounts. In case that helps at all.


thanks,
Dan Williams Sept. 22, 2022, 12:34 a.m. UTC | #6
Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 04:45:22PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> > > > The initial memremap_pages() implementation inherited the
> > > > __init_single_page() default of pages starting life with an elevated
> > > > reference count. This originally allowed for the page->pgmap pointer to
> > > > alias with the storage for page->lru since a page was only allowed to be
> > > > on an lru list when its reference count was zero.
> > > > 
> > > > Since then, 'struct page' definition cleanups have arranged for
> > > > dedicated space for the ZONE_DEVICE page metadata, and the
> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> > > > page->_refcount transition to route the page to free_zone_device_page()
> > > > and not the core-mm page-free. With those cleanups in place and with
> > > > filesystem-dax and device-dax now converted to take and drop references
> > > > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> > > > and MEMORY_DEVICE_GENERIC reference counts at 0.
> > > > 
> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> > > > pages start life at _refcount 1, so make that the default if
> > > > pgmap->init_mode is left at zero.
> > > 
> > > I'm shocked to read this - how does it make any sense?
> > 
> > I think what happened is that since memremap_pages() historically
> > produced pages with an elevated reference count that GPU drivers skipped
> > taking a reference on first allocation and just passed along an elevated
> > reference count page to the first user.
> > 
> > So either we keep that assumption or update all users to be prepared for
> > idle pages coming out of memremap_pages().
> > 
> > This is all in reaction to the "set_page_count(page, 1);" in
> > free_zone_device_page(). Which I am happy to get rid of but need from
> > help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> > memremap_pages() starting all pages at reference count 0.
> 
> But, but this is all racy, it can't do this:
> 
> +	if (pgmap->ops && pgmap->ops->page_free)
> +		pgmap->ops->page_free(page);
>  
>  	/*
> +	 * Reset the page count to the @init_mode value to prepare for
> +	 * handing out the page again.
>  	 */
> +	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
> +		set_page_count(page, 1);
> 
> after the fact! Something like that hmm_test has already threaded the
> "freed" page into the free list via ops->page_free(), it can't have a
> 0 ref count and be on the free list, even temporarily :(
> 
> Maybe it nees to be re-ordered?
> 
> > > How on earth can a free'd page have both a 0 and 1 refcount??
> > 
> > This is residual wonkiness from memremap_pages() handing out pages with
> > elevated reference counts at the outset.
> 
> I think the answer to my question is the above troubled code where we
> still set the page refcount back to 1 even in the page_free path, so
> there is some consistency "a freed paged may have a refcount of 1" for
> the driver.
> 
> So, I guess this patch makes sense but I would put more noise around
> INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
> explicit constant) and alert people that they need to fix their stuff
> to get rid of it.

Sounds reasonable.

> We should definately try to fix hmm_test as well so people have a good
> reference code to follow in fixing the other drivers :(

Oh, that's a good idea. I can probably fix that up and leave it to the
GPU driver folks to catch up with that example so we can kill off
INIT_PAGEMAP_BUSY.
Alistair Popple Sept. 22, 2022, 1:36 a.m. UTC | #7
Dan Williams <dan.j.williams@intel.com> writes:

> Jason Gunthorpe wrote:
>> On Wed, Sep 21, 2022 at 04:45:22PM -0700, Dan Williams wrote:
>> > Jason Gunthorpe wrote:
>> > > On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
>> > > > The initial memremap_pages() implementation inherited the
>> > > > __init_single_page() default of pages starting life with an elevated
>> > > > reference count. This originally allowed for the page->pgmap pointer to
>> > > > alias with the storage for page->lru since a page was only allowed to be
>> > > > on an lru list when its reference count was zero.
>> > > >
>> > > > Since then, 'struct page' definition cleanups have arranged for
>> > > > dedicated space for the ZONE_DEVICE page metadata, and the
>> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
>> > > > page->_refcount transition to route the page to free_zone_device_page()
>> > > > and not the core-mm page-free. With those cleanups in place and with
>> > > > filesystem-dax and device-dax now converted to take and drop references
>> > > > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
>> > > > and MEMORY_DEVICE_GENERIC reference counts at 0.
>> > > >
>> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
>> > > > pages start life at _refcount 1, so make that the default if
>> > > > pgmap->init_mode is left at zero.
>> > >
>> > > I'm shocked to read this - how does it make any sense?
>> >
>> > I think what happened is that since memremap_pages() historically
>> > produced pages with an elevated reference count that GPU drivers skipped
>> > taking a reference on first allocation and just passed along an elevated
>> > reference count page to the first user.
>> >
>> > So either we keep that assumption or update all users to be prepared for
>> > idle pages coming out of memremap_pages().
>> >
>> > This is all in reaction to the "set_page_count(page, 1);" in
>> > free_zone_device_page(). Which I am happy to get rid of but need from
>> > help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
>> > memremap_pages() starting all pages at reference count 0.
>>
>> But, but this is all racy, it can't do this:
>>
>> +	if (pgmap->ops && pgmap->ops->page_free)
>> +		pgmap->ops->page_free(page);
>>
>>  	/*
>> +	 * Reset the page count to the @init_mode value to prepare for
>> +	 * handing out the page again.
>>  	 */
>> +	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
>> +		set_page_count(page, 1);
>>
>> after the fact! Something like that hmm_test has already threaded the
>> "freed" page into the free list via ops->page_free(), it can't have a
>> 0 ref count and be on the free list, even temporarily :(
>>
>> Maybe it nees to be re-ordered?
>>
>> > > How on earth can a free'd page have both a 0 and 1 refcount??
>> >
>> > This is residual wonkiness from memremap_pages() handing out pages with
>> > elevated reference counts at the outset.
>>
>> I think the answer to my question is the above troubled code where we
>> still set the page refcount back to 1 even in the page_free path, so
>> there is some consistency "a freed paged may have a refcount of 1" for
>> the driver.
>>
>> So, I guess this patch makes sense but I would put more noise around
>> INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
>> explicit constant) and alert people that they need to fix their stuff
>> to get rid of it.
>
> Sounds reasonable.
>
>> We should definately try to fix hmm_test as well so people have a good
>> reference code to follow in fixing the other drivers :(
>
> Oh, that's a good idea. I can probably fix that up and leave it to the
> GPU driver folks to catch up with that example so we can kill off
> INIT_PAGEMAP_BUSY.

I'm hoping to send my series that fixes up all drivers using device
coherent/private later this week or early next. So you could also just
wait for that and remove INIT_PAGEMAP_BUSY entirely.

 - Alistair
Dan Williams Sept. 22, 2022, 2:34 a.m. UTC | #8
Alistair Popple wrote:
> 
> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > Jason Gunthorpe wrote:
> >> On Wed, Sep 21, 2022 at 04:45:22PM -0700, Dan Williams wrote:
> >> > Jason Gunthorpe wrote:
> >> > > On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> >> > > > The initial memremap_pages() implementation inherited the
> >> > > > __init_single_page() default of pages starting life with an elevated
> >> > > > reference count. This originally allowed for the page->pgmap pointer to
> >> > > > alias with the storage for page->lru since a page was only allowed to be
> >> > > > on an lru list when its reference count was zero.
> >> > > >
> >> > > > Since then, 'struct page' definition cleanups have arranged for
> >> > > > dedicated space for the ZONE_DEVICE page metadata, and the
> >> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> >> > > > page->_refcount transition to route the page to free_zone_device_page()
> >> > > > and not the core-mm page-free. With those cleanups in place and with
> >> > > > filesystem-dax and device-dax now converted to take and drop references
> >> > > > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> >> > > > and MEMORY_DEVICE_GENERIC reference counts at 0.
> >> > > >
> >> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> >> > > > pages start life at _refcount 1, so make that the default if
> >> > > > pgmap->init_mode is left at zero.
> >> > >
> >> > > I'm shocked to read this - how does it make any sense?
> >> >
> >> > I think what happened is that since memremap_pages() historically
> >> > produced pages with an elevated reference count that GPU drivers skipped
> >> > taking a reference on first allocation and just passed along an elevated
> >> > reference count page to the first user.
> >> >
> >> > So either we keep that assumption or update all users to be prepared for
> >> > idle pages coming out of memremap_pages().
> >> >
> >> > This is all in reaction to the "set_page_count(page, 1);" in
> >> > free_zone_device_page(). Which I am happy to get rid of but need from
> >> > help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> >> > memremap_pages() starting all pages at reference count 0.
> >>
> >> But, but this is all racy, it can't do this:
> >>
> >> +	if (pgmap->ops && pgmap->ops->page_free)
> >> +		pgmap->ops->page_free(page);
> >>
> >>  	/*
> >> +	 * Reset the page count to the @init_mode value to prepare for
> >> +	 * handing out the page again.
> >>  	 */
> >> +	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
> >> +		set_page_count(page, 1);
> >>
> >> after the fact! Something like that hmm_test has already threaded the
> >> "freed" page into the free list via ops->page_free(), it can't have a
> >> 0 ref count and be on the free list, even temporarily :(
> >>
> >> Maybe it nees to be re-ordered?
> >>
> >> > > How on earth can a free'd page have both a 0 and 1 refcount??
> >> >
> >> > This is residual wonkiness from memremap_pages() handing out pages with
> >> > elevated reference counts at the outset.
> >>
> >> I think the answer to my question is the above troubled code where we
> >> still set the page refcount back to 1 even in the page_free path, so
> >> there is some consistency "a freed paged may have a refcount of 1" for
> >> the driver.
> >>
> >> So, I guess this patch makes sense but I would put more noise around
> >> INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
> >> explicit constant) and alert people that they need to fix their stuff
> >> to get rid of it.
> >
> > Sounds reasonable.
> >
> >> We should definately try to fix hmm_test as well so people have a good
> >> reference code to follow in fixing the other drivers :(
> >
> > Oh, that's a good idea. I can probably fix that up and leave it to the
> > GPU driver folks to catch up with that example so we can kill off
> > INIT_PAGEMAP_BUSY.
> 
> I'm hoping to send my series that fixes up all drivers using device
> coherent/private later this week or early next. So you could also just
> wait for that and remove INIT_PAGEMAP_BUSY entirely.

Oh, perfect, thanks!
Alistair Popple Sept. 26, 2022, 6:17 a.m. UTC | #9
Dan Williams <dan.j.williams@intel.com> writes:

[...]

>> >> > > How on earth can a free'd page have both a 0 and 1 refcount??
>> >> >
>> >> > This is residual wonkiness from memremap_pages() handing out pages with
>> >> > elevated reference counts at the outset.
>> >>
>> >> I think the answer to my question is the above troubled code where we
>> >> still set the page refcount back to 1 even in the page_free path, so
>> >> there is some consistency "a freed paged may have a refcount of 1" for
>> >> the driver.
>> >>
>> >> So, I guess this patch makes sense but I would put more noise around
>> >> INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
>> >> explicit constant) and alert people that they need to fix their stuff
>> >> to get rid of it.
>> >
>> > Sounds reasonable.
>> >
>> >> We should definately try to fix hmm_test as well so people have a good
>> >> reference code to follow in fixing the other drivers :(
>> >
>> > Oh, that's a good idea. I can probably fix that up and leave it to the
>> > GPU driver folks to catch up with that example so we can kill off
>> > INIT_PAGEMAP_BUSY.
>>
>> I'm hoping to send my series that fixes up all drivers using device
>> coherent/private later this week or early next. So you could also just
>> wait for that and remove INIT_PAGEMAP_BUSY entirely.
>
> Oh, perfect, thanks!

See
https://lore.kernel.org/linux-mm/3d74bb439723c7e46cbe47d1711795308aee4ae3.1664171943.git-series.apopple@nvidia.com/

I already had this in a series because the change was motivated by a
later patch there, but it's a standalone change and there's no reason it
couldn't be split out into it's own patch if that's better for you.

 - Alistair
diff mbox series

Patch

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 7f306939807e..8a7281d16c99 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -460,6 +460,7 @@  int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	pgmap->init_mode = INIT_PAGEMAP_IDLE;
 	if (dev_dax->align > PAGE_SIZE)
 		pgmap->vmemmap_shift =
 			order_base_2(dev_dax->align >> PAGE_SHIFT);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 7e88cd242380..9c98dcb9f33d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -529,6 +529,7 @@  static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		pmem->pgmap.init_mode = INIT_PAGEMAP_IDLE;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pfn_sb = nd_pfn->pfn_sb;
@@ -543,6 +544,7 @@  static int pmem_attach_disk(struct device *dev,
 		pmem->pgmap.range.end = res->end;
 		pmem->pgmap.nr_range = 1;
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		pmem->pgmap.init_mode = INIT_PAGEMAP_IDLE;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pmem->pfn_flags |= PFN_MAP;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3a27fecf072a..b9fdd8951e06 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -235,7 +235,7 @@  static inline void dax_unlock_mapping_entry(struct address_space *mapping,
  */
 static inline bool dax_page_idle(struct page *page)
 {
-	return page_ref_count(page) == 1;
+	return page_ref_count(page) == 0;
 }
 
 bool dax_alive(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index e5d30eec3bf1..9f1a57efd371 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -116,6 +116,7 @@  struct dev_pagemap_ops {
  *	representation. A bigger value will set up compound struct pages
  *	of the requested order value.
  * @ops: method table
+ * @init_mode: initial reference count mode
  * @owner: an opaque pointer identifying the entity that manages this
  *	instance.  Used by various helpers to make sure that no
  *	foreign ZONE_DEVICE memory is accessed.
@@ -131,6 +132,10 @@  struct dev_pagemap {
 	unsigned int flags;
 	unsigned long vmemmap_shift;
 	const struct dev_pagemap_ops *ops;
+	enum {
+		INIT_PAGEMAP_BUSY = 0, /* default / historical */
+		INIT_PAGEMAP_IDLE,
+	} init_mode;
 	void *owner;
 	int nr_range;
 	union {
diff --git a/mm/memremap.c b/mm/memremap.c
index 83c5e6fafd84..b6a7a95339b3 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -467,8 +467,10 @@  EXPORT_SYMBOL_GPL(get_dev_pagemap_many);
 
 void free_zone_device_page(struct page *page)
 {
-	if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
-		return;
+	struct dev_pagemap *pgmap = page->pgmap;
+
+	/* wake filesystem 'break dax layouts' waiters */
+	wake_up_var(page);
 
 	mem_cgroup_uncharge(page_folio(page));
 
@@ -503,12 +505,15 @@  void free_zone_device_page(struct page *page)
 	 * to clear page->mapping.
 	 */
 	page->mapping = NULL;
-	page->pgmap->ops->page_free(page);
+	if (pgmap->ops && pgmap->ops->page_free)
+		pgmap->ops->page_free(page);
 
 	/*
-	 * Reset the page count to 1 to prepare for handing out the page again.
+	 * Reset the page count to the @init_mode value to prepare for
+	 * handing out the page again.
 	 */
-	set_page_count(page, 1);
+	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
+		set_page_count(page, 1);
 }
 
 #ifdef CONFIG_FS_DAX
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..8ee52992055b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6719,6 +6719,8 @@  static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 {
 
 	__init_single_page(page, pfn, zone_idx, nid);
+	if (pgmap->init_mode == INIT_PAGEMAP_IDLE)
+		set_page_count(page, 0);
 
 	/*
 	 * Mark page reserved as it will need to wait for onlining