diff mbox

[PATCHv3,13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages

Message ID 20160915115523.29737-14-kirill.shutemov@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kirill A . Shutemov Sept. 15, 2016, 11:54 a.m. UTC
invalidate_inode_page() has expectation about page_count() of the page
-- if it's not 2 (one to caller, one to radix-tree), it will not be
dropped. That condition almost never met for THPs -- tail pages are
pinned to the pagevec.

Let's drop them, before calling invalidate_inode_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Jan Kara Oct. 11, 2016, 3:58 p.m. UTC | #1
On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> invalidate_inode_page() has expectation about page_count() of the page
> -- if it's not 2 (one to caller, one to radix-tree), it will not be
> dropped. That condition almost never met for THPs -- tail pages are
> pinned to the pagevec.
> 
> Let's drop them, before calling invalidate_inode_page().
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/truncate.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index a01cce450a26..ce904e4b1708 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  				/* 'end' is in the middle of THP */
>  				if (index ==  round_down(end, HPAGE_PMD_NR))
>  					continue;
> +				/*
> +				 * invalidate_inode_page() expects
> +				 * page_count(page) == 2 to drop page from page
> +				 * cache -- drop tail pages references.
> +				 */
> +				get_page(page);
> +				pagevec_release(&pvec);

I'm not quite sure why this is needed. When you have multiorder entry in
the radix tree for your huge page, then you should not get more entries in
the pagevec for your huge page. What do I miss?

								Honza
Kirill A. Shutemov Oct. 11, 2016, 9:53 p.m. UTC | #2
On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > invalidate_inode_page() has expectation about page_count() of the page
> > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > dropped. That condition almost never met for THPs -- tail pages are
> > pinned to the pagevec.
> > 
> > Let's drop them, before calling invalidate_inode_page().
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/truncate.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index a01cce450a26..ce904e4b1708 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> >  				/* 'end' is in the middle of THP */
> >  				if (index ==  round_down(end, HPAGE_PMD_NR))
> >  					continue;
> > +				/*
> > +				 * invalidate_inode_page() expects
> > +				 * page_count(page) == 2 to drop page from page
> > +				 * cache -- drop tail pages references.
> > +				 */
> > +				get_page(page);
> > +				pagevec_release(&pvec);
> 
> I'm not quite sure why this is needed. When you have multiorder entry in
> the radix tree for your huge page, then you should not get more entries in
> the pagevec for your huge page. What do I miss?

For compatibility reason find_get_entries() (which is called by
pagevec_lookup_entries()) collects all subpages of huge page in the range
(head/tails). See patch [07/41]

So huge page, which is fully in the range it will be pinned up to
PAGEVEC_SIZE times.
Jan Kara Oct. 12, 2016, 6:43 a.m. UTC | #3
On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote:
> On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > > invalidate_inode_page() has expectation about page_count() of the page
> > > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > > dropped. That condition almost never met for THPs -- tail pages are
> > > pinned to the pagevec.
> > > 
> > > Let's drop them, before calling invalidate_inode_page().
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  mm/truncate.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > index a01cce450a26..ce904e4b1708 100644
> > > --- a/mm/truncate.c
> > > +++ b/mm/truncate.c
> > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > >  				/* 'end' is in the middle of THP */
> > >  				if (index ==  round_down(end, HPAGE_PMD_NR))
> > >  					continue;
> > > +				/*
> > > +				 * invalidate_inode_page() expects
> > > +				 * page_count(page) == 2 to drop page from page
> > > +				 * cache -- drop tail pages references.
> > > +				 */
> > > +				get_page(page);
> > > +				pagevec_release(&pvec);
> > 
> > I'm not quite sure why this is needed. When you have multiorder entry in
> > the radix tree for your huge page, then you should not get more entries in
> > the pagevec for your huge page. What do I miss?
> 
> For compatibility reason find_get_entries() (which is called by
> pagevec_lookup_entries()) collects all subpages of huge page in the range
> (head/tails). See patch [07/41]
> 
> So huge page, which is fully in the range it will be pinned up to
> PAGEVEC_SIZE times.

Yeah, I see. But then won't it be cleaner to provide iteration method that
would add to pagevec each radix tree entry (regardless of its order) only
once and then use it in places where we care? Instead of strange dances
like you do here?

Ultimately we could convert all the places to use these new iteration
methods but I don't see that as immediately necessary and maybe there are
places where getting all the subpages in the pagevec actually makes life
simpler for us (please point me if you know about such place).

On a somewhat unrelated note: I've noticed that you don't invalidate
a huge page when only part of it should be invalidated. That actually
breaks some assumptions filesystems make. In particular direct IO code
assumes that if you do

	filemap_write_and_wait_range(inode, start, end);
	invalidate_inode_pages2_range(inode, start, end);

all the page cache covering start-end *will* be invalidated. Your skipping
of partial pages breaks this assumption and thus can bring consistency
issues (e.g. write done using direct IO won't be seen by following buffered
read).

								Honza
Kirill A. Shutemov Oct. 24, 2016, 10:41 a.m. UTC | #4
On Wed, Oct 12, 2016 at 08:43:20AM +0200, Jan Kara wrote:
> On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote:
> > On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> > > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > > > invalidate_inode_page() has expectation about page_count() of the page
> > > > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > > > dropped. That condition almost never met for THPs -- tail pages are
> > > > pinned to the pagevec.
> > > > 
> > > > Let's drop them, before calling invalidate_inode_page().
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > ---
> > > >  mm/truncate.c | 11 +++++++++++
> > > >  1 file changed, 11 insertions(+)
> > > > 
> > > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > > index a01cce450a26..ce904e4b1708 100644
> > > > --- a/mm/truncate.c
> > > > +++ b/mm/truncate.c
> > > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > >  				/* 'end' is in the middle of THP */
> > > >  				if (index ==  round_down(end, HPAGE_PMD_NR))
> > > >  					continue;
> > > > +				/*
> > > > +				 * invalidate_inode_page() expects
> > > > +				 * page_count(page) == 2 to drop page from page
> > > > +				 * cache -- drop tail pages references.
> > > > +				 */
> > > > +				get_page(page);
> > > > +				pagevec_release(&pvec);
> > > 
> > > I'm not quite sure why this is needed. When you have multiorder entry in
> > > the radix tree for your huge page, then you should not get more entries in
> > > the pagevec for your huge page. What do I miss?
> > 
> > For compatibility reason find_get_entries() (which is called by
> > pagevec_lookup_entries()) collects all subpages of huge page in the range
> > (head/tails). See patch [07/41]
> > 
> > So huge page, which is fully in the range it will be pinned up to
> > PAGEVEC_SIZE times.
> 
> Yeah, I see. But then won't it be cleaner to provide iteration method that
> would add to pagevec each radix tree entry (regardless of its order) only
> once and then use it in places where we care? Instead of strange dances
> like you do here?

Maybe. It would require doubling number of find_get_* helpers or
additional flag in each. We have too many already.

And multi-order entries interface for radix-tree has not yet settled in.
I would rather defer such rework until it will be shaped fully.

Let's come back to this later.

> Ultimately we could convert all the places to use these new iteration
> methods but I don't see that as immediately necessary and maybe there are
> places where getting all the subpages in the pagevec actually makes life
> simpler for us (please point me if you know about such place).

I did the way I did to now evaluate each use of find_get_*() one-by-one.
I guessed most of the callers of find_get_page() would be confused by
getting head page instead relevant subpage. Maybe I was wrong and it was
easier to make caller work with that. I don't know...

> On a somewhat unrelated note: I've noticed that you don't invalidate
> a huge page when only part of it should be invalidated. That actually
> breaks some assumptions filesystems make. In particular direct IO code
> assumes that if you do
> 
> 	filemap_write_and_wait_range(inode, start, end);
> 	invalidate_inode_pages2_range(inode, start, end);
> 
> all the page cache covering start-end *will* be invalidated. Your skipping
> of partial pages breaks this assumption and thus can bring consistency
> issues (e.g. write done using direct IO won't be seen by following buffered
> read).

Acctually, invalidate_inode_pages2_range does invalidate whole page if
part of it is in the range. I've catched this problem during testing.
diff mbox

Patch

diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..ce904e4b1708 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -504,10 +504,21 @@  unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				/* 'end' is in the middle of THP */
 				if (index ==  round_down(end, HPAGE_PMD_NR))
 					continue;
+				/*
+				 * invalidate_inode_page() expects
+				 * page_count(page) == 2 to drop page from page
+				 * cache -- drop tail pages references.
+				 */
+				get_page(page);
+				pagevec_release(&pvec);
 			}
 
 			ret = invalidate_inode_page(page);
 			unlock_page(page);
+
+			if (PageTransHuge(page))
+				put_page(page);
+
 			/*
 			 * Invalidation is a hint that the page is no longer
 			 * of interest and try to speed up its reclaim.