Message ID | 20160915115523.29737-14-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote: > invalidate_inode_page() has expectation about page_count() of the page > -- if it's not 2 (one to caller, one to radix-tree), it will not be > dropped. That condition almost never met for THPs -- tail pages are > pinned to the pagevec. > > Let's drop them, before calling invalidate_inode_page(). > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > --- > mm/truncate.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/truncate.c b/mm/truncate.c > index a01cce450a26..ce904e4b1708 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, > /* 'end' is in the middle of THP */ > if (index == round_down(end, HPAGE_PMD_NR)) > continue; > + /* > + * invalidate_inode_page() expects > + * page_count(page) == 2 to drop page from page > + * cache -- drop tail pages references. > + */ > + get_page(page); > + pagevec_release(&pvec); I'm not quite sure why this is needed. When you have multiorder entry in the radix tree for your huge page, then you should not get more entries in the pagevec for your huge page. What do I miss? Honza
On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote: > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote: > > invalidate_inode_page() has expectation about page_count() of the page > > -- if it's not 2 (one to caller, one to radix-tree), it will not be > > dropped. That condition almost never met for THPs -- tail pages are > > pinned to the pagevec. > > > > Let's drop them, before calling invalidate_inode_page(). > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > --- > > mm/truncate.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/mm/truncate.c b/mm/truncate.c > > index a01cce450a26..ce904e4b1708 100644 > > --- a/mm/truncate.c > > +++ b/mm/truncate.c > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, > > /* 'end' is in the middle of THP */ > > if (index == round_down(end, HPAGE_PMD_NR)) > > continue; > > + /* > > + * invalidate_inode_page() expects > > + * page_count(page) == 2 to drop page from page > > + * cache -- drop tail pages references. > > + */ > > + get_page(page); > > + pagevec_release(&pvec); > > I'm not quite sure why this is needed. When you have multiorder entry in > the radix tree for your huge page, then you should not get more entries in > the pagevec for your huge page. What do I miss? For compatibility reason find_get_entries() (which is called by pagevec_lookup_entries()) collects all subpages of huge page in the range (head/tails). See patch [07/41] So huge page, which is fully in the range it will be pinned up to PAGEVEC_SIZE times.
On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote: > On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote: > > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote: > > > invalidate_inode_page() has expectation about page_count() of the page > > > -- if it's not 2 (one to caller, one to radix-tree), it will not be > > > dropped. That condition almost never met for THPs -- tail pages are > > > pinned to the pagevec. > > > > > > Let's drop them, before calling invalidate_inode_page(). > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > --- > > > mm/truncate.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/mm/truncate.c b/mm/truncate.c > > > index a01cce450a26..ce904e4b1708 100644 > > > --- a/mm/truncate.c > > > +++ b/mm/truncate.c > > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, > > > /* 'end' is in the middle of THP */ > > > if (index == round_down(end, HPAGE_PMD_NR)) > > > continue; > > > + /* > > > + * invalidate_inode_page() expects > > > + * page_count(page) == 2 to drop page from page > > > + * cache -- drop tail pages references. > > > + */ > > > + get_page(page); > > > + pagevec_release(&pvec); > > > > I'm not quite sure why this is needed. When you have multiorder entry in > > the radix tree for your huge page, then you should not get more entries in > > the pagevec for your huge page. What do I miss? > > For compatibility reason find_get_entries() (which is called by > pagevec_lookup_entries()) collects all subpages of huge page in the range > (head/tails). See patch [07/41] > > So huge page, which is fully in the range it will be pinned up to > PAGEVEC_SIZE times. Yeah, I see. But then won't it be cleaner to provide iteration method that would add to pagevec each radix tree entry (regardless of its order) only once and then use it in places where we care? Instead of strange dances like you do here? Ultimately we could convert all the places to use these new iteration methods but I don't see that as immediately necessary and maybe there are places where getting all the subpages in the pagevec actually makes life simpler for us (please point me if you know about such place). On a somewhat unrelated note: I've noticed that you don't invalidate a huge page when only part of it should be invalidated. That actually breaks some assumptions filesystems make. In particular direct IO code assumes that if you do filemap_write_and_wait_range(inode, start, end); invalidate_inode_pages2_range(inode, start, end); all the page cache covering start-end *will* be invalidated. Your skipping of partial pages breaks this assumption and thus can bring consistency issues (e.g. write done using direct IO won't be seen by following buffered read). Honza
On Wed, Oct 12, 2016 at 08:43:20AM +0200, Jan Kara wrote: > On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote: > > On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote: > > > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote: > > > > invalidate_inode_page() has expectation about page_count() of the page > > > > -- if it's not 2 (one to caller, one to radix-tree), it will not be > > > > dropped. That condition almost never met for THPs -- tail pages are > > > > pinned to the pagevec. > > > > > > > > Let's drop them, before calling invalidate_inode_page(). > > > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > > --- > > > > mm/truncate.c | 11 +++++++++++ > > > > 1 file changed, 11 insertions(+) > > > > > > > > diff --git a/mm/truncate.c b/mm/truncate.c > > > > index a01cce450a26..ce904e4b1708 100644 > > > > --- a/mm/truncate.c > > > > +++ b/mm/truncate.c > > > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, > > > > /* 'end' is in the middle of THP */ > > > > if (index == round_down(end, HPAGE_PMD_NR)) > > > > continue; > > > > + /* > > > > + * invalidate_inode_page() expects > > > > + * page_count(page) == 2 to drop page from page > > > > + * cache -- drop tail pages references. > > > > + */ > > > > + get_page(page); > > > > + pagevec_release(&pvec); > > > > > > I'm not quite sure why this is needed. When you have multiorder entry in > > > the radix tree for your huge page, then you should not get more entries in > > > the pagevec for your huge page. What do I miss? > > > > For compatibility reason find_get_entries() (which is called by > > pagevec_lookup_entries()) collects all subpages of huge page in the range > > (head/tails). See patch [07/41] > > > > So huge page, which is fully in the range it will be pinned up to > > PAGEVEC_SIZE times. > > Yeah, I see. But then won't it be cleaner to provide iteration method that > would add to pagevec each radix tree entry (regardless of its order) only > once and then use it in places where we care? Instead of strange dances > like you do here? Maybe. It would require doubling number of find_get_* helpers or additional flag in each. We have too many already. And multi-order entries interface for radix-tree has not yet settled in. I would rather defer such rework until it will be shaped fully. Let's come back to this later. > Ultimately we could convert all the places to use these new iteration > methods but I don't see that as immediately necessary and maybe there are > places where getting all the subpages in the pagevec actually makes life > simpler for us (please point me if you know about such place). I did the way I did to now evaluate each use of find_get_*() one-by-one. I guessed most of the callers of find_get_page() would be confused by getting head page instead relevant subpage. Maybe I was wrong and it was easier to make caller work with that. I don't know... > On a somewhat unrelated note: I've noticed that you don't invalidate > a huge page when only part of it should be invalidated. That actually > breaks some assumptions filesystems make. In particular direct IO code > assumes that if you do > > filemap_write_and_wait_range(inode, start, end); > invalidate_inode_pages2_range(inode, start, end); > > all the page cache covering start-end *will* be invalidated. Your skipping > of partial pages breaks this assumption and thus can bring consistency > issues (e.g. write done using direct IO won't be seen by following buffered > read). Acctually, invalidate_inode_pages2_range does invalidate whole page if part of it is in the range. I've catched this problem during testing.
diff --git a/mm/truncate.c b/mm/truncate.c index a01cce450a26..ce904e4b1708 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, /* 'end' is in the middle of THP */ if (index == round_down(end, HPAGE_PMD_NR)) continue; + /* + * invalidate_inode_page() expects + * page_count(page) == 2 to drop page from page + * cache -- drop tail pages references. + */ + get_page(page); + pagevec_release(&pvec); } ret = invalidate_inode_page(page); unlock_page(page); + + if (PageTransHuge(page)) + put_page(page); + /* * Invalidation is a hint that the page is no longer * of interest and try to speed up its reclaim.
invalidate_inode_page() has expectation about page_count() of the page -- if it's not 2 (one to caller, one to radix-tree), it will not be dropped. That condition almost never met for THPs -- tail pages are pinned to the pagevec. Let's drop them, before calling invalidate_inode_page(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- mm/truncate.c | 11 +++++++++++ 1 file changed, 11 insertions(+)