mbox series

[v3,0/2] mm,thp: Add filemap_huge_fault() for THP

Message ID 20190731082513.16957-1-william.kucharski@oracle.com (mailing list archive)
Headers show
Series mm,thp: Add filemap_huge_fault() for THP | expand

Message

William Kucharski July 31, 2019, 8:25 a.m. UTC
This set of patches is the first step towards a mechanism for automatically
mapping read-only text areas of appropriate size and alignment to THPs
whenever possible.

For now, the central routine, filemap_huge_fault(), amd various support
routines are only included if the experimental kernel configuration option

	RO_EXEC_FILEMAP_HUGE_FAULT_THP

is enabled.

This is because filemap_huge_fault() is dependent upon the
address_space_operations vector readpage() pointing to a routine that will
read and fill an entire large page at a time without poulluting the page
cache with PAGESIZE entries for the large page being mapped or performing
readahead that would pollute the page cache entries for succeeding large
pages. Unfortunately, there is no good way to determine how many bytes
were read by readpage(). At present, if filemap_huge_fault() were to call
a conventional readpage() routine, it would only fill the first PAGESIZE
bytes of the large page, which is definitely NOT the desired behavior.

However, by making the code available now it is hoped that filesystem
maintainers who have pledged to provide such a mechanism will do so more
rapidly.

The first part of the patch adds an order field to __page_cache_alloc(),
allowing callers to directly request page cache pages of various sizes.
This code was provided by Matthew Wilcox.

The second part of the patch implements the filemap_huge_fault() mechanism
as described above.

Changes since v2:
1. FGP changes were pulled out to enable submission as an independent
   patch
2. Inadvertent tab spacing and comment changes were reverted

Changes since v1:
1. Fix improperly generated patch for v1 PATCH 1/2

Matthew Wilcox (1):
  mm: Allow the page cache to allocate large pages

William Kucharski (1):
  Add filemap_huge_fault() to attempt to satisfy page faults on
    memory-mapped read-only text pages using THP when possible.

 fs/afs/dir.c            |   2 +-
 fs/btrfs/compression.c  |   2 +-
 fs/cachefiles/rdwr.c    |   4 +-
 fs/ceph/addr.c          |   2 +-
 fs/ceph/file.c          |   2 +-
 include/linux/huge_mm.h |  16 +-
 include/linux/mm.h      |   6 +
 include/linux/pagemap.h |  10 +-
 mm/Kconfig              |  15 ++
 mm/filemap.c            | 320 ++++++++++++++++++++++++++++++++++++++--
 mm/huge_memory.c        |   3 +
 mm/mmap.c               |  36 ++++-
 mm/readahead.c          |   2 +-
 mm/rmap.c               |   8 +
 net/ceph/pagelist.c     |   4 +-
 net/ceph/pagevec.c      |   2 +-
 16 files changed, 401 insertions(+), 33 deletions(-)

Comments

Song Liu July 31, 2019, 8:35 a.m. UTC | #1
> On Jul 31, 2019, at 1:25 AM, William Kucharski <william.kucharski@oracle.com> wrote:
> 
> This set of patches is the first step towards a mechanism for automatically
> mapping read-only text areas of appropriate size and alignment to THPs
> whenever possible.
> 
> For now, the central routine, filemap_huge_fault(), amd various support
> routines are only included if the experimental kernel configuration option
> 
> 	RO_EXEC_FILEMAP_HUGE_FAULT_THP
> 
> is enabled.
> 
> This is because filemap_huge_fault() is dependent upon the
> address_space_operations vector readpage() pointing to a routine that will
> read and fill an entire large page at a time without poulluting the page
> cache with PAGESIZE entries for the large page being mapped or performing
> readahead that would pollute the page cache entries for succeeding large
> pages. Unfortunately, there is no good way to determine how many bytes
> were read by readpage(). At present, if filemap_huge_fault() were to call
> a conventional readpage() routine, it would only fill the first PAGESIZE
> bytes of the large page, which is definitely NOT the desired behavior.
> 
> However, by making the code available now it is hoped that filesystem
> maintainers who have pledged to provide such a mechanism will do so more
> rapidly.

Could you please explain how to test/try this? Would it automatically map
all executables to THPs? 

Thanks,
Song
William Kucharski July 31, 2019, 8:58 a.m. UTC | #2
On 7/31/19 2:35 AM, Song Liu wrote:

> Could you please explain how to test/try this? Would it automatically map
> all executables to THPs?

Until there is filesystem support you can't actually try this, though I have 
tested it through some hacks during development and am also working on some 
other methods to be able to test this before large page filesystem read support 
is in place.

The end goal is that if enabled, when a fault occurs for an RO executable where 
the faulting address lies within a vma properly aligned/sized for the fault to 
be satisfied by mapping a THP, and the kernel can allocate a THP, the fault WILL 
be satisfied by mapping the THP.

It's not expected that all executables nor even all pages of all executables 
would be THP-mapped, just those executables and ranges where alignment and size 
permit. Future optimizations may include fine-tuning these checks to try to 
better determine whether an application would actually benefit from THP mapping.

 From some quick and dirty experiments I performed, I've seen that there are a 
surprising number of applications that may end up with THP-mapped pages, 
including Perl, Chrome and Firefox.

However I don't yet know what the actual vs. theoretical benefits would be.

     -- Bill
Dave Chinner July 31, 2019, 10:20 a.m. UTC | #3
On Wed, Jul 31, 2019 at 02:25:11AM -0600, William Kucharski wrote:
> This set of patches is the first step towards a mechanism for automatically
> mapping read-only text areas of appropriate size and alignment to THPs
> whenever possible.
> 
> For now, the central routine, filemap_huge_fault(), amd various support
> routines are only included if the experimental kernel configuration option
> 
> 	RO_EXEC_FILEMAP_HUGE_FAULT_THP
> 
> is enabled.
> 
> This is because filemap_huge_fault() is dependent upon the
> address_space_operations vector readpage() pointing to a routine that will
> read and fill an entire large page at a time without poulluting the page
> cache with PAGESIZE entries

How is the readpage code supposed to stuff a THP page into a bio?

i.e. Do bio's support huge pages, and if not, what is needed to
stuff a huge page in a bio chain?

Once you can answer that question, you should be able to easily
convert the iomap_readpage/iomap_readpage_actor code to support THP
pages without having to care about much else as iomap_readpage()
is already coded in a way that will iterate IO over the entire THP
for you....

Cheers,

Dave.
Matthew Wilcox (Oracle) July 31, 2019, 11:32 a.m. UTC | #4
On Wed, Jul 31, 2019 at 08:20:53PM +1000, Dave Chinner wrote:
> On Wed, Jul 31, 2019 at 02:25:11AM -0600, William Kucharski wrote:
> > This set of patches is the first step towards a mechanism for automatically
> > mapping read-only text areas of appropriate size and alignment to THPs
> > whenever possible.
> > 
> > For now, the central routine, filemap_huge_fault(), amd various support
> > routines are only included if the experimental kernel configuration option
> > 
> > 	RO_EXEC_FILEMAP_HUGE_FAULT_THP
> > 
> > is enabled.
> > 
> > This is because filemap_huge_fault() is dependent upon the
> > address_space_operations vector readpage() pointing to a routine that will
> > read and fill an entire large page at a time without poulluting the page
> > cache with PAGESIZE entries
> 
> How is the readpage code supposed to stuff a THP page into a bio?
> 
> i.e. Do bio's support huge pages, and if not, what is needed to
> stuff a huge page in a bio chain?

I believe that the current BIO code (after Ming Lei's multipage patches
from late last year / earlier this year) is capable of handling a
PMD-sized page.

> Once you can answer that question, you should be able to easily
> convert the iomap_readpage/iomap_readpage_actor code to support THP
> pages without having to care about much else as iomap_readpage()
> is already coded in a way that will iterate IO over the entire THP
> for you....

Christoph drafted a patch which illustrates the changes needed to the
iomap code.  The biggest problem is:

struct iomap_page {
        atomic_t                read_count;
        atomic_t                write_count;
        DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
};

All of a sudden that needs to go from a single unsigned long bitmap (or
two on 64kB page size machines) to 512 bytes on x86 and even larger on,
eg, POWER.

It's egregious because no sane filesystem is going to fragment a PMD
sized page into that number of discontiguous blocks, so we never need
to allocate the 520 byte data structure this suddenly becomes.  It'd be
nice to have a more efficient data structure (maybe that tracks uptodate
by extent instead of by individual sector?)  But I don't understand the
iomap layer at all, and I never understood buggerheads, so I don't have
a useful contribution here.
Dave Chinner July 31, 2019, 10:19 p.m. UTC | #5
On Wed, Jul 31, 2019 at 04:32:21AM -0700, Matthew Wilcox wrote:
> On Wed, Jul 31, 2019 at 08:20:53PM +1000, Dave Chinner wrote:
> > On Wed, Jul 31, 2019 at 02:25:11AM -0600, William Kucharski wrote:
> > > This set of patches is the first step towards a mechanism for automatically
> > > mapping read-only text areas of appropriate size and alignment to THPs
> > > whenever possible.
> > > 
> > > For now, the central routine, filemap_huge_fault(), amd various support
> > > routines are only included if the experimental kernel configuration option
> > > 
> > > 	RO_EXEC_FILEMAP_HUGE_FAULT_THP
> > > 
> > > is enabled.
> > > 
> > > This is because filemap_huge_fault() is dependent upon the
> > > address_space_operations vector readpage() pointing to a routine that will
> > > read and fill an entire large page at a time without poulluting the page
> > > cache with PAGESIZE entries
> > 
> > How is the readpage code supposed to stuff a THP page into a bio?
> > 
> > i.e. Do bio's support huge pages, and if not, what is needed to
> > stuff a huge page in a bio chain?
> 
> I believe that the current BIO code (after Ming Lei's multipage patches
> from late last year / earlier this year) is capable of handling a
> PMD-sized page.
> 
> > Once you can answer that question, you should be able to easily
> > convert the iomap_readpage/iomap_readpage_actor code to support THP
> > pages without having to care about much else as iomap_readpage()
> > is already coded in a way that will iterate IO over the entire THP
> > for you....
> 
> Christoph drafted a patch which illustrates the changes needed to the
> iomap code.  The biggest problem is:
> 
> struct iomap_page {
>         atomic_t                read_count;
>         atomic_t                write_count;
>         DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
> };
> 
> All of a sudden that needs to go from a single unsigned long bitmap (or
> two on 64kB page size machines) to 512 bytes on x86 and even larger on,
> eg, POWER.

The struct iomap_page is dynamically allocated, so the bitmap itself
can be sized appropriate to the size of the page the structure is
being allocated for. The current code is simple because we have a
bound PAGE_SIZE so the structure size is always small.

Making it dynamically sized would also reduce the size of the bitmap
because it only needs to track filesystem blocks, not sectors. The
fact it is hard coded means it has to support the worst case of
tracking uptodata state for 512 byte block sizes, hence the "128
bits on 64k pages" static size.

i.e. huge pages on a 4k block size filesystem only requires 512
*bits* for a 2MB page, not 512 * 8 bits.  And when I get back to the
64k block size on 4k page size support for XFS+iomap, that will go
down even further. i.e. the huge page will only have to track 32
filesystem blocks, not 512, and we're back to fitting in the
existing static iomap_page....

So, yeah, I think the struct iomap_page needs to be dynamically
sized to support 2MB (or larger) pages effectively.

/me wonders what is necessary for page invalidation to work
correctly for these huge pages. e.g. someone does a direct IO
write to a range within a cached read only huge page....

Which reminds me, I bet there are assumptions in some of the iomap
code (or surrounding filesystem code) that assume if filesystem
block size = PAGE_SIZE there will be no iomap_page attached to the
page. And that if there is a iomap_page attached, then the block
size is < PAGE_SIZE. And do't make assumptions about block size
being <= PAGE_SIZE, as I have a patchset to support block size >
PAGE_SIZE for the iomap and XFS code which I'll be getting back to
Real Soon.

> It's egregious because no sane filesystem is going to fragment a PMD
> sized page into that number of discontiguous blocks,

It's not whether a sane filesytem will do that, the reality is that
it can happen and so it needs to work. Anyone using 512 byte block
size filesysetms and expecting PMD sized pages to be *efficient* has
rocks in their head. We just need to make it work.

> so we never need
> to allocate the 520 byte data structure this suddenly becomes.  It'd be
> nice to have a more efficient data structure (maybe that tracks uptodate
> by extent instead of by individual sector?)

Extents can still get fragmented, and we have to support the worst
case fragmentation that can occur. Which is single filesystem
blocks. And that fragmentation can change during the life of the
page (punch out blocks, allocate different ones, COW, etc) so we
have to allocate the worst case up front even if we rarely (if
ever!) need it.

> But I don't understand the
> iomap layer at all, and I never understood buggerheads, so I don't have
> a useful contribution here.

iomap is a whole lot easier - the only thing we need to track at the
"page cache" level is which parts of the page contain valid data and
that's what the struct iomap_page is for when more than one bit of
uptodate information needs to be stored. the iomap infrastructure
does everything else through the filesystem and so only requires the
caching layer to track the valid data ranges in each page...

IOWs, all we need to worry about for PMD faults in iomap is getting
the page sizes right, iterating IO ranges to fill/write back full
PMD pages and tracking uptodate state in the page on a filesystem
block granularity. Everything else should just work....

Cheers,

Dave.