Message ID | 20190731082513.16957-1-william.kucharski@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | mm,thp: Add filemap_huge_fault() for THP | expand |
> On Jul 31, 2019, at 1:25 AM, William Kucharski <william.kucharski@oracle.com> wrote: > > This set of patches is the first step towards a mechanism for automatically > mapping read-only text areas of appropriate size and alignment to THPs > whenever possible. > > For now, the central routine, filemap_huge_fault(), amd various support > routines are only included if the experimental kernel configuration option > > RO_EXEC_FILEMAP_HUGE_FAULT_THP > > is enabled. > > This is because filemap_huge_fault() is dependent upon the > address_space_operations vector readpage() pointing to a routine that will > read and fill an entire large page at a time without poulluting the page > cache with PAGESIZE entries for the large page being mapped or performing > readahead that would pollute the page cache entries for succeeding large > pages. Unfortunately, there is no good way to determine how many bytes > were read by readpage(). At present, if filemap_huge_fault() were to call > a conventional readpage() routine, it would only fill the first PAGESIZE > bytes of the large page, which is definitely NOT the desired behavior. > > However, by making the code available now it is hoped that filesystem > maintainers who have pledged to provide such a mechanism will do so more > rapidly. Could you please explain how to test/try this? Would it automatically map all executables to THPs? Thanks, Song
On 7/31/19 2:35 AM, Song Liu wrote: > Could you please explain how to test/try this? Would it automatically map > all executables to THPs? Until there is filesystem support you can't actually try this, though I have tested it through some hacks during development and am also working on some other methods to be able to test this before large page filesystem read support is in place. The end goal is that if enabled, when a fault occurs for an RO executable where the faulting address lies within a vma properly aligned/sized for the fault to be satisfied by mapping a THP, and the kernel can allocate a THP, the fault WILL be satisfied by mapping the THP. It's not expected that all executables nor even all pages of all executables would be THP-mapped, just those executables and ranges where alignment and size permit. Future optimizations may include fine-tuning these checks to try to better determine whether an application would actually benefit from THP mapping. From some quick and dirty experiments I performed, I've seen that there are a surprising number of applications that may end up with THP-mapped pages, including Perl, Chrome and Firefox. However I don't yet know what the actual vs. theoretical benefits would be. -- Bill
On Wed, Jul 31, 2019 at 02:25:11AM -0600, William Kucharski wrote: > This set of patches is the first step towards a mechanism for automatically > mapping read-only text areas of appropriate size and alignment to THPs > whenever possible. > > For now, the central routine, filemap_huge_fault(), amd various support > routines are only included if the experimental kernel configuration option > > RO_EXEC_FILEMAP_HUGE_FAULT_THP > > is enabled. > > This is because filemap_huge_fault() is dependent upon the > address_space_operations vector readpage() pointing to a routine that will > read and fill an entire large page at a time without poulluting the page > cache with PAGESIZE entries How is the readpage code supposed to stuff a THP page into a bio? i.e. Do bio's support huge pages, and if not, what is needed to stuff a huge page in a bio chain? Once you can answer that question, you should be able to easily convert the iomap_readpage/iomap_readpage_actor code to support THP pages without having to care about much else as iomap_readpage() is already coded in a way that will iterate IO over the entire THP for you.... Cheers, Dave.
On Wed, Jul 31, 2019 at 08:20:53PM +1000, Dave Chinner wrote: > On Wed, Jul 31, 2019 at 02:25:11AM -0600, William Kucharski wrote: > > This set of patches is the first step towards a mechanism for automatically > > mapping read-only text areas of appropriate size and alignment to THPs > > whenever possible. > > > > For now, the central routine, filemap_huge_fault(), amd various support > > routines are only included if the experimental kernel configuration option > > > > RO_EXEC_FILEMAP_HUGE_FAULT_THP > > > > is enabled. > > > > This is because filemap_huge_fault() is dependent upon the > > address_space_operations vector readpage() pointing to a routine that will > > read and fill an entire large page at a time without poulluting the page > > cache with PAGESIZE entries > > How is the readpage code supposed to stuff a THP page into a bio? > > i.e. Do bio's support huge pages, and if not, what is needed to > stuff a huge page in a bio chain? I believe that the current BIO code (after Ming Lei's multipage patches from late last year / earlier this year) is capable of handling a PMD-sized page. > Once you can answer that question, you should be able to easily > convert the iomap_readpage/iomap_readpage_actor code to support THP > pages without having to care about much else as iomap_readpage() > is already coded in a way that will iterate IO over the entire THP > for you.... Christoph drafted a patch which illustrates the changes needed to the iomap code. The biggest problem is: struct iomap_page { atomic_t read_count; atomic_t write_count; DECLARE_BITMAP(uptodate, PAGE_SIZE / 512); }; All of a sudden that needs to go from a single unsigned long bitmap (or two on 64kB page size machines) to 512 bytes on x86 and even larger on, eg, POWER. It's egregious because no sane filesystem is going to fragment a PMD sized page into that number of discontiguous blocks, so we never need to allocate the 520 byte data structure this suddenly becomes. It'd be nice to have a more efficient data structure (maybe that tracks uptodate by extent instead of by individual sector?) But I don't understand the iomap layer at all, and I never understood buggerheads, so I don't have a useful contribution here.
On Wed, Jul 31, 2019 at 04:32:21AM -0700, Matthew Wilcox wrote: > On Wed, Jul 31, 2019 at 08:20:53PM +1000, Dave Chinner wrote: > > On Wed, Jul 31, 2019 at 02:25:11AM -0600, William Kucharski wrote: > > > This set of patches is the first step towards a mechanism for automatically > > > mapping read-only text areas of appropriate size and alignment to THPs > > > whenever possible. > > > > > > For now, the central routine, filemap_huge_fault(), amd various support > > > routines are only included if the experimental kernel configuration option > > > > > > RO_EXEC_FILEMAP_HUGE_FAULT_THP > > > > > > is enabled. > > > > > > This is because filemap_huge_fault() is dependent upon the > > > address_space_operations vector readpage() pointing to a routine that will > > > read and fill an entire large page at a time without poulluting the page > > > cache with PAGESIZE entries > > > > How is the readpage code supposed to stuff a THP page into a bio? > > > > i.e. Do bio's support huge pages, and if not, what is needed to > > stuff a huge page in a bio chain? > > I believe that the current BIO code (after Ming Lei's multipage patches > from late last year / earlier this year) is capable of handling a > PMD-sized page. > > > Once you can answer that question, you should be able to easily > > convert the iomap_readpage/iomap_readpage_actor code to support THP > > pages without having to care about much else as iomap_readpage() > > is already coded in a way that will iterate IO over the entire THP > > for you.... > > Christoph drafted a patch which illustrates the changes needed to the > iomap code. The biggest problem is: > > struct iomap_page { > atomic_t read_count; > atomic_t write_count; > DECLARE_BITMAP(uptodate, PAGE_SIZE / 512); > }; > > All of a sudden that needs to go from a single unsigned long bitmap (or > two on 64kB page size machines) to 512 bytes on x86 and even larger on, > eg, POWER. The struct iomap_page is dynamically allocated, so the bitmap itself can be sized appropriate to the size of the page the structure is being allocated for. The current code is simple because we have a bound PAGE_SIZE so the structure size is always small. Making it dynamically sized would also reduce the size of the bitmap because it only needs to track filesystem blocks, not sectors. The fact it is hard coded means it has to support the worst case of tracking uptodata state for 512 byte block sizes, hence the "128 bits on 64k pages" static size. i.e. huge pages on a 4k block size filesystem only requires 512 *bits* for a 2MB page, not 512 * 8 bits. And when I get back to the 64k block size on 4k page size support for XFS+iomap, that will go down even further. i.e. the huge page will only have to track 32 filesystem blocks, not 512, and we're back to fitting in the existing static iomap_page.... So, yeah, I think the struct iomap_page needs to be dynamically sized to support 2MB (or larger) pages effectively. /me wonders what is necessary for page invalidation to work correctly for these huge pages. e.g. someone does a direct IO write to a range within a cached read only huge page.... Which reminds me, I bet there are assumptions in some of the iomap code (or surrounding filesystem code) that assume if filesystem block size = PAGE_SIZE there will be no iomap_page attached to the page. And that if there is a iomap_page attached, then the block size is < PAGE_SIZE. And do't make assumptions about block size being <= PAGE_SIZE, as I have a patchset to support block size > PAGE_SIZE for the iomap and XFS code which I'll be getting back to Real Soon. > It's egregious because no sane filesystem is going to fragment a PMD > sized page into that number of discontiguous blocks, It's not whether a sane filesytem will do that, the reality is that it can happen and so it needs to work. Anyone using 512 byte block size filesysetms and expecting PMD sized pages to be *efficient* has rocks in their head. We just need to make it work. > so we never need > to allocate the 520 byte data structure this suddenly becomes. It'd be > nice to have a more efficient data structure (maybe that tracks uptodate > by extent instead of by individual sector?) Extents can still get fragmented, and we have to support the worst case fragmentation that can occur. Which is single filesystem blocks. And that fragmentation can change during the life of the page (punch out blocks, allocate different ones, COW, etc) so we have to allocate the worst case up front even if we rarely (if ever!) need it. > But I don't understand the > iomap layer at all, and I never understood buggerheads, so I don't have > a useful contribution here. iomap is a whole lot easier - the only thing we need to track at the "page cache" level is which parts of the page contain valid data and that's what the struct iomap_page is for when more than one bit of uptodate information needs to be stored. the iomap infrastructure does everything else through the filesystem and so only requires the caching layer to track the valid data ranges in each page... IOWs, all we need to worry about for PMD faults in iomap is getting the page sizes right, iterating IO ranges to fill/write back full PMD pages and tracking uptodate state in the page on a filesystem block granularity. Everything else should just work.... Cheers, Dave.