mbox series

[v2,0/3] Enable THP for text section of non-shmem files

Message ID 20190614182204.2673660-1-songliubraving@fb.com (mailing list archive)
Headers show
Series Enable THP for text section of non-shmem files | expand

Message

Song Liu June 14, 2019, 6:22 p.m. UTC
This set follows up discussion at LSF/MM 2019. The motivation is to put
text section of an application in THP, and thus reduces iTLB miss rate and
improves performance. Both Facebook and Oracle showed strong interests to
this feature.

To make reviews easier, this set aims a mininal valid product. Current
version of the work does not have any changes to file system specific
code. This comes with some limitations (discussed later).

This set enables an application to "hugify" its text section by simply
running something like:

          madvise(0x600000, 0x80000, MADV_HUGEPAGE);

Before this call, the /proc/<pid>/maps looks like:

    00400000-074d0000 r-xp 00000000 00:27 2006927     app

After this call, part of the text section is split out and mapped to THP:

    00400000-00425000 r-xp 00000000 00:27 2006927     app
    00600000-00e00000 r-xp 00200000 00:27 2006927     app   <<< on THP
    00e00000-074d0000 r-xp 00a00000 00:27 2006927     app

Limitations:

1. This only works for text section (vma with VM_DENYWRITE).
2. Once the application put its own pages in THP, the file is read only.
   open(file, O_WRITE) will fail with -ETXTBSY. To modify/update the file,
   it must be removed first. Here is an example case:

    root@virt-test:~/# ./app hugify
    ^C

    root@virt-test:~/# dd if=/dev/zero of=./app bs=1k count=2
    dd: failed to open './app': Text file busy

    root@virt-test:~/# cp app.backup app
    cp: cannot create regular file 'app': Text file busy

    root@virt-test:~/# rm app
    root@virt-test:~/# cp app.backup app
    root@virt-test:~/#

We gated this feature with an experimental config, READ_ONLY_THP_FOR_FS.
Once we get better support on the write path, we can remove the config and
enable it by default.

Tested cases:
1. Tested with btrfs and ext4.
2. Tested with real work application (memcache like caching service).
3. Tested with "THP aware uprobe":
   https://patchwork.kernel.org/project/linux-mm/list/?series=131339

Please share your comments and suggestions on this.

Thanks!

Changes v1 => v2:
1. Fixed a missing mem_cgroup_commit_charge() for non-shmem case.

Song Liu (3):
  mm: check compound_head(page)->mapping in filemap_fault()
  mm,thp: stats for file backed THP
  mm,thp: add read-only THP support for (non-shmem) FS

 fs/proc/meminfo.c      |   4 ++
 include/linux/fs.h     |   8 ++++
 include/linux/mmzone.h |   2 +
 mm/Kconfig             |  11 +++++
 mm/filemap.c           |   7 +--
 mm/khugepaged.c        | 106 +++++++++++++++++++++++++++++++++--------
 mm/rmap.c              |  12 +++--
 mm/vmstat.c            |   2 +
 8 files changed, 125 insertions(+), 27 deletions(-)

--
2.17.1

Comments

Andrew Morton June 18, 2019, 9:12 p.m. UTC | #1
On Fri, 14 Jun 2019 11:22:01 -0700 Song Liu <songliubraving@fb.com> wrote:

> This set follows up discussion at LSF/MM 2019. The motivation is to put
> text section of an application in THP, and thus reduces iTLB miss rate and
> improves performance. Both Facebook and Oracle showed strong interests to
> this feature.
> 
> To make reviews easier, this set aims a mininal valid product. Current
> version of the work does not have any changes to file system specific
> code. This comes with some limitations (discussed later).
> 
> This set enables an application to "hugify" its text section by simply
> running something like:
> 
>           madvise(0x600000, 0x80000, MADV_HUGEPAGE);
> 
> Before this call, the /proc/<pid>/maps looks like:
> 
>     00400000-074d0000 r-xp 00000000 00:27 2006927     app
> 
> After this call, part of the text section is split out and mapped to THP:
> 
>     00400000-00425000 r-xp 00000000 00:27 2006927     app
>     00600000-00e00000 r-xp 00200000 00:27 2006927     app   <<< on THP
>     00e00000-074d0000 r-xp 00a00000 00:27 2006927     app
> 
> Limitations:
> 
> 1. This only works for text section (vma with VM_DENYWRITE).
> 2. Once the application put its own pages in THP, the file is read only.
>    open(file, O_WRITE) will fail with -ETXTBSY. To modify/update the file,
>    it must be removed first.

Removed?  Even if the original mmap/madvise has gone away?  hm.

I'm wondering if this limitation can be abused in some fashion: mmap a
file to which you have read permissions, run madvise(MADV_HUGEPAGE) and
thus prevent the file's owner from being able to modify the file?  Or
something like that.  What are the issues and protections here?
Song Liu June 18, 2019, 9:48 p.m. UTC | #2
> On Jun 18, 2019, at 2:12 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Fri, 14 Jun 2019 11:22:01 -0700 Song Liu <songliubraving@fb.com> wrote:
> 
>> This set follows up discussion at LSF/MM 2019. The motivation is to put
>> text section of an application in THP, and thus reduces iTLB miss rate and
>> improves performance. Both Facebook and Oracle showed strong interests to
>> this feature.
>> 
>> To make reviews easier, this set aims a mininal valid product. Current
>> version of the work does not have any changes to file system specific
>> code. This comes with some limitations (discussed later).
>> 
>> This set enables an application to "hugify" its text section by simply
>> running something like:
>> 
>>          madvise(0x600000, 0x80000, MADV_HUGEPAGE);
>> 
>> Before this call, the /proc/<pid>/maps looks like:
>> 
>>    00400000-074d0000 r-xp 00000000 00:27 2006927     app
>> 
>> After this call, part of the text section is split out and mapped to THP:
>> 
>>    00400000-00425000 r-xp 00000000 00:27 2006927     app
>>    00600000-00e00000 r-xp 00200000 00:27 2006927     app   <<< on THP
>>    00e00000-074d0000 r-xp 00a00000 00:27 2006927     app
>> 
>> Limitations:
>> 
>> 1. This only works for text section (vma with VM_DENYWRITE).
>> 2. Once the application put its own pages in THP, the file is read only.
>>   open(file, O_WRITE) will fail with -ETXTBSY. To modify/update the file,
>>   it must be removed first.
> 
> Removed?  Even if the original mmap/madvise has gone away?  hm.

Yeah, it is not ideal. The thp holds a negative count on i_mmap_writable, 
so it cannot be opened for write. 

> 
> I'm wondering if this limitation can be abused in some fashion: mmap a
> file to which you have read permissions, run madvise(MADV_HUGEPAGE) and
> thus prevent the file's owner from being able to modify the file?  Or
> something like that.  What are the issues and protections here?

In this case, the owner need to make a copy of the file, and then remove 
and update the original file. 

In this version, we want either split huge page on writes, or fail the 
write when we cannot split. However, the huge page information is only 
available at page level, and on the write path, page level information 
is not available until write_begin(). So it is hard to stop writes at 
earlier stage. Therefore, in this version, we leverage i_mmap_writable, 
which is at address_space level. So it is easier to stop writes to the 
file. 

This is a temporary behavior. And it is gated by the config. So I guess
it is OK. It works well for our use cases though. Once we have better 
write support, we can remove the limitation. 

If this is too weird, I am also open to suggestions. 

Thanks,
Song
Song Liu June 19, 2019, 6:26 a.m. UTC | #3
> On Jun 18, 2019, at 2:12 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Fri, 14 Jun 2019 11:22:01 -0700 Song Liu <songliubraving@fb.com> wrote:
> 
>> This set follows up discussion at LSF/MM 2019. The motivation is to put
>> text section of an application in THP, and thus reduces iTLB miss rate and
>> improves performance. Both Facebook and Oracle showed strong interests to
>> this feature.
>> 
>> To make reviews easier, this set aims a mininal valid product. Current
>> version of the work does not have any changes to file system specific
>> code. This comes with some limitations (discussed later).
>> 
>> This set enables an application to "hugify" its text section by simply
>> running something like:
>> 
>>          madvise(0x600000, 0x80000, MADV_HUGEPAGE);
>> 
>> Before this call, the /proc/<pid>/maps looks like:
>> 
>>    00400000-074d0000 r-xp 00000000 00:27 2006927     app
>> 
>> After this call, part of the text section is split out and mapped to THP:
>> 
>>    00400000-00425000 r-xp 00000000 00:27 2006927     app
>>    00600000-00e00000 r-xp 00200000 00:27 2006927     app   <<< on THP
>>    00e00000-074d0000 r-xp 00a00000 00:27 2006927     app
>> 
>> Limitations:
>> 
>> 1. This only works for text section (vma with VM_DENYWRITE).
>> 2. Once the application put its own pages in THP, the file is read only.
>>   open(file, O_WRITE) will fail with -ETXTBSY. To modify/update the file,
>>   it must be removed first.
> 
> Removed?  Even if the original mmap/madvise has gone away?  hm.
> 
> I'm wondering if this limitation can be abused in some fashion: mmap a
> file to which you have read permissions, run madvise(MADV_HUGEPAGE) and
> thus prevent the file's owner from being able to modify the file?  Or
> something like that.  What are the issues and protections here?
> 

I found a better solution to this limitation. Please refer to changes
in v3 (especially 6/6). 

Thanks,
Song
Andrew Morton June 20, 2019, 1:13 a.m. UTC | #4
On Tue, 18 Jun 2019 21:48:16 +0000 Song Liu <songliubraving@fb.com> wrote:

> > I'm wondering if this limitation can be abused in some fashion: mmap a
> > file to which you have read permissions, run madvise(MADV_HUGEPAGE) and
> > thus prevent the file's owner from being able to modify the file?  Or
> > something like that.  What are the issues and protections here?
> 
> In this case, the owner need to make a copy of the file, and then remove 
> and update the original file. 
> 
> In this version, we want either split huge page on writes, or fail the 
> write when we cannot split. However, the huge page information is only 
> available at page level, and on the write path, page level information 
> is not available until write_begin(). So it is hard to stop writes at 
> earlier stage. Therefore, in this version, we leverage i_mmap_writable, 
> which is at address_space level. So it is easier to stop writes to the 
> file. 
> 
> This is a temporary behavior. And it is gated by the config. So I guess
> it is OK. It works well for our use cases though. Once we have better 
> write support, we can remove the limitation. 
> 
> If this is too weird, I am also open to suggestions. 

Well, it's more than weird?  This permits user A to deny service to
user B?  User A can, maliciously or accidentally, prevent user B from
modifying a file which user B has permission to modify?  Such as, umm,
/etc/hosts?
Song Liu June 20, 2019, 2:04 a.m. UTC | #5
> On Jun 19, 2019, at 6:13 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Tue, 18 Jun 2019 21:48:16 +0000 Song Liu <songliubraving@fb.com> wrote:
> 
>>> I'm wondering if this limitation can be abused in some fashion: mmap a
>>> file to which you have read permissions, run madvise(MADV_HUGEPAGE) and
>>> thus prevent the file's owner from being able to modify the file?  Or
>>> something like that.  What are the issues and protections here?
>> 
>> In this case, the owner need to make a copy of the file, and then remove 
>> and update the original file. 
>> 
>> In this version, we want either split huge page on writes, or fail the 
>> write when we cannot split. However, the huge page information is only 
>> available at page level, and on the write path, page level information 
>> is not available until write_begin(). So it is hard to stop writes at 
>> earlier stage. Therefore, in this version, we leverage i_mmap_writable, 
>> which is at address_space level. So it is easier to stop writes to the 
>> file. 
>> 
>> This is a temporary behavior. And it is gated by the config. So I guess
>> it is OK. It works well for our use cases though. Once we have better 
>> write support, we can remove the limitation. 
>> 
>> If this is too weird, I am also open to suggestions. 
> 
> Well, it's more than weird?  This permits user A to deny service to
> user B?  User A can, maliciously or accidentally, prevent user B from
> modifying a file which user B has permission to modify?  Such as, umm,
> /etc/hosts?

I have removed this behavior in v3. I think we really don't need this. 

Thanks,
Song