diff mbox

[RFC] fs: introduce ST_HUGE flag and set it to tmpfs and hugetlbfs

Message ID 1523999293-94152-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yang Shi April 17, 2018, 9:08 p.m. UTC
Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
filesystem with huge page support anymore. tmpfs can use huge page via
THP when mounting by "huge=" mount option.

When applications use huge page on hugetlbfs, it just need check the
filesystem magic number, but it is not enough for tmpfs. So, introduce
ST_HUGE flag to statfs if super block has SB_HUGE set which indicates
huge page is supported on the specific filesystem.

Some applications could benefit from this change, for example QEMU.
When use mmap file as guest VM backend memory, QEMU typically mmap the
file size plus one extra page. If the file is on hugetlbfs the extra
page is huge page size (i.e. 2MB), but it is still 4KB on tmpfs even
though THP is enabled. tmpfs THP requires VMA is huge page aligned, so
if 4KB page is used THP will not be used at all. The below /proc/meminfo
fragment shows the THP use of QEMU with 4K page:

ShmemHugePages:   679936 kB
ShmemPmdMapped:        0 kB

With ST_HUGE flag, QEMU can get huge page, then /proc/meminfo looks
like:

ShmemHugePages:    77824 kB
ShmemPmdMapped:     6144 kB

With this flag, the applications can know if huge page is supported on
the filesystem then optimize the behavior of the applications
accordingly. Although the similar function can be implemented in
applications by traversing the mount options, it looks more convenient
if kernel can provide such flag.

Even though ST_HUGE is set, f_bsize still returns 4KB for tmpfs since
THP could be split, and it also my fallback to 4KB page silently if
there is not enough huge page.

And, set the flag for hugetlbfs as well to keep the consistency, and the
applications don't have to know what filesystem is used to use huge
page, just need to check ST_HUGE flag.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
---
 fs/hugetlbfs/inode.c   | 1 +
 fs/statfs.c            | 2 ++
 include/linux/fs.h     | 1 +
 include/linux/statfs.h | 1 +
 mm/shmem.c             | 8 ++++++++
 5 files changed, 13 insertions(+)

Comments

Andrew Morton April 17, 2018, 9:31 p.m. UTC | #1
On Wed, 18 Apr 2018 05:08:13 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:

> Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
> filesystem with huge page support anymore. tmpfs can use huge page via
> THP when mounting by "huge=" mount option.
> 
> When applications use huge page on hugetlbfs, it just need check the
> filesystem magic number, but it is not enough for tmpfs. So, introduce
> ST_HUGE flag to statfs if super block has SB_HUGE set which indicates
> huge page is supported on the specific filesystem.
> 
> Some applications could benefit from this change, for example QEMU.
> When use mmap file as guest VM backend memory, QEMU typically mmap the
> file size plus one extra page. If the file is on hugetlbfs the extra
> page is huge page size (i.e. 2MB), but it is still 4KB on tmpfs even
> though THP is enabled. tmpfs THP requires VMA is huge page aligned, so
> if 4KB page is used THP will not be used at all. The below /proc/meminfo
> fragment shows the THP use of QEMU with 4K page:
> 
> ShmemHugePages:   679936 kB
> ShmemPmdMapped:        0 kB
> 
> With ST_HUGE flag, QEMU can get huge page, then /proc/meminfo looks
> like:
> 
> ShmemHugePages:    77824 kB
> ShmemPmdMapped:     6144 kB
> 
> With this flag, the applications can know if huge page is supported on
> the filesystem then optimize the behavior of the applications
> accordingly. Although the similar function can be implemented in
> applications by traversing the mount options, it looks more convenient
> if kernel can provide such flag.
> 
> Even though ST_HUGE is set, f_bsize still returns 4KB for tmpfs since
> THP could be split, and it also my fallback to 4KB page silently if
> there is not enough huge page.
> 
> And, set the flag for hugetlbfs as well to keep the consistency, and the
> applications don't have to know what filesystem is used to use huge
> page, just need to check ST_HUGE flag.
> 

Patch is simple enough, although I'm having trouble forming an opinion
about it ;)

It will call for an update to the statfs(2) manpage.  I'm not sure
which of linux-man@vger.kernel.org, mtk.manpages@gmail.com and
linux-api@vger.kernel.org is best for that, so I'd cc all three...
Yang Shi April 17, 2018, 9:51 p.m. UTC | #2
On 4/17/18 2:31 PM, Andrew Morton wrote:
> On Wed, 18 Apr 2018 05:08:13 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>> Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
>> filesystem with huge page support anymore. tmpfs can use huge page via
>> THP when mounting by "huge=" mount option.
>>
>> When applications use huge page on hugetlbfs, it just need check the
>> filesystem magic number, but it is not enough for tmpfs. So, introduce
>> ST_HUGE flag to statfs if super block has SB_HUGE set which indicates
>> huge page is supported on the specific filesystem.
>>
>> Some applications could benefit from this change, for example QEMU.
>> When use mmap file as guest VM backend memory, QEMU typically mmap the
>> file size plus one extra page. If the file is on hugetlbfs the extra
>> page is huge page size (i.e. 2MB), but it is still 4KB on tmpfs even
>> though THP is enabled. tmpfs THP requires VMA is huge page aligned, so
>> if 4KB page is used THP will not be used at all. The below /proc/meminfo
>> fragment shows the THP use of QEMU with 4K page:
>>
>> ShmemHugePages:   679936 kB
>> ShmemPmdMapped:        0 kB
>>
>> With ST_HUGE flag, QEMU can get huge page, then /proc/meminfo looks
>> like:
>>
>> ShmemHugePages:    77824 kB
>> ShmemPmdMapped:     6144 kB
>>
>> With this flag, the applications can know if huge page is supported on
>> the filesystem then optimize the behavior of the applications
>> accordingly. Although the similar function can be implemented in
>> applications by traversing the mount options, it looks more convenient
>> if kernel can provide such flag.
>>
>> Even though ST_HUGE is set, f_bsize still returns 4KB for tmpfs since
>> THP could be split, and it also my fallback to 4KB page silently if
>> there is not enough huge page.
>>
>> And, set the flag for hugetlbfs as well to keep the consistency, and the
>> applications don't have to know what filesystem is used to use huge
>> page, just need to check ST_HUGE flag.
>>
> Patch is simple enough, although I'm having trouble forming an opinion
> about it ;)
>
> It will call for an update to the statfs(2) manpage.  I'm not sure
> which of linux-man@vger.kernel.org, mtk.manpages@gmail.com and
> linux-api@vger.kernel.org is best for that, so I'd cc all three...

Thanks, Andrew. Added cc to those 3 lists.
Matthew Wilcox April 17, 2018, 11:22 p.m. UTC | #3
On Wed, Apr 18, 2018 at 05:08:13AM +0800, Yang Shi wrote:
> When applications use huge page on hugetlbfs, it just need check the
> filesystem magic number, but it is not enough for tmpfs. So, introduce
> ST_HUGE flag to statfs if super block has SB_HUGE set which indicates
> huge page is supported on the specific filesystem.

Hm.  What's the plan for communicating support for page sizes other
than PMD page sizes?  I know ARM has several different page sizes,
as do PA-RISC and ia64.  Even x86 might support 1G page sizes through
tmpfs one day.
Yang Shi April 17, 2018, 11:37 p.m. UTC | #4
On 4/17/18 4:22 PM, Matthew Wilcox wrote:
> On Wed, Apr 18, 2018 at 05:08:13AM +0800, Yang Shi wrote:
>> When applications use huge page on hugetlbfs, it just need check the
>> filesystem magic number, but it is not enough for tmpfs. So, introduce
>> ST_HUGE flag to statfs if super block has SB_HUGE set which indicates
>> huge page is supported on the specific filesystem.
> Hm.  What's the plan for communicating support for page sizes other
> than PMD page sizes?  I know ARM has several different page sizes,
> as do PA-RISC and ia64.  Even x86 might support 1G page sizes through
> tmpfs one day.

For THP page size, we already have 
/sys/kernel/mm/transparent_hugepage/hpage_pmd_size exported. The 
applications could read this to get the THP size. If PUD size THP 
supported is added later, we can export hpage_pud_size.

Please see the below commit log for more details:

commit 49920d28781dcced10cd30cb9a938e7d045a1c94
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Dec 12 16:44:50 2016 -0800

     mm: make transparent hugepage size public

     Test programs want to know the size of a transparent hugepage. While it
     is commonly the same as the size of a hugetlbfs page (shown as
     Hugepagesize in /proc/meminfo), that is not always so: powerpc
     implements transparent hugepages in a different way from hugetlbfs
     pages, so it's coincidence when their sizes are the same; and x86 and
     others can support more than one hugetlbfs page size.

     Add /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to show the THP
     size in bytes - it's the same for Anonymous and Shmem hugepages.  Call
     it hpage_pmd_size (after HPAGE_PMD_SIZE) rather than hpage_size, in 
case
     some transparent support for pud and pgd pages is added later.


Thanks,
Yang
Christoph Hellwig April 18, 2018, 10:27 a.m. UTC | #5
On Wed, Apr 18, 2018 at 05:08:13AM +0800, Yang Shi wrote:
> Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
> filesystem with huge page support anymore. tmpfs can use huge page via
> THP when mounting by "huge=" mount option.
> 
> When applications use huge page on hugetlbfs, it just need check the
> filesystem magic number, but it is not enough for tmpfs. So, introduce
> ST_HUGE flag to statfs if super block has SB_HUGE set which indicates
> huge page is supported on the specific filesystem.
> 
> Some applications could benefit from this change, for example QEMU.
> When use mmap file as guest VM backend memory, QEMU typically mmap the
> file size plus one extra page. If the file is on hugetlbfs the extra
> page is huge page size (i.e. 2MB), but it is still 4KB on tmpfs even
> though THP is enabled. tmpfs THP requires VMA is huge page aligned, so
> if 4KB page is used THP will not be used at all. The below /proc/meminfo
> fragment shows the THP use of QEMU with 4K page:
> 
> ShmemHugePages:   679936 kB
> ShmemPmdMapped:        0 kB
> 
> With ST_HUGE flag, QEMU can get huge page, then /proc/meminfo looks
> like:
> 
> ShmemHugePages:    77824 kB
> ShmemPmdMapped:     6144 kB
> 
> With this flag, the applications can know if huge page is supported on
> the filesystem then optimize the behavior of the applications
> accordingly. Although the similar function can be implemented in
> applications by traversing the mount options, it looks more convenient
> if kernel can provide such flag.
> 
> Even though ST_HUGE is set, f_bsize still returns 4KB for tmpfs since
> THP could be split, and it also my fallback to 4KB page silently if
> there is not enough huge page.

Seems like your should report it through the st_blksize field of struct
stat then, instead of introducing a not very useful binary field then.
Yang Shi April 18, 2018, 6:18 p.m. UTC | #6
On 4/18/18 3:27 AM, Christoph Hellwig wrote:
> On Wed, Apr 18, 2018 at 05:08:13AM +0800, Yang Shi wrote:
>> Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
>> filesystem with huge page support anymore. tmpfs can use huge page via
>> THP when mounting by "huge=" mount option.
>>
>> When applications use huge page on hugetlbfs, it just need check the
>> filesystem magic number, but it is not enough for tmpfs. So, introduce
>> ST_HUGE flag to statfs if super block has SB_HUGE set which indicates
>> huge page is supported on the specific filesystem.
>>
>> Some applications could benefit from this change, for example QEMU.
>> When use mmap file as guest VM backend memory, QEMU typically mmap the
>> file size plus one extra page. If the file is on hugetlbfs the extra
>> page is huge page size (i.e. 2MB), but it is still 4KB on tmpfs even
>> though THP is enabled. tmpfs THP requires VMA is huge page aligned, so
>> if 4KB page is used THP will not be used at all. The below /proc/meminfo
>> fragment shows the THP use of QEMU with 4K page:
>>
>> ShmemHugePages:   679936 kB
>> ShmemPmdMapped:        0 kB
>>
>> With ST_HUGE flag, QEMU can get huge page, then /proc/meminfo looks
>> like:
>>
>> ShmemHugePages:    77824 kB
>> ShmemPmdMapped:     6144 kB
>>
>> With this flag, the applications can know if huge page is supported on
>> the filesystem then optimize the behavior of the applications
>> accordingly. Although the similar function can be implemented in
>> applications by traversing the mount options, it looks more convenient
>> if kernel can provide such flag.
>>
>> Even though ST_HUGE is set, f_bsize still returns 4KB for tmpfs since
>> THP could be split, and it also my fallback to 4KB page silently if
>> there is not enough huge page.
> Seems like your should report it through the st_blksize field of struct
> stat then, instead of introducing a not very useful binary field then.

Yes, thanks for the suggestion. I did think about it before I went with 
the new flag. Not like hugetlb, THP will *not* guarantee huge page is 
used all the time, it may fallback to regular 4K page or may get split. 
I'm not sure how the applications use f_bsize field, it might break 
existing applications and the value might be abused by applications to 
have counter optimization. So, IMHO, a new flag may sound safer.

Yang
Mike Kravetz April 18, 2018, 8:26 p.m. UTC | #7
On 04/17/2018 02:08 PM, Yang Shi wrote:
> And, set the flag for hugetlbfs as well to keep the consistency, and the
> applications don't have to know what filesystem is used to use huge
> page, just need to check ST_HUGE flag.

For hugetlbfs, setting such a flag would be for consistency only.  mapping
hugetlbfs files REQUIRES huge page alignment and size.

If an application would want to take advantage of this flag for tmpfs, it
needs to map at a fixed address (MAP_FIXED) for huge page alignment.  So,
it will need to do one of the 'mmap tricks' to get a mapping at a suitably
aligned address.  

IIRC, there is code to 'suitably align' DAX mappings to appropriate huge page
boundaries.  Perhaps, something like this could be added for tmpfs mounted
with huge=?  Of course, this would not take into account 'length' but may
help some.
Yang Shi April 18, 2018, 8:53 p.m. UTC | #8
On 4/18/18 1:26 PM, Mike Kravetz wrote:
> On 04/17/2018 02:08 PM, Yang Shi wrote:
>> And, set the flag for hugetlbfs as well to keep the consistency, and the
>> applications don't have to know what filesystem is used to use huge
>> page, just need to check ST_HUGE flag.
> For hugetlbfs, setting such a flag would be for consistency only.  mapping
> hugetlbfs files REQUIRES huge page alignment and size.

Yes, applications don't have to read this flag if the underlying 
filesystem is hugetlbfs. The fs magic number is good enough.

>
> If an application would want to take advantage of this flag for tmpfs, it
> needs to map at a fixed address (MAP_FIXED) for huge page alignment.  So,
> it will need to do one of the 'mmap tricks' to get a mapping at a suitably
> aligned address.

It doesn't have to be MAP_FIXED, but definitely has to be huge page 
aligned. This flag is aimed for this case. With this flag, the 
applications can know the underlying tmpfs with huge page supported, 
then the applications can mmap memory with huge page alignment 
intentionally.

>
> IIRC, there is code to 'suitably align' DAX mappings to appropriate huge page
> boundaries.  Perhaps, something like this could be added for tmpfs mounted
> with huge=?  Of course, this would not take into account 'length' but may
> help some.

Might be. However THP already exported huge page size to sysfs, the 
applications can read it to get the alignment.

Thanks,
Yang

>
Christoph Hellwig April 19, 2018, 8:28 a.m. UTC | #9
On Wed, Apr 18, 2018 at 11:18:25AM -0700, Yang Shi wrote:
> Yes, thanks for the suggestion. I did think about it before I went with the
> new flag. Not like hugetlb, THP will *not* guarantee huge page is used all
> the time, it may fallback to regular 4K page or may get split. I'm not sure
> how the applications use f_bsize field, it might break existing applications
> and the value might be abused by applications to have counter optimization.
> So, IMHO, a new flag may sound safer.

But st_blksize isn't the block size, that is why I suggested it.  It is
the preferred I/O size, and various file systems can report way
larger values than the block size already.
Kirill A. Shutemov April 19, 2018, 9:01 a.m. UTC | #10
On Wed, Apr 18, 2018 at 01:26:35PM -0700, Mike Kravetz wrote:
> If an application would want to take advantage of this flag for tmpfs, it
> needs to map at a fixed address (MAP_FIXED) for huge page alignment.  So,
> it will need to do one of the 'mmap tricks' to get a mapping at a suitably
> aligned address.  

We don't need MAP_FIXED. We already have all required magic in
shmem_get_unmapped_area().
Kirill A. Shutemov April 19, 2018, 9:05 a.m. UTC | #11
On Thu, Apr 19, 2018 at 01:28:10AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 18, 2018 at 11:18:25AM -0700, Yang Shi wrote:
> > Yes, thanks for the suggestion. I did think about it before I went with the
> > new flag. Not like hugetlb, THP will *not* guarantee huge page is used all
> > the time, it may fallback to regular 4K page or may get split. I'm not sure
> > how the applications use f_bsize field, it might break existing applications
> > and the value might be abused by applications to have counter optimization.
> > So, IMHO, a new flag may sound safer.
> 
> But st_blksize isn't the block size, that is why I suggested it.  It is
> the preferred I/O size, and various file systems can report way
> larger values than the block size already.

I agree. This looks like a better fit.
Yang Shi April 20, 2018, 12:18 a.m. UTC | #12
On 4/19/18 1:28 AM, Christoph Hellwig wrote:
> On Wed, Apr 18, 2018 at 11:18:25AM -0700, Yang Shi wrote:
>> Yes, thanks for the suggestion. I did think about it before I went with the
>> new flag. Not like hugetlb, THP will *not* guarantee huge page is used all
>> the time, it may fallback to regular 4K page or may get split. I'm not sure
>> how the applications use f_bsize field, it might break existing applications
>> and the value might be abused by applications to have counter optimization.
>> So, IMHO, a new flag may sound safer.
> But st_blksize isn't the block size, that is why I suggested it.  It is
> the preferred I/O size, and various file systems can report way
> larger values than the block size already.

Thanks. If it is safe to applications, It definitely can return huge 
page size via st_blksize.

Is it safe to return huge page size via statfs->f_bsize? It sounds it 
has not to be the physical block size too. The man page says it is 
"Optimal transfer block size".

Yang
diff mbox

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b9a254d..3754b45 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1265,6 +1265,7 @@  static void init_once(void *foo)
 	sb->s_op = &hugetlbfs_ops;
 	sb->s_time_gran = 1;
 	sb->s_root = d_make_root(hugetlbfs_get_root(sb, &config));
+	sb->s_flags |= SB_HUGE;
 	if (!sb->s_root)
 		goto out_free;
 	return 0;
diff --git a/fs/statfs.c b/fs/statfs.c
index 5b2a24f..ac0403a 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -41,6 +41,8 @@  static int flags_by_sb(int s_flags)
 		flags |= ST_MANDLOCK;
 	if (s_flags & SB_RDONLY)
 		flags |= ST_RDONLY;
+	if (s_flags & SB_HUGE)
+		flags |= ST_HUGE;
 	return flags;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c6baf76..df246e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1287,6 +1287,7 @@  struct fasync_struct {
 #define SB_SYNCHRONOUS	16	/* Writes are synced at once */
 #define SB_MANDLOCK	64	/* Allow mandatory locks on an FS */
 #define SB_DIRSYNC	128	/* Directory modifications are synchronous */
+#define SB_HUGE		256	/* Support hugepage/THP */
 #define SB_NOATIME	1024	/* Do not update access times. */
 #define SB_NODIRATIME	2048	/* Do not update directory access times */
 #define SB_SILENT	32768
diff --git a/include/linux/statfs.h b/include/linux/statfs.h
index 3142e98..79a634b 100644
--- a/include/linux/statfs.h
+++ b/include/linux/statfs.h
@@ -40,5 +40,6 @@  struct kstatfs {
 #define ST_NOATIME	0x0400	/* do not update access times */
 #define ST_NODIRATIME	0x0800	/* do not update directory access times */
 #define ST_RELATIME	0x1000	/* update atime relative to mtime/ctime */
+#define ST_HUGE		0x2000	/* support hugepage/thp */
 
 #endif
diff --git a/mm/shmem.c b/mm/shmem.c
index b859192..d5312ec 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3632,6 +3632,11 @@  static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
 	sbinfo->max_inodes  = config.max_inodes;
 	sbinfo->free_inodes = config.max_inodes - inodes;
 
+	if (sbinfo->huge > 0)
+		sb->s_flags |= SB_HUGE;
+	else
+		sb->s_flags &= ~SB_HUGE;
+
 	/*
 	 * Preserve previous mempolicy unless mpol remount option was specified.
 	 */
@@ -3804,6 +3809,9 @@  int shmem_fill_super(struct super_block *sb, void *data, int silent)
 	}
 	sb->s_export_op = &shmem_export_ops;
 	sb->s_flags |= SB_NOSEC;
+
+	if (sbinfo->huge > 0)
+		sb->s_flags |= SB_HUGE;
 #else
 	sb->s_flags |= SB_NOUSER;
 #endif