diff mbox series

[v2,3/3] mm,thp: add read-only THP support for (non-shmem) FS

Message ID 20190614182204.2673660-4-songliubraving@fb.com (mailing list archive)
State New, archived
Headers show
Series Enable THP for text section of non-shmem files | expand

Commit Message

Song Liu June 14, 2019, 6:22 p.m. UTC
This patch is (hopefully) the first step to enable THP for non-shmem
filesystems.

This patch enables an application to put part of its text sections to THP
via madvise, for example:

    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);

We tried to reuse the logic for THP on tmpfs. The following functions are
renamed to reflect the new functionality:

	collapse_shmem()	=>  collapse_file()
	khugepaged_scan_shmem()	=>  khugepaged_scan_file()

Currently, write is not supported for non-shmem THP. This is enforced by
taking negative i_writecount. Therefore, if file has THP pages in the
page cache, open() to write will fail. To update/modify the file, the
user need to remove it first.

An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
feature.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/fs.h |   8 ++++
 mm/Kconfig         |  11 +++++
 mm/filemap.c       |   5 ++-
 mm/khugepaged.c    | 106 ++++++++++++++++++++++++++++++++++++---------
 mm/rmap.c          |  12 +++--
 5 files changed, 116 insertions(+), 26 deletions(-)

Comments

Rik van Riel June 17, 2019, 3:42 p.m. UTC | #1
On Fri, 2019-06-14 at 11:22 -0700, Song Liu wrote:
> 
> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> +	if (shmem_file(vma->vm_file) ||
> +	    (vma->vm_file && (vm_flags & VM_DENYWRITE))) {
> +#else
>  	if (shmem_file(vma->vm_file)) {
> +#endif
>  		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
>  			return false;

Future cleanup idea: could it be nice to hide the
above behind a "vma_can_have_file_thp" function or
similar?

That inline function could also have a comment explaining
why the check is the way it is.

OTOH, I guess this series is just the first step towards
more complete functionality, and things are likely to change
again soon(ish).

> @@ -1628,14 +1692,14 @@ static void khugepaged_scan_shmem(struct
> mm_struct *mm,
>  			result = SCAN_EXCEED_NONE_PTE;
>  		} else {
>  			node = khugepaged_find_target_node();
> -			collapse_shmem(mm, mapping, start, hpage,
> node);
> +			collapse_file(vma, mapping, start, hpage,
> node);
>  		}
>  	}

If for some reason you end up posting a v3 of this
series, the s/_shmem/_file/ renaming could be broken
out into its own patch.

All the code looks good though.

Acked-by: Rik van Riel <riel@surriel.com>
Kirill A. Shutemov June 21, 2019, 12:58 p.m. UTC | #2
On Fri, Jun 14, 2019 at 11:22:04AM -0700, Song Liu wrote:
> This patch is (hopefully) the first step to enable THP for non-shmem
> filesystems.
> 
> This patch enables an application to put part of its text sections to THP
> via madvise, for example:
> 
>     madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
> 
> We tried to reuse the logic for THP on tmpfs. The following functions are
> renamed to reflect the new functionality:
> 
> 	collapse_shmem()	=>  collapse_file()
> 	khugepaged_scan_shmem()	=>  khugepaged_scan_file()
> 
> Currently, write is not supported for non-shmem THP. This is enforced by
> taking negative i_writecount. Therefore, if file has THP pages in the
> page cache, open() to write will fail. To update/modify the file, the
> user need to remove it first.
> 
> An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
> feature.

Please document explicitly that the feature opens local DoS attack: any
user with read access to file can block write to the file by using
MADV_HUGEPAGE for a range of the file.

As is it only has to be used with trusted userspace.

We also might want to have mount option in addition to Kconfig option to
enable the feature on per-mount basis.
Song Liu June 21, 2019, 1:08 p.m. UTC | #3
Hi Kirill,

> On Jun 21, 2019, at 5:58 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Fri, Jun 14, 2019 at 11:22:04AM -0700, Song Liu wrote:
>> This patch is (hopefully) the first step to enable THP for non-shmem
>> filesystems.
>> 
>> This patch enables an application to put part of its text sections to THP
>> via madvise, for example:
>> 
>>    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
>> 
>> We tried to reuse the logic for THP on tmpfs. The following functions are
>> renamed to reflect the new functionality:
>> 
>> 	collapse_shmem()	=>  collapse_file()
>> 	khugepaged_scan_shmem()	=>  khugepaged_scan_file()
>> 
>> Currently, write is not supported for non-shmem THP. This is enforced by
>> taking negative i_writecount. Therefore, if file has THP pages in the
>> page cache, open() to write will fail. To update/modify the file, the
>> user need to remove it first.
>> 
>> An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
>> feature.
> 
> Please document explicitly that the feature opens local DoS attack: any
> user with read access to file can block write to the file by using
> MADV_HUGEPAGE for a range of the file.
> 
> As is it only has to be used with trusted userspace.
> 
> We also might want to have mount option in addition to Kconfig option to
> enable the feature on per-mount basis.

This behavior has been removed from v3 to v5. 

Thanks,
Song
Kirill A. Shutemov June 21, 2019, 1:11 p.m. UTC | #4
On Fri, Jun 21, 2019 at 01:08:39PM +0000, Song Liu wrote:
> 
> Hi Kirill,
> 
> > On Jun 21, 2019, at 5:58 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > On Fri, Jun 14, 2019 at 11:22:04AM -0700, Song Liu wrote:
> >> This patch is (hopefully) the first step to enable THP for non-shmem
> >> filesystems.
> >> 
> >> This patch enables an application to put part of its text sections to THP
> >> via madvise, for example:
> >> 
> >>    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
> >> 
> >> We tried to reuse the logic for THP on tmpfs. The following functions are
> >> renamed to reflect the new functionality:
> >> 
> >> 	collapse_shmem()	=>  collapse_file()
> >> 	khugepaged_scan_shmem()	=>  khugepaged_scan_file()
> >> 
> >> Currently, write is not supported for non-shmem THP. This is enforced by
> >> taking negative i_writecount. Therefore, if file has THP pages in the
> >> page cache, open() to write will fail. To update/modify the file, the
> >> user need to remove it first.
> >> 
> >> An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
> >> feature.
> > 
> > Please document explicitly that the feature opens local DoS attack: any
> > user with read access to file can block write to the file by using
> > MADV_HUGEPAGE for a range of the file.
> > 
> > As is it only has to be used with trusted userspace.
> > 
> > We also might want to have mount option in addition to Kconfig option to
> > enable the feature on per-mount basis.
> 
> This behavior has been removed from v3 to v5. 

Yes, I've catch up with that. :P
diff mbox series

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7fdfe93e25d..cda996ddaee1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2871,6 +2871,10 @@  static inline int get_write_access(struct inode *inode)
 {
 	return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
 }
+static inline int __deny_write_access(struct inode *inode)
+{
+	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+}
 static inline int deny_write_access(struct file *file)
 {
 	struct inode *inode = file_inode(file);
@@ -2880,6 +2884,10 @@  static inline void put_write_access(struct inode * inode)
 {
 	atomic_dec(&inode->i_writecount);
 }
+static inline void __allow_write_access(struct inode *inode)
+{
+	atomic_inc(&inode->i_writecount);
+}
 static inline void allow_write_access(struct file *file)
 {
 	if (file)
diff --git a/mm/Kconfig b/mm/Kconfig
index f0c76ba47695..546d45d9bdab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -762,6 +762,17 @@  config GUP_BENCHMARK
 
 	  See tools/testing/selftests/vm/gup_benchmark.c
 
+config READ_ONLY_THP_FOR_FS
+	bool "Read-only THP for filesystems (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM
+
+	help
+	  Allow khugepaged to put read-only file-backed pages in THP.
+
+	  This is marked experimental because it makes files with thp in
+	  the page cache read-only. To overwrite the file, it need to be
+	  truncated or removed first.
+
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
diff --git a/mm/filemap.c b/mm/filemap.c
index f5b79a43946d..966f24cee711 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -203,8 +203,9 @@  static void unaccount_page_cache_page(struct address_space *mapping,
 		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
 		if (PageTransHuge(page))
 			__dec_node_page_state(page, NR_SHMEM_THPS);
-	} else {
-		VM_BUG_ON_PAGE(PageTransHuge(page), page);
+	} else if (PageTransHuge(page)) {
+		__dec_node_page_state(page, NR_FILE_THPS);
+		__allow_write_access(mapping->host);
 	}
 
 	/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..1855ace48488 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -48,6 +48,7 @@  enum scan_result {
 	SCAN_CGROUP_CHARGE_FAIL,
 	SCAN_EXCEED_SWAP_PTE,
 	SCAN_TRUNCATED,
+	SCAN_PAGE_HAS_PRIVATE,
 };
 
 #define CREATE_TRACE_POINTS
@@ -404,7 +405,13 @@  static bool hugepage_vma_check(struct vm_area_struct *vma,
 	    (vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 		return false;
+
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	if (shmem_file(vma->vm_file) ||
+	    (vma->vm_file && (vm_flags & VM_DENYWRITE))) {
+#else
 	if (shmem_file(vma->vm_file)) {
+#endif
 		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
 			return false;
 		return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
@@ -456,8 +463,9 @@  int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 	unsigned long hstart, hend;
 
 	/*
-	 * khugepaged does not yet work on non-shmem files or special
-	 * mappings. And file-private shmem THP is not supported.
+	 * khugepaged only supports read-only files for non-shmem files.
+	 * khugepaged does not yet work on special mappings. And
+	 * file-private shmem THP is not supported.
 	 */
 	if (!hugepage_vma_check(vma, vm_flags))
 		return 0;
@@ -1284,12 +1292,12 @@  static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 }
 
 /**
- * collapse_shmem - collapse small tmpfs/shmem pages into huge one.
+ * collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
  *
  * Basic scheme is simple, details are more complex:
  *  - allocate and lock a new huge page;
  *  - scan page cache replacing old pages with the new one
- *    + swap in pages if necessary;
+ *    + swap/gup in pages if necessary;
  *    + fill in gaps;
  *    + keep old pages around in case rollback is required;
  *  - if replacing succeeds:
@@ -1301,10 +1309,11 @@  static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static void collapse_shmem(struct mm_struct *mm,
+static void collapse_file(struct vm_area_struct *vma,
 		struct address_space *mapping, pgoff_t start,
 		struct page **hpage, int node)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	gfp_t gfp;
 	struct page *new_page;
 	struct mem_cgroup *memcg;
@@ -1312,7 +1321,11 @@  static void collapse_shmem(struct mm_struct *mm,
 	LIST_HEAD(pagelist);
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
 	int nr_none = 0, result = SCAN_SUCCEED;
+	bool is_shmem = shmem_file(vma->vm_file);
 
+#ifndef CONFIG_READ_ONLY_THP_FOR_FS
+	VM_BUG_ON(!is_shmem);
+#endif
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
 	/* Only allocate from the target node */
@@ -1344,7 +1357,8 @@  static void collapse_shmem(struct mm_struct *mm,
 	} while (1);
 
 	__SetPageLocked(new_page);
-	__SetPageSwapBacked(new_page);
+	if (is_shmem)
+		__SetPageSwapBacked(new_page);
 	new_page->index = start;
 	new_page->mapping = mapping;
 
@@ -1359,7 +1373,7 @@  static void collapse_shmem(struct mm_struct *mm,
 		struct page *page = xas_next(&xas);
 
 		VM_BUG_ON(index != xas.xa_index);
-		if (!page) {
+		if (is_shmem && !page) {
 			/*
 			 * Stop if extent has been truncated or hole-punched,
 			 * and is now completely empty.
@@ -1380,7 +1394,7 @@  static void collapse_shmem(struct mm_struct *mm,
 			continue;
 		}
 
-		if (xa_is_value(page) || !PageUptodate(page)) {
+		if (is_shmem && (xa_is_value(page) || !PageUptodate(page))) {
 			xas_unlock_irq(&xas);
 			/* swap in or instantiate fallocated page */
 			if (shmem_getpage(mapping->host, index, &page,
@@ -1388,6 +1402,24 @@  static void collapse_shmem(struct mm_struct *mm,
 				result = SCAN_FAIL;
 				goto xa_unlocked;
 			}
+		} else if (!page || xa_is_value(page)) {
+			unsigned long vaddr;
+
+			VM_BUG_ON(is_shmem);
+
+			vaddr = vma->vm_start +
+				((index - vma->vm_pgoff) << PAGE_SHIFT);
+			xas_unlock_irq(&xas);
+			if (get_user_pages_remote(NULL, mm, vaddr, 1,
+					FOLL_FORCE, &page, NULL, NULL) != 1) {
+				result = SCAN_FAIL;
+				goto xa_unlocked;
+			}
+			lru_add_drain();
+			lock_page(page);
+		} else if (!PageUptodate(page) || PageDirty(page)) {
+			result = SCAN_FAIL;
+			goto xa_locked;
 		} else if (trylock_page(page)) {
 			get_page(page);
 			xas_unlock_irq(&xas);
@@ -1422,6 +1454,12 @@  static void collapse_shmem(struct mm_struct *mm,
 			goto out_unlock;
 		}
 
+		if (page_has_private(page) &&
+		    !try_to_release_page(page, GFP_KERNEL)) {
+			result = SCAN_PAGE_HAS_PRIVATE;
+			break;
+		}
+
 		if (page_mapped(page))
 			unmap_mapping_pages(mapping, index, 1, false);
 
@@ -1459,12 +1497,20 @@  static void collapse_shmem(struct mm_struct *mm,
 		goto xa_unlocked;
 	}
 
-	__inc_node_page_state(new_page, NR_SHMEM_THPS);
+	if (is_shmem)
+		__inc_node_page_state(new_page, NR_SHMEM_THPS);
+	else {
+		__inc_node_page_state(new_page, NR_FILE_THPS);
+		__deny_write_access(mapping->host);
+	}
+
 	if (nr_none) {
 		struct zone *zone = page_zone(new_page);
 
 		__mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
-		__mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
+		if (is_shmem)
+			__mod_node_page_state(zone->zone_pgdat, NR_SHMEM,
+					      nr_none);
 	}
 
 xa_locked:
@@ -1502,10 +1548,15 @@  static void collapse_shmem(struct mm_struct *mm,
 
 		SetPageUptodate(new_page);
 		page_ref_add(new_page, HPAGE_PMD_NR - 1);
-		set_page_dirty(new_page);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
+
+		if (is_shmem) {
+			set_page_dirty(new_page);
+			lru_cache_add_anon(new_page);
+		} else {
+			lru_cache_add_file(new_page);
+		}
 		count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
-		lru_cache_add_anon(new_page);
 
 		/*
 		 * Remove pte page tables, so we can re-fault the page as huge.
@@ -1520,7 +1571,9 @@  static void collapse_shmem(struct mm_struct *mm,
 		/* Something went wrong: roll back page cache changes */
 		xas_lock_irq(&xas);
 		mapping->nrpages -= nr_none;
-		shmem_uncharge(mapping->host, nr_none);
+
+		if (is_shmem)
+			shmem_uncharge(mapping->host, nr_none);
 
 		xas_set(&xas, start);
 		xas_for_each(&xas, page, end - 1) {
@@ -1560,7 +1613,7 @@  static void collapse_shmem(struct mm_struct *mm,
 	/* TODO: tracepoints */
 }
 
-static void khugepaged_scan_shmem(struct mm_struct *mm,
+static void khugepaged_scan_file(struct vm_area_struct *vma,
 		struct address_space *mapping,
 		pgoff_t start, struct page **hpage)
 {
@@ -1603,6 +1656,17 @@  static void khugepaged_scan_shmem(struct mm_struct *mm,
 			break;
 		}
 
+		if (page_has_private(page) && trylock_page(page)) {
+			int ret;
+
+			ret = try_to_release_page(page, GFP_KERNEL);
+			unlock_page(page);
+			if (!ret) {
+				result = SCAN_PAGE_HAS_PRIVATE;
+				break;
+			}
+		}
+
 		if (page_count(page) != 1 + page_mapcount(page)) {
 			result = SCAN_PAGE_COUNT;
 			break;
@@ -1628,14 +1692,14 @@  static void khugepaged_scan_shmem(struct mm_struct *mm,
 			result = SCAN_EXCEED_NONE_PTE;
 		} else {
 			node = khugepaged_find_target_node();
-			collapse_shmem(mm, mapping, start, hpage, node);
+			collapse_file(vma, mapping, start, hpage, node);
 		}
 	}
 
 	/* TODO: tracepoints */
 }
 #else
-static void khugepaged_scan_shmem(struct mm_struct *mm,
+static void khugepaged_scan_file(struct vm_area_struct *vma,
 		struct address_space *mapping,
 		pgoff_t start, struct page **hpage)
 {
@@ -1710,17 +1774,19 @@  static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (shmem_file(vma->vm_file)) {
+			if (vma->vm_file) {
 				struct file *file;
 				pgoff_t pgoff = linear_page_index(vma,
 						khugepaged_scan.address);
-				if (!shmem_huge_enabled(vma))
+
+				if (shmem_file(vma->vm_file)
+				    && !shmem_huge_enabled(vma))
 					goto skip;
 				file = get_file(vma->vm_file);
 				up_read(&mm->mmap_sem);
 				ret = 1;
-				khugepaged_scan_shmem(mm, file->f_mapping,
-						pgoff, hpage);
+				khugepaged_scan_file(vma, file->f_mapping,
+						     pgoff, hpage);
 				fput(file);
 			} else {
 				ret = khugepaged_scan_pmd(mm, vma,
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..87cfa2c19eda 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1192,8 +1192,10 @@  void page_add_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
 			goto out;
-		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
-		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
+		if (PageSwapBacked(page))
+			__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
+		else
+			__inc_node_page_state(page, NR_FILE_PMDMAPPED);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
 			VM_WARN_ON_ONCE(!PageLocked(page));
@@ -1232,8 +1234,10 @@  static void page_remove_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			goto out;
-		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
-		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
+		if (PageSwapBacked(page))
+			__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
+		else
+			__dec_node_page_state(page, NR_FILE_PMDMAPPED);
 	} else {
 		if (!atomic_add_negative(-1, &page->_mapcount))
 			goto out;