diff mbox

mm: thp: remove use_zero_page sysfs knob

Message ID 1532110430-115278-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yang Shi July 20, 2018, 6:13 p.m. UTC
By digging into the original review, it looks use_zero_page sysfs knob
was added to help ease-of-testing and give user a way to mitigate
refcounting overhead.

It has been a few years since the knob was added at the first place, I
think we are confident that it is stable enough. And, since commit
6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"),
it looks refcounting overhead has been reduced significantly.

Other than the above, the value of the knob is always 1 (enabled by
default), I'm supposed very few people turn it off by default.

So, it sounds not worth to still keep this knob around.

Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  7 -------
 include/linux/huge_mm.h                    |  4 ----
 mm/huge_memory.c                           | 22 ++--------------------
 3 files changed, 2 insertions(+), 31 deletions(-)

Comments

Andrew Morton July 20, 2018, 7:32 p.m. UTC | #1
On Sat, 21 Jul 2018 02:13:50 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:

> By digging into the original review, it looks use_zero_page sysfs knob
> was added to help ease-of-testing and give user a way to mitigate
> refcounting overhead.
> 
> It has been a few years since the knob was added at the first place, I
> think we are confident that it is stable enough. And, since commit
> 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"),
> it looks refcounting overhead has been reduced significantly.
> 
> Other than the above, the value of the knob is always 1 (enabled by
> default), I'm supposed very few people turn it off by default.
> 
> So, it sounds not worth to still keep this knob around.

Probably OK.  Might not be OK, nobody knows.

It's been there for seven years so another six months won't kill us. 
How about as an intermediate step we add a printk("use_zero_page is
scheduled for removal.  Please contact linux-mm@kvack.org if you need
it").
David Rientjes July 20, 2018, 8:02 p.m. UTC | #2
On Fri, 20 Jul 2018, Andrew Morton wrote:

> > By digging into the original review, it looks use_zero_page sysfs knob
> > was added to help ease-of-testing and give user a way to mitigate
> > refcounting overhead.
> > 
> > It has been a few years since the knob was added at the first place, I
> > think we are confident that it is stable enough. And, since commit
> > 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"),
> > it looks refcounting overhead has been reduced significantly.
> > 
> > Other than the above, the value of the knob is always 1 (enabled by
> > default), I'm supposed very few people turn it off by default.
> > 
> > So, it sounds not worth to still keep this knob around.
> 
> Probably OK.  Might not be OK, nobody knows.
> 
> It's been there for seven years so another six months won't kill us. 
> How about as an intermediate step we add a printk("use_zero_page is
> scheduled for removal.  Please contact linux-mm@kvack.org if you need
> it").
> 

We disable the huge zero page through this interface, there were issues 
related to the huge zero page shrinker (probably best to never free a 
per-node huge zero page after allocated) and CVE-2017-1000405 for huge 
dirty COW.
Yang Shi July 20, 2018, 8:37 p.m. UTC | #3
On 7/20/18 1:02 PM, David Rientjes wrote:
> On Fri, 20 Jul 2018, Andrew Morton wrote:
>
>>> By digging into the original review, it looks use_zero_page sysfs knob
>>> was added to help ease-of-testing and give user a way to mitigate
>>> refcounting overhead.
>>>
>>> It has been a few years since the knob was added at the first place, I
>>> think we are confident that it is stable enough. And, since commit
>>> 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"),
>>> it looks refcounting overhead has been reduced significantly.
>>>
>>> Other than the above, the value of the knob is always 1 (enabled by
>>> default), I'm supposed very few people turn it off by default.
>>>
>>> So, it sounds not worth to still keep this knob around.
>> Probably OK.  Might not be OK, nobody knows.
>>
>> It's been there for seven years so another six months won't kill us.
>> How about as an intermediate step we add a printk("use_zero_page is
>> scheduled for removal.  Please contact linux-mm@kvack.org if you need
>> it").
>>
> We disable the huge zero page through this interface, there were issues
> related to the huge zero page shrinker (probably best to never free a
> per-node huge zero page after allocated) and CVE-2017-1000405 for huge
> dirty COW.

Thanks for the information. It looks the CVE has been resolved by commit 
a8f97366452ed491d13cf1e44241bc0b5740b1f0 ("mm, thp: Do not make page 
table dirty unconditionally in touch_p[mu]d()"), which is in 4.15 already.

What was the shrinker related issue? I'm supposed it has been resolved, 
right?
David Rientjes July 20, 2018, 9:05 p.m. UTC | #4
On Fri, 20 Jul 2018, Yang Shi wrote:

> > We disable the huge zero page through this interface, there were issues
> > related to the huge zero page shrinker (probably best to never free a
> > per-node huge zero page after allocated) and CVE-2017-1000405 for huge
> > dirty COW.
> 
> Thanks for the information. It looks the CVE has been resolved by commit
> a8f97366452ed491d13cf1e44241bc0b5740b1f0 ("mm, thp: Do not make page table
> dirty unconditionally in touch_p[mu]d()"), which is in 4.15 already.
> 

For users who run kernels earlier than 4.15 they may choose to mitigate 
the CVE by using this tunable.  It's not something we permanently need to 
have, but it may likely be too early.

> What was the shrinker related issue? I'm supposed it has been resolved, right?
> 

The huge zero page can be reclaimed under memory pressure and, if it is, 
it is attempted to be allocted again with gfp flags that attempt memory 
compaction that can become expensive.  If we are constantly under memory 
pressure, it gets freed and reallocated millions of times always trying to 
compact memory both directly and by kicking kcompactd in the background.

It likely should also be per node.
Kirill A. Shutemov July 20, 2018, 9:06 p.m. UTC | #5
On Sat, Jul 21, 2018 at 02:13:50AM +0800, Yang Shi wrote:
> By digging into the original review, it looks use_zero_page sysfs knob
> was added to help ease-of-testing and give user a way to mitigate
> refcounting overhead.
> 
> It has been a few years since the knob was added at the first place, I
> think we are confident that it is stable enough. And, since commit
> 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"),
> it looks refcounting overhead has been reduced significantly.
> 
> Other than the above, the value of the knob is always 1 (enabled by
> default), I'm supposed very few people turn it off by default.
> 
> So, it sounds not worth to still keep this knob around.

I don't think that having the knob around is huge maintenance burden.
And since it helped to workaround a security bug relative recently I would
rather keep it.
Yang Shi July 20, 2018, 11:49 p.m. UTC | #6
On 7/20/18 2:05 PM, David Rientjes wrote:
> On Fri, 20 Jul 2018, Yang Shi wrote:
>
>>> We disable the huge zero page through this interface, there were issues
>>> related to the huge zero page shrinker (probably best to never free a
>>> per-node huge zero page after allocated) and CVE-2017-1000405 for huge
>>> dirty COW.
>> Thanks for the information. It looks the CVE has been resolved by commit
>> a8f97366452ed491d13cf1e44241bc0b5740b1f0 ("mm, thp: Do not make page table
>> dirty unconditionally in touch_p[mu]d()"), which is in 4.15 already.
>>
> For users who run kernels earlier than 4.15 they may choose to mitigate
> the CVE by using this tunable.  It's not something we permanently need to
> have, but it may likely be too early.

Yes, it might be good to keep it around for a while.

>
>> What was the shrinker related issue? I'm supposed it has been resolved, right?
>>
> The huge zero page can be reclaimed under memory pressure and, if it is,
> it is attempted to be allocted again with gfp flags that attempt memory
> compaction that can become expensive.  If we are constantly under memory
> pressure, it gets freed and reallocated millions of times always trying to
> compact memory both directly and by kicking kcompactd in the background.

Even though we don't use huge zero page, we may also run into the 
similar issue under memory pressure. Just save the cost of calling huge 
zero page shrinker, but actually its cost sound not high.

>
> It likely should also be per node.
Yang Shi July 20, 2018, 11:51 p.m. UTC | #7
On 7/20/18 2:06 PM, Kirill A. Shutemov wrote:
> On Sat, Jul 21, 2018 at 02:13:50AM +0800, Yang Shi wrote:
>> By digging into the original review, it looks use_zero_page sysfs knob
>> was added to help ease-of-testing and give user a way to mitigate
>> refcounting overhead.
>>
>> It has been a few years since the knob was added at the first place, I
>> think we are confident that it is stable enough. And, since commit
>> 6fcb52a56ff60 ("thp: reduce usage of huge zero page's atomic counter"),
>> it looks refcounting overhead has been reduced significantly.
>>
>> Other than the above, the value of the knob is always 1 (enabled by
>> default), I'm supposed very few people turn it off by default.
>>
>> So, it sounds not worth to still keep this knob around.
> I don't think that having the knob around is huge maintenance burden.
> And since it helped to workaround a security bug relative recently I would
> rather keep it.

I agree to keep it for a while to let that security bug cool down, 
however, if there is no user anymore, it sounds pointless to still keep 
a dead knob.

>
Matthew Wilcox (Oracle) July 22, 2018, 3:51 a.m. UTC | #8
On Fri, Jul 20, 2018 at 02:05:52PM -0700, David Rientjes wrote:
> The huge zero page can be reclaimed under memory pressure and, if it is, 
> it is attempted to be allocted again with gfp flags that attempt memory 
> compaction that can become expensive.  If we are constantly under memory 
> pressure, it gets freed and reallocated millions of times always trying to 
> compact memory both directly and by kicking kcompactd in the background.
> 
> It likely should also be per node.

Have you benchmarked making the non-huge zero page per-node?
David Rientjes July 23, 2018, 8:28 p.m. UTC | #9
On Sat, 21 Jul 2018, Matthew Wilcox wrote:

> > The huge zero page can be reclaimed under memory pressure and, if it is, 
> > it is attempted to be allocted again with gfp flags that attempt memory 
> > compaction that can become expensive.  If we are constantly under memory 
> > pressure, it gets freed and reallocated millions of times always trying to 
> > compact memory both directly and by kicking kcompactd in the background.
> > 
> > It likely should also be per node.
> 
> Have you benchmarked making the non-huge zero page per-node?
> 

Not since we disable it :)  I will, though.  The more concerning issue for 
us, modulo CVE-2017-1000405, is the cpu cost of constantly directly 
compacting memory for allocating the hzp in real time after it has been 
reclaimed.  We've observed this happening tens or hundreds of thousands 
of times on some systems.  It will be 2MB per node on x86 if the data 
suggests we should make it NUMA aware, I don't think the cost is too high 
to leave it persistently available even under memory pressure if 
use_zero_page is enabled.
David Rientjes July 23, 2018, 8:31 p.m. UTC | #10
On Fri, 20 Jul 2018, Yang Shi wrote:

> I agree to keep it for a while to let that security bug cool down, however, if
> there is no user anymore, it sounds pointless to still keep a dead knob.
> 

It's not a dead knob.  We use it, and for reasons other than 
CVE-2017-1000405.  To mitigate the cost of constantly compacting memory to 
allocate it after it has been freed due to memry pressure, we can either 
continue to disable it, allow it to be persistently available, or use a 
new value for use_zero_page to specify it should be persistently 
available.
David Rientjes July 23, 2018, 9:33 p.m. UTC | #11
On Mon, 23 Jul 2018, David Rientjes wrote:

> > > The huge zero page can be reclaimed under memory pressure and, if it is, 
> > > it is attempted to be allocted again with gfp flags that attempt memory 
> > > compaction that can become expensive.  If we are constantly under memory 
> > > pressure, it gets freed and reallocated millions of times always trying to 
> > > compact memory both directly and by kicking kcompactd in the background.
> > > 
> > > It likely should also be per node.
> > 
> > Have you benchmarked making the non-huge zero page per-node?
> > 
> 
> Not since we disable it :)  I will, though.  The more concerning issue for 
> us, modulo CVE-2017-1000405, is the cpu cost of constantly directly 
> compacting memory for allocating the hzp in real time after it has been 
> reclaimed.  We've observed this happening tens or hundreds of thousands 
> of times on some systems.  It will be 2MB per node on x86 if the data 
> suggests we should make it NUMA aware, I don't think the cost is too high 
> to leave it persistently available even under memory pressure if 
> use_zero_page is enabled.
> 

Measuring access latency to 4GB of memory on Naples I observe ~6.7% 
slower access latency intrasocket and ~14% slower intersocket.

use_zero_page is currently a simple thp flag, meaning it rejects writes 
where val != !!val, so perhaps it would be best to overload it with 
additional options?  I can imagine 0x2 defining persistent allocation so 
that the hzp is not freed when the refcount goes to 0 and 0x4 defining if 
the hzp should be per node.  Implementing persistent allocation fixes our 
concern with it, so I'd like to start there.  Comments?
Yang Shi July 23, 2018, 9:49 p.m. UTC | #12
On 7/23/18 2:33 PM, David Rientjes wrote:
> On Mon, 23 Jul 2018, David Rientjes wrote:
>
>>>> The huge zero page can be reclaimed under memory pressure and, if it is,
>>>> it is attempted to be allocted again with gfp flags that attempt memory
>>>> compaction that can become expensive.  If we are constantly under memory
>>>> pressure, it gets freed and reallocated millions of times always trying to
>>>> compact memory both directly and by kicking kcompactd in the background.
>>>>
>>>> It likely should also be per node.
>>> Have you benchmarked making the non-huge zero page per-node?
>>>
>> Not since we disable it :)  I will, though.  The more concerning issue for
>> us, modulo CVE-2017-1000405, is the cpu cost of constantly directly
>> compacting memory for allocating the hzp in real time after it has been
>> reclaimed.  We've observed this happening tens or hundreds of thousands
>> of times on some systems.  It will be 2MB per node on x86 if the data
>> suggests we should make it NUMA aware, I don't think the cost is too high
>> to leave it persistently available even under memory pressure if
>> use_zero_page is enabled.
>>
> Measuring access latency to 4GB of memory on Naples I observe ~6.7%
> slower access latency intrasocket and ~14% slower intersocket.
>
> use_zero_page is currently a simple thp flag, meaning it rejects writes
> where val != !!val, so perhaps it would be best to overload it with
> additional options?  I can imagine 0x2 defining persistent allocation so
> that the hzp is not freed when the refcount goes to 0 and 0x4 defining if
> the hzp should be per node.  Implementing persistent allocation fixes our
> concern with it, so I'd like to start there.  Comments?

Sounds worth trying to me :-)  It might be worth making it persistent by 
default. Keeping 2MB memory unreclaimable sounds not harmful for the use 
case which prefer to use THP.
Yang Shi July 23, 2018, 9:52 p.m. UTC | #13
On 7/23/18 1:31 PM, David Rientjes wrote:
> On Fri, 20 Jul 2018, Yang Shi wrote:
>
>> I agree to keep it for a while to let that security bug cool down, however, if
>> there is no user anymore, it sounds pointless to still keep a dead knob.
>>
> It's not a dead knob.  We use it, and for reasons other than
> CVE-2017-1000405.  To mitigate the cost of constantly compacting memory to
> allocate it after it has been freed due to memry pressure, we can either
> continue to disable it, allow it to be persistently available, or use a
> new value for use_zero_page to specify it should be persistently
> available.

My understanding is the cost of memory compaction is *not* unique for 
huge zero page, right? It is expected when memory pressure is met, even 
though huge zero page is disabled.
David Rientjes July 23, 2018, 11:14 p.m. UTC | #14
On Mon, 23 Jul 2018, Yang Shi wrote:

> > > I agree to keep it for a while to let that security bug cool down,
> > > however, if
> > > there is no user anymore, it sounds pointless to still keep a dead knob.
> > > 
> > It's not a dead knob.  We use it, and for reasons other than
> > CVE-2017-1000405.  To mitigate the cost of constantly compacting memory to
> > allocate it after it has been freed due to memry pressure, we can either
> > continue to disable it, allow it to be persistently available, or use a
> > new value for use_zero_page to specify it should be persistently
> > available.
> 
> My understanding is the cost of memory compaction is *not* unique for huge
> zero page, right? It is expected when memory pressure is met, even though huge
> zero page is disabled.
> 

It's caused by fragmentation, not necessarily memory pressure.  We've 
disabled it because compacting for tens of thousands of huge zero pages in 
the background has a noticeable impact on cpu.  Additionally, if the hzp 
cannot be allocated at runtime it increases the rss of applications that 
map it, making it unpredictable.  Making it persistent, as I've been 
suggesting, fixes these issues.
Kirill A. Shutemov July 24, 2018, 9:08 a.m. UTC | #15
On Mon, Jul 23, 2018 at 02:33:08PM -0700, David Rientjes wrote:
> On Mon, 23 Jul 2018, David Rientjes wrote:
> 
> > > > The huge zero page can be reclaimed under memory pressure and, if it is, 
> > > > it is attempted to be allocted again with gfp flags that attempt memory 
> > > > compaction that can become expensive.  If we are constantly under memory 
> > > > pressure, it gets freed and reallocated millions of times always trying to 
> > > > compact memory both directly and by kicking kcompactd in the background.
> > > > 
> > > > It likely should also be per node.
> > > 
> > > Have you benchmarked making the non-huge zero page per-node?
> > > 
> > 
> > Not since we disable it :)  I will, though.  The more concerning issue for 
> > us, modulo CVE-2017-1000405, is the cpu cost of constantly directly 
> > compacting memory for allocating the hzp in real time after it has been 
> > reclaimed.  We've observed this happening tens or hundreds of thousands 
> > of times on some systems.  It will be 2MB per node on x86 if the data 
> > suggests we should make it NUMA aware, I don't think the cost is too high 
> > to leave it persistently available even under memory pressure if 
> > use_zero_page is enabled.
> > 
> 
> Measuring access latency to 4GB of memory on Naples I observe ~6.7% 
> slower access latency intrasocket and ~14% slower intersocket.
> 
> use_zero_page is currently a simple thp flag, meaning it rejects writes 
> where val != !!val, so perhaps it would be best to overload it with 
> additional options?  I can imagine 0x2 defining persistent allocation so 
> that the hzp is not freed when the refcount goes to 0 and 0x4 defining if 
> the hzp should be per node.  Implementing persistent allocation fixes our 
> concern with it, so I'd like to start there.  Comments?

Why not a separate files?
David Rientjes July 24, 2018, 8:32 p.m. UTC | #16
On Tue, 24 Jul 2018, Kirill A. Shutemov wrote:

> > use_zero_page is currently a simple thp flag, meaning it rejects writes 
> > where val != !!val, so perhaps it would be best to overload it with 
> > additional options?  I can imagine 0x2 defining persistent allocation so 
> > that the hzp is not freed when the refcount goes to 0 and 0x4 defining if 
> > the hzp should be per node.  Implementing persistent allocation fixes our 
> > concern with it, so I'd like to start there.  Comments?
> 
> Why not a separate files?
> 

That works as well.  I'll write a patch for persistent allocation first to 
address our most immediate need.
diff mbox

Patch

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 7ab93a8..d471ce8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -148,13 +148,6 @@  madvise
 never
 	should be self-explanatory.
 
-By default kernel tries to use huge zero page on read page fault to
-anonymous mapping. It's possible to disable huge zero page by writing 0
-or enable it back by writing 1::
-
-	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
-	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
-
 Some userspace (such as a test program, or an optimized memory allocation
 library) may want to know the size (in bytes) of a transparent hugepage::
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a1262..0ea7808 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -58,7 +58,6 @@  enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
-	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
@@ -116,9 +115,6 @@  static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 	return false;
 }
 
-#define transparent_hugepage_use_zero_page()				\
-	(transparent_hugepage_flags &					\
-	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
 #ifdef CONFIG_DEBUG_VM
 #define transparent_hugepage_debug_cow()				\
 	(transparent_hugepage_flags &					\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cd7c1a..0d4cf87 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -54,8 +54,7 @@ 
 	(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
 #endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
-	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
-	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
+	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
 
 static struct shrinker deferred_split_shrinker;
 
@@ -273,21 +272,6 @@  static ssize_t defrag_store(struct kobject *kobj,
 static struct kobj_attribute defrag_attr =
 	__ATTR(defrag, 0644, defrag_show, defrag_store);
 
-static ssize_t use_zero_page_show(struct kobject *kobj,
-		struct kobj_attribute *attr, char *buf)
-{
-	return single_hugepage_flag_show(kobj, attr, buf,
-				TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
-}
-static ssize_t use_zero_page_store(struct kobject *kobj,
-		struct kobj_attribute *attr, const char *buf, size_t count)
-{
-	return single_hugepage_flag_store(kobj, attr, buf, count,
-				 TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
-}
-static struct kobj_attribute use_zero_page_attr =
-	__ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store);
-
 static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 		struct kobj_attribute *attr, char *buf)
 {
@@ -317,7 +301,6 @@  static ssize_t debug_cow_store(struct kobject *kobj,
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
-	&use_zero_page_attr.attr,
 	&hpage_pmd_size_attr.attr,
 #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
 	&shmem_enabled_attr.attr,
@@ -677,8 +660,7 @@  int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
 		return VM_FAULT_OOM;
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
-			!mm_forbids_zeropage(vma->vm_mm) &&
-			transparent_hugepage_use_zero_page()) {
+			!mm_forbids_zeropage(vma->vm_mm)) {
 		pgtable_t pgtable;
 		struct page *zero_page;
 		bool set;