Message ID | 20210612000714.775825-1-willy@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Mark idle page tracking as BROKEN | expand |
On Fri, Jun 11, 2021 at 6:08 PM Matthew Wilcox (Oracle) <willy@infradead.org> wrote: > > In discussion with other MM developers around how idle page tracking > should be fixed for transparent huge pages, several expressed the opinion > that it should be removed as it is inefficient at accomplishing the > job that it is supposed to, and we have better mechanisms (eg uffd) for > accomplishing the same goals these days. > > Mark the feature as BROKEN for now and we can remove it entirely in a > few months if nobody complains. It is not enabled by Android, ChromeOS, > Debian, Fedora or SUSE. Red Hat enabled it with RHEL-8.1 and UEK followed > suit, but I have been unable to find why RHEL enabled it. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Yu Zhao <yuzhao@google.com> It had been broken on arm64 (corrupting user data) until commit 07509e10dcc7 ("arm64: pgtable: Fix pte_accessible()") came along. It may also break functions that call pte/pmd_mkold() but not test_and_clear_young(), e.g., it breaks MADV_FREE because page_referenced() will return true upon seeing PageYoung(), which in turn makes the page reclaim reject the madvise()'ed pages. > --- > mm/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 02d44e3420f5..311b50bb92ce 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -772,7 +772,7 @@ config DEFERRED_STRUCT_PAGE_INIT > > config IDLE_PAGE_TRACKING > bool "Enable idle page tracking" > - depends on SYSFS && MMU > + depends on SYSFS && MMU && BROKEN > select PAGE_EXTENSION if !64BIT > help > This feature allows to estimate the amount of user pages that have > -- > 2.30.2 > >
On Sat, Jun 12, 2021 at 01:07:14AM +0100, Matthew Wilcox (Oracle) wrote: > In discussion with other MM developers around how idle page tracking > should be fixed for transparent huge pages, several expressed the opinion > that it should be removed as it is inefficient at accomplishing the > job that it is supposed to, and we have better mechanisms (eg uffd) for > accomplishing the same goals these days. > > Mark the feature as BROKEN for now and we can remove it entirely in a > few months if nobody complains. It is not enabled by Android, ChromeOS, > Debian, Fedora or SUSE. Red Hat enabled it with RHEL-8.1 and UEK followed > suit, but I have been unable to find why RHEL enabled it. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
From: SeongJae Park <sjpark@amazon.de> Hello Matthew, On Sat, 12 Jun 2021 01:07:14 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > In discussion with other MM developers around how idle page tracking > should be fixed for transparent huge pages, several expressed the opinion > that it should be removed as it is inefficient at accomplishing the > job that it is supposed to, and we have better mechanisms (eg uffd) for > accomplishing the same goals these days. I think the THP case[1] is an intended behavior[2]. Could you please share a link to the discussion or a detailed summary if possible? > > Mark the feature as BROKEN for now and we can remove it entirely in a > few months if nobody complains. It is not enabled by Android, ChromeOS, > Debian, Fedora or SUSE. Red Hat enabled it with RHEL-8.1 and UEK followed > suit, but I have been unable to find why RHEL enabled it. Amazon Linux is also using it[3], for DAMON[4]. In detail, DAMON doesn't use Idle Page Tracking but PG_Idle in kernel space, to avoid interfering the reclaim logic[5]. So, I'm ok with removing the Idle Page Tracking user space interface, but gonna be opposed to removing PG_Idle. Nevertheless, the interference is not a real problem to DAMON, as DAMON is aimed to provide just a reasonable quality of the monitoring, rather than strict correctness. Hence, if people think the interference is also not a problem for the reclaim logic (after all, it does nothing unless sysadmin manually turns it on in runtime, and can be turned off at anytime), I would simply update DAMON code to don't use PG_Idle, add warnings in the doc, and wouldn't be opposed to this change. But, if not, I think this change will be a big problem to us. If the problem is a lack of dedicated maintainer for Idle Page Tracking or PG_Idle, I can volunteer. [1] https://lore.kernel.org/linux-mm/YMGKVmt8trMJ9kOP@casper.infradead.org/ [2] https://lore.kernel.org/linux-mm/20210614081610.16123-1-sjpark@amazon.de/ [3] https://github.com/amazonlinux/linux/blob/amazon-5.10.y/master/mm/damon/Kconfig#L19 [4] https://lore.kernel.org/linux-mm/20210520075629.4332-1-sj38.park@gmail.com/ [5] https://lore.kernel.org/linux-mm/20210525154427.30921-1-sjpark@amazon.de/ Thanks, SeongJae Park [...]
On Mon, 14 Jun 2021 13:49:26 +0000 SeongJae Park <sj38.park@gmail.com> wrote: > From: SeongJae Park <sjpark@amazon.de> > > Hello Matthew, > > On Sat, 12 Jun 2021 01:07:14 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > > > In discussion with other MM developers around how idle page tracking > > should be fixed for transparent huge pages, several expressed the opinion > > that it should be removed as it is inefficient at accomplishing the > > job that it is supposed to, and we have better mechanisms (eg uffd) for > > accomplishing the same goals these days. > > I think the THP case[1] is an intended behavior[2]. Could you please share a > link to the discussion or a detailed summary if possible? > > > > > Mark the feature as BROKEN for now and we can remove it entirely in a > > few months if nobody complains. It is not enabled by Android, ChromeOS, > > Debian, Fedora or SUSE. Red Hat enabled it with RHEL-8.1 and UEK followed > > suit, but I have been unable to find why RHEL enabled it. > > Amazon Linux is also using it[3], for DAMON[4]. In detail, DAMON doesn't use > Idle Page Tracking but PG_Idle in kernel space, to avoid interfering the > reclaim logic[5]. So, I'm ok with removing the Idle Page Tracking user space > interface, but gonna be opposed to removing PG_Idle. > > Nevertheless, the interference is not a real problem to DAMON, as DAMON is > aimed to provide just a reasonable quality of the monitoring, rather than > strict correctness. Hence, if people think the interference is also not a > problem for the reclaim logic (after all, it does nothing unless sysadmin > manually turns it on in runtime, and can be turned off at anytime), I would > simply update DAMON code to don't use PG_Idle, add warnings in the doc, and > wouldn't be opposed to this change. Couldn't the DAMON patchset simply re-add PG_Idle? Perhaps with a new name which is more appropriate to the DAMON usage?
From: SeongJae Park <sjpark@amazon.de> On Mon, 14 Jun 2021 19:04:20 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 14 Jun 2021 13:49:26 +0000 SeongJae Park <sj38.park@gmail.com> wrote: > > > From: SeongJae Park <sjpark@amazon.de> > > > > Hello Matthew, > > > > On Sat, 12 Jun 2021 01:07:14 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > > > > > In discussion with other MM developers around how idle page tracking > > > should be fixed for transparent huge pages, several expressed the opinion > > > that it should be removed as it is inefficient at accomplishing the > > > job that it is supposed to, and we have better mechanisms (eg uffd) for > > > accomplishing the same goals these days. > > > > I think the THP case[1] is an intended behavior[2]. Could you please share a > > link to the discussion or a detailed summary if possible? > > > > > > > > Mark the feature as BROKEN for now and we can remove it entirely in a > > > few months if nobody complains. It is not enabled by Android, ChromeOS, > > > Debian, Fedora or SUSE. Red Hat enabled it with RHEL-8.1 and UEK followed > > > suit, but I have been unable to find why RHEL enabled it. > > > > Amazon Linux is also using it[3], for DAMON[4]. In detail, DAMON doesn't use > > Idle Page Tracking but PG_Idle in kernel space, to avoid interfering the > > reclaim logic[5]. So, I'm ok with removing the Idle Page Tracking user space > > interface, but gonna be opposed to removing PG_Idle. > > > > Nevertheless, the interference is not a real problem to DAMON, as DAMON is > > aimed to provide just a reasonable quality of the monitoring, rather than > > strict correctness. Hence, if people think the interference is also not a > > problem for the reclaim logic (after all, it does nothing unless sysadmin > > manually turns it on in runtime, and can be turned off at anytime), I would > > simply update DAMON code to don't use PG_Idle, add warnings in the doc, and > > wouldn't be opposed to this change. > > Couldn't the DAMON patchset simply re-add PG_Idle? Perhaps with a new > name which is more appropriate to the DAMON usage? Good point, that makes sense. Thank you, Andrew. Nevertheless, I still not fully understand why Idle Page Tracking is considered broken, and therefore worrying if I will end up reintroducing PG_Idle with DAMON patchset in the broken form. It would still be great if a link to the dicussion could be provided. Thanks, SeongJae Park
On 12.06.21 02:07, Matthew Wilcox (Oracle) wrote: I might be missing something important, so some questions/comments > In discussion with other MM developers around how idle page tracking > should be fixed for transparent huge pages, several expressed the opinion > that it should be removed as it is inefficient at accomplishing the > job that it is supposed to, and we have better mechanisms (eg uffd) for > accomplishing the same goals these days. 1. A link to that discussion would be nice. I am missing some important details in this patch description. 2. "should be fixed for transparent huge pages" -- has it always been like this or has the behavior changed at some point? Do the semantics, and how the feature is getting used, clearly identify this case that needs fixing as something that really has to be fixed? Or was it always like that and actually expected to work like that ("semtantics")? For example, just like for softdirty tracking, an over-indication might be just fine. More extreme, I think idle tracking can actually deal with an under-indication, if it's actually used for minor performance improvements (just like with DAEMON). Again, this should be clarified in this patch description. Because I read "This information can be useful for estimating the workload’s working set size, which, in turn, can be taken into account when configuring the workload parameters, setting memory cgroup limits, or deciding where to place the workload within a compute cluster." -- which doesn't sound like it has to be 100% correct in all of the cases. And "Since the idle memory tracking feature is based on the memory reclaimer logic, it only works with pages that are on an LRU list, other pages are silently ignored. That means it will ignore a user memory page if it is isolated, but since there are usually not many of them, it should not affect the overall result noticeably. In order not to stall scanning of the idle page bitmap, locked pages may be skipped too", so there are already special cases to deal with. 3. I don't see how the "better mechanisms" can actually be used to accomplish the same goal. You state "uffd", however, I don't see a way to actually achieve the same goal using uffd. In MISSING mode, you would have to zap/discard page content to get an access notification. in WP mode you really cannot catch reads. > > Mark the feature as BROKEN for now and we can remove it entirely in a > few months if nobody complains. It is not enabled by Android, ChromeOS, > Debian, Fedora or SUSE. Red Hat enabled it with RHEL-8.1 and UEK followed > suit, but I have been unable to find why RHEL enabled it. My fellow RH people tell me that we enabled in RHEL7 on customer demand and consequently enabled it in RHEL8. To me it feels like we might be removing a feature that is working just as expected because the semantics might not be 100% clear to everybody involved. Of course, I might be wrong, that's why I'm asking. I'd actually vote for documenting what's happening, just like we do for locked pages and !LRU pages ... unless there is really something heavily broken.
On Tue, Jun 15, 2021 at 09:41:37AM +0200, David Hildenbrand wrote: > On 12.06.21 02:07, Matthew Wilcox (Oracle) wrote: > > I might be missing something important, so some questions/comments > > > In discussion with other MM developers around how idle page tracking > > should be fixed for transparent huge pages, several expressed the opinion > > that it should be removed as it is inefficient at accomplishing the > > job that it is supposed to, and we have better mechanisms (eg uffd) for > > accomplishing the same goals these days. > > 1. A link to that discussion would be nice. I am missing some important > details in this patch description. It was on the phone in our bi-weekly THP call. As I recall, those present were Kirill, Yu Zhao, William Kucharski, Zi Yan, Vlastimil Babka. Song Liu sent apologies, and I think Mike Kravetz had a conflict. > 2. "should be fixed for transparent huge pages" -- has it always been like > this or has the behavior changed at some point? Do the semantics, and how > the feature is getting used, clearly identify this case that needs fixing as > something that really has to be fixed? Or was it always like that and > actually expected to work like that ("semtantics")? I don't know. I asked the others on the call and the answer I got was essentially "Just delete it". I'm kind of hoping the others speak up.
On Tue, Jun 15, 2021 at 8:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Jun 15, 2021 at 09:41:37AM +0200, David Hildenbrand wrote: > > On 12.06.21 02:07, Matthew Wilcox (Oracle) wrote: > > > > I might be missing something important, so some questions/comments > > > > > In discussion with other MM developers around how idle page tracking > > > should be fixed for transparent huge pages, several expressed the opinion > > > that it should be removed as it is inefficient at accomplishing the > > > job that it is supposed to, and we have better mechanisms (eg uffd) for > > > accomplishing the same goals these days. > > > > 1. A link to that discussion would be nice. I am missing some important > > details in this patch description. > > It was on the phone in our bi-weekly THP call. As I recall, those > present were Kirill, Yu Zhao, William Kucharski, Zi Yan, Vlastimil Babka. > Song Liu sent apologies, and I think Mike Kravetz had a conflict. > > > 2. "should be fixed for transparent huge pages" -- has it always been like > > this or has the behavior changed at some point? Do the semantics, and how > > the feature is getting used, clearly identify this case that needs fixing as > > something that really has to be fixed? Or was it always like that and > > actually expected to work like that ("semtantics")? > > I don't know. I asked the others on the call and the answer I got was > essentially "Just delete it". > > I'm kind of hoping the others speak up. I listed a couple of things when acking this patch. Being broken is not a problem as long as there are users who care about it. What made me think such users may not exist is that nobody ever complained about those things until we stumbled on them -- I'm not insisting on deleting this feature, just clarifying why I thought so. The real question is how well we want to support it. My understanding, based on the previous discussion, is "as-is". That is we simply follow the current "semantics" when converting it to using folios.
On 6/16/21 8:22 AM, Yu Zhao wrote: > On Tue, Jun 15, 2021 at 8:55 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> >> I don't know. I asked the others on the call and the answer I got was >> essentially "Just delete it". >> >> I'm kind of hoping the others speak up. > > I listed a couple of things when acking this patch. Being broken is > not a problem as long as there are users who care about it. What made > me think such users may not exist is that nobody ever complained about > those things until we stumbled on them -- I'm not insisting on > deleting this feature, just clarifying why I thought so. Similar feelings here. On the call it looked like the feature was abandoned by its creators, and it wasn't clear if the distros that had it enabled did so due to reasons that still apply for future versions. Sending the proposal and getting a feedback that there are users is one of the expected valid outcomes. > The real question is how well we want to support it. My understanding, > based on the previous discussion, is "as-is". That is we simply follow > the current "semantics" when converting it to using folios. >
On 16.06.21 10:36, Vlastimil Babka wrote: > On 6/16/21 8:22 AM, Yu Zhao wrote: >> On Tue, Jun 15, 2021 at 8:55 PM Matthew Wilcox <willy@infradead.org> wrote: >>> >>> >>> I don't know. I asked the others on the call and the answer I got was >>> essentially "Just delete it". >>> >>> I'm kind of hoping the others speak up. >> >> I listed a couple of things when acking this patch. Being broken is >> not a problem as long as there are users who care about it. What made >> me think such users may not exist is that nobody ever complained about >> those things until we stumbled on them -- I'm not insisting on >> deleting this feature, just clarifying why I thought so. > > Similar feelings here. On the call it looked like the feature was abandoned by > its creators, and it wasn't clear if the distros that had it enabled did so due > to reasons that still apply for future versions. Sending the proposal and > getting a feedback that there are users is one of the expected valid outcomes. For us (RH) it will be very interesting to know the exact things that are "suboptimal" (I'm avoiding the terminology "broken" here), so we can actually evaluate if this might affect customers and might be worth "improving". That's why I was asking for more information (either to document, or to optimize). Because even if we (upstream) might decide to abandon the feature at some point, distros will still have it enabled. Thanks all!
On Wed, Jun 16, 2021 at 2:43 AM David Hildenbrand <david@redhat.com> wrote: > > On 16.06.21 10:36, Vlastimil Babka wrote: > > On 6/16/21 8:22 AM, Yu Zhao wrote: > >> On Tue, Jun 15, 2021 at 8:55 PM Matthew Wilcox <willy@infradead.org> wrote: > >>> > >>> > >>> I don't know. I asked the others on the call and the answer I got was > >>> essentially "Just delete it". > >>> > >>> I'm kind of hoping the others speak up. > >> > >> I listed a couple of things when acking this patch. Being broken is > >> not a problem as long as there are users who care about it. What made > >> me think such users may not exist is that nobody ever complained about > >> those things until we stumbled on them -- I'm not insisting on > >> deleting this feature, just clarifying why I thought so. > > > > Similar feelings here. On the call it looked like the feature was abandoned by > > its creators, and it wasn't clear if the distros that had it enabled did so due > > to reasons that still apply for future versions. Sending the proposal and > > getting a feedback that there are users is one of the expected valid outcomes. > > For us (RH) it will be very interesting to know the exact things that > are "suboptimal" (I'm avoiding the terminology "broken" here), so we can > actually evaluate if this might affect customers and might be worth > "improving". I consider the examples I gave in my first email breakages -- others broke/break the idle page tracking -- and I think it's safe to assume they will continue to happen. If you are really looking for improvements, the page compaction has always been a good example. For the idle page tracking, with physical memory as little as 4GB, it needs to go thru one million PFNs, no matter how many compound or buddy pages there are. For THPs, it will try to get_page_unless_zero() on tail pages, which always fails. This is why we discussed it in the meeting. What can't be improved is the memory locality of PFNs. They are not grouped by memcgs or processes. Two PFNs next to each other can be from two processes with two sets of five-level page tables. The cache misses simply outweigh any potential benefits one might get from this feature, speaking as one of the customers.
On 16.06.21 21:23, Yu Zhao wrote: > On Wed, Jun 16, 2021 at 2:43 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 16.06.21 10:36, Vlastimil Babka wrote: >>> On 6/16/21 8:22 AM, Yu Zhao wrote: >>>> On Tue, Jun 15, 2021 at 8:55 PM Matthew Wilcox <willy@infradead.org> wrote: >>>>> >>>>> >>>>> I don't know. I asked the others on the call and the answer I got was >>>>> essentially "Just delete it". >>>>> >>>>> I'm kind of hoping the others speak up. >>>> >>>> I listed a couple of things when acking this patch. Being broken is >>>> not a problem as long as there are users who care about it. What made >>>> me think such users may not exist is that nobody ever complained about >>>> those things until we stumbled on them -- I'm not insisting on >>>> deleting this feature, just clarifying why I thought so. >>> >>> Similar feelings here. On the call it looked like the feature was abandoned by >>> its creators, and it wasn't clear if the distros that had it enabled did so due >>> to reasons that still apply for future versions. Sending the proposal and >>> getting a feedback that there are users is one of the expected valid outcomes. >> >> For us (RH) it will be very interesting to know the exact things that >> are "suboptimal" (I'm avoiding the terminology "broken" here), so we can >> actually evaluate if this might affect customers and might be worth >> "improving". > > I consider the examples I gave in my first email breakages -- others > broke/break the idle page tracking -- and I think it's safe to assume > they will continue to happen. Right, just as with any other feature that has very bad (no?) upstream test coverage and doesn't immediately blow up if not done 100% right. So to summarize (thanks for the input!): 1. It was really broken om arm64 before we had 07509e10dcc7 ("arm64: pgtable: Fix pte_accessible()") but should be working now. 2. Functions that call pte/pmd_mkold() but not test_and_clear_young() are shaky. 3. MADV_FREE'ed pages won't actually get freed and treated as if they were reaccessed, because page_referenced() will return true upon seeing PageYoung(). 4. Huge page handling is suboptimal and requires proper care from user space to get it right: https://lore.kernel.org/linux-mm/20210614081610.16123-1-sjpark@amazon.de/ I suspect daemon will have similar interest in optimizing 2 and 3, right? > > If you are really looking for improvements, the page compaction has > always been a good example. For the idle page tracking, with physical > memory as little as 4GB, it needs to go thru one million PFNs, no > matter how many compound or buddy pages there are. For THPs, it will > try to get_page_unless_zero() on tail pages, which always fails. This > is why we discussed it in the meeting. Right, this sounds sub-optimal. > > What can't be improved is the memory locality of PFNs. They are not > grouped by memcgs or processes. Two PFNs next to each other can be > from two processes with two sets of five-level page tables. The cache > misses simply outweigh any potential benefits one might get from this > feature, speaking as one of the customers. Right.
diff --git a/mm/Kconfig b/mm/Kconfig index 02d44e3420f5..311b50bb92ce 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -772,7 +772,7 @@ config DEFERRED_STRUCT_PAGE_INIT config IDLE_PAGE_TRACKING bool "Enable idle page tracking" - depends on SYSFS && MMU + depends on SYSFS && MMU && BROKEN select PAGE_EXTENSION if !64BIT help This feature allows to estimate the amount of user pages that have
In discussion with other MM developers around how idle page tracking should be fixed for transparent huge pages, several expressed the opinion that it should be removed as it is inefficient at accomplishing the job that it is supposed to, and we have better mechanisms (eg uffd) for accomplishing the same goals these days. Mark the feature as BROKEN for now and we can remove it entirely in a few months if nobody complains. It is not enabled by Android, ChromeOS, Debian, Fedora or SUSE. Red Hat enabled it with RHEL-8.1 and UEK followed suit, but I have been unable to find why RHEL enabled it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- mm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)