Message ID | 1547061285-100329-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: memcontrol: do memory reclaim when offlining | expand |
On Thu, Jan 10, 2019 at 03:14:40AM +0800, Yang Shi wrote: > > We have some usecases which create and remove memcgs very frequently, > and the tasks in the memcg may just access the files which are unlikely > accessed by anyone else. So, we prefer force_empty the memcg before > rmdir'ing it to reclaim the page cache so that they don't get > accumulated to incur unnecessary memory pressure. Since the memory > pressure may incur direct reclaim to harm some latency sensitive > applications. We have kswapd for exactly this purpose. Can you lay out more details on why that is not good enough, especially in conjunction with tuning the watermark_scale_factor etc.? We've been pretty adamant that users shouldn't use drop_caches for performance for example, and that the need to do this usually is indicative of a problem or suboptimal tuning in the VM subsystem. How is this different?
On 1/9/19 11:32 AM, Johannes Weiner wrote: > On Thu, Jan 10, 2019 at 03:14:40AM +0800, Yang Shi wrote: >> We have some usecases which create and remove memcgs very frequently, >> and the tasks in the memcg may just access the files which are unlikely >> accessed by anyone else. So, we prefer force_empty the memcg before >> rmdir'ing it to reclaim the page cache so that they don't get >> accumulated to incur unnecessary memory pressure. Since the memory >> pressure may incur direct reclaim to harm some latency sensitive >> applications. > We have kswapd for exactly this purpose. Can you lay out more details > on why that is not good enough, especially in conjunction with tuning > the watermark_scale_factor etc.? watermark_scale_factor does help out for some workloads in general. However, memcgs might be created then do memory allocation faster than kswapd in some our workloads. And, the tune may work for one kind machine or workload, but may not work for others. But, we may have different kind workloads (for example, latency-sensitive and batch jobs) run on the same machine, so it is kind of hard for us to guarantee all the workloads work well together by relying on kswapd and watermark_scale_factor only. And, we know the page cache access pattern would be one-off for some memcgs, and those page caches are unlikely shared by others, so why not just drop them when the memcg is offlined. Reclaiming those cold page caches earlier would also improve the efficiency of memcg creation for long run. > > We've been pretty adamant that users shouldn't use drop_caches for > performance for example, and that the need to do this usually is > indicative of a problem or suboptimal tuning in the VM subsystem. > > How is this different? IMHO, that depends on the usecases and workloads. As I mentioned above, if we know some page caches from some memcgs are referenced one-off and unlikely shared, why just keep them around to increase memory pressure? Thanks, Yang
On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote: > As I mentioned above, if we know some page caches from some memcgs > are referenced one-off and unlikely shared, why just keep them > around to increase memory pressure? It's just not clear to me that your scenarios are generic enough to justify adding two interfaces that we have to maintain forever, and that they couldn't be solved with existing mechanisms. Please explain: - Unmapped clean page cache isn't expensive to reclaim, certainly cheaper than the IO involved in new application startup. How could recycling clean cache be a prohibitive part of workload warmup? - Why you cannot temporarily raise the kswapd watermarks right before an important application starts up (your answer was sorta handwavy) - Why you cannot use madvise/fadvise when an application whose cache you won't reuse exits - Why you couldn't set memory.high or memory.max to 0 after the application quits and before you call rmdir on the cgroup Adding a permanent kernel interface is a serious measure. I think you need to make a much better case for it, discuss why other options are not practical, and show that this will be a generally useful thing for cgroup users and not just a niche fix for very specific situations.
On 1/9/19 1:23 PM, Johannes Weiner wrote: > On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote: >> As I mentioned above, if we know some page caches from some memcgs >> are referenced one-off and unlikely shared, why just keep them >> around to increase memory pressure? > It's just not clear to me that your scenarios are generic enough to > justify adding two interfaces that we have to maintain forever, and > that they couldn't be solved with existing mechanisms. > > Please explain: > > - Unmapped clean page cache isn't expensive to reclaim, certainly > cheaper than the IO involved in new application startup. How could > recycling clean cache be a prohibitive part of workload warmup? It is nothing about recycling. Those page caches might be referenced by memcg just once, then nobody touch them until memory pressure is hit. And, they might be not accessed again at any time soon. > > - Why you cannot temporarily raise the kswapd watermarks right before > an important application starts up (your answer was sorta handwavy) It could, but kswapd watermark is global. Boosting kswapd watermark may cause kswapd reclaim some memory from some memcgs which we want to keep untouched. Although v2's low/min could provide some protection, it is still not prohibited generally. And, v1 doesn't have such protection at all. force_empty or wipe_on_offline could be used to target to some specific memcgs which we may know exactly what they do or it is safe to reclaim memory from them. IMHO, this may make better isolation. > > - Why you cannot use madvise/fadvise when an application whose cache > you won't reuse exits Sure we can. But, we can't guarantee all applications use them properly. > > - Why you couldn't set memory.high or memory.max to 0 after the > application quits and before you call rmdir on the cgroup I recall I explained this in the review email for the first version. Set memory.high or memory.max to 0 would trigger direct reclaim which may stall the offline of memcg. But, we have "restarting the same name job" logic in our usecase (I'm not quite sure why they do so). Basically, it means to create memcg with the exact same name right after the old one is deleted, but may have different limit or other settings. The creation has to wait for rmdir is done. > > Adding a permanent kernel interface is a serious measure. I think you > need to make a much better case for it, discuss why other options are > not practical, and show that this will be a generally useful thing for > cgroup users and not just a niche fix for very specific situations. I do understand your concern and the maintenance cost for a permanent kernel interface. I'm not quite sure if this is generic enough, however, Michal Hocko did mention "It seems we have several people asking for something like that already.", so at least it sounds not like "a niche fix for very specific situations". In my first submit, I did reuse force_empty interface to keep it less intrusive, at least not a new interface. Since we have several people asking for something like that already, Michal suggested a new knob instead of reusing force_empty. Thanks, Yang
On Wed, Jan 09, 2019 at 02:09:20PM -0800, Yang Shi wrote: > On 1/9/19 1:23 PM, Johannes Weiner wrote: > > On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote: > > > As I mentioned above, if we know some page caches from some memcgs > > > are referenced one-off and unlikely shared, why just keep them > > > around to increase memory pressure? > > It's just not clear to me that your scenarios are generic enough to > > justify adding two interfaces that we have to maintain forever, and > > that they couldn't be solved with existing mechanisms. > > > > Please explain: > > > > - Unmapped clean page cache isn't expensive to reclaim, certainly > > cheaper than the IO involved in new application startup. How could > > recycling clean cache be a prohibitive part of workload warmup? > > It is nothing about recycling. Those page caches might be referenced by > memcg just once, then nobody touch them until memory pressure is hit. And, > they might be not accessed again at any time soon. I meant recycling the page frames, not the cache in them. So the new workload as it starts up needs to take those pages from the LRU list instead of just the allocator freelist. While that's obviously not the same cost, it's not clear why the difference would be prohibitive to application startup especially since app startup tends to be dominated by things like IO to fault in executables etc. > > - Why you couldn't set memory.high or memory.max to 0 after the > > application quits and before you call rmdir on the cgroup > > I recall I explained this in the review email for the first version. Set > memory.high or memory.max to 0 would trigger direct reclaim which may stall > the offline of memcg. But, we have "restarting the same name job" logic in > our usecase (I'm not quite sure why they do so). Basically, it means to > create memcg with the exact same name right after the old one is deleted, > but may have different limit or other settings. The creation has to wait for > rmdir is done. This really needs a fix on your end. We cannot add new cgroup control files because you cannot handle a delayed release in the cgroupfs namespace while you're reclaiming associated memory. A simple serial number would fix this. Whether others have asked for this knob or not, these patches should come with a solid case in the cover letter and changelogs that explain why this ABI is necessary to solve a generic cgroup usecase. But it sounds to me that setting the limit to 0 once the group is empty would meet the functional requirement (use fork() if you don't want to wait) of what you are trying to do. I don't think the new interface bar is met here.
On 1/9/19 2:51 PM, Johannes Weiner wrote: > On Wed, Jan 09, 2019 at 02:09:20PM -0800, Yang Shi wrote: >> On 1/9/19 1:23 PM, Johannes Weiner wrote: >>> On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote: >>>> As I mentioned above, if we know some page caches from some memcgs >>>> are referenced one-off and unlikely shared, why just keep them >>>> around to increase memory pressure? >>> It's just not clear to me that your scenarios are generic enough to >>> justify adding two interfaces that we have to maintain forever, and >>> that they couldn't be solved with existing mechanisms. >>> >>> Please explain: >>> >>> - Unmapped clean page cache isn't expensive to reclaim, certainly >>> cheaper than the IO involved in new application startup. How could >>> recycling clean cache be a prohibitive part of workload warmup? >> It is nothing about recycling. Those page caches might be referenced by >> memcg just once, then nobody touch them until memory pressure is hit. And, >> they might be not accessed again at any time soon. > I meant recycling the page frames, not the cache in them. So the new > workload as it starts up needs to take those pages from the LRU list > instead of just the allocator freelist. While that's obviously not the > same cost, it's not clear why the difference would be prohibitive to > application startup especially since app startup tends to be dominated > by things like IO to fault in executables etc. I'm a little bit confused here. Even though those page frames are not reclaimed by force_empty, they would be reclaimed by kswapd later when memory pressure is hit. For some usecases, they may prefer get recycled before kswapd kick them out LRU, but for some usecases avoiding memory pressure might outpace page frame recycling. > >>> - Why you couldn't set memory.high or memory.max to 0 after the >>> application quits and before you call rmdir on the cgroup >> I recall I explained this in the review email for the first version. Set >> memory.high or memory.max to 0 would trigger direct reclaim which may stall >> the offline of memcg. But, we have "restarting the same name job" logic in >> our usecase (I'm not quite sure why they do so). Basically, it means to >> create memcg with the exact same name right after the old one is deleted, >> but may have different limit or other settings. The creation has to wait for >> rmdir is done. > This really needs a fix on your end. We cannot add new cgroup control > files because you cannot handle a delayed release in the cgroupfs > namespace while you're reclaiming associated memory. A simple serial > number would fix this. > > Whether others have asked for this knob or not, these patches should > come with a solid case in the cover letter and changelogs that explain > why this ABI is necessary to solve a generic cgroup usecase. But it > sounds to me that setting the limit to 0 once the group is empty would > meet the functional requirement (use fork() if you don't want to wait) > of what you are trying to do. Do you mean do something like the below: echo 0 > cg1/memory.max & rmdir cg1 & mkdir cg1 & But, the latency is still there, even though memcg creation (mkdir) can be done very fast by using fork(), the latency would delay afterwards operations, i.e. attaching tasks (echo PID > cg1/cgroup.procs). When we calculating the time consumption of the container deployment, we would count from mkdir to the job is actually launched. So, without delaying force_empty to offline kworker, we still suffer from the latency. Am I missing anything? Thanks, Yang > > I don't think the new interface bar is met here.
On Wed, Jan 09, 2019 at 05:47:41PM -0800, Yang Shi wrote: > On 1/9/19 2:51 PM, Johannes Weiner wrote: > > On Wed, Jan 09, 2019 at 02:09:20PM -0800, Yang Shi wrote: > > > On 1/9/19 1:23 PM, Johannes Weiner wrote: > > > > On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote: > > > > > As I mentioned above, if we know some page caches from some memcgs > > > > > are referenced one-off and unlikely shared, why just keep them > > > > > around to increase memory pressure? > > > > It's just not clear to me that your scenarios are generic enough to > > > > justify adding two interfaces that we have to maintain forever, and > > > > that they couldn't be solved with existing mechanisms. > > > > > > > > Please explain: > > > > > > > > - Unmapped clean page cache isn't expensive to reclaim, certainly > > > > cheaper than the IO involved in new application startup. How could > > > > recycling clean cache be a prohibitive part of workload warmup? > > > It is nothing about recycling. Those page caches might be referenced by > > > memcg just once, then nobody touch them until memory pressure is hit. And, > > > they might be not accessed again at any time soon. > > I meant recycling the page frames, not the cache in them. So the new > > workload as it starts up needs to take those pages from the LRU list > > instead of just the allocator freelist. While that's obviously not the > > same cost, it's not clear why the difference would be prohibitive to > > application startup especially since app startup tends to be dominated > > by things like IO to fault in executables etc. > > I'm a little bit confused here. Even though those page frames are not > reclaimed by force_empty, they would be reclaimed by kswapd later when > memory pressure is hit. For some usecases, they may prefer get recycled > before kswapd kick them out LRU, but for some usecases avoiding memory > pressure might outpace page frame recycling. I understand that, but you're not providing data for the "may prefer" part. You haven't shown that any proactive reclaim actually matters and is a significant net improvement to a real workload in a real hardware environment, and that the usecase is generic and widespread enough to warrant an entirely new kernel interface. > > > > - Why you couldn't set memory.high or memory.max to 0 after the > > > > application quits and before you call rmdir on the cgroup > > > I recall I explained this in the review email for the first version. Set > > > memory.high or memory.max to 0 would trigger direct reclaim which may stall > > > the offline of memcg. But, we have "restarting the same name job" logic in > > > our usecase (I'm not quite sure why they do so). Basically, it means to > > > create memcg with the exact same name right after the old one is deleted, > > > but may have different limit or other settings. The creation has to wait for > > > rmdir is done. > > This really needs a fix on your end. We cannot add new cgroup control > > files because you cannot handle a delayed release in the cgroupfs > > namespace while you're reclaiming associated memory. A simple serial > > number would fix this. > > > > Whether others have asked for this knob or not, these patches should > > come with a solid case in the cover letter and changelogs that explain > > why this ABI is necessary to solve a generic cgroup usecase. But it > > sounds to me that setting the limit to 0 once the group is empty would > > meet the functional requirement (use fork() if you don't want to wait) > > of what you are trying to do. > > Do you mean do something like the below: > > echo 0 > cg1/memory.max & > rmdir cg1 & > mkdir cg1 & > > But, the latency is still there, even though memcg creation (mkdir) can be > done very fast by using fork(), the latency would delay afterwards > operations, i.e. attaching tasks (echo PID > cg1/cgroup.procs). When we > calculating the time consumption of the container deployment, we would count > from mkdir to the job is actually launched. I'm saying that the same-name requirement is your problem, not the kernel's. It's not unreasonable for the kernel to say that as long as you want to do something with the cgroup, such as forcibly emptying out the left-over cache, that the group name stays in the namespace. Requiring the same exact cgroup name for another instance of the same job sounds like a bogus requirement. Surely you can use serial numbers to denote subsequent invocations of the same job and handle that from whatever job management software you're using: ( echo 0 > job1345-1/memory.max; rmdir job12345-1 ) & mkdir job12345-2 See, completely decoupled.
Not sure if you guys received my yesterday's reply or not. I sent twice, but both got bounced back. Maybe my company email server has some problems. So, I sent this with my personal email. On Mon, Jan 14, 2019 at 11:01 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jan 09, 2019 at 05:47:41PM -0800, Yang Shi wrote: > > On 1/9/19 2:51 PM, Johannes Weiner wrote: > > > On Wed, Jan 09, 2019 at 02:09:20PM -0800, Yang Shi wrote: > > > > On 1/9/19 1:23 PM, Johannes Weiner wrote: > > > > > On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote: > > > > > > As I mentioned above, if we know some page caches from some memcgs > > > > > > are referenced one-off and unlikely shared, why just keep them > > > > > > around to increase memory pressure? > > > > > It's just not clear to me that your scenarios are generic enough to > > > > > justify adding two interfaces that we have to maintain forever, and > > > > > that they couldn't be solved with existing mechanisms. > > > > > > > > > > Please explain: > > > > > > > > > > - Unmapped clean page cache isn't expensive to reclaim, certainly > > > > > cheaper than the IO involved in new application startup. How could > > > > > recycling clean cache be a prohibitive part of workload warmup? > > > > It is nothing about recycling. Those page caches might be referenced by > > > > memcg just once, then nobody touch them until memory pressure is hit. And, > > > > they might be not accessed again at any time soon. > > > I meant recycling the page frames, not the cache in them. So the new > > > workload as it starts up needs to take those pages from the LRU list > > > instead of just the allocator freelist. While that's obviously not the > > > same cost, it's not clear why the difference would be prohibitive to > > > application startup especially since app startup tends to be dominated > > > by things like IO to fault in executables etc. > > > > I'm a little bit confused here. Even though those page frames are not > > reclaimed by force_empty, they would be reclaimed by kswapd later when > > memory pressure is hit. For some usecases, they may prefer get recycled > > before kswapd kick them out LRU, but for some usecases avoiding memory > > pressure might outpace page frame recycling. > > I understand that, but you're not providing data for the "may prefer" > part. You haven't shown that any proactive reclaim actually matters > and is a significant net improvement to a real workload in a real > hardware environment, and that the usecase is generic and widespread > enough to warrant an entirely new kernel interface. Proactive reclaim could prevent from getting offline memcgs accumulated. In our production environment, we saw offline memcgs could reach over 450K (just a few hundred online memcgs) in some cases. kswapd is supposed to help to remove offline memcgs when memory pressure hit, but with such huge number of offline memcgs, kswapd would take very long time to iterate all of them. Such huge number of offline memcgs could bring in other latency problems whenever iterating memcgs is needed, i.e. show memory.stat, direct reclaim, oom, etc. So, we also use force_empty to keep reasonable number of offline memcgs. And, Fam Zheng from Bytedance noticed delayed force_empty gets things done more effectively. Please see the discussion here https://www.spinics.net/lists/cgroups/msg21259.html Thanks, Yang </hannes@cmpxchg.org>> > > > > > - Why you couldn't set memory.high or memory.max to 0 after the > > > > > application quits and before you call rmdir on the cgroup > > > > I recall I explained this in the review email for the first version. Set > > > > memory.high or memory.max to 0 would trigger direct reclaim which may stall > > > > the offline of memcg. But, we have "restarting the same name job" logic in > > > > our usecase (I'm not quite sure why they do so). Basically, it means to > > > > create memcg with the exact same name right after the old one is deleted, > > > > but may have different limit or other settings. The creation has to wait for > > > > rmdir is done. > > > This really needs a fix on your end. We cannot add new cgroup control > > > files because you cannot handle a delayed release in the cgroupfs > > > namespace while you're reclaiming associated memory. A simple serial > > > number would fix this. > > > > > > Whether others have asked for this knob or not, these patches should > > > come with a solid case in the cover letter and changelogs that explain > > > why this ABI is necessary to solve a generic cgroup usecase. But it > > > sounds to me that setting the limit to 0 once the group is empty would > > > meet the functional requirement (use fork() if you don't want to wait) > > > of what you are trying to do. > > > > Do you mean do something like the below: > > > > echo 0 > cg1/memory.max & > > rmdir cg1 & > > mkdir cg1 & > > > > But, the latency is still there, even though memcg creation (mkdir) can be > > done very fast by using fork(), the latency would delay afterwards > > operations, i.e. attaching tasks (echo PID > cg1/cgroup.procs). When we > > calculating the time consumption of the container deployment, we would count > > from mkdir to the job is actually launched. > > I'm saying that the same-name requirement is your problem, not the > kernel's. It's not unreasonable for the kernel to say that as long as > you want to do something with the cgroup, such as forcibly emptying > out the left-over cache, that the group name stays in the namespace. > > Requiring the same exact cgroup name for another instance of the same > job sounds like a bogus requirement. Surely you can use serial numbers > to denote subsequent invocations of the same job and handle that from > whatever job management software you're using: > > ( echo 0 > job1345-1/memory.max; rmdir job12345-1 ) & > mkdir job12345-2 > > See, completely decoupled. >