Message ID | 20200520143712.GA749486@chrisdown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, memcg: reclaim more aggressively before high allocator throttling | expand |
On Wed 20-05-20 15:37:12, Chris Down wrote: > In Facebook production, we've seen cases where cgroups have been put > into allocator throttling even when they appear to have a lot of slack > file caches which should be trivially reclaimable. > > Looking more closely, the problem is that we only try a single cgroup > reclaim walk for each return to usermode before calculating whether or > not we should throttle. This single attempt doesn't produce enough > pressure to shrink for cgroups with a rapidly growing amount of file > caches prior to entering allocator throttling. > > As an example, we see that threads in an affected cgroup are stuck in > allocator throttling: > > # for i in $(cat cgroup.threads); do > > grep over_high "/proc/$i/stack" > > done > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > ...however, there is no I/O pressure reported by PSI, despite a lot of > slack file pages: > > # cat memory.pressure > some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 > full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 > # cat io.pressure > some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 > full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 > # grep _file memory.stat > inactive_file 1370939392 > active_file 661635072 > > This patch changes the behaviour to retry reclaim either until the > current task goes below the 10ms grace period, or we are making no > reclaim progress at all. In the latter case, we enter reclaim throttling > as before. Let me try to understand the actual problem. The high memory reclaim has a target which is proportional to the amount of charged memory. For most requests that would be SWAP_CLUSTER_MAX though (resp. N times that where N is the number of memcgs in excess up the hierarchy). I can see to be insufficient if the memcg is already in a large excess but if the reclaim can make a forward progress this should just work fine because each charging context should reclaim at least the contributed amount. Do you have any insight on why this doesn't work in your situation? Especially with such a large inactive file list I would be really surprised if the reclaim was not able to make a forward progress. Now to your patch. I do not like it much to be honest. MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in memory_high_write because the that is an interruptible context so there shouldn't be a good reason to give up after $FOO number of failed attempts. try_charge and memory_max_write are slightly different because we are invoking OOM killer based on the number of failed attempts. Also if the current high reclaim scaling is insufficient then we should be handling that via memcg_nr_pages_over_high rather than effectivelly unbound number of reclaim retries. That being said, the changelog should be more specific about the underlying problem and if the real problem is in the reclaim target then it should be handled by an increased but still fixed size. If the throttling is just too aggressive and puts task into sleep even when a reclaim has been performed then the throttling should be fixed.
On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote: > On Wed 20-05-20 15:37:12, Chris Down wrote: > > In Facebook production, we've seen cases where cgroups have been put > > into allocator throttling even when they appear to have a lot of slack > > file caches which should be trivially reclaimable. > > > > Looking more closely, the problem is that we only try a single cgroup > > reclaim walk for each return to usermode before calculating whether or > > not we should throttle. This single attempt doesn't produce enough > > pressure to shrink for cgroups with a rapidly growing amount of file > > caches prior to entering allocator throttling. > > > > As an example, we see that threads in an affected cgroup are stuck in > > allocator throttling: > > > > # for i in $(cat cgroup.threads); do > > > grep over_high "/proc/$i/stack" > > > done > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > ...however, there is no I/O pressure reported by PSI, despite a lot of > > slack file pages: > > > > # cat memory.pressure > > some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 > > full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 > > # cat io.pressure > > some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 > > full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 > > # grep _file memory.stat > > inactive_file 1370939392 > > active_file 661635072 > > > > This patch changes the behaviour to retry reclaim either until the > > current task goes below the 10ms grace period, or we are making no > > reclaim progress at all. In the latter case, we enter reclaim throttling > > as before. > > Let me try to understand the actual problem. The high memory reclaim has > a target which is proportional to the amount of charged memory. For most > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where > N is the number of memcgs in excess up the hierarchy). I can see to be > insufficient if the memcg is already in a large excess but if the > reclaim can make a forward progress this should just work fine because > each charging context should reclaim at least the contributed amount. > > Do you have any insight on why this doesn't work in your situation? > Especially with such a large inactive file list I would be really > surprised if the reclaim was not able to make a forward progress. The workload we observed this in was downloading a large file and writing it to disk, which means that a good chunk of that memory was dirty. The first reclaim pass appears to make little progress because it runs into dirty pages. > Now to your patch. I do not like it much to be honest. > MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in > memory_high_write because the that is an interruptible context so there > shouldn't be a good reason to give up after $FOO number of failed > attempts. try_charge and memory_max_write are slightly different because > we are invoking OOM killer based on the number of failed attempts. The same is true for memory.high. We are invoking the userspace OOM killer when memory.high reclaim fails and we put tasks to sleep. The actual number of retries is arbitrary, correct. That's because OOM is arbitrary. It's a sampled state, and this is our sampling period. But it's not that important. The much more important thing is that we continue reclaiming as long as there is forward progress. How many times we retry when there is no forward progress is less critical - but it's nice to have the same cutoff for OOM situations everywhere. > Also if the current high reclaim scaling is insufficient then we should > be handling that via memcg_nr_pages_over_high rather than effectivelly > unbound number of reclaim retries. ???
On Wed 20-05-20 12:51:31, Johannes Weiner wrote: > On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote: > > On Wed 20-05-20 15:37:12, Chris Down wrote: > > > In Facebook production, we've seen cases where cgroups have been put > > > into allocator throttling even when they appear to have a lot of slack > > > file caches which should be trivially reclaimable. > > > > > > Looking more closely, the problem is that we only try a single cgroup > > > reclaim walk for each return to usermode before calculating whether or > > > not we should throttle. This single attempt doesn't produce enough > > > pressure to shrink for cgroups with a rapidly growing amount of file > > > caches prior to entering allocator throttling. > > > > > > As an example, we see that threads in an affected cgroup are stuck in > > > allocator throttling: > > > > > > # for i in $(cat cgroup.threads); do > > > > grep over_high "/proc/$i/stack" > > > > done > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > > > ...however, there is no I/O pressure reported by PSI, despite a lot of > > > slack file pages: > > > > > > # cat memory.pressure > > > some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 > > > full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 > > > # cat io.pressure > > > some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 > > > full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 > > > # grep _file memory.stat > > > inactive_file 1370939392 > > > active_file 661635072 > > > > > > This patch changes the behaviour to retry reclaim either until the > > > current task goes below the 10ms grace period, or we are making no > > > reclaim progress at all. In the latter case, we enter reclaim throttling > > > as before. > > > > Let me try to understand the actual problem. The high memory reclaim has > > a target which is proportional to the amount of charged memory. For most > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where > > N is the number of memcgs in excess up the hierarchy). I can see to be > > insufficient if the memcg is already in a large excess but if the > > reclaim can make a forward progress this should just work fine because > > each charging context should reclaim at least the contributed amount. > > > > Do you have any insight on why this doesn't work in your situation? > > Especially with such a large inactive file list I would be really > > surprised if the reclaim was not able to make a forward progress. > > The workload we observed this in was downloading a large file and > writing it to disk, which means that a good chunk of that memory was > dirty. The first reclaim pass appears to make little progress because > it runs into dirty pages. OK, I see but why does the subsequent reclaim attempt makes a forward progress? Is this just because dirty pages are flushed in the mean time? Because if this is the case then the underlying problem seems to be that the reclaim should be throttled on dirty data. > > Now to your patch. I do not like it much to be honest. > > MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in > > memory_high_write because the that is an interruptible context so there > > shouldn't be a good reason to give up after $FOO number of failed > > attempts. try_charge and memory_max_write are slightly different because > > we are invoking OOM killer based on the number of failed attempts. > > The same is true for memory.high. We are invoking the userspace OOM > killer when memory.high reclaim fails and we put tasks to sleep. Right but there is no way to indicate that the reclaim has failed when writing to memory.high. > The actual number of retries is arbitrary, correct. That's because OOM > is arbitrary. It's a sampled state, and this is our sampling period. > > But it's not that important. The much more important thing is that we > continue reclaiming as long as there is forward progress. How many > times we retry when there is no forward progress is less critical - > but it's nice to have the same cutoff for OOM situations everywhere. > > > Also if the current high reclaim scaling is insufficient then we should > > be handling that via memcg_nr_pages_over_high rather than effectivelly > > unbound number of reclaim retries. > > ??? I am not sure what you are asking here.
On Wed, May 20, 2020 at 07:04:30PM +0200, Michal Hocko wrote: > On Wed 20-05-20 12:51:31, Johannes Weiner wrote: > > On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote: > > > On Wed 20-05-20 15:37:12, Chris Down wrote: > > > > In Facebook production, we've seen cases where cgroups have been put > > > > into allocator throttling even when they appear to have a lot of slack > > > > file caches which should be trivially reclaimable. > > > > > > > > Looking more closely, the problem is that we only try a single cgroup > > > > reclaim walk for each return to usermode before calculating whether or > > > > not we should throttle. This single attempt doesn't produce enough > > > > pressure to shrink for cgroups with a rapidly growing amount of file > > > > caches prior to entering allocator throttling. > > > > > > > > As an example, we see that threads in an affected cgroup are stuck in > > > > allocator throttling: > > > > > > > > # for i in $(cat cgroup.threads); do > > > > > grep over_high "/proc/$i/stack" > > > > > done > > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > > > > > ...however, there is no I/O pressure reported by PSI, despite a lot of > > > > slack file pages: > > > > > > > > # cat memory.pressure > > > > some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 > > > > full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 > > > > # cat io.pressure > > > > some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 > > > > full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 > > > > # grep _file memory.stat > > > > inactive_file 1370939392 > > > > active_file 661635072 > > > > > > > > This patch changes the behaviour to retry reclaim either until the > > > > current task goes below the 10ms grace period, or we are making no > > > > reclaim progress at all. In the latter case, we enter reclaim throttling > > > > as before. > > > > > > Let me try to understand the actual problem. The high memory reclaim has > > > a target which is proportional to the amount of charged memory. For most > > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where > > > N is the number of memcgs in excess up the hierarchy). I can see to be > > > insufficient if the memcg is already in a large excess but if the > > > reclaim can make a forward progress this should just work fine because > > > each charging context should reclaim at least the contributed amount. > > > > > > Do you have any insight on why this doesn't work in your situation? > > > Especially with such a large inactive file list I would be really > > > surprised if the reclaim was not able to make a forward progress. > > > > The workload we observed this in was downloading a large file and > > writing it to disk, which means that a good chunk of that memory was > > dirty. The first reclaim pass appears to make little progress because > > it runs into dirty pages. > > OK, I see but why does the subsequent reclaim attempt makes a forward > progress? Is this just because dirty pages are flushed in the mean time? > Because if this is the case then the underlying problem seems to be that > the reclaim should be throttled on dirty data. That's what I assume. Chris wanted to do more reclaim tracing. But is this actually important beyond maybe curiosity? We retry every other reclaim invocation on forward progress. There is not a single naked call to try_to_free_pages(), and this here is the only exception where we don't loop on try_to_free_mem_cgroup_pages(). And there are very good, widely established reason for that: Under pressure, page reclaim can struggle to satisfy the reclaim goal and may return with less pages reclaimed than asked to. Under concurrency, a parallel allocation can invalidate the reclaim progress made by a thread. When either of these happen, the reclaiming thread should not throw its hands up and give up. It shouldn't invoke the kernel OOM killer, and it shouldn't go to sleep to trigger the userspace OOM killer. Reclaim hasn't failed as long as there is forward progress to be made. This isn't a daring concept, it's standard practice throughout the VM. I don't quite understand what makes this situation different. It's not *that* important which of the many known reasons for reclaim to not succeed on first try has prompted this patch, is it? > > > Now to your patch. I do not like it much to be honest. > > > MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in > > > memory_high_write because the that is an interruptible context so there > > > shouldn't be a good reason to give up after $FOO number of failed > > > attempts. try_charge and memory_max_write are slightly different because > > > we are invoking OOM killer based on the number of failed attempts. > > > > The same is true for memory.high. We are invoking the userspace OOM > > killer when memory.high reclaim fails and we put tasks to sleep. > > Right but there is no way to indicate that the reclaim has failed when > writing to memory.high. I'm less concerned about memory.high-writing than try_charge(). Although IMO it's nice to be consistent and make the same effort as we would everywhere else to meet the limit before returning from the write(). > > The actual number of retries is arbitrary, correct. That's because OOM > > is arbitrary. It's a sampled state, and this is our sampling period. > > > > But it's not that important. The much more important thing is that we > > continue reclaiming as long as there is forward progress. How many > > times we retry when there is no forward progress is less critical - > > but it's nice to have the same cutoff for OOM situations everywhere. > > > > > Also if the current high reclaim scaling is insufficient then we should > > > be handling that via memcg_nr_pages_over_high rather than effectivelly > > > unbound number of reclaim retries. > > > > ??? > > I am not sure what you are asking here. You expressed that some alternate solution B would be preferable, without any detail on why you think that is the case. And it's certainly not obvious or self-explanatory - in particular because Chris's proposal *is* obvious and self-explanatory, given how everybody else is already doing loops around page reclaim.
Michal Hocko writes: >Let me try to understand the actual problem. The high memory reclaim has >a target which is proportional to the amount of charged memory. For most >requests that would be SWAP_CLUSTER_MAX though (resp. N times that where >N is the number of memcgs in excess up the hierarchy). I can see to be >insufficient if the memcg is already in a large excess but if the >reclaim can make a forward progress this should just work fine because >each charging context should reclaim at least the contributed amount. > >Do you have any insight on why this doesn't work in your situation? >Especially with such a large inactive file list I would be really >surprised if the reclaim was not able to make a forward progress. Reclaim can fail for any number of reasons, which is why we have retries sprinkled all over for it already. It doesn't seem hard to believe that it might just fail for transient reasons and drive us deeper into the hole as a result. In this case, a.) the application is producing tons of dirty pages, and b.) we have really heavy systemwide I/O contention on the affected machines. This high load is one of the reasons that direct and kswapd reclaim cannot keep up, and thus nr_pages can become a number of orders of magnitude larger than SWAP_CLUSTER_MAX. This is trivially reproducible on these machines, it's not an edge case. Putting a trace_printk("%d\n", __LINE__) at non-successful reclaim in shrink_page_list shows that what's happening is always (and I really mean always) the "dirty page and you're not kswapd" check, as expected: if (PageDirty(page)) { /* * Only kswapd can writeback filesystem pages * to avoid risk of stack overflow. But avoid * injecting inefficient single-page IO into * flusher writeback as much as possible: only * write pages when we've encountered many * dirty pages, and when we've already scanned * the rest of the LRU for clean pages and see * the same dirty pages again (PageReclaim). */ if (page_is_file_lru(page) && (!current_is_kswapd() || !PageReclaim(page) || !test_bit(PGDAT_DIRTY, &pgdat->flags))) { /* * Immediately reclaim when written back. * Similar in principal to deactivate_page() * except we already have the page isolated * and know it's dirty */ inc_node_page_state(page, NR_VMSCAN_IMMEDIATE); SetPageReclaim(page); goto activate_locked; } >Now to your patch. I do not like it much to be honest. >MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in >memory_high_write because the that is an interruptible context so there >shouldn't be a good reason to give up after $FOO number of failed >attempts. try_charge and memory_max_write are slightly different because >we are invoking OOM killer based on the number of failed attempts. As Johannes mentioned, the very intent of memory.high is to have it managed using a userspace OOM killer, which monitors PSI. As such, I'm not sure this distinction means much.
On Wed 20-05-20 21:26:50, Chris Down wrote: > Michal Hocko writes: > > Let me try to understand the actual problem. The high memory reclaim has > > a target which is proportional to the amount of charged memory. For most > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where > > N is the number of memcgs in excess up the hierarchy). I can see to be > > insufficient if the memcg is already in a large excess but if the > > reclaim can make a forward progress this should just work fine because > > each charging context should reclaim at least the contributed amount. > > > > Do you have any insight on why this doesn't work in your situation? > > Especially with such a large inactive file list I would be really > > surprised if the reclaim was not able to make a forward progress. > > Reclaim can fail for any number of reasons, which is why we have retries > sprinkled all over for it already. It doesn't seem hard to believe that it > might just fail for transient reasons and drive us deeper into the hole as a > result. Reclaim can certainly fail. It is however surprising to see it fail with such a large inactive lru list and reasonably small reclaim target. Having the full LRU of dirty pages sounds a bit unusual, IO throttling for v2 and explicit throttling during the reclaim for v1 should prevent from that. If the reclaim gives up too easily then this should be addressed at the reclaim level. > In this case, a.) the application is producing tons of dirty pages, and b.) > we have really heavy systemwide I/O contention on the affected machines. > This high load is one of the reasons that direct and kswapd reclaim cannot > keep up, and thus nr_pages can become a number of orders of magnitude larger > than SWAP_CLUSTER_MAX. This is trivially reproducible on these machines, > it's not an edge case. Please elaborate some more. memcg_nr_pages_over_high shouldn't really depend on the system wide activity. It should scale with the requested charges. So yes it can get large for something like a large read/write which does a lot of allocations in a single syscall before returning to the userspace. But ok, let's say that the reclaim target is large and then a single reclaim attempt might fail. Then I am wondering why your patch is not really targetting to reclaim memcg_nr_pages_over_high pages and instead push for reclaim down to the high limit. The main problem I see with that approach is that the loop could easily lead to reclaim unfairness when a heavy producer which doesn't leave the kernel (e.g. a large read/write call) can keep a different task doing all the reclaim work. The loop is effectivelly unbound when there is a reclaim progress and so the return to the userspace is by no means proportional to the requested memory/charge.
On Wed 20-05-20 13:51:35, Johannes Weiner wrote: > On Wed, May 20, 2020 at 07:04:30PM +0200, Michal Hocko wrote: > > On Wed 20-05-20 12:51:31, Johannes Weiner wrote: > > > On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote: > > > > On Wed 20-05-20 15:37:12, Chris Down wrote: > > > > > In Facebook production, we've seen cases where cgroups have been put > > > > > into allocator throttling even when they appear to have a lot of slack > > > > > file caches which should be trivially reclaimable. > > > > > > > > > > Looking more closely, the problem is that we only try a single cgroup > > > > > reclaim walk for each return to usermode before calculating whether or > > > > > not we should throttle. This single attempt doesn't produce enough > > > > > pressure to shrink for cgroups with a rapidly growing amount of file > > > > > caches prior to entering allocator throttling. > > > > > > > > > > As an example, we see that threads in an affected cgroup are stuck in > > > > > allocator throttling: > > > > > > > > > > # for i in $(cat cgroup.threads); do > > > > > > grep over_high "/proc/$i/stack" > > > > > > done > > > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > > > > > > > ...however, there is no I/O pressure reported by PSI, despite a lot of > > > > > slack file pages: > > > > > > > > > > # cat memory.pressure > > > > > some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 > > > > > full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 > > > > > # cat io.pressure > > > > > some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 > > > > > full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 > > > > > # grep _file memory.stat > > > > > inactive_file 1370939392 > > > > > active_file 661635072 > > > > > > > > > > This patch changes the behaviour to retry reclaim either until the > > > > > current task goes below the 10ms grace period, or we are making no > > > > > reclaim progress at all. In the latter case, we enter reclaim throttling > > > > > as before. > > > > > > > > Let me try to understand the actual problem. The high memory reclaim has > > > > a target which is proportional to the amount of charged memory. For most > > > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where > > > > N is the number of memcgs in excess up the hierarchy). I can see to be > > > > insufficient if the memcg is already in a large excess but if the > > > > reclaim can make a forward progress this should just work fine because > > > > each charging context should reclaim at least the contributed amount. > > > > > > > > Do you have any insight on why this doesn't work in your situation? > > > > Especially with such a large inactive file list I would be really > > > > surprised if the reclaim was not able to make a forward progress. > > > > > > The workload we observed this in was downloading a large file and > > > writing it to disk, which means that a good chunk of that memory was > > > dirty. The first reclaim pass appears to make little progress because > > > it runs into dirty pages. > > > > OK, I see but why does the subsequent reclaim attempt makes a forward > > progress? Is this just because dirty pages are flushed in the mean time? > > Because if this is the case then the underlying problem seems to be that > > the reclaim should be throttled on dirty data. > > That's what I assume. Chris wanted to do more reclaim tracing. But is > this actually important beyond maybe curiosity? Yes, because it might show that there is a deeper problem. Having an extremely large file list full of dirty data and pre-mature failure for the reclaim sounds like a problem that is worth looking into closely. > We retry every other reclaim invocation on forward progress. There is > not a single naked call to try_to_free_pages(), and this here is the > only exception where we don't loop on try_to_free_mem_cgroup_pages(). I am not saying the looping over try_to_free_pages is wrong. I do care about the final reclaim target. That shouldn't be arbitrary. We have established a target which is proportional to the requested amount of memory. And there is a good reason for that. If any task tries to reclaim down to the high limit then this might lead to a large unfairness when heavy producers piggy back on the active reclaimer(s). I wouldn't mind to loop over try_to_free_pages to meet the requested memcg_nr_pages_over_high target. [...] > > > > Also if the current high reclaim scaling is insufficient then we should > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly > > > > unbound number of reclaim retries. > > > > > > ??? > > > > I am not sure what you are asking here. > > You expressed that some alternate solution B would be preferable, > without any detail on why you think that is the case. > > And it's certainly not obvious or self-explanatory - in particular > because Chris's proposal *is* obvious and self-explanatory, given how > everybody else is already doing loops around page reclaim. Sorry, I could have been less cryptic. I hope the above and my response to Chris goes into more details why I do not like this proposal and what is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high target. If the current calculation of the target is unsufficient - e.g. in situations where the high limit excess is very large then this should be reflected in memcg_nr_pages_over_high. Is it more clear?
Michal Hocko writes: >On Wed 20-05-20 21:26:50, Chris Down wrote: >> Michal Hocko writes: >> > Let me try to understand the actual problem. The high memory reclaim has >> > a target which is proportional to the amount of charged memory. For most >> > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where >> > N is the number of memcgs in excess up the hierarchy). I can see to be >> > insufficient if the memcg is already in a large excess but if the >> > reclaim can make a forward progress this should just work fine because >> > each charging context should reclaim at least the contributed amount. >> > >> > Do you have any insight on why this doesn't work in your situation? >> > Especially with such a large inactive file list I would be really >> > surprised if the reclaim was not able to make a forward progress. >> >> Reclaim can fail for any number of reasons, which is why we have retries >> sprinkled all over for it already. It doesn't seem hard to believe that it >> might just fail for transient reasons and drive us deeper into the hole as a >> result. > >Reclaim can certainly fail. It is however surprising to see it fail with >such a large inactive lru list and reasonably small reclaim target. Why do you think the reclaim target is small? In the case of generating tons of dirty pages, current->memcg_nr_pages_over_high can grow to be huge (on the order of several tens of megabytes or more). >Having the full LRU of dirty pages sounds a bit unusual, IO throttling >for v2 and explicit throttling during the reclaim for v1 should prevent >from that. If the reclaim gives up too easily then this should be >addressed at the reclaim level. I'm not sure I agree. Reclaim knows what you asked it to do: reclaim N pages, but what to do about the situation when it fails to satisfy that is a job for the caller. In this case, we are willing to even tolerate a little bit of overage up to the 10ms throttle threshold. In other cases, we want to do other checks first before retrying, because the tradeoffs are different. Putting all of this inside the reclaim logic seems unwieldy. >> In this case, a.) the application is producing tons of dirty pages, and b.) >> we have really heavy systemwide I/O contention on the affected machines. >> This high load is one of the reasons that direct and kswapd reclaim cannot >> keep up, and thus nr_pages can become a number of orders of magnitude larger >> than SWAP_CLUSTER_MAX. This is trivially reproducible on these machines, >> it's not an edge case. > >Please elaborate some more. memcg_nr_pages_over_high shouldn't really >depend on the system wide activity. It should scale with the requested >charges. So yes it can get large for something like a large read/write >which does a lot of allocations in a single syscall before returning to >the userspace. It can also get large if a number of subsequent reclaim attempts are making progress, but not satisfying demand fully, as is happening here. As a facetious example, even if we request N and reclaim can satisfy N-1 each time, eventually those single pages can grow to become a non-trivial size. >But ok, let's say that the reclaim target is large and then a single >reclaim attempt might fail. Then I am wondering why your patch is not >really targetting to reclaim memcg_nr_pages_over_high pages and instead >push for reclaim down to the high limit. > >The main problem I see with that approach is that the loop could easily >lead to reclaim unfairness when a heavy producer which doesn't leave the >kernel (e.g. a large read/write call) can keep a different task doing >all the reclaim work. The loop is effectivelly unbound when there is a >reclaim progress and so the return to the userspace is by no means >proportional to the requested memory/charge. It's not unbound when there is reclaim progress, it stops when we are within the memory.high throttling grace period. Right after reclaim, we check if penalty_jiffies is less than 10ms, and abort and further reclaim or allocator throttling: retry_reclaim: nr_reclaimed = reclaim_high(memcg, nr_pages, GFP_KERNEL); /* * memory.high is breached and reclaim is unable to keep up. Throttle * allocators proactively to slow down excessive growth. */ penalty_jiffies = calculate_high_delay(memcg, nr_pages); /* * Don't sleep if the amount of jiffies this memcg owes us is so low * that it's not even worth doing, in an attempt to be nice to those who * go only a small amount over their memory.high value and maybe haven't * been aggressively reclaimed enough yet. */ if (penalty_jiffies <= HZ / 100) goto out; Regardless, you're pushing for different reclaim semantics for memory.high than memory.max here, which requires evidence that the current approach taken for memory.max is wrong or causing issues. And sure, you can say that that's because in memory.max's case we would have a memcg OOM, but again, that's not really different from how memory.high is supposed to work: with a userspace OOM killer monitoring it and producing OOM kills as necessary.
On Thu 21-05-20 12:27:11, Chris Down wrote: > Michal Hocko writes: > > On Wed 20-05-20 21:26:50, Chris Down wrote: > > > Michal Hocko writes: > > > > Let me try to understand the actual problem. The high memory reclaim has > > > > a target which is proportional to the amount of charged memory. For most > > > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where > > > > N is the number of memcgs in excess up the hierarchy). I can see to be > > > > insufficient if the memcg is already in a large excess but if the > > > > reclaim can make a forward progress this should just work fine because > > > > each charging context should reclaim at least the contributed amount. > > > > > > > > Do you have any insight on why this doesn't work in your situation? > > > > Especially with such a large inactive file list I would be really > > > > surprised if the reclaim was not able to make a forward progress. > > > > > > Reclaim can fail for any number of reasons, which is why we have retries > > > sprinkled all over for it already. It doesn't seem hard to believe that it > > > might just fail for transient reasons and drive us deeper into the hole as a > > > result. > > > > Reclaim can certainly fail. It is however surprising to see it fail with > > such a large inactive lru list and reasonably small reclaim target. > > Why do you think the reclaim target is small? In the case of generating tons > of dirty pages, current->memcg_nr_pages_over_high can grow to be huge (on > the order of several tens of megabytes or more). Because from my experience there are not tons of charges inside one syscall usually. Yeah, some syscalls can generate a lot of them but that shouldn't be a majority. > > Having the full LRU of dirty pages sounds a bit unusual, IO throttling > > for v2 and explicit throttling during the reclaim for v1 should prevent > > from that. If the reclaim gives up too easily then this should be > > addressed at the reclaim level. > > I'm not sure I agree. Reclaim knows what you asked it to do: reclaim N > pages, but what to do about the situation when it fails to satisfy that is a > job for the caller. In this case, we are willing to even tolerate a little > bit of overage up to the 10ms throttle threshold. In other cases, we want to > do other checks first before retrying, because the tradeoffs are different. > Putting all of this inside the reclaim logic seems unwieldy. That is not what I meant. We do have some throttling inside the reclaim because failing reclaim too quickly can easily lead to pre mature OOMs. If that doesn't work then we should have a look why. E.g. it is quite unexpected to have large LRU full of dirty pages because this suggests that dirty throttling doesn't work properly. > > The main problem I see with that approach is that the loop could easily > > lead to reclaim unfairness when a heavy producer which doesn't leave the > > kernel (e.g. a large read/write call) can keep a different task doing > > all the reclaim work. The loop is effectivelly unbound when there is a > > reclaim progress and so the return to the userspace is by no means > > proportional to the requested memory/charge. > > It's not unbound when there is reclaim progress, it stops when we are within > the memory.high throttling grace period. Right after reclaim, we check if > penalty_jiffies is less than 10ms, and abort and further reclaim or > allocator throttling: Just imagine that you have parallel producers increasing the high limit excess while somebody reclaims those. Sure in practice the loop will be bounded but the reclaimer might perform much more work on behalf of other tasks.
(I'll leave the dirty throttling discussion to Johannes, because I'm not so familiar with that code or its history.) Michal Hocko writes: >> > The main problem I see with that approach is that the loop could easily >> > lead to reclaim unfairness when a heavy producer which doesn't leave the >> > kernel (e.g. a large read/write call) can keep a different task doing >> > all the reclaim work. The loop is effectivelly unbound when there is a >> > reclaim progress and so the return to the userspace is by no means >> > proportional to the requested memory/charge. >> >> It's not unbound when there is reclaim progress, it stops when we are within >> the memory.high throttling grace period. Right after reclaim, we check if >> penalty_jiffies is less than 10ms, and abort and further reclaim or >> allocator throttling: > >Just imagine that you have parallel producers increasing the high limit >excess while somebody reclaims those. Sure in practice the loop will be >bounded but the reclaimer might perform much more work on behalf of >other tasks. A cgroup is a unit and breaking it down into "reclaim fairness" for individual tasks like this seems suspect to me. For example, if one task in a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup is going to be penalised by allocator throttling as a result, even if they aren't "responsible" for that reclaim. So the options here are as follows when a cgroup is over memory.high and a single reclaim isn't enough: 1. Decline further reclaim. Instead, throttle for up to 2 seconds. 2. Keep on reclaiming. Only throttle if we can't get back under memory.high. The outcome of your suggestion to decline further reclaim is case #1, which is significantly more practically "unfair" to that task. Throttling is extremely disruptive to tasks and should be a last resort when we've exhausted all other practical options. It shouldn't be something you get just because you didn't try to reclaim hard enough.
Chris Down writes: >A cgroup is a unit and breaking it down into "reclaim fairness" for >individual tasks like this seems suspect to me. For example, if one >task in a cgroup is leaking unreclaimable memory like crazy, everyone >in that cgroup is going to be penalised by allocator throttling as a >result, even if they aren't "responsible" for that reclaim. s/for that reclaim/for that overage/
On Thu 21-05-20 12:27:11, Chris Down wrote: [...] > Regardless, you're pushing for different reclaim semantics for memory.high > than memory.max here, which requires evidence that the current approach > taken for memory.max is wrong or causing issues. Sorry, I have skipped over this part. Memory high limit reclaim has historically acted as a best effort action to throttle the allocation/charge pace. This would work both if the implementation simply tried to reclaim down to the high limit or if the reclaim is proportional to the memory consumption by a specific consumer. We do the later because it is much easier to establish fairness for. If you want to change that you somehow have to deal with the fairness problem. And yes, we do not guarantee any fairness for the hard limit or direct reclaim in general but that behavior is generally problematic and there should be really strong arguments to move high limit reclaim that direction IMHO.
On Thu 21-05-20 13:23:27, Chris Down wrote: > (I'll leave the dirty throttling discussion to Johannes, because I'm not so > familiar with that code or its history.) > > Michal Hocko writes: > > > > The main problem I see with that approach is that the loop could easily > > > > lead to reclaim unfairness when a heavy producer which doesn't leave the > > > > kernel (e.g. a large read/write call) can keep a different task doing > > > > all the reclaim work. The loop is effectivelly unbound when there is a > > > > reclaim progress and so the return to the userspace is by no means > > > > proportional to the requested memory/charge. > > > > > > It's not unbound when there is reclaim progress, it stops when we are within > > > the memory.high throttling grace period. Right after reclaim, we check if > > > penalty_jiffies is less than 10ms, and abort and further reclaim or > > > allocator throttling: > > > > Just imagine that you have parallel producers increasing the high limit > > excess while somebody reclaims those. Sure in practice the loop will be > > bounded but the reclaimer might perform much more work on behalf of > > other tasks. > > A cgroup is a unit and breaking it down into "reclaim fairness" for > individual tasks like this seems suspect to me. For example, if one task in > a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup > is going to be penalised by allocator throttling as a result, even if they > aren't "responsible" for that reclaim. You are right, but that doesn't mean that it is desirable that some tasks would be throttled unexpectedly too long because of the other's activity. We already have that behavior for the direct reclaim and I have to say I really hate it and had to spend a lot of time debugging latency issues. Our excuse has been that the system is struggling at that time so any quality of service is simply out of picture. I do not think the same argument can be applied to memory.high which doesn't really represent a mark when the memcg is struggling so hard to drop any signs of fairness on the floor. > So the options here are as follows when a cgroup is over memory.high and a > single reclaim isn't enough: > > 1. Decline further reclaim. Instead, throttle for up to 2 seconds. > 2. Keep on reclaiming. Only throttle if we can't get back under memory.high. > > The outcome of your suggestion to decline further reclaim is case #1, which > is significantly more practically "unfair" to that task. Throttling is > extremely disruptive to tasks and should be a last resort when we've > exhausted all other practical options. It shouldn't be something you get > just because you didn't try to reclaim hard enough. I believe I have asked in other email in this thread. Could you explain why enforcint the requested target (memcg_nr_pages_over_high) is insufficient for the problem you are dealing with? Because that would make sense for large targets to me while it would keep relatively reasonable semantic of the throttling - aka proportional to the memory demand rather than the excess.
Michal Hocko writes: >> A cgroup is a unit and breaking it down into "reclaim fairness" for >> individual tasks like this seems suspect to me. For example, if one task in >> a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup >> is going to be penalised by allocator throttling as a result, even if they >> aren't "responsible" for that reclaim. > >You are right, but that doesn't mean that it is desirable that some >tasks would be throttled unexpectedly too long because of the other's activity. Are you really talking about throttling, or reclaim? If throttling, tasks are already throttled proportionally to how much this allocation is contributing to the overage in calculate_high_delay. If you're talking about reclaim, trying to reason about whether the overage is the result of some other task in this cgroup or the task that's allocating right now is something that we already know doesn't work well (eg. global OOM). >> So the options here are as follows when a cgroup is over memory.high and a >> single reclaim isn't enough: >> >> 1. Decline further reclaim. Instead, throttle for up to 2 seconds. >> 2. Keep on reclaiming. Only throttle if we can't get back under memory.high. >> >> The outcome of your suggestion to decline further reclaim is case #1, which >> is significantly more practically "unfair" to that task. Throttling is >> extremely disruptive to tasks and should be a last resort when we've >> exhausted all other practical options. It shouldn't be something you get >> just because you didn't try to reclaim hard enough. > >I believe I have asked in other email in this thread. Could you explain >why enforcint the requested target (memcg_nr_pages_over_high) is >insufficient for the problem you are dealing with? Because that would >make sense for large targets to me while it would keep relatively >reasonable semantic of the throttling - aka proportional to the memory >demand rather than the excess. memcg_nr_pages_over_high is related to the charge size. As such, if you're way over memory.high as a result of transient reclaim failures, but the majority of your charges are small, it's going to hard to make meaningful progress: 1. Most nr_pages will be MEMCG_CHARGE_BATCH, which is not enough to help; 2. Large allocations will only get a single reclaim attempt to succeed. As such, in many cases we're either doomed to successfully reclaim a paltry amount of pages, or fail to reclaim a lot of pages. Asking try_to_free_pages() to deal with those huge allocations is generally not reasonable, regardless of the specifics of why it doesn't work in this case.
Chris Down writes: >>I believe I have asked in other email in this thread. Could you explain >>why enforcint the requested target (memcg_nr_pages_over_high) is >>insufficient for the problem you are dealing with? Because that would >>make sense for large targets to me while it would keep relatively >>reasonable semantic of the throttling - aka proportional to the memory >>demand rather than the excess. > >memcg_nr_pages_over_high is related to the charge size. As such, if >you're way over memory.high as a result of transient reclaim failures, >but the majority of your charges are small, it's going to hard to make >meaningful progress: > >1. Most nr_pages will be MEMCG_CHARGE_BATCH, which is not enough to help; >2. Large allocations will only get a single reclaim attempt to succeed. > >As such, in many cases we're either doomed to successfully reclaim a >paltry amount of pages, or fail to reclaim a lot of pages. Asking >try_to_free_pages() to deal with those huge allocations is generally >not reasonable, regardless of the specifics of why it doesn't work in >this case. Oh, I somehow elided the "enforcing" part of your proposal. Still, there's no guarantee even if large allocations are reclaimed fully that we will end up going back below memory.high, because even a single other large allocation which fails to reclaim can knock us out of whack again.
On Thu 21-05-20 13:57:59, Chris Down wrote: > Michal Hocko writes: > > > A cgroup is a unit and breaking it down into "reclaim fairness" for > > > individual tasks like this seems suspect to me. For example, if one task in > > > a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup > > > is going to be penalised by allocator throttling as a result, even if they > > > aren't "responsible" for that reclaim. > > > > You are right, but that doesn't mean that it is desirable that some > > tasks would be throttled unexpectedly too long because of the other's activity. > > Are you really talking about throttling, or reclaim? If throttling, tasks > are already throttled proportionally to how much this allocation is > contributing to the overage in calculate_high_delay. Reclaim is a part of the throttling mechanism. It is a productive side of it actually. > If you're talking about reclaim, trying to reason about whether the overage > is the result of some other task in this cgroup or the task that's > allocating right now is something that we already know doesn't work well > (eg. global OOM). I am not sure I follow you here.
On Thu 21-05-20 14:05:30, Chris Down wrote: > Chris Down writes: > > > I believe I have asked in other email in this thread. Could you explain > > > why enforcint the requested target (memcg_nr_pages_over_high) is > > > insufficient for the problem you are dealing with? Because that would > > > make sense for large targets to me while it would keep relatively > > > reasonable semantic of the throttling - aka proportional to the memory > > > demand rather than the excess. > > > > memcg_nr_pages_over_high is related to the charge size. As such, if > > you're way over memory.high as a result of transient reclaim failures, > > but the majority of your charges are small, it's going to hard to make > > meaningful progress: > > > > 1. Most nr_pages will be MEMCG_CHARGE_BATCH, which is not enough to help; > > 2. Large allocations will only get a single reclaim attempt to succeed. > > > > As such, in many cases we're either doomed to successfully reclaim a > > paltry amount of pages, or fail to reclaim a lot of pages. Asking > > try_to_free_pages() to deal with those huge allocations is generally not > > reasonable, regardless of the specifics of why it doesn't work in this > > case. > > Oh, I somehow elided the "enforcing" part of your proposal. Still, there's > no guarantee even if large allocations are reclaimed fully that we will end > up going back below memory.high, because even a single other large > allocation which fails to reclaim can knock us out of whack again. Yeah, there is no guarantee and that is fine. Because memory.high is not about guarantee. It is about a best effort and slowing down the allocation pace so that the userspace has time to do something about that. That being said I would be really curious about how enforcing the memcg_nr_pages_over_high target works in your setups where you see the problem. If that doesn't work for some reason and the reclaim should be more pro-active then I would suggest to scale it via memcg_nr_pages_over_high rather than essentially keep it around and ignore it. Preserving at least some form of fairness and predictable behavior is important IMHO but if there is no way to achieve that then there should be a very good explanation for that. I hope that we it is more clear what is our thinking now. I will be FTO for upcoming days trying to get some rest from email so my response time will be longer. Will be back on Thursday.
Michal Hocko writes: >On Thu 21-05-20 13:57:59, Chris Down wrote: >> Michal Hocko writes: >> > > A cgroup is a unit and breaking it down into "reclaim fairness" for >> > > individual tasks like this seems suspect to me. For example, if one task in >> > > a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup >> > > is going to be penalised by allocator throttling as a result, even if they >> > > aren't "responsible" for that reclaim. >> > >> > You are right, but that doesn't mean that it is desirable that some >> > tasks would be throttled unexpectedly too long because of the other's activity. >> >> Are you really talking about throttling, or reclaim? If throttling, tasks >> are already throttled proportionally to how much this allocation is >> contributing to the overage in calculate_high_delay. > >Reclaim is a part of the throttling mechanism. It is a productive side >of it actually. I guess let's avoid using the term "throttling", since in this context it sounds like we're talking about allocator throttling :-) >> If you're talking about reclaim, trying to reason about whether the overage >> is the result of some other task in this cgroup or the task that's >> allocating right now is something that we already know doesn't work well >> (eg. global OOM). > >I am not sure I follow you here. Let me rephrase: your statement is that it's not desirable "that some task would be throttled unexpectedly too long because of [the activity of another task also within that cgroup]" (let me know if that's not what you meant). But trying to avoid that requires knowing which activity abstractly instigates this entire mess in the first place, which we have nowhere near enough context to determine.
On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote: > On Wed 20-05-20 13:51:35, Johannes Weiner wrote: > > On Wed, May 20, 2020 at 07:04:30PM +0200, Michal Hocko wrote: > > > On Wed 20-05-20 12:51:31, Johannes Weiner wrote: > > > > On Wed, May 20, 2020 at 06:07:56PM +0200, Michal Hocko wrote: > > > > > On Wed 20-05-20 15:37:12, Chris Down wrote: > > > > > > In Facebook production, we've seen cases where cgroups have been put > > > > > > into allocator throttling even when they appear to have a lot of slack > > > > > > file caches which should be trivially reclaimable. > > > > > > > > > > > > Looking more closely, the problem is that we only try a single cgroup > > > > > > reclaim walk for each return to usermode before calculating whether or > > > > > > not we should throttle. This single attempt doesn't produce enough > > > > > > pressure to shrink for cgroups with a rapidly growing amount of file > > > > > > caches prior to entering allocator throttling. > > > > > > > > > > > > As an example, we see that threads in an affected cgroup are stuck in > > > > > > allocator throttling: > > > > > > > > > > > > # for i in $(cat cgroup.threads); do > > > > > > > grep over_high "/proc/$i/stack" > > > > > > > done > > > > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > > > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > > > > > > > > > > > ...however, there is no I/O pressure reported by PSI, despite a lot of > > > > > > slack file pages: > > > > > > > > > > > > # cat memory.pressure > > > > > > some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 > > > > > > full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 > > > > > > # cat io.pressure > > > > > > some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 > > > > > > full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 > > > > > > # grep _file memory.stat > > > > > > inactive_file 1370939392 > > > > > > active_file 661635072 > > > > > > > > > > > > This patch changes the behaviour to retry reclaim either until the > > > > > > current task goes below the 10ms grace period, or we are making no > > > > > > reclaim progress at all. In the latter case, we enter reclaim throttling > > > > > > as before. > > > > > > > > > > Let me try to understand the actual problem. The high memory reclaim has > > > > > a target which is proportional to the amount of charged memory. For most > > > > > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where > > > > > N is the number of memcgs in excess up the hierarchy). I can see to be > > > > > insufficient if the memcg is already in a large excess but if the > > > > > reclaim can make a forward progress this should just work fine because > > > > > each charging context should reclaim at least the contributed amount. > > > > > > > > > > Do you have any insight on why this doesn't work in your situation? > > > > > Especially with such a large inactive file list I would be really > > > > > surprised if the reclaim was not able to make a forward progress. > > > > > > > > The workload we observed this in was downloading a large file and > > > > writing it to disk, which means that a good chunk of that memory was > > > > dirty. The first reclaim pass appears to make little progress because > > > > it runs into dirty pages. > > > > > > OK, I see but why does the subsequent reclaim attempt makes a forward > > > progress? Is this just because dirty pages are flushed in the mean time? > > > Because if this is the case then the underlying problem seems to be that > > > the reclaim should be throttled on dirty data. > > > > That's what I assume. Chris wanted to do more reclaim tracing. But is > > this actually important beyond maybe curiosity? > > Yes, because it might show that there is a deeper problem. Having an > extremely large file list full of dirty data and pre-mature failure for > the reclaim sounds like a problem that is worth looking into closely. > > > We retry every other reclaim invocation on forward progress. There is > > not a single naked call to try_to_free_pages(), and this here is the > > only exception where we don't loop on try_to_free_mem_cgroup_pages(). > > I am not saying the looping over try_to_free_pages is wrong. I do care > about the final reclaim target. That shouldn't be arbitrary. We have > established a target which is proportional to the requested amount of > memory. And there is a good reason for that. If any task tries to > reclaim down to the high limit then this might lead to a large > unfairness when heavy producers piggy back on the active reclaimer(s). Why is that different than any other form of reclaim? > I wouldn't mind to loop over try_to_free_pages to meet the requested > memcg_nr_pages_over_high target. Should we do the same for global reclaim? Move reclaim to userspace resume where there are no GFP_FS, GFP_NOWAIT etc. restrictions and then have everybody just reclaim exactly what they asked for, and punt interrupts / kthread allocations to a worker/kswapd? > > > > > Also if the current high reclaim scaling is insufficient then we should > > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly > > > > > unbound number of reclaim retries. > > > > > > > > ??? > > > > > > I am not sure what you are asking here. > > > > You expressed that some alternate solution B would be preferable, > > without any detail on why you think that is the case. > > > > And it's certainly not obvious or self-explanatory - in particular > > because Chris's proposal *is* obvious and self-explanatory, given how > > everybody else is already doing loops around page reclaim. > > Sorry, I could have been less cryptic. I hope the above and my response > to Chris goes into more details why I do not like this proposal and what > is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high > target. If the current calculation of the target is unsufficient - e.g. > in situations where the high limit excess is very large then this should > be reflected in memcg_nr_pages_over_high. > > Is it more clear? Well you haven't made a good argument why memory.high is actually different than any other form of reclaim, and why it should be the only implementation of page reclaim that has special-cased handling for the inherent "unfairness" or rather raciness of that operation. You cut these lines from the quote: Under pressure, page reclaim can struggle to satisfy the reclaim goal and may return with less pages reclaimed than asked to. Under concurrency, a parallel allocation can invalidate the reclaim progress made by a thread. Even if we *could* invest more into trying to avoid any unfairness, you haven't made a point why we actually should do that here specifically, yet not everywhere else. (And people have tried to do it for global reclaim[1], but clearly this isn't a meaningful problem in practice.) I have a good reason why we shouldn't: because it's special casing memory.high from other forms of reclaim, and that is a maintainability problem. We've recently been discussing ways to make the memory.high implementation stand out less, not make it stand out even more. There is no solid reason it should be different from memory.max reclaim, except that it should sleep instead of invoke OOM at the end. It's already a mess we're trying to get on top of and straighten out, and you're proposing to add more kinks that will make this work harder. I have to admit, I'm baffled by this conversation. I consider this a fairly obvious, idiomatic change, and I cannot relate to the objections or counter-proposals in the slightest. [1] http://lkml.iu.edu/hypermail//linux/kernel/0810.0/0169.html
On Thu 21-05-20 14:41:47, Chris Down wrote: > Michal Hocko writes: > > On Thu 21-05-20 13:57:59, Chris Down wrote: [...] > > > If you're talking about reclaim, trying to reason about whether the overage > > > is the result of some other task in this cgroup or the task that's > > > allocating right now is something that we already know doesn't work well > > > (eg. global OOM). > > > > I am not sure I follow you here. > > Let me rephrase: your statement is that it's not desirable "that some task > would be throttled unexpectedly too long because of [the activity of another > task also within that cgroup]" (let me know if that's not what you meant). > But trying to avoid that requires knowing which activity abstractly > instigates this entire mess in the first place, which we have nowhere near > enough context to determine. Yeah, if we want to be really precise then you are right, nothing like that is really feasible for the reclaim. Reclaiming 1 page might be much more expensive than 100 pages because LRU order doesn't reflect the cost of the reclaim at all. What, I believe, we want is a best effort, really. If the reclaim target is somehow bound to the requested amount of memory then we can at least say that more memory hungry consumers are reclaiming more. Which is something people can wrap their head around much easier than a free competition on the reclaim with some hard to predict losers who do all the work and some lucky ones which just happen to avoid throttling by a better timing. Really think of the direct reclaim and how the unfairness suck there.
Michal Hocko writes: >On Thu 21-05-20 14:41:47, Chris Down wrote: >> Michal Hocko writes: >> > On Thu 21-05-20 13:57:59, Chris Down wrote: >[...] >> > > If you're talking about reclaim, trying to reason about whether the overage >> > > is the result of some other task in this cgroup or the task that's >> > > allocating right now is something that we already know doesn't work well >> > > (eg. global OOM). >> > >> > I am not sure I follow you here. >> >> Let me rephrase: your statement is that it's not desirable "that some task >> would be throttled unexpectedly too long because of [the activity of another >> task also within that cgroup]" (let me know if that's not what you meant). >> But trying to avoid that requires knowing which activity abstractly >> instigates this entire mess in the first place, which we have nowhere near >> enough context to determine. > >Yeah, if we want to be really precise then you are right, nothing like >that is really feasible for the reclaim. Reclaiming 1 page might be much >more expensive than 100 pages because LRU order doesn't reflect the >cost of the reclaim at all. What, I believe, we want is a best effort, >really. If the reclaim target is somehow bound to the requested amount >of memory then we can at least say that more memory hungry consumers are >reclaiming more. Which is something people can wrap their head around >much easier than a free competition on the reclaim with some hard to >predict losers who do all the work and some lucky ones which just happen >to avoid throttling by a better timing. Really think of the direct >reclaim and how the unfairness suck there. I really don't follow this logic. You're talking about reclaim-induced latency, but the alternative isn't freedom from latency, it's scheduler-induced latency from allocator throttling (and probably of a significantly higher magnitude). And again, that's totally justified if you are part of a cgroup which is significantly above its memory.high -- that's the kind of grouping you sign up for when you put multiple tasks in the same cgroup. The premise of being over memory.high is that everyone in the affected cgroup must do their utmost to reclaim where possible, and if they fail to get below it again, we're going to deschedule them. *That's* what's best-effort about it. The losers aren't hard to predict. It's *all* the tasks in this cgroup if they don't each make their utmost attempt to get the cgroup's memory back under control. Doing more reclaim isn't even in the same magnitude of sucking as getting allocator throttled.
On Thu, May 21, 2020 at 09:51:55AM -0400, Johannes Weiner wrote: > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote: > > I wouldn't mind to loop over try_to_free_pages to meet the requested > > memcg_nr_pages_over_high target. > > Should we do the same for global reclaim? Move reclaim to userspace > resume where there are no GFP_FS, GFP_NOWAIT etc. restrictions and > then have everybody just reclaim exactly what they asked for, and punt > interrupts / kthread allocations to a worker/kswapd? Oof, typo: I meant limit reclaim by memory.max and memory.limit_in_bytes. Not physical memory reclaim of course. > > > > > > Also if the current high reclaim scaling is insufficient then we should > > > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly > > > > > > unbound number of reclaim retries. > > > > > > > > > > ??? > > > > > > > > I am not sure what you are asking here. > > > > > > You expressed that some alternate solution B would be preferable, > > > without any detail on why you think that is the case. > > > > > > And it's certainly not obvious or self-explanatory - in particular > > > because Chris's proposal *is* obvious and self-explanatory, given how > > > everybody else is already doing loops around page reclaim. > > > > Sorry, I could have been less cryptic. I hope the above and my response > > to Chris goes into more details why I do not like this proposal and what > > is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high > > target. If the current calculation of the target is unsufficient - e.g. > > in situations where the high limit excess is very large then this should > > be reflected in memcg_nr_pages_over_high. > > > > Is it more clear? > > Well you haven't made a good argument why memory.high is actually > different than any other form of reclaim, and why it should be the > only implementation of page reclaim that has special-cased handling > for the inherent "unfairness" or rather raciness of that operation. > > You cut these lines from the quote: > > Under pressure, page reclaim can struggle to satisfy the reclaim > goal and may return with less pages reclaimed than asked to. > > Under concurrency, a parallel allocation can invalidate the reclaim > progress made by a thread. > > Even if we *could* invest more into trying to avoid any unfairness, > you haven't made a point why we actually should do that here > specifically, yet not everywhere else. > > (And people have tried to do it for global reclaim[1], but clearly > this isn't a meaningful problem in practice.) > > I have a good reason why we shouldn't: because it's special casing > memory.high from other forms of reclaim, and that is a maintainability > problem. We've recently been discussing ways to make the memory.high > implementation stand out less, not make it stand out even more. There > is no solid reason it should be different from memory.max reclaim, > except that it should sleep instead of invoke OOM at the end. It's > already a mess we're trying to get on top of and straighten out, and > you're proposing to add more kinks that will make this work harder. > > I have to admit, I'm baffled by this conversation. I consider this a > fairly obvious, idiomatic change, and I cannot relate to the > objections or counter-proposals in the slightest. > > [1] http://lkml.iu.edu/hypermail//linux/kernel/0810.0/0169.html
On Thu 21-05-20 09:51:52, Johannes Weiner wrote: > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote: [...] > > I am not saying the looping over try_to_free_pages is wrong. I do care > > about the final reclaim target. That shouldn't be arbitrary. We have > > established a target which is proportional to the requested amount of > > memory. And there is a good reason for that. If any task tries to > > reclaim down to the high limit then this might lead to a large > > unfairness when heavy producers piggy back on the active reclaimer(s). > > Why is that different than any other form of reclaim? Because the high limit reclaim is a best effort rather than must to either get over reclaim watermarks and continue allocation or meet the hard limit requirement to continue. In an ideal world even the global resp. hard limit reclaim should consider fairness. They don't because that is easier but that sucks. I have been involved in debugging countless of issues where direct reclaim was taking too long because of the unfairness. Users simply see that as bug and I am not surprised. > > I wouldn't mind to loop over try_to_free_pages to meet the requested > > memcg_nr_pages_over_high target. > > Should we do the same for global reclaim? Move reclaim to userspace > resume where there are no GFP_FS, GFP_NOWAIT etc. restrictions and > then have everybody just reclaim exactly what they asked for, and punt > interrupts / kthread allocations to a worker/kswapd? This would be quite challenging considering the page allocator wouldn't be able to make a forward progress without doing any reclaim. But maybe you can be creative with watermarks. > > > > > > Also if the current high reclaim scaling is insufficient then we should > > > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly > > > > > > unbound number of reclaim retries. > > > > > > > > > > ??? > > > > > > > > I am not sure what you are asking here. > > > > > > You expressed that some alternate solution B would be preferable, > > > without any detail on why you think that is the case. > > > > > > And it's certainly not obvious or self-explanatory - in particular > > > because Chris's proposal *is* obvious and self-explanatory, given how > > > everybody else is already doing loops around page reclaim. > > > > Sorry, I could have been less cryptic. I hope the above and my response > > to Chris goes into more details why I do not like this proposal and what > > is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high > > target. If the current calculation of the target is unsufficient - e.g. > > in situations where the high limit excess is very large then this should > > be reflected in memcg_nr_pages_over_high. > > > > Is it more clear? > > Well you haven't made a good argument why memory.high is actually > different than any other form of reclaim, and why it should be the > only implementation of page reclaim that has special-cased handling > for the inherent "unfairness" or rather raciness of that operation. > > You cut these lines from the quote: > > Under pressure, page reclaim can struggle to satisfy the reclaim > goal and may return with less pages reclaimed than asked to. > > Under concurrency, a parallel allocation can invalidate the reclaim > progress made by a thread. > > Even if we *could* invest more into trying to avoid any unfairness, > you haven't made a point why we actually should do that here > specifically, yet not everywhere else. I have tried to explain my thinking elsewhere in the thread. The bottom line is that high limit is a way of throttling rather than meeting a specific target. With the current implementation we scale the reclaim activity by the consumer's demand which is something that is not terribly complex to wrap your head around and reason about. Because the objective is to not increase the excess much. It offers some sort of fairness as well. I fully recognize that a full fairness is not something we can target but working reasonably well most of the time sounds good enough for me. > (And people have tried to do it for global reclaim[1], but clearly > this isn't a meaningful problem in practice.) > > I have a good reason why we shouldn't: because it's special casing > memory.high from other forms of reclaim, and that is a maintainability > problem. We've recently been discussing ways to make the memory.high > implementation stand out less, not make it stand out even more. There > is no solid reason it should be different from memory.max reclaim, > except that it should sleep instead of invoke OOM at the end. It's > already a mess we're trying to get on top of and straighten out, and > you're proposing to add more kinks that will make this work harder. I do see your point of course. But I do not give the code consistency a higher priority than the potential unfairness aspect of the user visible behavior for something that can do better. Really the direct reclaim unfairness is really painfull and hard to explain to users. You can essentially only hand wave that system is struggling so fairness is not really a priority anymore. > I have to admit, I'm baffled by this conversation. I consider this a > fairly obvious, idiomatic change, and I cannot relate to the > objections or counter-proposals in the slightest. I have to admit that I would prefer a much less aggressive tone. We are discussing a topic which is obviously not black and white and there are different aspects of it. Thanks! > [1] http://lkml.iu.edu/hypermail//linux/kernel/0810.0/0169.html
Michal Hocko writes: >> I have a good reason why we shouldn't: because it's special casing >> memory.high from other forms of reclaim, and that is a maintainability >> problem. We've recently been discussing ways to make the memory.high >> implementation stand out less, not make it stand out even more. There >> is no solid reason it should be different from memory.max reclaim, >> except that it should sleep instead of invoke OOM at the end. It's >> already a mess we're trying to get on top of and straighten out, and >> you're proposing to add more kinks that will make this work harder. > >I do see your point of course. But I do not give the code consistency >a higher priority than the potential unfairness aspect of the user >visible behavior for something that can do better. Really the direct >reclaim unfairness is really painfull and hard to explain to users. You >can essentially only hand wave that system is struggling so fairness is >not really a priority anymore. It's not handwaving. When using cgroup features, including memory.high, the unit for consideration is a cgroup, not a task. That we happen to act on individual tasks in this case is just an implementation detail. That one task in that cgroup is may be penalised "unfairly" is well within the specification: we set limits as part of a cgroup, we account as part of a cgroup, and we throttle and reclaim as part of a cgroup. We may make some very rudimentary attempts to "be fair" on a per-task basis where that's trivial, but that's just one-off niceties, not a statement of precedent. When exceeding memory.high, the contract is "this cgroup must immediately attempt to shrink". Breaking it down per-task in terms of fairness at that point doesn't make sense: all the tasks in one cgroup are in it together.
On Thu, May 21, 2020 at 04:35:15PM +0200, Michal Hocko wrote: > On Thu 21-05-20 09:51:52, Johannes Weiner wrote: > > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote: > [...] > > > I am not saying the looping over try_to_free_pages is wrong. I do care > > > about the final reclaim target. That shouldn't be arbitrary. We have > > > established a target which is proportional to the requested amount of > > > memory. And there is a good reason for that. If any task tries to > > > reclaim down to the high limit then this might lead to a large > > > unfairness when heavy producers piggy back on the active reclaimer(s). > > > > Why is that different than any other form of reclaim? > > Because the high limit reclaim is a best effort rather than must to > either get over reclaim watermarks and continue allocation or meet the > hard limit requirement to continue. It's not best effort. It's a must-meet or get put to sleep. You are mistaken about what memory.high is. > In an ideal world even the global resp. hard limit reclaim should > consider fairness. They don't because that is easier but that sucks. I > have been involved in debugging countless of issues where direct reclaim > was taking too long because of the unfairness. Users simply see that as > bug and I am not surprised. Then there should be a generic fix to this problem (like the page capturing during reclaim). You're bringing anecdotal evidence that reclaim has a generic problem, which nobody has seriously tried to fix in recent times, and then ask people to hack around it in a patch that only brings the behavior for this specific instance in line with everybody else. I'm sorry, but this IS a black and white issue, and I think you're out of line here. If you think reclaim fairness is a problem, it should be on you to provide concrete data for that and propose changes on how we do reclaim, instead of asking to hack around it in one callsite - thereby introducing inconsistencies to userspace between different limits, as well as inconsistencies and complications for the kernel developers that actually work on this code (take a look at git blame). > > > I wouldn't mind to loop over try_to_free_pages to meet the requested > > > memcg_nr_pages_over_high target. > > > > Should we do the same for global reclaim? Move reclaim to userspace > > resume where there are no GFP_FS, GFP_NOWAIT etc. restrictions and > > then have everybody just reclaim exactly what they asked for, and punt > > interrupts / kthread allocations to a worker/kswapd? > > This would be quite challenging considering the page allocator wouldn't > be able to make a forward progress without doing any reclaim. But maybe > you can be creative with watermarks. I clarified in the follow-up email that I meant limit reclaim. > > > > > > > Also if the current high reclaim scaling is insufficient then we should > > > > > > > be handling that via memcg_nr_pages_over_high rather than effectivelly > > > > > > > unbound number of reclaim retries. > > > > > > > > > > > > ??? > > > > > > > > > > I am not sure what you are asking here. > > > > > > > > You expressed that some alternate solution B would be preferable, > > > > without any detail on why you think that is the case. > > > > > > > > And it's certainly not obvious or self-explanatory - in particular > > > > because Chris's proposal *is* obvious and self-explanatory, given how > > > > everybody else is already doing loops around page reclaim. > > > > > > Sorry, I could have been less cryptic. I hope the above and my response > > > to Chris goes into more details why I do not like this proposal and what > > > is the alternative. But let me summarize. I propose to use memcg_nr_pages_over_high > > > target. If the current calculation of the target is unsufficient - e.g. > > > in situations where the high limit excess is very large then this should > > > be reflected in memcg_nr_pages_over_high. > > > > > > Is it more clear? > > > > Well you haven't made a good argument why memory.high is actually > > different than any other form of reclaim, and why it should be the > > only implementation of page reclaim that has special-cased handling > > for the inherent "unfairness" or rather raciness of that operation. > > > > You cut these lines from the quote: > > > > Under pressure, page reclaim can struggle to satisfy the reclaim > > goal and may return with less pages reclaimed than asked to. > > > > Under concurrency, a parallel allocation can invalidate the reclaim > > progress made by a thread. > > > > Even if we *could* invest more into trying to avoid any unfairness, > > you haven't made a point why we actually should do that here > > specifically, yet not everywhere else. > > I have tried to explain my thinking elsewhere in the thread. The bottom > line is that high limit is a way of throttling rather than meeting a > specific target. That's an incorrect assumption. Of course it should meet the specific target that the user specified. > > (And people have tried to do it for global reclaim[1], but clearly > > this isn't a meaningful problem in practice.) > > > > I have a good reason why we shouldn't: because it's special casing > > memory.high from other forms of reclaim, and that is a maintainability > > problem. We've recently been discussing ways to make the memory.high > > implementation stand out less, not make it stand out even more. There > > is no solid reason it should be different from memory.max reclaim, > > except that it should sleep instead of invoke OOM at the end. It's > > already a mess we're trying to get on top of and straighten out, and > > you're proposing to add more kinks that will make this work harder. > > I do see your point of course. But I do not give the code consistency > a higher priority than the potential unfairness aspect of the user > visible behavior for something that can do better. Michal, you have almost no authorship stake in this code base. Would it be possible to defer judgement on maintainability to people who do?
On Thu 21-05-20 12:38:33, Johannes Weiner wrote: > On Thu, May 21, 2020 at 04:35:15PM +0200, Michal Hocko wrote: > > On Thu 21-05-20 09:51:52, Johannes Weiner wrote: > > > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote: > > [...] > > > > I am not saying the looping over try_to_free_pages is wrong. I do care > > > > about the final reclaim target. That shouldn't be arbitrary. We have > > > > established a target which is proportional to the requested amount of > > > > memory. And there is a good reason for that. If any task tries to > > > > reclaim down to the high limit then this might lead to a large > > > > unfairness when heavy producers piggy back on the active reclaimer(s). > > > > > > Why is that different than any other form of reclaim? > > > > Because the high limit reclaim is a best effort rather than must to > > either get over reclaim watermarks and continue allocation or meet the > > hard limit requirement to continue. > > It's not best effort. It's a must-meet or get put to sleep. You are > mistaken about what memory.high is. I do not see anything like that being documented. Let me remind you what the documentation says: memory.high A read-write single value file which exists on non-root cgroups. The default is "max". Memory usage throttle limit. This is the main mechanism to control memory usage of a cgroup. If a cgroup's usage goes over the high boundary, the processes of the cgroup are throttled and put under heavy reclaim pressure. Going over the high limit never invokes the OOM killer and under extreme conditions the limit may be breached. My understanding is that breaching the limit is acceptable if the memory is not reclaimable after placing a heavy reclaim pressure. We can discuss what the heavy reclaim means but the underlying fact is that the keeping the consumption under the limit is a best effort. Please also let me remind you that the best effort implementation has been there since the beginning when the memory.high has been introduced. Now you seem to be convinced that the semantic is _obviously_ different. It is not the first time when the high limit behavior has changed. Mostly based on "what is currently happening in your fleet". And can see why it is reasonable to adopt to a real life usage. That is OK most of the time. But I haven't heard why keeping the existing approach and enforcing the reclaim target is not working properly so far. All I can hear is a generic statement that consistency matters much more than all potential problem it might introduce. Anyway, I do see that you are not really willing to have a non-confrontational discussion so I do not bother to reply to the rest and participate in the further discussion. As usual, let me remind you that I haven't nacked the patch. I do not plan to do that because "this is not black&white" as already said. But if your really want to push this through then let's do it properly at least. memcg->memcg_nr_pages_over_high has only very vague meaning if the reclaim target is the high limit. The changelog should be also explicit about a potentially large stalls so that people debugging such a problem have a clue at least.
On Thu, May 21, 2020 at 07:37:01PM +0200, Michal Hocko wrote: > On Thu 21-05-20 12:38:33, Johannes Weiner wrote: > > On Thu, May 21, 2020 at 04:35:15PM +0200, Michal Hocko wrote: > > > On Thu 21-05-20 09:51:52, Johannes Weiner wrote: > > > > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote: > > > [...] > > > > > I am not saying the looping over try_to_free_pages is wrong. I do care > > > > > about the final reclaim target. That shouldn't be arbitrary. We have > > > > > established a target which is proportional to the requested amount of > > > > > memory. And there is a good reason for that. If any task tries to > > > > > reclaim down to the high limit then this might lead to a large > > > > > unfairness when heavy producers piggy back on the active reclaimer(s). > > > > > > > > Why is that different than any other form of reclaim? > > > > > > Because the high limit reclaim is a best effort rather than must to > > > either get over reclaim watermarks and continue allocation or meet the > > > hard limit requirement to continue. > > > > It's not best effort. It's a must-meet or get put to sleep. You are > > mistaken about what memory.high is. > > I do not see anything like that being documented. Let me remind you what > the documentation says: > memory.high > A read-write single value file which exists on non-root > cgroups. The default is "max". > > Memory usage throttle limit. This is the main mechanism to > control memory usage of a cgroup. If a cgroup's usage goes > over the high boundary, the processes of the cgroup are > throttled and put under heavy reclaim pressure. > > Going over the high limit never invokes the OOM killer and > under extreme conditions the limit may be breached. > > My understanding is that breaching the limit is acceptable if the memory > is not reclaimable after placing a heavy reclaim pressure. We can > discuss what the heavy reclaim means but the underlying fact is that the > keeping the consumption under the limit is a best effort. It says it's the main mechanism to control memory usage, and that "under extreme conditions the limit may be breached". This doesn't sound like "let's try some reclaim and see how it goes" to me. As the person who designed and implemented this feature, it certainly wasn't the intention. > Please also let me remind you that the best effort implementation has > been there since the beginning when the memory.high has been introduced. > Now you seem to be convinced that the semantic is _obviously_ different. > > It is not the first time when the high limit behavior has changed. > Mostly based on "what is currently happening in your fleet". And can see > why it is reasonable to adopt to a real life usage. That is OK most of > the time. But I haven't heard why keeping the existing approach and > enforcing the reclaim target is not working properly so far. All I can > hear is a generic statement that consistency matters much more than all > potential problem it might introduce. The assumption when writing the first implementation was that the full reclaim cycle that we impose on every subsequent allocation was enough to 1) mount a significant effort to push back allocations or 2) if it fails, at least hold up allocations enough to curb further growth. As it turned out when deploying this code at scale, reclaim is not sufficient to achieve #2, because it simply may fail with not that many pages to scan - especially on systems without swap. So after observing a violation of the promised behavior, we added the sleeps for situations where reclaim fails to contain the workload as stated. After adding the sleeps, we noticed - again after deploying at scale - that in certain situations reclaim isn't failing but simply returning early, and we go to sleep and get OOM killed on full file LRUs. After analyzing this problem, it's clear that we had an oversight here: all other reclaimers are already familiar with the fact that reclaim may not be able to complete the reclaim target in one call, or that page reclaim is inherently racy and reclaim work can be stolen. We send a simple bug fix: bring this instance of reclaim in line with how everybody else is using the reclaim API, to meet the semantics as they are intendend and documented. And somehow this is controversial, and we're just changing around user promises as we see fit for our particular usecase? I don't even understand how the supposed alternate semantics you read between the lines in the documentation would make for a useful feature: It may fail to contain a group of offending tasks to the configured limit, but it will be fair to those tasks while doing so? > But if your really want to push this through then let's do it > properly at least. memcg->memcg_nr_pages_over_high has only very > vague meaning if the reclaim target is the high limit. task->memcg_nr_pages_over_high is not vague, it's a best-effort mechanism to distribute fairness. It's the current task's share of the cgroup's overage, and it allows us in the majority of situations to distribute reclaim work and sleeps in proportion to how much the task is actually at fault. However, due to the inherent raciness of reclaim, this may sometimes fail. In that situation, it's way better to suffer some unfairness than to give up, go to sleep, and risk OOM intervention - or give up, let the task continue and fail memory containment of the cgroup. Both of these are more important than the concept of "fairness" between tasks that already share a cgroup and memory pool, and likely have a myriad of other resource dependencies between them. > The changelog should be also explicit about a potentially large > stalls so that people debugging such a problem have a clue at least. The large stalls you see from hitting your memory limit? At what point would that be unexpected? All you see is a task inside reclaim while it's trying to allocate and the cgroup is at its limit. The same as you would with memory.max and global reclaim.
On Thu 21-05-20 14:45:05, Johannes Weiner wrote: > On Thu, May 21, 2020 at 07:37:01PM +0200, Michal Hocko wrote: > > On Thu 21-05-20 12:38:33, Johannes Weiner wrote: > > > On Thu, May 21, 2020 at 04:35:15PM +0200, Michal Hocko wrote: > > > > On Thu 21-05-20 09:51:52, Johannes Weiner wrote: > > > > > On Thu, May 21, 2020 at 09:32:45AM +0200, Michal Hocko wrote: > > > > [...] > > > > > > I am not saying the looping over try_to_free_pages is wrong. I do care > > > > > > about the final reclaim target. That shouldn't be arbitrary. We have > > > > > > established a target which is proportional to the requested amount of > > > > > > memory. And there is a good reason for that. If any task tries to > > > > > > reclaim down to the high limit then this might lead to a large > > > > > > unfairness when heavy producers piggy back on the active reclaimer(s). > > > > > > > > > > Why is that different than any other form of reclaim? > > > > > > > > Because the high limit reclaim is a best effort rather than must to > > > > either get over reclaim watermarks and continue allocation or meet the > > > > hard limit requirement to continue. > > > > > > It's not best effort. It's a must-meet or get put to sleep. You are > > > mistaken about what memory.high is. > > > > I do not see anything like that being documented. Let me remind you what > > the documentation says: > > memory.high > > A read-write single value file which exists on non-root > > cgroups. The default is "max". > > > > Memory usage throttle limit. This is the main mechanism to > > control memory usage of a cgroup. If a cgroup's usage goes > > over the high boundary, the processes of the cgroup are > > throttled and put under heavy reclaim pressure. > > > > Going over the high limit never invokes the OOM killer and > > under extreme conditions the limit may be breached. > > > > My understanding is that breaching the limit is acceptable if the memory > > is not reclaimable after placing a heavy reclaim pressure. We can > > discuss what the heavy reclaim means but the underlying fact is that the > > keeping the consumption under the limit is a best effort. > > It says it's the main mechanism to control memory usage, and that > "under extreme conditions the limit may be breached". This doesn't > sound like "let's try some reclaim and see how it goes" to me. > > As the person who designed and implemented this feature, it certainly > wasn't the intention. > > > Please also let me remind you that the best effort implementation has > > been there since the beginning when the memory.high has been introduced. > > Now you seem to be convinced that the semantic is _obviously_ different. > > > > It is not the first time when the high limit behavior has changed. > > Mostly based on "what is currently happening in your fleet". And can see > > why it is reasonable to adopt to a real life usage. That is OK most of > > the time. But I haven't heard why keeping the existing approach and > > enforcing the reclaim target is not working properly so far. All I can > > hear is a generic statement that consistency matters much more than all > > potential problem it might introduce. > > The assumption when writing the first implementation was that the full > reclaim cycle that we impose on every subsequent allocation was enough > to 1) mount a significant effort to push back allocations or 2) if it > fails, at least hold up allocations enough to curb further growth. > > As it turned out when deploying this code at scale, reclaim is not > sufficient to achieve #2, because it simply may fail with not that > many pages to scan - especially on systems without swap. So after > observing a violation of the promised behavior, we added the sleeps > for situations where reclaim fails to contain the workload as stated. > > After adding the sleeps, we noticed - again after deploying at scale - > that in certain situations reclaim isn't failing but simply returning > early, and we go to sleep and get OOM killed on full file LRUs. > > After analyzing this problem, it's clear that we had an oversight > here: all other reclaimers are already familiar with the fact that > reclaim may not be able to complete the reclaim target in one call, or > that page reclaim is inherently racy and reclaim work can be stolen. There is no disagreement here. > We send a simple bug fix: bring this instance of reclaim in line with > how everybody else is using the reclaim API, to meet the semantics as > they are intendend and documented. Here is where we are not on the same page though. Once you have identified that the main problem is that the reclaim fails too early to meet the target then the fix would be to enforce that target. I have asked why this hasn't been done and haven't got any real answer for that. Instead what you call "a simple bug fix" has larger consequences which are not really explained in the changelog and they are also not really trivial to see. If the changelog explicitly stated that the proportional memory reclaim is not sufficient because XYZ and the implementation has been changed to instead meet the high limit target then this would be a completely different story and I believe we could have saved some discussion. > And somehow this is controversial, and we're just changing around user > promises as we see fit for our particular usecase? > > I don't even understand how the supposed alternate semantics you read > between the lines in the documentation would make for a useful > feature: It may fail to contain a group of offending tasks to the > configured limit, but it will be fair to those tasks while doing so? > > > But if your really want to push this through then let's do it > > properly at least. memcg->memcg_nr_pages_over_high has only very > > vague meaning if the reclaim target is the high limit. > > task->memcg_nr_pages_over_high is not vague, it's a best-effort > mechanism to distribute fairness. It's the current task's share of the > cgroup's overage, and it allows us in the majority of situations to > distribute reclaim work and sleeps in proportion to how much the task > is actually at fault. Agreed. But this stops being the case as soon as the reclaim target has been reached and new reclaim attempts are enforced because the memcg is still above the high limit. Because then you have a completely different reclaim target - get down to the limit. This would be especially visible with a large memcg_nr_pages_over_high which could even lead to an over reclaim.
Michal Hocko writes: >> We send a simple bug fix: bring this instance of reclaim in line with >> how everybody else is using the reclaim API, to meet the semantics as >> they are intendend and documented. > >Here is where we are not on the same page though. Once you have identified >that the main problem is that the reclaim fails too early to meet the >target then the fix would be to enforce that target. I have asked why >this hasn't been done and haven't got any real answer for that. Instead >what you call "a simple bug fix" has larger consequences which are not >really explained in the changelog and they are also not really trivial >to see. If the changelog explicitly stated that the proportional memory >reclaim is not sufficient because XYZ and the implementation has been >changed to instead meet the high limit target then this would be a >completely different story and I believe we could have saved some >discussion. I agree that the changelog can be made more clear. Any objection if I send v2 with changelog changes to that effect, then? :-) >> And somehow this is controversial, and we're just changing around user >> promises as we see fit for our particular usecase? >> >> I don't even understand how the supposed alternate semantics you read >> between the lines in the documentation would make for a useful >> feature: It may fail to contain a group of offending tasks to the >> configured limit, but it will be fair to those tasks while doing so? >> >> > But if your really want to push this through then let's do it >> > properly at least. memcg->memcg_nr_pages_over_high has only very >> > vague meaning if the reclaim target is the high limit. >> >> task->memcg_nr_pages_over_high is not vague, it's a best-effort >> mechanism to distribute fairness. It's the current task's share of the >> cgroup's overage, and it allows us in the majority of situations to >> distribute reclaim work and sleeps in proportion to how much the task >> is actually at fault. > >Agreed. But this stops being the case as soon as the reclaim target has >been reached and new reclaim attempts are enforced because the memcg is >still above the high limit. Because then you have a completely different >reclaim target - get down to the limit. This would be especially visible >with a large memcg_nr_pages_over_high which could even lead to an over >reclaim. We actually over reclaim even before this patch -- this patch doesn't bring much new in that regard. Tracing try_to_free_pages for a cgroup at the memory.high threshold shows that before this change, we sometimes even reclaim on the order of twice the number of pages requested. For example, I see cases where we requested 1000 pages to be reclaimed, but end up reclaiming 2000 in a single reclaim attempt.
I haven't gone through the whole email-chain, so, I might be asking some repetitive questions. I will go through the email-chain later. On Wed, May 20, 2020 at 7:37 AM Chris Down <chris@chrisdown.name> wrote: > > In Facebook production, we've seen cases where cgroups have been put > into allocator throttling even when they appear to have a lot of slack > file caches which should be trivially reclaimable. > > Looking more closely, the problem is that we only try a single cgroup > reclaim walk for each return to usermode before calculating whether or > not we should throttle. This single attempt doesn't produce enough > pressure to shrink for cgroups with a rapidly growing amount of file > caches prior to entering allocator throttling. In my experience it is usually shrink_slab which requires hammering multiple times to actually reclaim memory. > > As an example, we see that threads in an affected cgroup are stuck in > allocator throttling: > > # for i in $(cat cgroup.threads); do > > grep over_high "/proc/$i/stack" > > done > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > ...however, there is no I/O pressure reported by PSI, despite a lot of > slack file pages: > > # cat memory.pressure > some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 > full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 > # cat io.pressure > some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 > full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 > # grep _file memory.stat > inactive_file 1370939392 > active_file 661635072 > > This patch changes the behaviour to retry reclaim either until the > current task goes below the 10ms grace period, or we are making no > reclaim progress at all. In the latter case, we enter reclaim throttling > as before. > > To a user, there's no intuitive reason for the reclaim behaviour to > differ from hitting memory.high as part of a new allocation, as opposed > to hitting memory.high because someone lowered its value. As such this > also brings an added benefit: it unifies the reclaim behaviour between > the two. What was the initial reason to have different behavior in the first place? > > There's precedent for this behaviour: we already do reclaim retries when > writing to memory.{high,max}, in max reclaim, and in the page allocator > itself. > > Signed-off-by: Chris Down <chris@chrisdown.name> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Tejun Heo <tj@kernel.org> > Cc: Michal Hocko <mhocko@kernel.org> > --- > mm/memcontrol.c | 28 +++++++++++++++++++++++----- > 1 file changed, 23 insertions(+), 5 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2df9510b7d64..b040951ccd6b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -73,6 +73,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys); > > struct mem_cgroup *root_mem_cgroup __read_mostly; > > +/* The number of times we should retry reclaim failures before giving up. */ > #define MEM_CGROUP_RECLAIM_RETRIES 5 > > /* Socket memory accounting disabled? */ > @@ -2228,17 +2229,22 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > return 0; > } > > -static void reclaim_high(struct mem_cgroup *memcg, > - unsigned int nr_pages, > - gfp_t gfp_mask) > +static unsigned long reclaim_high(struct mem_cgroup *memcg, > + unsigned int nr_pages, > + gfp_t gfp_mask) > { > + unsigned long nr_reclaimed = 0; > + > do { > if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->high)) > continue; > memcg_memory_event(memcg, MEMCG_HIGH); > - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); > + nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, > + gfp_mask, true); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > + > + return nr_reclaimed; > } > > static void high_work_func(struct work_struct *work) > @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void) > { > unsigned long penalty_jiffies; > unsigned long pflags; > + unsigned long nr_reclaimed; > unsigned int nr_pages = current->memcg_nr_pages_over_high; Is there any benefit to keep current->memcg_nr_pages_over_high after this change? Why not just use SWAP_CLUSTER_MAX? > + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > struct mem_cgroup *memcg; > > if (likely(!nr_pages)) > return; > > memcg = get_mem_cgroup_from_mm(current->mm); > - reclaim_high(memcg, nr_pages, GFP_KERNEL); > current->memcg_nr_pages_over_high = 0; > > +retry_reclaim: > + nr_reclaimed = reclaim_high(memcg, nr_pages, GFP_KERNEL); > + > /* > * memory.high is breached and reclaim is unable to keep up. Throttle > * allocators proactively to slow down excessive growth. > @@ -2403,6 +2413,14 @@ void mem_cgroup_handle_over_high(void) > if (penalty_jiffies <= HZ / 100) > goto out; > > + /* > + * If reclaim is making forward progress but we're still over > + * memory.high, we want to encourage that rather than doing allocator > + * throttling. > + */ > + if (nr_reclaimed || nr_retries--) > + goto retry_reclaim; > + > /* > * If we exit early, we're guaranteed to die (since > * schedule_timeout_killable sets TASK_KILLABLE). This means we don't > -- > 2.26.2 >
Shakeel Butt writes: >What was the initial reason to have different behavior in the first place? This differing behaviour is simply a mistake, it was never intended to be this deviate from what happens elsewhere. To that extent this patch is as much a bug fix as it is an improvement. >> static void high_work_func(struct work_struct *work) >> @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void) >> { >> unsigned long penalty_jiffies; >> unsigned long pflags; >> + unsigned long nr_reclaimed; >> unsigned int nr_pages = current->memcg_nr_pages_over_high; > >Is there any benefit to keep current->memcg_nr_pages_over_high after >this change? Why not just use SWAP_CLUSTER_MAX? I don't feel strongly either way, but current->memcg_nr_pages_over_high can be very large for large allocations. That said, maybe we should just reclaim `max(SWAP_CLUSTER_MAX, current - high)` for each loop? I agree that with this design it looks like perhaps we don't need it any more. Johannes, what do you think?
On Thu, May 28, 2020 at 06:31:01PM +0200, Michal Hocko wrote: > On Thu 21-05-20 14:45:05, Johannes Weiner wrote: > > After analyzing this problem, it's clear that we had an oversight > > here: all other reclaimers are already familiar with the fact that > > reclaim may not be able to complete the reclaim target in one call, or > > that page reclaim is inherently racy and reclaim work can be stolen. > > There is no disagreement here. > > > We send a simple bug fix: bring this instance of reclaim in line with > > how everybody else is using the reclaim API, to meet the semantics as > > they are intendend and documented. > > Here is where we are not on the same page though. Once you have identified > that the main problem is that the reclaim fails too early to meet the > target then the fix would be to enforce that target. I have asked why > this hasn't been done and haven't got any real answer for that. Then I encourage you to re-read the thread. I have explained that reclaim invocations can fail to meet the requested target for a variety of reasons, including dirty state or other states that make memory temporarily unreclaimable, race conditions between reclaimers and so forth. I have also pointed out that this is widely acknowledged by the fact that all other reclaimers retry in the exact same manner. If you want to question that VM-wide precedence, please do so in your own patches. As to the question around fairness, I have explained that fairness is a best effort and that if push comes to shove, preventing premature OOM situations or failing cgroup containment and causing system-wide OOMs is more important. > Instead what you call "a simple bug fix" has larger consequences > which are not really explained in the changelog and they are also > not really trivial to see. If the changelog explicitly stated that > the proportional memory reclaim is not sufficient because XYZ and > the implementation has been changed to instead meet the high limit > target then this would be a completely different story and I believe > we could have saved some discussion. The point of memory.high reclaim is to meet the memory.high memory limit. That, too, has been addressed - although it's astounding that it needed to be pointed out. The proportionality is an attempt at fairness that doesn't override the primary purpose. I appreciate your concerns, but your questions have been addressed. And you're not contributing anything of value to the conversation until you familiarize yourself with the purpose of the memory.high interface. Thanks
On Thu, May 28, 2020 at 08:48:31PM +0100, Chris Down wrote: > Shakeel Butt writes: > > What was the initial reason to have different behavior in the first place? > > This differing behaviour is simply a mistake, it was never intended to be > this deviate from what happens elsewhere. To that extent this patch is as > much a bug fix as it is an improvement. Yes, it was an oversight. > > > static void high_work_func(struct work_struct *work) > > > @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void) > > > { > > > unsigned long penalty_jiffies; > > > unsigned long pflags; > > > + unsigned long nr_reclaimed; > > > unsigned int nr_pages = current->memcg_nr_pages_over_high; > > > > Is there any benefit to keep current->memcg_nr_pages_over_high after > > this change? Why not just use SWAP_CLUSTER_MAX? It's there for the same reason why try_to_free_pages() takes a reclaim argument in the first place: we want to make the thread allocating the most also do the most reclaim work. Consider a thread allocating THPs in a loop with another thread allocating regular pages. Remember that all callers loop. They could theoretically all just ask for SWAP_CLUSTER_MAX pages over and over again. The idea is to have fairness in most cases, and avoid allocation failure, premature OOM, and containment failure in the edge cases that are caused by the inherent raciness of page reclaim. > I don't feel strongly either way, but current->memcg_nr_pages_over_high can > be very large for large allocations. > > That said, maybe we should just reclaim `max(SWAP_CLUSTER_MAX, current - > high)` for each loop? I agree that with this design it looks like perhaps we > don't need it any more. > > Johannes, what do you think? How about this: Reclaim memcg_nr_pages_over_high in the first iteration, then switch to SWAP_CLUSTER_MAX in the retries. This acknowledges that while the page allocator and memory.max reclaim every time an allocation is made, memory.high is currently batched and can have larger targets. We want the allocating thread to reclaim at least the batch size, but beyond that only what's necessary to prevent premature OOM or failing containment. Add a comment stating as much. Once we reclaim memory.high synchronously instead of batched, this exceptional handling is no longer needed and can be deleted again. Does that sound reasonable?
On Thu, May 28, 2020 at 1:30 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, May 28, 2020 at 08:48:31PM +0100, Chris Down wrote: > > Shakeel Butt writes: > > > What was the initial reason to have different behavior in the first place? > > > > This differing behaviour is simply a mistake, it was never intended to be > > this deviate from what happens elsewhere. To that extent this patch is as > > much a bug fix as it is an improvement. > > Yes, it was an oversight. > > > > > static void high_work_func(struct work_struct *work) > > > > @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void) > > > > { > > > > unsigned long penalty_jiffies; > > > > unsigned long pflags; > > > > + unsigned long nr_reclaimed; > > > > unsigned int nr_pages = current->memcg_nr_pages_over_high; > > > > > > Is there any benefit to keep current->memcg_nr_pages_over_high after > > > this change? Why not just use SWAP_CLUSTER_MAX? > > It's there for the same reason why try_to_free_pages() takes a reclaim > argument in the first place: we want to make the thread allocating the > most also do the most reclaim work. Consider a thread allocating THPs > in a loop with another thread allocating regular pages. > > Remember that all callers loop. They could theoretically all just ask > for SWAP_CLUSTER_MAX pages over and over again. > > The idea is to have fairness in most cases, and avoid allocation > failure, premature OOM, and containment failure in the edge cases that > are caused by the inherent raciness of page reclaim. > Thanks for the explanation. > > I don't feel strongly either way, but current->memcg_nr_pages_over_high can > > be very large for large allocations. > > > > That said, maybe we should just reclaim `max(SWAP_CLUSTER_MAX, current - > > high)` for each loop? I agree that with this design it looks like perhaps we > > don't need it any more. > > > > Johannes, what do you think? > > How about this: > > Reclaim memcg_nr_pages_over_high in the first iteration, then switch > to SWAP_CLUSTER_MAX in the retries. > > This acknowledges that while the page allocator and memory.max reclaim > every time an allocation is made, memory.high is currently batched and > can have larger targets. We want the allocating thread to reclaim at > least the batch size, but beyond that only what's necessary to prevent > premature OOM or failing containment. > > Add a comment stating as much. > > Once we reclaim memory.high synchronously instead of batched, this > exceptional handling is no longer needed and can be deleted again. > > Does that sound reasonable? SGTM. It does not seem controversial to me to let the task do the work to resolve the condition for which it is being throttled.
Johannes Weiner writes: >> I don't feel strongly either way, but current->memcg_nr_pages_over_high can >> be very large for large allocations. >> >> That said, maybe we should just reclaim `max(SWAP_CLUSTER_MAX, current - >> high)` for each loop? I agree that with this design it looks like perhaps we >> don't need it any more. >> >> Johannes, what do you think? > >How about this: > >Reclaim memcg_nr_pages_over_high in the first iteration, then switch >to SWAP_CLUSTER_MAX in the retries. > >This acknowledges that while the page allocator and memory.max reclaim >every time an allocation is made, memory.high is currently batched and >can have larger targets. We want the allocating thread to reclaim at >least the batch size, but beyond that only what's necessary to prevent >premature OOM or failing containment. > >Add a comment stating as much. > >Once we reclaim memory.high synchronously instead of batched, this >exceptional handling is no longer needed and can be deleted again. > >Does that sound reasonable? That sounds good to me, thanks. I'll change that in v2 soonish and update the changelog.
On Thu 28-05-20 16:29:44, Johannes Weiner wrote: [...] > How about this: > > Reclaim memcg_nr_pages_over_high in the first iteration, then switch > to SWAP_CLUSTER_MAX in the retries. Yes, that makes much more sense to me.
On Thu 28-05-20 17:48:48, Chris Down wrote: > Michal Hocko writes: > > > We send a simple bug fix: bring this instance of reclaim in line with > > > how everybody else is using the reclaim API, to meet the semantics as > > > they are intendend and documented. > > > > Here is where we are not on the same page though. Once you have identified > > that the main problem is that the reclaim fails too early to meet the > > target then the fix would be to enforce that target. I have asked why > > this hasn't been done and haven't got any real answer for that. Instead > > what you call "a simple bug fix" has larger consequences which are not > > really explained in the changelog and they are also not really trivial > > to see. If the changelog explicitly stated that the proportional memory > > reclaim is not sufficient because XYZ and the implementation has been > > changed to instead meet the high limit target then this would be a > > completely different story and I believe we could have saved some > > discussion. > > I agree that the changelog can be made more clear. Any objection if I send > v2 with changelog changes to that effect, then? :-) Yes, please. And I would highly appreciate to have the above addressed. So that we do not have to really scratch heads why a particular design decision has been made and argue what was the thinking behind. > > > And somehow this is controversial, and we're just changing around user > > > promises as we see fit for our particular usecase? > > > > > > I don't even understand how the supposed alternate semantics you read > > > between the lines in the documentation would make for a useful > > > feature: It may fail to contain a group of offending tasks to the > > > configured limit, but it will be fair to those tasks while doing so? > > > > > > > But if your really want to push this through then let's do it > > > > properly at least. memcg->memcg_nr_pages_over_high has only very > > > > vague meaning if the reclaim target is the high limit. > > > > > > task->memcg_nr_pages_over_high is not vague, it's a best-effort > > > mechanism to distribute fairness. It's the current task's share of the > > > cgroup's overage, and it allows us in the majority of situations to > > > distribute reclaim work and sleeps in proportion to how much the task > > > is actually at fault. > > > > Agreed. But this stops being the case as soon as the reclaim target has > > been reached and new reclaim attempts are enforced because the memcg is > > still above the high limit. Because then you have a completely different > > reclaim target - get down to the limit. This would be especially visible > > with a large memcg_nr_pages_over_high which could even lead to an over > > reclaim. > > We actually over reclaim even before this patch -- this patch doesn't bring > much new in that regard. > > Tracing try_to_free_pages for a cgroup at the memory.high threshold shows > that before this change, we sometimes even reclaim on the order of twice the > number of pages requested. For example, I see cases where we requested 1000 > pages to be reclaimed, but end up reclaiming 2000 in a single reclaim > attempt. This is interesting and worth looking into. I am aware that we can reclaim potentially much more pages during the icache reclaim and that there was a heated discussion without any fix merged in the end IIRC. Do you have any details?
Michal Hocko writes: >> > > task->memcg_nr_pages_over_high is not vague, it's a best-effort >> > > mechanism to distribute fairness. It's the current task's share of the >> > > cgroup's overage, and it allows us in the majority of situations to >> > > distribute reclaim work and sleeps in proportion to how much the task >> > > is actually at fault. >> > >> > Agreed. But this stops being the case as soon as the reclaim target has >> > been reached and new reclaim attempts are enforced because the memcg is >> > still above the high limit. Because then you have a completely different >> > reclaim target - get down to the limit. This would be especially visible >> > with a large memcg_nr_pages_over_high which could even lead to an over >> > reclaim. >> >> We actually over reclaim even before this patch -- this patch doesn't bring >> much new in that regard. >> >> Tracing try_to_free_pages for a cgroup at the memory.high threshold shows >> that before this change, we sometimes even reclaim on the order of twice the >> number of pages requested. For example, I see cases where we requested 1000 >> pages to be reclaimed, but end up reclaiming 2000 in a single reclaim >> attempt. > >This is interesting and worth looking into. I am aware that we can >reclaim potentially much more pages during the icache reclaim and that >there was a heated discussion without any fix merged in the end IIRC. >Do you have any details? Sure, we can look into this more, but let's do it separately from this patch -- I don't see that its merging should be contingent on that discussion :-)
On Fri 29-05-20 11:08:58, Chris Down wrote: > Michal Hocko writes: > > > > > task->memcg_nr_pages_over_high is not vague, it's a best-effort > > > > > mechanism to distribute fairness. It's the current task's share of the > > > > > cgroup's overage, and it allows us in the majority of situations to > > > > > distribute reclaim work and sleeps in proportion to how much the task > > > > > is actually at fault. > > > > > > > > Agreed. But this stops being the case as soon as the reclaim target has > > > > been reached and new reclaim attempts are enforced because the memcg is > > > > still above the high limit. Because then you have a completely different > > > > reclaim target - get down to the limit. This would be especially visible > > > > with a large memcg_nr_pages_over_high which could even lead to an over > > > > reclaim. > > > > > > We actually over reclaim even before this patch -- this patch doesn't bring > > > much new in that regard. > > > > > > Tracing try_to_free_pages for a cgroup at the memory.high threshold shows > > > that before this change, we sometimes even reclaim on the order of twice the > > > number of pages requested. For example, I see cases where we requested 1000 > > > pages to be reclaimed, but end up reclaiming 2000 in a single reclaim > > > attempt. > > > > This is interesting and worth looking into. I am aware that we can > > reclaim potentially much more pages during the icache reclaim and that > > there was a heated discussion without any fix merged in the end IIRC. > > Do you have any details? > > Sure, we can look into this more, but let's do it separately from this patch > -- I don't see that its merging should be contingent on that discussion :-) Yes that is a separate issue.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2df9510b7d64..b040951ccd6b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -73,6 +73,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys); struct mem_cgroup *root_mem_cgroup __read_mostly; +/* The number of times we should retry reclaim failures before giving up. */ #define MEM_CGROUP_RECLAIM_RETRIES 5 /* Socket memory accounting disabled? */ @@ -2228,17 +2229,22 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) return 0; } -static void reclaim_high(struct mem_cgroup *memcg, - unsigned int nr_pages, - gfp_t gfp_mask) +static unsigned long reclaim_high(struct mem_cgroup *memcg, + unsigned int nr_pages, + gfp_t gfp_mask) { + unsigned long nr_reclaimed = 0; + do { if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->high)) continue; memcg_memory_event(memcg, MEMCG_HIGH); - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, + gfp_mask, true); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); + + return nr_reclaimed; } static void high_work_func(struct work_struct *work) @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void) { unsigned long penalty_jiffies; unsigned long pflags; + unsigned long nr_reclaimed; unsigned int nr_pages = current->memcg_nr_pages_over_high; + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct mem_cgroup *memcg; if (likely(!nr_pages)) return; memcg = get_mem_cgroup_from_mm(current->mm); - reclaim_high(memcg, nr_pages, GFP_KERNEL); current->memcg_nr_pages_over_high = 0; +retry_reclaim: + nr_reclaimed = reclaim_high(memcg, nr_pages, GFP_KERNEL); + /* * memory.high is breached and reclaim is unable to keep up. Throttle * allocators proactively to slow down excessive growth. @@ -2403,6 +2413,14 @@ void mem_cgroup_handle_over_high(void) if (penalty_jiffies <= HZ / 100) goto out; + /* + * If reclaim is making forward progress but we're still over + * memory.high, we want to encourage that rather than doing allocator + * throttling. + */ + if (nr_reclaimed || nr_retries--) + goto retry_reclaim; + /* * If we exit early, we're guaranteed to die (since * schedule_timeout_killable sets TASK_KILLABLE). This means we don't