Message ID | 20230109205336.3665937-40-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Per-VMA locks | expand |
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > call_rcu() can take a long time when callback offloading is enabled. > Its use in the vm_area_free can cause regressions in the exit path when > multiple VMAs are being freed. What kind of regressions. > To minimize that impact, place VMAs into > a list and free them in groups using one call_rcu() call per group. Please add some data to justify this additional complexity.
On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > call_rcu() can take a long time when callback offloading is enabled. > > Its use in the vm_area_free can cause regressions in the exit path when > > multiple VMAs are being freed. > > What kind of regressions. > > > To minimize that impact, place VMAs into > > a list and free them in groups using one call_rcu() call per group. > > Please add some data to justify this additional complexity. Sorry, should have done that in the first place. A 4.3% regression was noticed when running execl test from unixbench suite. spawn test also showed 1.6% regression. Profiling revealed that vma freeing was taking longer due to call_rcu() which is slow when RCU callback offloading is enabled. I asked Paul McKenney and he explained to me that because the callbacks are offloaded to some other kthread, possibly running on some other CPU, it is necessary to use explicit locking. Locking on a per-call_rcu() basis would result in excessive contention during callback flooding. So, by batching call_rcu() work we cut that overhead and reduce this lock contention. > -- > Michal Hocko > SUSE Labs
On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > call_rcu() can take a long time when callback offloading is enabled. > > > Its use in the vm_area_free can cause regressions in the exit path when > > > multiple VMAs are being freed. > > > > What kind of regressions. > > > > > To minimize that impact, place VMAs into > > > a list and free them in groups using one call_rcu() call per group. > > > > Please add some data to justify this additional complexity. > > Sorry, should have done that in the first place. A 4.3% regression was > noticed when running execl test from unixbench suite. spawn test also > showed 1.6% regression. Profiling revealed that vma freeing was taking > longer due to call_rcu() which is slow when RCU callback offloading is > enabled. Could you be more specific? vma freeing is async with the RCU so how come this has resulted in a regression? Is there any heavy rcu_synchronize in the exec path? That would be an interesting information.
On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > multiple VMAs are being freed. > > > > > > What kind of regressions. > > > > > > > To minimize that impact, place VMAs into > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > Please add some data to justify this additional complexity. > > > > Sorry, should have done that in the first place. A 4.3% regression was > > noticed when running execl test from unixbench suite. spawn test also > > showed 1.6% regression. Profiling revealed that vma freeing was taking > > longer due to call_rcu() which is slow when RCU callback offloading is > > enabled. > > Could you be more specific? vma freeing is async with the RCU so how > come this has resulted in a regression? Is there any heavy > rcu_synchronize in the exec path? That would be an interesting > information. No, there is no heavy rcu_synchronize() or any other additional synchronous load in the exit path. It's the call_rcu() which can block the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of other call_rcu()'s going on in parallel. Note that call_rcu() calls rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling revealed that this function was taking multiple ms (don't recall the actual number, sorry). Paul's explanation implied that this happens due to contention on the locks taken in this function. For more in-depth details I'll have to ask Paul for help :) This code is quite complex and I don't know all the details of RCU implementation. > -- > Michal Hocko > SUSE Labs
On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote: > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > multiple VMAs are being freed. > > > > > > > > What kind of regressions. > > > > > > > > > To minimize that impact, place VMAs into > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > Please add some data to justify this additional complexity. > > > > > > Sorry, should have done that in the first place. A 4.3% regression was > > > noticed when running execl test from unixbench suite. spawn test also > > > showed 1.6% regression. Profiling revealed that vma freeing was taking > > > longer due to call_rcu() which is slow when RCU callback offloading is > > > enabled. > > > > Could you be more specific? vma freeing is async with the RCU so how > > come this has resulted in a regression? Is there any heavy > > rcu_synchronize in the exec path? That would be an interesting > > information. > > No, there is no heavy rcu_synchronize() or any other additional > synchronous load in the exit path. It's the call_rcu() which can block > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of > other call_rcu()'s going on in parallel. Note that call_rcu() calls > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling > revealed that this function was taking multiple ms (don't recall the > actual number, sorry). Paul's explanation implied that this happens > due to contention on the locks taken in this function. For more > in-depth details I'll have to ask Paul for help :) This code is quite > complex and I don't know all the details of RCU implementation. There are a couple of possibilities here. First, if I am remembering correctly, the time between the call_rcu() and invocation of the corresponding callback was taking multiple seconds, but that was because the kernel was built with CONFIG_LAZY_RCU=y in order to save power by batching RCU work over multiple call_rcu() invocations. If this is causing a problem for a given call site, the shiny new call_rcu_hurry() can be used instead. Doing this gets back to the old-school non-laziness, but can of course consume more power. Second, there is a much shorter one-jiffy delay between the call_rcu() and the invocation of the corresponding callback in kernels built with either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose of this delay is to avoid lock contention, and so this delay is incurred only on CPUs that are queuing callbacks at a rate exceeding 16K/second. This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU invoking call_rcu() at least 16 times within a given jiffy will incur the added delay. The reason for this delay is the use of a separate ->nocb_bypass list. As Suren says, this bypass list is used to reduce lock contention on the main ->cblist. This is not needed in old-school kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y (including most datacenter kernels) because in that case the callbacks enqueued by call_rcu() are touched only by the corresponding CPU, so that there is no need for locks. Third, if you are instead seeing multiple milliseconds of CPU consumed by call_rcu() in the common case (for example, without the aid of interrupts, NMIs, or SMIs), please do let me know. That sounds to me like a bug. Or have I lost track of some other slow case? Thanx, Paul
On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote: > > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > multiple VMAs are being freed. > > > > > > > > > > What kind of regressions. > > > > > > > > > > > To minimize that impact, place VMAs into > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > Please add some data to justify this additional complexity. > > > > > > > > Sorry, should have done that in the first place. A 4.3% regression was > > > > noticed when running execl test from unixbench suite. spawn test also > > > > showed 1.6% regression. Profiling revealed that vma freeing was taking > > > > longer due to call_rcu() which is slow when RCU callback offloading is > > > > enabled. > > > > > > Could you be more specific? vma freeing is async with the RCU so how > > > come this has resulted in a regression? Is there any heavy > > > rcu_synchronize in the exec path? That would be an interesting > > > information. > > > > No, there is no heavy rcu_synchronize() or any other additional > > synchronous load in the exit path. It's the call_rcu() which can block > > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of > > other call_rcu()'s going on in parallel. Note that call_rcu() calls > > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling > > revealed that this function was taking multiple ms (don't recall the > > actual number, sorry). Paul's explanation implied that this happens > > due to contention on the locks taken in this function. For more > > in-depth details I'll have to ask Paul for help :) This code is quite > > complex and I don't know all the details of RCU implementation. > > There are a couple of possibilities here. > > First, if I am remembering correctly, the time between the call_rcu() > and invocation of the corresponding callback was taking multiple seconds, > but that was because the kernel was built with CONFIG_LAZY_RCU=y in > order to save power by batching RCU work over multiple call_rcu() > invocations. If this is causing a problem for a given call site, the > shiny new call_rcu_hurry() can be used instead. Doing this gets back > to the old-school non-laziness, but can of course consume more power. That would not be the case because CONFIG_LAZY_RCU was not an option at the time I was profiling this issue. Laxy RCU would be a great option to replace this patch but unfortunately it's not the default behavior, so I would still have to implement this batching in case lazy RCU is not enabled. > > Second, there is a much shorter one-jiffy delay between the call_rcu() > and the invocation of the corresponding callback in kernels built with > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose > of this delay is to avoid lock contention, and so this delay is incurred > only on CPUs that are queuing callbacks at a rate exceeding 16K/second. > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU > invoking call_rcu() at least 16 times within a given jiffy will incur > the added delay. The reason for this delay is the use of a separate > ->nocb_bypass list. As Suren says, this bypass list is used to reduce > lock contention on the main ->cblist. This is not needed in old-school > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y > (including most datacenter kernels) because in that case the callbacks > enqueued by call_rcu() are touched only by the corresponding CPU, so > that there is no need for locks. I believe this is the reason in my profiled case. > > Third, if you are instead seeing multiple milliseconds of CPU consumed by > call_rcu() in the common case (for example, without the aid of interrupts, > NMIs, or SMIs), please do let me know. That sounds to me like a bug. I don't think I've seen such a case. Thanks for clarifications, Paul! > > Or have I lost track of some other slow case? > > Thanx, Paul
On Wed, Jan 18, 2023 at 11:01:08AM -0800, Suren Baghdasaryan wrote: > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote: > > > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > > > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > > multiple VMAs are being freed. > > > > > > > > > > > > What kind of regressions. > > > > > > > > > > > > > To minimize that impact, place VMAs into > > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > > > Please add some data to justify this additional complexity. > > > > > > > > > > Sorry, should have done that in the first place. A 4.3% regression was > > > > > noticed when running execl test from unixbench suite. spawn test also > > > > > showed 1.6% regression. Profiling revealed that vma freeing was taking > > > > > longer due to call_rcu() which is slow when RCU callback offloading is > > > > > enabled. > > > > > > > > Could you be more specific? vma freeing is async with the RCU so how > > > > come this has resulted in a regression? Is there any heavy > > > > rcu_synchronize in the exec path? That would be an interesting > > > > information. > > > > > > No, there is no heavy rcu_synchronize() or any other additional > > > synchronous load in the exit path. It's the call_rcu() which can block > > > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of > > > other call_rcu()'s going on in parallel. Note that call_rcu() calls > > > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling > > > revealed that this function was taking multiple ms (don't recall the > > > actual number, sorry). Paul's explanation implied that this happens > > > due to contention on the locks taken in this function. For more > > > in-depth details I'll have to ask Paul for help :) This code is quite > > > complex and I don't know all the details of RCU implementation. > > > > There are a couple of possibilities here. > > > > First, if I am remembering correctly, the time between the call_rcu() > > and invocation of the corresponding callback was taking multiple seconds, > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in > > order to save power by batching RCU work over multiple call_rcu() > > invocations. If this is causing a problem for a given call site, the > > shiny new call_rcu_hurry() can be used instead. Doing this gets back > > to the old-school non-laziness, but can of course consume more power. > > That would not be the case because CONFIG_LAZY_RCU was not an option > at the time I was profiling this issue. > Laxy RCU would be a great option to replace this patch but > unfortunately it's not the default behavior, so I would still have to > implement this batching in case lazy RCU is not enabled. > > > Second, there is a much shorter one-jiffy delay between the call_rcu() > > and the invocation of the corresponding callback in kernels built with > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only > > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose > > of this delay is to avoid lock contention, and so this delay is incurred > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second. > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU > > invoking call_rcu() at least 16 times within a given jiffy will incur > > the added delay. The reason for this delay is the use of a separate > > ->nocb_bypass list. As Suren says, this bypass list is used to reduce > > lock contention on the main ->cblist. This is not needed in old-school > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y > > (including most datacenter kernels) because in that case the callbacks > > enqueued by call_rcu() are touched only by the corresponding CPU, so > > that there is no need for locks. > > I believe this is the reason in my profiled case. > > > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by > > call_rcu() in the common case (for example, without the aid of interrupts, > > NMIs, or SMIs), please do let me know. That sounds to me like a bug. > > I don't think I've seen such a case. Whew!!! ;-) > Thanks for clarifications, Paul! No problem! Thanx, Paul
On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote: > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote: [...] > > There are a couple of possibilities here. > > > > First, if I am remembering correctly, the time between the call_rcu() > > and invocation of the corresponding callback was taking multiple seconds, > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in > > order to save power by batching RCU work over multiple call_rcu() > > invocations. If this is causing a problem for a given call site, the > > shiny new call_rcu_hurry() can be used instead. Doing this gets back > > to the old-school non-laziness, but can of course consume more power. > > That would not be the case because CONFIG_LAZY_RCU was not an option > at the time I was profiling this issue. > Laxy RCU would be a great option to replace this patch but > unfortunately it's not the default behavior, so I would still have to > implement this batching in case lazy RCU is not enabled. > > > > > Second, there is a much shorter one-jiffy delay between the call_rcu() > > and the invocation of the corresponding callback in kernels built with > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only > > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose > > of this delay is to avoid lock contention, and so this delay is incurred > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second. > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU > > invoking call_rcu() at least 16 times within a given jiffy will incur > > the added delay. The reason for this delay is the use of a separate > > ->nocb_bypass list. As Suren says, this bypass list is used to reduce > > lock contention on the main ->cblist. This is not needed in old-school > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y > > (including most datacenter kernels) because in that case the callbacks > > enqueued by call_rcu() are touched only by the corresponding CPU, so > > that there is no need for locks. > > I believe this is the reason in my profiled case. > > > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by > > call_rcu() in the common case (for example, without the aid of interrupts, > > NMIs, or SMIs), please do let me know. That sounds to me like a bug. > > I don't think I've seen such a case. > Thanks for clarifications, Paul! Thanks for the explanation Paul. I have to say this has caught me as a surprise. There are just not enough details about the benchmark to understand what is going on but I find it rather surprising that call_rcu can induce a higher overhead than the actual kmem_cache_free which is the callback. My naive understanding has been that call_rcu is really fast way to defer the execution to the RCU safe context to do the final cleanup.
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > call_rcu() can take a long time when callback offloading is enabled. > Its use in the vm_area_free can cause regressions in the exit path when > multiple VMAs are being freed. To minimize that impact, place VMAs into > a list and free them in groups using one call_rcu() call per group. After some more clarification I can understand how call_rcu might not be super happy about thousands of callbacks to be invoked and I do agree that this is not really optimal. On the other hand I do not like this solution much either. VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that much with processes with a huge number of vmas either. It would still be in housands of callbacks to be scheduled without a good reason. Instead, are there any other cases than remove_vma that need this batching? We could easily just link all the vmas into linked list and use a single call_rcu instead, no? This would both simplify the implementation, remove the scaling issue as well and we do not have to argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > call_rcu() can take a long time when callback offloading is enabled. > > Its use in the vm_area_free can cause regressions in the exit path when > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > a list and free them in groups using one call_rcu() call per group. > > After some more clarification I can understand how call_rcu might not be > super happy about thousands of callbacks to be invoked and I do agree > that this is not really optimal. > > On the other hand I do not like this solution much either. > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > much with processes with a huge number of vmas either. It would still be > in housands of callbacks to be scheduled without a good reason. > > Instead, are there any other cases than remove_vma that need this > batching? We could easily just link all the vmas into linked list and > use a single call_rcu instead, no? This would both simplify the > implementation, remove the scaling issue as well and we do not have to > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. Yes, I agree the solution is not stellar. I wanted something simple but this is probably too simple. OTOH keeping all dead vm_area_structs on the list without hooking up a shrinker (additional complexity) does not sound too appealing either. WDYT about time domain throttling to limit draining the list to say once per second like this: void vm_area_free(struct vm_area_struct *vma) { struct mm_struct *mm = vma->vm_mm; bool drain; free_anon_vma_name(vma); spin_lock(&mm->vma_free_list.lock); list_add(&vma->vm_free_list, &mm->vma_free_list.head); mm->vma_free_list.size++; - drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX; + drain = jiffies > mm->last_drain_tm + HZ; spin_unlock(&mm->vma_free_list.lock); - if (drain) + if (drain) { drain_free_vmas(mm); + mm->last_drain_tm = jiffies; + } } Ultimately we want to prevent very frequent call_rcu() calls, so throttling in the time domain seems appropriate. That's the simplest way I can think of to address your concern about a quick spike in VMA freeing. It does not place any restriction on the list size and we might have excessive dead vm_area_structs if after a large spike there are no vm_area_free() calls but I don't know if that's a real problem, so not sure we should be addressing it at this time. WDYT? > > -- > Michal Hocko > SUSE Labs
On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote: > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote: > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote: > [...] > > > There are a couple of possibilities here. > > > > > > First, if I am remembering correctly, the time between the call_rcu() > > > and invocation of the corresponding callback was taking multiple seconds, > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in > > > order to save power by batching RCU work over multiple call_rcu() > > > invocations. If this is causing a problem for a given call site, the > > > shiny new call_rcu_hurry() can be used instead. Doing this gets back > > > to the old-school non-laziness, but can of course consume more power. > > > > That would not be the case because CONFIG_LAZY_RCU was not an option > > at the time I was profiling this issue. > > Laxy RCU would be a great option to replace this patch but > > unfortunately it's not the default behavior, so I would still have to > > implement this batching in case lazy RCU is not enabled. > > > > > > > > Second, there is a much shorter one-jiffy delay between the call_rcu() > > > and the invocation of the corresponding callback in kernels built with > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose > > > of this delay is to avoid lock contention, and so this delay is incurred > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second. > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU > > > invoking call_rcu() at least 16 times within a given jiffy will incur > > > the added delay. The reason for this delay is the use of a separate > > > ->nocb_bypass list. As Suren says, this bypass list is used to reduce > > > lock contention on the main ->cblist. This is not needed in old-school > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y > > > (including most datacenter kernels) because in that case the callbacks > > > enqueued by call_rcu() are touched only by the corresponding CPU, so > > > that there is no need for locks. > > > > I believe this is the reason in my profiled case. > > > > > > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by > > > call_rcu() in the common case (for example, without the aid of interrupts, > > > NMIs, or SMIs), please do let me know. That sounds to me like a bug. > > > > I don't think I've seen such a case. > > Thanks for clarifications, Paul! > > Thanks for the explanation Paul. I have to say this has caught me as a > surprise. There are just not enough details about the benchmark to > understand what is going on but I find it rather surprising that > call_rcu can induce a higher overhead than the actual kmem_cache_free > which is the callback. My naive understanding has been that call_rcu is > really fast way to defer the execution to the RCU safe context to do the > final cleanup. If I am following along correctly (ha!), then your "induce a higher overhead" should be something like "induce a higher to-kfree() latency". Of course, there already is a higher latency-to-kfree via call_rcu() than via a direct call to kfree(), and callback-offload CPUs that are being flooded with callbacks raise that latency a jiffy or so more in order to avoid lock contention. If this becomes a problem, the callback-offloading code can be a bit smarter about avoiding lock contention, but need to see a real problem before I make that change. But if there is a real problem I will of course fix it. Or did I miss a turn in this discussion? Thanx, Paul
On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote: > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > call_rcu() can take a long time when callback offloading is enabled. > > > Its use in the vm_area_free can cause regressions in the exit path when > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > a list and free them in groups using one call_rcu() call per group. > > > > After some more clarification I can understand how call_rcu might not be > > super happy about thousands of callbacks to be invoked and I do agree > > that this is not really optimal. > > > > On the other hand I do not like this solution much either. > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > much with processes with a huge number of vmas either. It would still be > > in housands of callbacks to be scheduled without a good reason. > > > > Instead, are there any other cases than remove_vma that need this > > batching? We could easily just link all the vmas into linked list and > > use a single call_rcu instead, no? This would both simplify the > > implementation, remove the scaling issue as well and we do not have to > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > Yes, I agree the solution is not stellar. I wanted something simple > but this is probably too simple. OTOH keeping all dead vm_area_structs > on the list without hooking up a shrinker (additional complexity) does > not sound too appealing either. WDYT about time domain throttling to > limit draining the list to say once per second like this: > > void vm_area_free(struct vm_area_struct *vma) > { > struct mm_struct *mm = vma->vm_mm; > bool drain; > > free_anon_vma_name(vma); > > spin_lock(&mm->vma_free_list.lock); > list_add(&vma->vm_free_list, &mm->vma_free_list.head); > mm->vma_free_list.size++; > - drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX; > + drain = jiffies > mm->last_drain_tm + HZ; > > spin_unlock(&mm->vma_free_list.lock); > > - if (drain) > + if (drain) { > drain_free_vmas(mm); > + mm->last_drain_tm = jiffies; > + } > } > > Ultimately we want to prevent very frequent call_rcu() calls, so > throttling in the time domain seems appropriate. That's the simplest > way I can think of to address your concern about a quick spike in VMA > freeing. It does not place any restriction on the list size and we > might have excessive dead vm_area_structs if after a large spike there > are no vm_area_free() calls but I don't know if that's a real problem, > so not sure we should be addressing it at this time. WDYT? Just to double-check, we really did try the very frequent call_rcu() invocations and we really did see a problem, correct? Although it is not perfect, call_rcu() is designed to take a fair amount of abuse. So if we didn't see a real problem, the frequent call_rcu() invocations might be a bit simpler. Thanx, Paul
On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote: > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > After some more clarification I can understand how call_rcu might not be > > > super happy about thousands of callbacks to be invoked and I do agree > > > that this is not really optimal. > > > > > > On the other hand I do not like this solution much either. > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > much with processes with a huge number of vmas either. It would still be > > > in housands of callbacks to be scheduled without a good reason. > > > > > > Instead, are there any other cases than remove_vma that need this > > > batching? We could easily just link all the vmas into linked list and > > > use a single call_rcu instead, no? This would both simplify the > > > implementation, remove the scaling issue as well and we do not have to > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > Yes, I agree the solution is not stellar. I wanted something simple > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > on the list without hooking up a shrinker (additional complexity) does > > not sound too appealing either. WDYT about time domain throttling to > > limit draining the list to say once per second like this: > > > > void vm_area_free(struct vm_area_struct *vma) > > { > > struct mm_struct *mm = vma->vm_mm; > > bool drain; > > > > free_anon_vma_name(vma); > > > > spin_lock(&mm->vma_free_list.lock); > > list_add(&vma->vm_free_list, &mm->vma_free_list.head); > > mm->vma_free_list.size++; > > - drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX; > > + drain = jiffies > mm->last_drain_tm + HZ; > > > > spin_unlock(&mm->vma_free_list.lock); > > > > - if (drain) > > + if (drain) { > > drain_free_vmas(mm); > > + mm->last_drain_tm = jiffies; > > + } > > } > > > > Ultimately we want to prevent very frequent call_rcu() calls, so > > throttling in the time domain seems appropriate. That's the simplest > > way I can think of to address your concern about a quick spike in VMA > > freeing. It does not place any restriction on the list size and we > > might have excessive dead vm_area_structs if after a large spike there > > are no vm_area_free() calls but I don't know if that's a real problem, > > so not sure we should be addressing it at this time. WDYT? > > Just to double-check, we really did try the very frequent call_rcu() > invocations and we really did see a problem, correct? Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw regressions when a process exits and all its VMAs get destroyed, causing a flood of call_rcu()'s. > > Although it is not perfect, call_rcu() is designed to take a fair amount > of abuse. So if we didn't see a real problem, the frequent call_rcu() > invocations might be a bit simpler. > > Thanx, Paul
On Thu, Jan 19, 2023 at 11:47:36AM -0800, Suren Baghdasaryan wrote: > On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote: > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > that this is not really optimal. > > > > > > > > On the other hand I do not like this solution much either. > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > much with processes with a huge number of vmas either. It would still be > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > batching? We could easily just link all the vmas into linked list and > > > > use a single call_rcu instead, no? This would both simplify the > > > > implementation, remove the scaling issue as well and we do not have to > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > on the list without hooking up a shrinker (additional complexity) does > > > not sound too appealing either. WDYT about time domain throttling to > > > limit draining the list to say once per second like this: > > > > > > void vm_area_free(struct vm_area_struct *vma) > > > { > > > struct mm_struct *mm = vma->vm_mm; > > > bool drain; > > > > > > free_anon_vma_name(vma); > > > > > > spin_lock(&mm->vma_free_list.lock); > > > list_add(&vma->vm_free_list, &mm->vma_free_list.head); > > > mm->vma_free_list.size++; > > > - drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX; > > > + drain = jiffies > mm->last_drain_tm + HZ; > > > > > > spin_unlock(&mm->vma_free_list.lock); > > > > > > - if (drain) > > > + if (drain) { > > > drain_free_vmas(mm); > > > + mm->last_drain_tm = jiffies; > > > + } > > > } > > > > > > Ultimately we want to prevent very frequent call_rcu() calls, so > > > throttling in the time domain seems appropriate. That's the simplest > > > way I can think of to address your concern about a quick spike in VMA > > > freeing. It does not place any restriction on the list size and we > > > might have excessive dead vm_area_structs if after a large spike there > > > are no vm_area_free() calls but I don't know if that's a real problem, > > > so not sure we should be addressing it at this time. WDYT? > > > > Just to double-check, we really did try the very frequent call_rcu() > > invocations and we really did see a problem, correct? > > Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw > regressions when a process exits and all its VMAs get destroyed, > causing a flood of call_rcu()'s. Thank you for the reminder, real problem needs solution. ;-) Thanx, Paul > > Although it is not perfect, call_rcu() is designed to take a fair amount > > of abuse. So if we didn't see a real problem, the frequent call_rcu() > > invocations might be a bit simpler.
On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > call_rcu() can take a long time when callback offloading is enabled. > > > Its use in the vm_area_free can cause regressions in the exit path when > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > a list and free them in groups using one call_rcu() call per group. > > > > After some more clarification I can understand how call_rcu might not be > > super happy about thousands of callbacks to be invoked and I do agree > > that this is not really optimal. > > > > On the other hand I do not like this solution much either. > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > much with processes with a huge number of vmas either. It would still be > > in housands of callbacks to be scheduled without a good reason. > > > > Instead, are there any other cases than remove_vma that need this > > batching? We could easily just link all the vmas into linked list and > > use a single call_rcu instead, no? This would both simplify the > > implementation, remove the scaling issue as well and we do not have to > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > Yes, I agree the solution is not stellar. I wanted something simple > but this is probably too simple. OTOH keeping all dead vm_area_structs > on the list without hooking up a shrinker (additional complexity) does > not sound too appealing either. I suspect you have missed my idea. I do not really want to keep the list around or any shrinker. It is dead simple. Collect all vmas in remove_vma and then call_rcu the whole list at once after the whole list (be it from exit_mmap or remove_mt). See?
On Thu 19-01-23 11:17:07, Paul E. McKenney wrote: > On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote: > > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote: > > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > [...] > > > > There are a couple of possibilities here. > > > > > > > > First, if I am remembering correctly, the time between the call_rcu() > > > > and invocation of the corresponding callback was taking multiple seconds, > > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in > > > > order to save power by batching RCU work over multiple call_rcu() > > > > invocations. If this is causing a problem for a given call site, the > > > > shiny new call_rcu_hurry() can be used instead. Doing this gets back > > > > to the old-school non-laziness, but can of course consume more power. > > > > > > That would not be the case because CONFIG_LAZY_RCU was not an option > > > at the time I was profiling this issue. > > > Laxy RCU would be a great option to replace this patch but > > > unfortunately it's not the default behavior, so I would still have to > > > implement this batching in case lazy RCU is not enabled. > > > > > > > > > > > Second, there is a much shorter one-jiffy delay between the call_rcu() > > > > and the invocation of the corresponding callback in kernels built with > > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full > > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only > > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose > > > > of this delay is to avoid lock contention, and so this delay is incurred > > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second. > > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU > > > > invoking call_rcu() at least 16 times within a given jiffy will incur > > > > the added delay. The reason for this delay is the use of a separate > > > > ->nocb_bypass list. As Suren says, this bypass list is used to reduce > > > > lock contention on the main ->cblist. This is not needed in old-school > > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y > > > > (including most datacenter kernels) because in that case the callbacks > > > > enqueued by call_rcu() are touched only by the corresponding CPU, so > > > > that there is no need for locks. > > > > > > I believe this is the reason in my profiled case. > > > > > > > > > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by > > > > call_rcu() in the common case (for example, without the aid of interrupts, > > > > NMIs, or SMIs), please do let me know. That sounds to me like a bug. > > > > > > I don't think I've seen such a case. > > > Thanks for clarifications, Paul! > > > > Thanks for the explanation Paul. I have to say this has caught me as a > > surprise. There are just not enough details about the benchmark to > > understand what is going on but I find it rather surprising that > > call_rcu can induce a higher overhead than the actual kmem_cache_free > > which is the callback. My naive understanding has been that call_rcu is > > really fast way to defer the execution to the RCU safe context to do the > > final cleanup. > > If I am following along correctly (ha!), then your "induce a higher > overhead" should be something like "induce a higher to-kfree() latency". Yes, this is expected. > Of course, there already is a higher latency-to-kfree via call_rcu() > than via a direct call to kfree(), and callback-offload CPUs that are > being flooded with callbacks raise that latency a jiffy or so more in > order to avoid lock contention. > > If this becomes a problem, the callback-offloading code can be a bit > smarter about avoiding lock contention, but need to see a real problem > before I make that change. But if there is a real problem I will of > course fix it. I believe that Suren claims that the call_rcu is really visible in the exit_mmap case. Time-to-free actual vmas shouldn't really be material for that path. If that happens much more later on there could be some side effects by an increased memory consumption but that should be marginal. How fast exit_mmap really is should only depend on direct calls from that path. But I guess we need some specific numbers from Suren to be sure what is going on here. Thanks!
On Fri, Jan 20, 2023 at 09:57:05AM +0100, Michal Hocko wrote: > On Thu 19-01-23 11:17:07, Paul E. McKenney wrote: > > On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote: > > > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote: > > > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > [...] > > > > > There are a couple of possibilities here. > > > > > > > > > > First, if I am remembering correctly, the time between the call_rcu() > > > > > and invocation of the corresponding callback was taking multiple seconds, > > > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in > > > > > order to save power by batching RCU work over multiple call_rcu() > > > > > invocations. If this is causing a problem for a given call site, the > > > > > shiny new call_rcu_hurry() can be used instead. Doing this gets back > > > > > to the old-school non-laziness, but can of course consume more power. > > > > > > > > That would not be the case because CONFIG_LAZY_RCU was not an option > > > > at the time I was profiling this issue. > > > > Laxy RCU would be a great option to replace this patch but > > > > unfortunately it's not the default behavior, so I would still have to > > > > implement this batching in case lazy RCU is not enabled. > > > > > > > > > > > > > > Second, there is a much shorter one-jiffy delay between the call_rcu() > > > > > and the invocation of the corresponding callback in kernels built with > > > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full > > > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only > > > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose > > > > > of this delay is to avoid lock contention, and so this delay is incurred > > > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second. > > > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU > > > > > invoking call_rcu() at least 16 times within a given jiffy will incur > > > > > the added delay. The reason for this delay is the use of a separate > > > > > ->nocb_bypass list. As Suren says, this bypass list is used to reduce > > > > > lock contention on the main ->cblist. This is not needed in old-school > > > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y > > > > > (including most datacenter kernels) because in that case the callbacks > > > > > enqueued by call_rcu() are touched only by the corresponding CPU, so > > > > > that there is no need for locks. > > > > > > > > I believe this is the reason in my profiled case. > > > > > > > > > > > > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by > > > > > call_rcu() in the common case (for example, without the aid of interrupts, > > > > > NMIs, or SMIs), please do let me know. That sounds to me like a bug. > > > > > > > > I don't think I've seen such a case. > > > > Thanks for clarifications, Paul! > > > > > > Thanks for the explanation Paul. I have to say this has caught me as a > > > surprise. There are just not enough details about the benchmark to > > > understand what is going on but I find it rather surprising that > > > call_rcu can induce a higher overhead than the actual kmem_cache_free > > > which is the callback. My naive understanding has been that call_rcu is > > > really fast way to defer the execution to the RCU safe context to do the > > > final cleanup. > > > > If I am following along correctly (ha!), then your "induce a higher > > overhead" should be something like "induce a higher to-kfree() latency". > > Yes, this is expected. > > > Of course, there already is a higher latency-to-kfree via call_rcu() > > than via a direct call to kfree(), and callback-offload CPUs that are > > being flooded with callbacks raise that latency a jiffy or so more in > > order to avoid lock contention. > > > > If this becomes a problem, the callback-offloading code can be a bit > > smarter about avoiding lock contention, but need to see a real problem > > before I make that change. But if there is a real problem I will of > > course fix it. > > I believe that Suren claims that the call_rcu is really visible in the > exit_mmap case. Time-to-free actual vmas shouldn't really be material > for that path. If that happens much more later on there could be some > side effects by an increased memory consumption but that should be > marginal. How fast exit_mmap really is should only depend on direct > calls from that path. > > But I guess we need some specific numbers from Suren to be sure what is > going on here. Actually, Suren did discuss these (perhaps offlist) back in August. I was just being forgetful. :-/ Thanx, Paul
On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > After some more clarification I can understand how call_rcu might not be > > > super happy about thousands of callbacks to be invoked and I do agree > > > that this is not really optimal. > > > > > > On the other hand I do not like this solution much either. > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > much with processes with a huge number of vmas either. It would still be > > > in housands of callbacks to be scheduled without a good reason. > > > > > > Instead, are there any other cases than remove_vma that need this > > > batching? We could easily just link all the vmas into linked list and > > > use a single call_rcu instead, no? This would both simplify the > > > implementation, remove the scaling issue as well and we do not have to > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > Yes, I agree the solution is not stellar. I wanted something simple > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > on the list without hooking up a shrinker (additional complexity) does > > not sound too appealing either. > > I suspect you have missed my idea. I do not really want to keep the list > around or any shrinker. It is dead simple. Collect all vmas in > remove_vma and then call_rcu the whole list at once after the whole list > (be it from exit_mmap or remove_mt). See? Yes, I understood your idea but keeping dead objects until the process exits even when the system is low on memory (no shrinkers attached) seems too wasteful. If we do this I would advocate for attaching a shrinker. > > -- > Michal Hocko > SUSE Labs
On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > that this is not really optimal. > > > > > > > > On the other hand I do not like this solution much either. > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > much with processes with a huge number of vmas either. It would still be > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > batching? We could easily just link all the vmas into linked list and > > > > use a single call_rcu instead, no? This would both simplify the > > > > implementation, remove the scaling issue as well and we do not have to > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > on the list without hooking up a shrinker (additional complexity) does > > > not sound too appealing either. > > > > I suspect you have missed my idea. I do not really want to keep the list > > around or any shrinker. It is dead simple. Collect all vmas in > > remove_vma and then call_rcu the whole list at once after the whole list > > (be it from exit_mmap or remove_mt). See? > > Yes, I understood your idea but keeping dead objects until the process > exits even when the system is low on memory (no shrinkers attached) > seems too wasteful. If we do this I would advocate for attaching a > shrinker. Maybe even simpler, since we are hit with this VMA freeing flood during exit_mmap (when all VMAs are destroyed), we pass a hint to vm_area_free to batch the destruction and all other cases call call_rcu()? I don't think there will be other cases of VMA destruction floods. > > > > > -- > > Michal Hocko > > SUSE Labs
On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > > that this is not really optimal. > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > much with processes with a huge number of vmas either. It would still be > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > batching? We could easily just link all the vmas into linked list and > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > implementation, remove the scaling issue as well and we do not have to > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > on the list without hooking up a shrinker (additional complexity) does > > > > not sound too appealing either. > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > around or any shrinker. It is dead simple. Collect all vmas in > > > remove_vma and then call_rcu the whole list at once after the whole list > > > (be it from exit_mmap or remove_mt). See? > > > > Yes, I understood your idea but keeping dead objects until the process > > exits even when the system is low on memory (no shrinkers attached) > > seems too wasteful. If we do this I would advocate for attaching a > > shrinker. > > Maybe even simpler, since we are hit with this VMA freeing flood > during exit_mmap (when all VMAs are destroyed), we pass a hint to > vm_area_free to batch the destruction and all other cases call > call_rcu()? I don't think there will be other cases of VMA destruction > floods. ... or have two different call_rcu functions; one for munmap() and one for exit. It'd be nice to use kmem_cache_free_bulk().
* Matthew Wilcox <willy@infradead.org> [230120 11:50]: > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > > > that this is not really optimal. > > > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > > much with processes with a huge number of vmas either. It would still be > > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > > batching? We could easily just link all the vmas into linked list and > > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > > implementation, remove the scaling issue as well and we do not have to > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > > on the list without hooking up a shrinker (additional complexity) does > > > > > not sound too appealing either. > > > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > > around or any shrinker. It is dead simple. Collect all vmas in > > > > remove_vma and then call_rcu the whole list at once after the whole list > > > > (be it from exit_mmap or remove_mt). See? > > > > > > Yes, I understood your idea but keeping dead objects until the process > > > exits even when the system is low on memory (no shrinkers attached) > > > seems too wasteful. If we do this I would advocate for attaching a > > > shrinker. > > > > Maybe even simpler, since we are hit with this VMA freeing flood > > during exit_mmap (when all VMAs are destroyed), we pass a hint to > > vm_area_free to batch the destruction and all other cases call > > call_rcu()? I don't think there will be other cases of VMA destruction > > floods. > > ... or have two different call_rcu functions; one for munmap() and > one for exit. It'd be nice to use kmem_cache_free_bulk(). Do we even need a call_rcu on exit? At the point of freeing the VMAs we have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock. Once we have obtained the write lock again, I think it's safe to say we can just go ahead and free the VMAs directly.
On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Matthew Wilcox <willy@infradead.org> [230120 11:50]: > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > > > > that this is not really optimal. > > > > > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > > > much with processes with a huge number of vmas either. It would still be > > > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > > > batching? We could easily just link all the vmas into linked list and > > > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > > > implementation, remove the scaling issue as well and we do not have to > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > > > on the list without hooking up a shrinker (additional complexity) does > > > > > > not sound too appealing either. > > > > > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > > > around or any shrinker. It is dead simple. Collect all vmas in > > > > > remove_vma and then call_rcu the whole list at once after the whole list > > > > > (be it from exit_mmap or remove_mt). See? > > > > > > > > Yes, I understood your idea but keeping dead objects until the process > > > > exits even when the system is low on memory (no shrinkers attached) > > > > seems too wasteful. If we do this I would advocate for attaching a > > > > shrinker. > > > > > > Maybe even simpler, since we are hit with this VMA freeing flood > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to > > > vm_area_free to batch the destruction and all other cases call > > > call_rcu()? I don't think there will be other cases of VMA destruction > > > floods. > > > > ... or have two different call_rcu functions; one for munmap() and > > one for exit. It'd be nice to use kmem_cache_free_bulk(). > > Do we even need a call_rcu on exit? At the point of freeing the VMAs we > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock. > Once we have obtained the write lock again, I think it's safe to say we > can just go ahead and free the VMAs directly. I think that would be still racy if the page fault handler found that VMA under read-RCU protection but did not lock it yet (no locks are held yet). If it's preempted, the VMA can be freed and destroyed from under it without RCU grace period. > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Fri, Jan 20, 2023 at 04:49:42PM +0000, Matthew Wilcox wrote: > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > > > that this is not really optimal. > > > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > > much with processes with a huge number of vmas either. It would still be > > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > > batching? We could easily just link all the vmas into linked list and > > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > > implementation, remove the scaling issue as well and we do not have to > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > > on the list without hooking up a shrinker (additional complexity) does > > > > > not sound too appealing either. > > > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > > around or any shrinker. It is dead simple. Collect all vmas in > > > > remove_vma and then call_rcu the whole list at once after the whole list > > > > (be it from exit_mmap or remove_mt). See? > > > > > > Yes, I understood your idea but keeping dead objects until the process > > > exits even when the system is low on memory (no shrinkers attached) > > > seems too wasteful. If we do this I would advocate for attaching a > > > shrinker. > > > > Maybe even simpler, since we are hit with this VMA freeing flood > > during exit_mmap (when all VMAs are destroyed), we pass a hint to > > vm_area_free to batch the destruction and all other cases call > > call_rcu()? I don't think there will be other cases of VMA destruction > > floods. > > ... or have two different call_rcu functions; one for munmap() and > one for exit. It'd be nice to use kmem_cache_free_bulk(). Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head structure's field, is much more cache-efficient. The penalty is that there is no callback function to do any cleanup. There is just a kfree()/kvfree (bulk version where applicable), nothing else. Thanx, Paul
On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote: > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > * Matthew Wilcox <willy@infradead.org> [230120 11:50]: > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > > > > > that this is not really optimal. > > > > > > > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > > > > much with processes with a huge number of vmas either. It would still be > > > > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > > > > batching? We could easily just link all the vmas into linked list and > > > > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > > > > implementation, remove the scaling issue as well and we do not have to > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > > > > on the list without hooking up a shrinker (additional complexity) does > > > > > > > not sound too appealing either. > > > > > > > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > > > > around or any shrinker. It is dead simple. Collect all vmas in > > > > > > remove_vma and then call_rcu the whole list at once after the whole list > > > > > > (be it from exit_mmap or remove_mt). See? > > > > > > > > > > Yes, I understood your idea but keeping dead objects until the process > > > > > exits even when the system is low on memory (no shrinkers attached) > > > > > seems too wasteful. If we do this I would advocate for attaching a > > > > > shrinker. > > > > > > > > Maybe even simpler, since we are hit with this VMA freeing flood > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to > > > > vm_area_free to batch the destruction and all other cases call > > > > call_rcu()? I don't think there will be other cases of VMA destruction > > > > floods. > > > > > > ... or have two different call_rcu functions; one for munmap() and > > > one for exit. It'd be nice to use kmem_cache_free_bulk(). > > > > Do we even need a call_rcu on exit? At the point of freeing the VMAs we > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock. > > Once we have obtained the write lock again, I think it's safe to say we > > can just go ahead and free the VMAs directly. > > I think that would be still racy if the page fault handler found that > VMA under read-RCU protection but did not lock it yet (no locks are > held yet). If it's preempted, the VMA can be freed and destroyed from > under it without RCU grace period. The page fault handler (or whatever other reader -- ptrace, proc, etc) should have a refcount on the mm_struct, so we can't be in this path trying to free VMAs. Right?
On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote: > > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > * Matthew Wilcox <willy@infradead.org> [230120 11:50]: > > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > > > > > > that this is not really optimal. > > > > > > > > > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > > > > > much with processes with a huge number of vmas either. It would still be > > > > > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > > > > > batching? We could easily just link all the vmas into linked list and > > > > > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > > > > > implementation, remove the scaling issue as well and we do not have to > > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > > > > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > > > > > on the list without hooking up a shrinker (additional complexity) does > > > > > > > > not sound too appealing either. > > > > > > > > > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > > > > > around or any shrinker. It is dead simple. Collect all vmas in > > > > > > > remove_vma and then call_rcu the whole list at once after the whole list > > > > > > > (be it from exit_mmap or remove_mt). See? > > > > > > > > > > > > Yes, I understood your idea but keeping dead objects until the process > > > > > > exits even when the system is low on memory (no shrinkers attached) > > > > > > seems too wasteful. If we do this I would advocate for attaching a > > > > > > shrinker. > > > > > > > > > > Maybe even simpler, since we are hit with this VMA freeing flood > > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to > > > > > vm_area_free to batch the destruction and all other cases call > > > > > call_rcu()? I don't think there will be other cases of VMA destruction > > > > > floods. > > > > > > > > ... or have two different call_rcu functions; one for munmap() and > > > > one for exit. It'd be nice to use kmem_cache_free_bulk(). > > > > > > Do we even need a call_rcu on exit? At the point of freeing the VMAs we > > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock. > > > Once we have obtained the write lock again, I think it's safe to say we > > > can just go ahead and free the VMAs directly. > > > > I think that would be still racy if the page fault handler found that > > VMA under read-RCU protection but did not lock it yet (no locks are > > held yet). If it's preempted, the VMA can be freed and destroyed from > > under it without RCU grace period. > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > should have a refcount on the mm_struct, so we can't be in this path > trying to free VMAs. Right? Hmm. That sounds right. I checked process_mrelease() as well, which operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps VMAs without freeing them, so we are still good. Michal, do you agree this is ok? lock_vma_under_rcu() receives mm as a parameter, so I guess it's implied that the caller should either mmget() it or operate on current->mm, so no need to document this requirement?
On Fri, Jan 20, 2023 at 9:21 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Jan 20, 2023 at 04:49:42PM +0000, Matthew Wilcox wrote: > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > > > > that this is not really optimal. > > > > > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > > > much with processes with a huge number of vmas either. It would still be > > > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > > > batching? We could easily just link all the vmas into linked list and > > > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > > > implementation, remove the scaling issue as well and we do not have to > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > > > on the list without hooking up a shrinker (additional complexity) does > > > > > > not sound too appealing either. > > > > > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > > > around or any shrinker. It is dead simple. Collect all vmas in > > > > > remove_vma and then call_rcu the whole list at once after the whole list > > > > > (be it from exit_mmap or remove_mt). See? > > > > > > > > Yes, I understood your idea but keeping dead objects until the process > > > > exits even when the system is low on memory (no shrinkers attached) > > > > seems too wasteful. If we do this I would advocate for attaching a > > > > shrinker. > > > > > > Maybe even simpler, since we are hit with this VMA freeing flood > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to > > > vm_area_free to batch the destruction and all other cases call > > > call_rcu()? I don't think there will be other cases of VMA destruction > > > floods. > > > > ... or have two different call_rcu functions; one for munmap() and > > one for exit. It'd be nice to use kmem_cache_free_bulk(). > > Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head > structure's field, is much more cache-efficient. > > The penalty is that there is no callback function to do any cleanup. > There is just a kfree()/kvfree (bulk version where applicable), > nothing else. If Liam's suggestion works then we won't need anything additional. We will free the vm_area_structs directly on process exit and use call_rcu() in all other cases. Let's see if Michal knows of any case which still needs an RCU grace period during exit_mmap. > > Thanx, Paul > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
* Suren Baghdasaryan <surenb@google.com> [230120 12:50]: > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote: > > > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > > > * Matthew Wilcox <willy@infradead.org> [230120 11:50]: > > > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > > > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > > > > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > > > > > > > that this is not really optimal. > > > > > > > > > > > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > > > > > > much with processes with a huge number of vmas either. It would still be > > > > > > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > > > > > > batching? We could easily just link all the vmas into linked list and > > > > > > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > > > > > > implementation, remove the scaling issue as well and we do not have to > > > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > > > > > > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > > > > > > on the list without hooking up a shrinker (additional complexity) does > > > > > > > > > not sound too appealing either. > > > > > > > > > > > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > > > > > > around or any shrinker. It is dead simple. Collect all vmas in > > > > > > > > remove_vma and then call_rcu the whole list at once after the whole list > > > > > > > > (be it from exit_mmap or remove_mt). See? > > > > > > > > > > > > > > Yes, I understood your idea but keeping dead objects until the process > > > > > > > exits even when the system is low on memory (no shrinkers attached) > > > > > > > seems too wasteful. If we do this I would advocate for attaching a > > > > > > > shrinker. > > > > > > > > > > > > Maybe even simpler, since we are hit with this VMA freeing flood > > > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to > > > > > > vm_area_free to batch the destruction and all other cases call > > > > > > call_rcu()? I don't think there will be other cases of VMA destruction > > > > > > floods. > > > > > > > > > > ... or have two different call_rcu functions; one for munmap() and > > > > > one for exit. It'd be nice to use kmem_cache_free_bulk(). > > > > > > > > Do we even need a call_rcu on exit? At the point of freeing the VMAs we > > > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock. > > > > Once we have obtained the write lock again, I think it's safe to say we > > > > can just go ahead and free the VMAs directly. > > > > > > I think that would be still racy if the page fault handler found that > > > VMA under read-RCU protection but did not lock it yet (no locks are > > > held yet). If it's preempted, the VMA can be freed and destroyed from > > > under it without RCU grace period. > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > > should have a refcount on the mm_struct, so we can't be in this path > > trying to free VMAs. Right? > > Hmm. That sounds right. I checked process_mrelease() as well, which > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps > VMAs without freeing them, so we are still good. Michal, do you agree > this is ok? > > lock_vma_under_rcu() receives mm as a parameter, so I guess it's > implied that the caller should either mmget() it or operate on > current->mm, so no need to document this requirement? It is also implied by the vma->vm_mm link. Otherwise any RCU holder of the VMA could have an unsafe pointer. In fact, if this isn't true then we need to change the callers to take the ref count to avoid just this scenario.
On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote: > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: [...] > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > > should have a refcount on the mm_struct, so we can't be in this path > > trying to free VMAs. Right? > > Hmm. That sounds right. I checked process_mrelease() as well, which > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps > VMAs without freeing them, so we are still good. Michal, do you agree > this is ok? Don't we need RCU procetions for the vma life time assurance? Jann has already shown how rwsem is not safe wrt to unlock and free without RCU.
On Fri 20-01-23 08:20:43, Suren Baghdasaryan wrote: > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > that this is not really optimal. > > > > > > > > On the other hand I do not like this solution much either. > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > much with processes with a huge number of vmas either. It would still be > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > batching? We could easily just link all the vmas into linked list and > > > > use a single call_rcu instead, no? This would both simplify the > > > > implementation, remove the scaling issue as well and we do not have to > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > on the list without hooking up a shrinker (additional complexity) does > > > not sound too appealing either. > > > > I suspect you have missed my idea. I do not really want to keep the list > > around or any shrinker. It is dead simple. Collect all vmas in > > remove_vma and then call_rcu the whole list at once after the whole list > > (be it from exit_mmap or remove_mt). See? > > Yes, I understood your idea but keeping dead objects until the process > exits even when the system is low on memory (no shrinkers attached) > seems too wasteful. If we do this I would advocate for attaching a > shrinker. I am still not sure we are on the same page here. No, vmas shouldn't lay around un ntil the process exit. I am really suggesting queuing only for remove_vma paths. You can have a different rcu callback than the one used for trivial single vma removal paths.
On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote: > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: > [...] > > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > > > should have a refcount on the mm_struct, so we can't be in this path > > > trying to free VMAs. Right? > > > > Hmm. That sounds right. I checked process_mrelease() as well, which > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps > > VMAs without freeing them, so we are still good. Michal, do you agree > > this is ok? > > Don't we need RCU procetions for the vma life time assurance? Jann has > already shown how rwsem is not safe wrt to unlock and free without RCU. Jann's case requires a thread freeing the VMA to be blocked on vma write lock waiting for the vma real lock to be released by a page fault handler. However exit_mmap() means mm->mm_users==0, which in turn suggests that there are no racing page fault handlers and no new page fault handlers will appear. Is that a correct assumption? If so, then races with page fault handlers can't happen while in exit_mmap(). Any other path (other than page fault handlers), accesses vma->lock under protection of mmap_lock (for read or write, does not matter). One exception is when we operate on an isolated VMA, then we don't need mmap_lock protection, but exit_mmap() does not deal with isolated VMAs, so out of scope here. exit_mmap() frees vm_area_structs under protection of mmap_lock in write mode, so races with anything other than page fault handler should be safe as they are today. That said, the future possible users of lock_vma_under_rcu() using VMA without mmap_lock protection will have to ensure mm's stability while they are using the obtained VMA. IOW they should elevate mm's refcount and keep it elevated as long as they are using that VMA and not before vma->lock is released. I guess it would be a good idea to document that requirement in lock_vma_under_rcu() comments if we decide to take this route. > > -- > Michal Hocko > SUSE Labs
On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote: > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote: > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: > > [...] > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > > > > should have a refcount on the mm_struct, so we can't be in this path > > > > trying to free VMAs. Right? > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps > > > VMAs without freeing them, so we are still good. Michal, do you agree > > > this is ok? > > > > Don't we need RCU procetions for the vma life time assurance? Jann has > > already shown how rwsem is not safe wrt to unlock and free without RCU. > > Jann's case requires a thread freeing the VMA to be blocked on vma > write lock waiting for the vma real lock to be released by a page > fault handler. However exit_mmap() means mm->mm_users==0, which in > turn suggests that there are no racing page fault handlers and no new > page fault handlers will appear. Is that a correct assumption? If so, > then races with page fault handlers can't happen while in exit_mmap(). > Any other path (other than page fault handlers), accesses vma->lock > under protection of mmap_lock (for read or write, does not matter). > One exception is when we operate on an isolated VMA, then we don't > need mmap_lock protection, but exit_mmap() does not deal with isolated > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under > protection of mmap_lock in write mode, so races with anything other > than page fault handler should be safe as they are today. I do not see you talking about #PF (RCU + vma read lock protected) with munmap. It is my understanding that the latter will synchronize over per vma lock (along with mmap_lock exclusive locking). But then we are back to the lifetime guarantees, or do I miss anything.
On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote: > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote: > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: > > > [...] > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > > > > > should have a refcount on the mm_struct, so we can't be in this path > > > > > trying to free VMAs. Right? > > > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps > > > > VMAs without freeing them, so we are still good. Michal, do you agree > > > > this is ok? > > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has > > > already shown how rwsem is not safe wrt to unlock and free without RCU. > > > > Jann's case requires a thread freeing the VMA to be blocked on vma > > write lock waiting for the vma real lock to be released by a page > > fault handler. However exit_mmap() means mm->mm_users==0, which in > > turn suggests that there are no racing page fault handlers and no new > > page fault handlers will appear. Is that a correct assumption? If so, > > then races with page fault handlers can't happen while in exit_mmap(). > > Any other path (other than page fault handlers), accesses vma->lock > > under protection of mmap_lock (for read or write, does not matter). > > One exception is when we operate on an isolated VMA, then we don't > > need mmap_lock protection, but exit_mmap() does not deal with isolated > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under > > protection of mmap_lock in write mode, so races with anything other > > than page fault handler should be safe as they are today. > > I do not see you talking about #PF (RCU + vma read lock protected) with > munmap. It is my understanding that the latter will synchronize over per > vma lock (along with mmap_lock exclusive locking). But then we are back > to the lifetime guarantees, or do I miss anything. munmap() or any VMA-freeing operation other than exit_mmap() will free using call_rcu(), as implemented today. The suggestion is to free VMAs directly, without RCU grace period only when done from exit_mmap(). That' because VMA freeing flood has been seen so far only in the case of exit_mmap() and we assume other cases are not that heavy to cause call_rcu() flood to cause regressions. That assumption might prove false but we can deal with that once we know it needs fixing. > -- > Michal Hocko > SUSE Labs
On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote: > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote: > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote: > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > [...] > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > > > > > > should have a refcount on the mm_struct, so we can't be in this path > > > > > > trying to free VMAs. Right? > > > > > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps > > > > > VMAs without freeing them, so we are still good. Michal, do you agree > > > > > this is ok? > > > > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has > > > > already shown how rwsem is not safe wrt to unlock and free without RCU. > > > > > > Jann's case requires a thread freeing the VMA to be blocked on vma > > > write lock waiting for the vma real lock to be released by a page > > > fault handler. However exit_mmap() means mm->mm_users==0, which in > > > turn suggests that there are no racing page fault handlers and no new > > > page fault handlers will appear. Is that a correct assumption? If so, > > > then races with page fault handlers can't happen while in exit_mmap(). > > > Any other path (other than page fault handlers), accesses vma->lock > > > under protection of mmap_lock (for read or write, does not matter). > > > One exception is when we operate on an isolated VMA, then we don't > > > need mmap_lock protection, but exit_mmap() does not deal with isolated > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under > > > protection of mmap_lock in write mode, so races with anything other > > > than page fault handler should be safe as they are today. > > > > I do not see you talking about #PF (RCU + vma read lock protected) with > > munmap. It is my understanding that the latter will synchronize over per > > vma lock (along with mmap_lock exclusive locking). But then we are back > > to the lifetime guarantees, or do I miss anything. > > munmap() or any VMA-freeing operation other than exit_mmap() will free > using call_rcu(), as implemented today. The suggestion is to free VMAs > directly, without RCU grace period only when done from exit_mmap(). OK, I have clearly missed that. This makes more sense but it also adds some more complexity and assumptions - a harder to maintain code in the end. Whoever wants to touch this scheme in future would have to re-evaluate all of them. So, I would just avoid that special casing if that is feasible. Dealing with the flood of call_rcu during exit_mmap is a trivial thing to deal with as proposed elsewhere (just batch all of them in a single run). This will surely add some more code but at least the locking would consistent.
On Mon, Jan 23, 2023 at 1:59 AM 'Michal Hocko' via kernel-team <kernel-team@android.com> wrote: > > On Fri 20-01-23 08:20:43, Suren Baghdasaryan wrote: > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > After some more clarification I can understand how call_rcu might not be > > > > > super happy about thousands of callbacks to be invoked and I do agree > > > > > that this is not really optimal. > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > much with processes with a huge number of vmas either. It would still be > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > batching? We could easily just link all the vmas into linked list and > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > implementation, remove the scaling issue as well and we do not have to > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1. > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > on the list without hooking up a shrinker (additional complexity) does > > > > not sound too appealing either. > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > around or any shrinker. It is dead simple. Collect all vmas in > > > remove_vma and then call_rcu the whole list at once after the whole list > > > (be it from exit_mmap or remove_mt). See? > > > > Yes, I understood your idea but keeping dead objects until the process > > exits even when the system is low on memory (no shrinkers attached) > > seems too wasteful. If we do this I would advocate for attaching a > > shrinker. > > I am still not sure we are on the same page here. No, vmas shouldn't lay > around un ntil the process exit. I am really suggesting queuing only for > remove_vma paths. You can have a different rcu callback than the one > used for trivial single vma removal paths. Oh, I also missed the remove_mt() part and thought you want to drain the list only in exit_mmap(). I think that's a good option! > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote: > > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote: > > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote: > > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > [...] > > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > > > > > > > should have a refcount on the mm_struct, so we can't be in this path > > > > > > > trying to free VMAs. Right? > > > > > > > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which > > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps > > > > > > VMAs without freeing them, so we are still good. Michal, do you agree > > > > > > this is ok? > > > > > > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has > > > > > already shown how rwsem is not safe wrt to unlock and free without RCU. > > > > > > > > Jann's case requires a thread freeing the VMA to be blocked on vma > > > > write lock waiting for the vma real lock to be released by a page > > > > fault handler. However exit_mmap() means mm->mm_users==0, which in > > > > turn suggests that there are no racing page fault handlers and no new > > > > page fault handlers will appear. Is that a correct assumption? If so, > > > > then races with page fault handlers can't happen while in exit_mmap(). > > > > Any other path (other than page fault handlers), accesses vma->lock > > > > under protection of mmap_lock (for read or write, does not matter). > > > > One exception is when we operate on an isolated VMA, then we don't > > > > need mmap_lock protection, but exit_mmap() does not deal with isolated > > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under > > > > protection of mmap_lock in write mode, so races with anything other > > > > than page fault handler should be safe as they are today. > > > > > > I do not see you talking about #PF (RCU + vma read lock protected) with > > > munmap. It is my understanding that the latter will synchronize over per > > > vma lock (along with mmap_lock exclusive locking). But then we are back > > > to the lifetime guarantees, or do I miss anything. > > > > munmap() or any VMA-freeing operation other than exit_mmap() will free > > using call_rcu(), as implemented today. The suggestion is to free VMAs > > directly, without RCU grace period only when done from exit_mmap(). > > OK, I have clearly missed that. This makes more sense but it also adds > some more complexity and assumptions - a harder to maintain code in the > end. Whoever wants to touch this scheme in future would have to > re-evaluate all of them. So, I would just avoid that special casing if > that is feasible. Ok, I understand your point. > > Dealing with the flood of call_rcu during exit_mmap is a trivial thing > to deal with as proposed elsewhere (just batch all of them in a single > run). This will surely add some more code but at least the locking would > consistent. Yes, batching the vmas into a list and draining it in remove_mt() and exit_mmap() as you suggested makes sense to me and is quite simple. Let's do that if nobody has objections. > -- > Michal Hocko > SUSE Labs
On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote: > On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote: > > > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote: > > > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote: > > > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > [...] > > > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > > > > > > > > should have a refcount on the mm_struct, so we can't be in this path > > > > > > > > trying to free VMAs. Right? > > > > > > > > > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which > > > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps > > > > > > > VMAs without freeing them, so we are still good. Michal, do you agree > > > > > > > this is ok? > > > > > > > > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has > > > > > > already shown how rwsem is not safe wrt to unlock and free without RCU. > > > > > > > > > > Jann's case requires a thread freeing the VMA to be blocked on vma > > > > > write lock waiting for the vma real lock to be released by a page > > > > > fault handler. However exit_mmap() means mm->mm_users==0, which in > > > > > turn suggests that there are no racing page fault handlers and no new > > > > > page fault handlers will appear. Is that a correct assumption? If so, > > > > > then races with page fault handlers can't happen while in exit_mmap(). > > > > > Any other path (other than page fault handlers), accesses vma->lock > > > > > under protection of mmap_lock (for read or write, does not matter). > > > > > One exception is when we operate on an isolated VMA, then we don't > > > > > need mmap_lock protection, but exit_mmap() does not deal with isolated > > > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under > > > > > protection of mmap_lock in write mode, so races with anything other > > > > > than page fault handler should be safe as they are today. > > > > > > > > I do not see you talking about #PF (RCU + vma read lock protected) with > > > > munmap. It is my understanding that the latter will synchronize over per > > > > vma lock (along with mmap_lock exclusive locking). But then we are back > > > > to the lifetime guarantees, or do I miss anything. > > > > > > munmap() or any VMA-freeing operation other than exit_mmap() will free > > > using call_rcu(), as implemented today. The suggestion is to free VMAs > > > directly, without RCU grace period only when done from exit_mmap(). > > > > OK, I have clearly missed that. This makes more sense but it also adds > > some more complexity and assumptions - a harder to maintain code in the > > end. Whoever wants to touch this scheme in future would have to > > re-evaluate all of them. So, I would just avoid that special casing if > > that is feasible. > > Ok, I understand your point. > > > > > Dealing with the flood of call_rcu during exit_mmap is a trivial thing > > to deal with as proposed elsewhere (just batch all of them in a single > > run). This will surely add some more code but at least the locking would > > consistent. > > Yes, batching the vmas into a list and draining it in remove_mt() and > exit_mmap() as you suggested makes sense to me and is quite simple. > Let's do that if nobody has objections. I object. We *know* nobody has a reference to any of the VMAs because you have to have a refcount on the mm before you can get a reference to a VMA. If Michal is saying that somebody could do: mmget(mm); vma = find_vma(mm); lock_vma(vma); mmput(mm); vma->a = b; unlock_vma(mm, vma); then that's something we'd catch in review -- you obviously can't use the mm after you've dropped your reference to it. Having all this extra code to solve two problems badly is a very poor choice. We have two distinct problems, each of which has a simple, efficient solution.
On Mon, Jan 23, 2023 at 10:23 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote: > > On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote: > > > > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote: > > > > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote: > > > > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > [...] > > > > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc) > > > > > > > > > should have a refcount on the mm_struct, so we can't be in this path > > > > > > > > > trying to free VMAs. Right? > > > > > > > > > > > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which > > > > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps > > > > > > > > VMAs without freeing them, so we are still good. Michal, do you agree > > > > > > > > this is ok? > > > > > > > > > > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has > > > > > > > already shown how rwsem is not safe wrt to unlock and free without RCU. > > > > > > > > > > > > Jann's case requires a thread freeing the VMA to be blocked on vma > > > > > > write lock waiting for the vma real lock to be released by a page > > > > > > fault handler. However exit_mmap() means mm->mm_users==0, which in > > > > > > turn suggests that there are no racing page fault handlers and no new > > > > > > page fault handlers will appear. Is that a correct assumption? If so, > > > > > > then races with page fault handlers can't happen while in exit_mmap(). > > > > > > Any other path (other than page fault handlers), accesses vma->lock > > > > > > under protection of mmap_lock (for read or write, does not matter). > > > > > > One exception is when we operate on an isolated VMA, then we don't > > > > > > need mmap_lock protection, but exit_mmap() does not deal with isolated > > > > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under > > > > > > protection of mmap_lock in write mode, so races with anything other > > > > > > than page fault handler should be safe as they are today. > > > > > > > > > > I do not see you talking about #PF (RCU + vma read lock protected) with > > > > > munmap. It is my understanding that the latter will synchronize over per > > > > > vma lock (along with mmap_lock exclusive locking). But then we are back > > > > > to the lifetime guarantees, or do I miss anything. > > > > > > > > munmap() or any VMA-freeing operation other than exit_mmap() will free > > > > using call_rcu(), as implemented today. The suggestion is to free VMAs > > > > directly, without RCU grace period only when done from exit_mmap(). > > > > > > OK, I have clearly missed that. This makes more sense but it also adds > > > some more complexity and assumptions - a harder to maintain code in the > > > end. Whoever wants to touch this scheme in future would have to > > > re-evaluate all of them. So, I would just avoid that special casing if > > > that is feasible. > > > > Ok, I understand your point. > > > > > > > > Dealing with the flood of call_rcu during exit_mmap is a trivial thing > > > to deal with as proposed elsewhere (just batch all of them in a single > > > run). This will surely add some more code but at least the locking would > > > consistent. > > > > Yes, batching the vmas into a list and draining it in remove_mt() and > > exit_mmap() as you suggested makes sense to me and is quite simple. > > Let's do that if nobody has objections. > > I object. We *know* nobody has a reference to any of the VMAs because > you have to have a refcount on the mm before you can get a reference > to a VMA. If Michal is saying that somebody could do: > > mmget(mm); > vma = find_vma(mm); > lock_vma(vma); > mmput(mm); > vma->a = b; > unlock_vma(mm, vma); More precisely, it's: mmget(mm); vma = lock_vma_under_rcu(mm, addr); -> calls vma_read_trylock(vma) mmput(mm); vma->a = b; vma_read_unlock(vma); To be fair, vma_read_unlock() does not take mm as a parameter, so one could have an impression that mm doesn't need to be pinned at the time of its call. > > then that's something we'd catch in review -- you obviously can't use > the mm after you've dropped your reference to it. > > Having all this extra code to solve two problems badly is a very poor > choice. We have two distinct problems, each of which has a simple, > efficient solution. >
On Mon 23-01-23 18:23:08, Matthew Wilcox wrote: > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote: [...] > > Yes, batching the vmas into a list and draining it in remove_mt() and > > exit_mmap() as you suggested makes sense to me and is quite simple. > > Let's do that if nobody has objections. > > I object. We *know* nobody has a reference to any of the VMAs because > you have to have a refcount on the mm before you can get a reference > to a VMA. If Michal is saying that somebody could do: > > mmget(mm); > vma = find_vma(mm); > lock_vma(vma); > mmput(mm); > vma->a = b; > unlock_vma(mm, vma); > > then that's something we'd catch in review -- you obviously can't use > the mm after you've dropped your reference to it. I am not claiming this is possible now. I do not think we want to have something like that in the future either but that is really hard to envision. I am claiming that it is subtle and potentially error prone to have two different ways of mass vma freeing wrt. locking. Also, don't we have a very similar situation during last munmaps?
On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote: > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote: > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote: > [...] > > > Yes, batching the vmas into a list and draining it in remove_mt() and > > > exit_mmap() as you suggested makes sense to me and is quite simple. > > > Let's do that if nobody has objections. > > > > I object. We *know* nobody has a reference to any of the VMAs because > > you have to have a refcount on the mm before you can get a reference > > to a VMA. If Michal is saying that somebody could do: > > > > mmget(mm); > > vma = find_vma(mm); > > lock_vma(vma); > > mmput(mm); > > vma->a = b; > > unlock_vma(mm, vma); > > > > then that's something we'd catch in review -- you obviously can't use > > the mm after you've dropped your reference to it. > > I am not claiming this is possible now. I do not think we want to have > something like that in the future either but that is really hard to > envision. I am claiming that it is subtle and potentially error prone to > have two different ways of mass vma freeing wrt. locking. Also, don't we > have a very similar situation during last munmaps? We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that. There are two cases; there's munmap(), which typically frees a single VMA (yes, theoretically, you can free hundreds of VMAs with a single call which spans multiple VMAs, but in practice that doesn't happen), and there's exit_mmap() which happens on exec() and exit(). For the munmap() case, just RCU-free each one individually. For the exit_mmap() case, there's no need to use RCU because nobody should still have a VMA pointer after calling mmdrop() [1] [1] Sorry, the above example should have been mmgrab()/mmdrop(), not mmget()/mmput(); you're not allowed to look at the VMA list with an mmget(), you need to have grabbed.
On Mon, Jan 23, 2023 at 11:31 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote: > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote: > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote: > > [...] > > > > Yes, batching the vmas into a list and draining it in remove_mt() and > > > > exit_mmap() as you suggested makes sense to me and is quite simple. > > > > Let's do that if nobody has objections. > > > > > > I object. We *know* nobody has a reference to any of the VMAs because > > > you have to have a refcount on the mm before you can get a reference > > > to a VMA. If Michal is saying that somebody could do: > > > > > > mmget(mm); > > > vma = find_vma(mm); > > > lock_vma(vma); > > > mmput(mm); > > > vma->a = b; > > > unlock_vma(mm, vma); > > > > > > then that's something we'd catch in review -- you obviously can't use > > > the mm after you've dropped your reference to it. > > > > I am not claiming this is possible now. I do not think we want to have > > something like that in the future either but that is really hard to > > envision. I am claiming that it is subtle and potentially error prone to > > have two different ways of mass vma freeing wrt. locking. Also, don't we > > have a very similar situation during last munmaps? > > We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that. > There are two cases; there's munmap(), which typically frees a single > VMA (yes, theoretically, you can free hundreds of VMAs with a single > call which spans multiple VMAs, but in practice that doesn't happen), > and there's exit_mmap() which happens on exec() and exit(). > > For the munmap() case, just RCU-free each one individually. For the > exit_mmap() case, there's no need to use RCU because nobody should still > have a VMA pointer after calling mmdrop() [1] > > [1] Sorry, the above example should have been mmgrab()/mmdrop(), not > mmget()/mmput(); you're not allowed to look at the VMA list with an > mmget(), you need to have grabbed. I think it's clear that this would work with the current code and that the concern is about any future possible misuse. So, it would be preferable to proactively prevent such misuse. vma_write_lock() and vma_write_unlock_mm() both have mmap_assert_write_locked(), so they always happen under mmap_lock protection and therefore do not pose any danger. The only issue we need to be careful with is calling vma_read_trylock()/vma_read_unlock() outside of mmap_lock protection while mm is unstable. I don't think doing mmget/mmput inside these functions is called for but maybe some assertions would prevent future misuse?
On Mon 23-01-23 19:30:43, Matthew Wilcox wrote: > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote: > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote: > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote: > > [...] > > > > Yes, batching the vmas into a list and draining it in remove_mt() and > > > > exit_mmap() as you suggested makes sense to me and is quite simple. > > > > Let's do that if nobody has objections. > > > > > > I object. We *know* nobody has a reference to any of the VMAs because > > > you have to have a refcount on the mm before you can get a reference > > > to a VMA. If Michal is saying that somebody could do: > > > > > > mmget(mm); > > > vma = find_vma(mm); > > > lock_vma(vma); > > > mmput(mm); > > > vma->a = b; > > > unlock_vma(mm, vma); > > > > > > then that's something we'd catch in review -- you obviously can't use > > > the mm after you've dropped your reference to it. > > > > I am not claiming this is possible now. I do not think we want to have > > something like that in the future either but that is really hard to > > envision. I am claiming that it is subtle and potentially error prone to > > have two different ways of mass vma freeing wrt. locking. Also, don't we > > have a very similar situation during last munmaps? > > We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that. > There are two cases; there's munmap(), which typically frees a single > VMA (yes, theoretically, you can free hundreds of VMAs with a single > call which spans multiple VMAs, but in practice that doesn't happen), > and there's exit_mmap() which happens on exec() and exit(). This requires special casing remove_vma for those two different paths (exit_mmap and remove_mt). If you ask me that sounds like a suboptimal code to even not handle potential large munmap which might very well be a rare thing as you say. But haven't we learned that sooner or later we will find out there is somebody that cares afterall? Anyway, this is not something I care about all that much. It is just weird to special case exit_mmap, if you ask me. Up to Suren to decide which way he wants to go. I just really didn't like the initial implementation of batching based on a completely arbitrary batch limit and lazy freeing.
On Mon, Jan 23, 2023 at 12:00 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 23-01-23 19:30:43, Matthew Wilcox wrote: > > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote: > > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote: > > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote: > > > [...] > > > > > Yes, batching the vmas into a list and draining it in remove_mt() and > > > > > exit_mmap() as you suggested makes sense to me and is quite simple. > > > > > Let's do that if nobody has objections. > > > > > > > > I object. We *know* nobody has a reference to any of the VMAs because > > > > you have to have a refcount on the mm before you can get a reference > > > > to a VMA. If Michal is saying that somebody could do: > > > > > > > > mmget(mm); > > > > vma = find_vma(mm); > > > > lock_vma(vma); > > > > mmput(mm); > > > > vma->a = b; > > > > unlock_vma(mm, vma); > > > > > > > > then that's something we'd catch in review -- you obviously can't use > > > > the mm after you've dropped your reference to it. > > > > > > I am not claiming this is possible now. I do not think we want to have > > > something like that in the future either but that is really hard to > > > envision. I am claiming that it is subtle and potentially error prone to > > > have two different ways of mass vma freeing wrt. locking. Also, don't we > > > have a very similar situation during last munmaps? > > > > We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that. > > There are two cases; there's munmap(), which typically frees a single > > VMA (yes, theoretically, you can free hundreds of VMAs with a single > > call which spans multiple VMAs, but in practice that doesn't happen), > > and there's exit_mmap() which happens on exec() and exit(). > > This requires special casing remove_vma for those two different paths > (exit_mmap and remove_mt). If you ask me that sounds like a suboptimal > code to even not handle potential large munmap which might very well be > a rare thing as you say. But haven't we learned that sooner or later we > will find out there is somebody that cares afterall? Anyway, this is not > something I care about all that much. It is just weird to special case > exit_mmap, if you ask me. Up to Suren to decide which way he wants to > go. I just really didn't like the initial implementation of batching > based on a completely arbitrary batch limit and lazy freeing. I would prefer to go with the simplest sufficient solution. A potential issue with a large munmap might prove to be real but I think we know how to easily fix that with batching if the issue ever materializes (I'll have a fix ready implementing Michal's suggestion). So, I suggest going with Liam's/Matthew's solution and converting to Michal's solution if regression shows up anywhere else. Would that be acceptable? > -- > Michal Hocko > SUSE Labs
* Michal Hocko <mhocko@suse.com> [230123 15:00]: > On Mon 23-01-23 19:30:43, Matthew Wilcox wrote: > > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote: > > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote: > > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote: > > > [...] > > > > > Yes, batching the vmas into a list and draining it in remove_mt() and > > > > > exit_mmap() as you suggested makes sense to me and is quite simple. > > > > > Let's do that if nobody has objections. > > > > > > > > I object. We *know* nobody has a reference to any of the VMAs because > > > > you have to have a refcount on the mm before you can get a reference > > > > to a VMA. If Michal is saying that somebody could do: > > > > > > > > mmget(mm); > > > > vma = find_vma(mm); > > > > lock_vma(vma); > > > > mmput(mm); > > > > vma->a = b; > > > > unlock_vma(mm, vma); > > > > > > > > then that's something we'd catch in review -- you obviously can't use > > > > the mm after you've dropped your reference to it. > > > > > > I am not claiming this is possible now. I do not think we want to have > > > something like that in the future either but that is really hard to > > > envision. I am claiming that it is subtle and potentially error prone to > > > have two different ways of mass vma freeing wrt. locking. Also, don't we > > > have a very similar situation during last munmaps? > > > > We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that. > > There are two cases; there's munmap(), which typically frees a single > > VMA (yes, theoretically, you can free hundreds of VMAs with a single > > call which spans multiple VMAs, but in practice that doesn't happen), > > and there's exit_mmap() which happens on exec() and exit(). > > This requires special casing remove_vma for those two different paths > (exit_mmap and remove_mt). If you ask me that sounds like a suboptimal > code to even not handle potential large munmap which might very well be > a rare thing as you say. But haven't we learned that sooner or later we > will find out there is somebody that cares afterall? Anyway, this is not > something I care about all that much. It is just weird to special case > exit_mmap, if you ask me. exit_mmap() is already a special case for locking (and statistics). This exists today to optimize the special exit scenario. I don't think it's a question of sub-optimal code but what we can get away without doing in the case of the process exit.
diff --git a/include/linux/mm.h b/include/linux/mm.h index 3158f33e268c..50c7a6dd9c7a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -250,6 +250,7 @@ void setup_initial_init_mm(void *start_code, void *end_code, struct vm_area_struct *vm_area_alloc(struct mm_struct *); struct vm_area_struct *vm_area_dup(struct vm_area_struct *); void vm_area_free(struct vm_area_struct *); +void drain_free_vmas(struct mm_struct *mm); #ifndef CONFIG_MMU extern struct rb_root nommu_region_tree; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index fce9113d979c..c0e6c8e4700b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -592,8 +592,18 @@ struct vm_area_struct { /* Information about our backing store: */ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units */ - struct file * vm_file; /* File we map to (can be NULL). */ - void * vm_private_data; /* was vm_pte (shared mem) */ + union { + struct { + /* File we map to (can be NULL). */ + struct file *vm_file; + + /* was vm_pte (shared mem) */ + void *vm_private_data; + }; +#ifdef CONFIG_PER_VMA_LOCK + struct list_head vm_free_list; +#endif + }; #ifdef CONFIG_ANON_VMA_NAME /* @@ -693,6 +703,11 @@ struct mm_struct { */ #ifdef CONFIG_PER_VMA_LOCK int mm_lock_seq; + struct { + struct list_head head; + spinlock_t lock; + int size; + } vma_free_list; #endif diff --git a/kernel/fork.c b/kernel/fork.c index 6d9f14e55ecf..97f2b751f88d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -481,26 +481,75 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) } #ifdef CONFIG_PER_VMA_LOCK -static void __vm_area_free(struct rcu_head *head) +static inline void __vm_area_free(struct vm_area_struct *vma) { - struct vm_area_struct *vma = container_of(head, struct vm_area_struct, - vm_rcu); /* The vma should either have no lock holders or be write-locked. */ vma_assert_no_reader(vma); kmem_cache_free(vm_area_cachep, vma); } -#endif + +static void vma_free_rcu_callback(struct rcu_head *head) +{ + struct vm_area_struct *first_vma; + struct vm_area_struct *vma, *vma2; + + first_vma = container_of(head, struct vm_area_struct, vm_rcu); + list_for_each_entry_safe(vma, vma2, &first_vma->vm_free_list, vm_free_list) + __vm_area_free(vma); + __vm_area_free(first_vma); +} + +void drain_free_vmas(struct mm_struct *mm) +{ + struct vm_area_struct *first_vma; + LIST_HEAD(to_destroy); + + spin_lock(&mm->vma_free_list.lock); + list_splice_init(&mm->vma_free_list.head, &to_destroy); + mm->vma_free_list.size = 0; + spin_unlock(&mm->vma_free_list.lock); + + if (list_empty(&to_destroy)) + return; + + first_vma = list_first_entry(&to_destroy, struct vm_area_struct, vm_free_list); + /* Remove the head which is allocated on the stack */ + list_del(&to_destroy); + + call_rcu(&first_vma->vm_rcu, vma_free_rcu_callback); +} + +#define VM_AREA_FREE_LIST_MAX 32 + +void vm_area_free(struct vm_area_struct *vma) +{ + struct mm_struct *mm = vma->vm_mm; + bool drain; + + free_anon_vma_name(vma); + + spin_lock(&mm->vma_free_list.lock); + list_add(&vma->vm_free_list, &mm->vma_free_list.head); + mm->vma_free_list.size++; + drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX; + spin_unlock(&mm->vma_free_list.lock); + + if (drain) + drain_free_vmas(mm); +} + +#else /* CONFIG_PER_VMA_LOCK */ + +void drain_free_vmas(struct mm_struct *mm) {} void vm_area_free(struct vm_area_struct *vma) { free_anon_vma_name(vma); -#ifdef CONFIG_PER_VMA_LOCK - call_rcu(&vma->vm_rcu, __vm_area_free); -#else kmem_cache_free(vm_area_cachep, vma); -#endif } +#endif /* CONFIG_PER_VMA_LOCK */ + static void account_kernel_stack(struct task_struct *tsk, int account) { if (IS_ENABLED(CONFIG_VMAP_STACK)) { @@ -1150,6 +1199,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, INIT_LIST_HEAD(&mm->mmlist); #ifdef CONFIG_PER_VMA_LOCK WRITE_ONCE(mm->mm_lock_seq, 0); + INIT_LIST_HEAD(&mm->vma_free_list.head); + spin_lock_init(&mm->vma_free_list.lock); + mm->vma_free_list.size = 0; #endif mm_pgtables_bytes_init(mm); mm->map_count = 0; diff --git a/mm/init-mm.c b/mm/init-mm.c index 33269314e060..b53d23c2d7a3 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -39,6 +39,9 @@ struct mm_struct init_mm = { .mmlist = LIST_HEAD_INIT(init_mm.mmlist), #ifdef CONFIG_PER_VMA_LOCK .mm_lock_seq = 0, + .vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head), + .vma_free_list.lock = __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock), + .vma_free_list.size = 0, #endif .user_ns = &init_user_ns, .cpu_bitmap = CPU_BITS_NONE, diff --git a/mm/mmap.c b/mm/mmap.c index 332af383f7cd..a0d5d3af1d95 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm) trace_exit_mmap(mm); __mt_destroy(&mm->mm_mt); mmap_write_unlock(mm); + drain_free_vmas(mm); vm_unacct_memory(nr_accounted); }
call_rcu() can take a long time when callback offloading is enabled. Its use in the vm_area_free can cause regressions in the exit path when multiple VMAs are being freed. To minimize that impact, place VMAs into a list and free them in groups using one call_rcu() call per group. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- include/linux/mm.h | 1 + include/linux/mm_types.h | 19 +++++++++-- kernel/fork.c | 68 +++++++++++++++++++++++++++++++++++----- mm/init-mm.c | 3 ++ mm/mmap.c | 1 + 5 files changed, 82 insertions(+), 10 deletions(-)