Message ID | 20200710015646.2020871-8-npiggin@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mmu context cleanup, lazy tlb cleanup, | expand |
On Fri, Jul 10, 2020 at 11:56:46AM +1000, Nicholas Piggin wrote: > On big systems, the mm refcount can become highly contented when doing > a lot of context switching with threaded applications (particularly > switching between the idle thread and an application thread). > > Abandoning lazy tlb slows switching down quite a bit in the important > user->idle->user cases, so so instead implement a non-refcounted scheme > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down > any remaining lazy ones. > > On a 16-socket 192-core POWER8 system, a context switching benchmark > with as many software threads as CPUs (so each switch will go in and > out of idle), upstream can achieve a rate of about 1 million context > switches per second. After this patch it goes up to 118 million. That's mighty impressive, however: > +static void shoot_lazy_tlbs(struct mm_struct *mm) > +{ > + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { > + smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); > + do_shoot_lazy_tlb(mm); > + } > +} IIRC you (power) never clear a CPU from that mask, so for other workloads I can see this resulting in massive amounts of IPIs. For instance, take as many processes as you have CPUs. For each, manually walk the task across all CPUs and exit. Again. Clearly, that's an extreme, but still...
Excerpts from Peter Zijlstra's message of July 10, 2020 7:35 pm: > On Fri, Jul 10, 2020 at 11:56:46AM +1000, Nicholas Piggin wrote: >> On big systems, the mm refcount can become highly contented when doing >> a lot of context switching with threaded applications (particularly >> switching between the idle thread and an application thread). >> >> Abandoning lazy tlb slows switching down quite a bit in the important >> user->idle->user cases, so so instead implement a non-refcounted scheme >> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >> any remaining lazy ones. >> >> On a 16-socket 192-core POWER8 system, a context switching benchmark >> with as many software threads as CPUs (so each switch will go in and >> out of idle), upstream can achieve a rate of about 1 million context >> switches per second. After this patch it goes up to 118 million. > > That's mighty impressive, however: Well, it's the usual case of "find a bouncing line and scale up the machine size until you achieve your desired improvements" :) But we are looking at some fundamental scalabilities and seeing if we can improve a few things. > >> +static void shoot_lazy_tlbs(struct mm_struct *mm) >> +{ >> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { >> + smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); >> + do_shoot_lazy_tlb(mm); >> + } >> +} > > IIRC you (power) never clear a CPU from that mask, so for other > workloads I can see this resulting in massive amounts of IPIs. > > For instance, take as many processes as you have CPUs. For each, > manually walk the task across all CPUs and exit. Again. > > Clearly, that's an extreme, but still... We do have some issues with that, it does tend to be very self-limiting though, short lived tasks that can drive lots of exits won't get to run on a lot of cores. It's worth keeping an eye on, it may not be too hard to mitigate the IPIs doing something dumb like collecting a queue of mms before killing a batch of them. Thanks, Nick
On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > On big systems, the mm refcount can become highly contented when doing > a lot of context switching with threaded applications (particularly > switching between the idle thread and an application thread). > > Abandoning lazy tlb slows switching down quite a bit in the important > user->idle->user cases, so so instead implement a non-refcounted scheme > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down > any remaining lazy ones. > > On a 16-socket 192-core POWER8 system, a context switching benchmark > with as many software threads as CPUs (so each switch will go in and > out of idle), upstream can achieve a rate of about 1 million context > switches per second. After this patch it goes up to 118 million. > I read the patch a couple of times, and I have a suggestion that could be nonsense. You are, effectively, using mm_cpumask() as a sort of refcount. You're saying "hey, this mm has no more references, but it still has nonempty mm_cpumask(), so let's send an IPI and shoot down those references too." I'm wondering whether you actually need the IPI. What if, instead, you actually treated mm_cpumask as a refcount for real? Roughly, in __mmdrop(), you would only free the page tables if mm_cpumask() is empty. And, in the code that removes a CPU from mm_cpumask(), you would check if mm_users == 0 and, if so, check if you just removed the last bit from mm_cpumask and potentially free the mm. Getting the locking right here could be a bit tricky -- you need to avoid two CPUs simultaneously exiting lazy TLB and thinking they should free the mm, and you also need to avoid an mm with mm_users hitting zero concurrently with the last remote CPU using it lazily exiting lazy TLB. Perhaps this could be resolved by having mm_count == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the mm" and mm_count == 0 meaning "now it's dead" and using some careful cmpxchg or dec_return to make sure that only one CPU frees it. Or maybe you'd need a lock or RCU for this, but the idea would be to only ever take the lock after mm_users goes to zero. --Andy
Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: > On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> On big systems, the mm refcount can become highly contented when doing >> a lot of context switching with threaded applications (particularly >> switching between the idle thread and an application thread). >> >> Abandoning lazy tlb slows switching down quite a bit in the important >> user->idle->user cases, so so instead implement a non-refcounted scheme >> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >> any remaining lazy ones. >> >> On a 16-socket 192-core POWER8 system, a context switching benchmark >> with as many software threads as CPUs (so each switch will go in and >> out of idle), upstream can achieve a rate of about 1 million context >> switches per second. After this patch it goes up to 118 million. >> > > I read the patch a couple of times, and I have a suggestion that could > be nonsense. You are, effectively, using mm_cpumask() as a sort of > refcount. You're saying "hey, this mm has no more references, but it > still has nonempty mm_cpumask(), so let's send an IPI and shoot down > those references too." I'm wondering whether you actually need the > IPI. What if, instead, you actually treated mm_cpumask as a refcount > for real? Roughly, in __mmdrop(), you would only free the page tables > if mm_cpumask() is empty. And, in the code that removes a CPU from > mm_cpumask(), you would check if mm_users == 0 and, if so, check if > you just removed the last bit from mm_cpumask and potentially free the > mm. > > Getting the locking right here could be a bit tricky -- you need to > avoid two CPUs simultaneously exiting lazy TLB and thinking they > should free the mm, and you also need to avoid an mm with mm_users > hitting zero concurrently with the last remote CPU using it lazily > exiting lazy TLB. Perhaps this could be resolved by having mm_count > == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the > mm" and mm_count == 0 meaning "now it's dead" and using some careful > cmpxchg or dec_return to make sure that only one CPU frees it. > > Or maybe you'd need a lock or RCU for this, but the idea would be to > only ever take the lock after mm_users goes to zero. I don't think it's nonsense, it could be a good way to avoid IPIs. I haven't seen much problem here that made me too concerned about IPIs yet, so I think the simple patch may be good enough to start with for powerpc. I'm looking at avoiding/reducing the IPIs by combining the unlazying with the exit TLB flush without doing anything fancy with ref counting, but we'll see. Thanks, Nick
> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>> >>> On big systems, the mm refcount can become highly contented when doing >>> a lot of context switching with threaded applications (particularly >>> switching between the idle thread and an application thread). >>> >>> Abandoning lazy tlb slows switching down quite a bit in the important >>> user->idle->user cases, so so instead implement a non-refcounted scheme >>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>> any remaining lazy ones. >>> >>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>> with as many software threads as CPUs (so each switch will go in and >>> out of idle), upstream can achieve a rate of about 1 million context >>> switches per second. After this patch it goes up to 118 million. >>> >> >> I read the patch a couple of times, and I have a suggestion that could >> be nonsense. You are, effectively, using mm_cpumask() as a sort of >> refcount. You're saying "hey, this mm has no more references, but it >> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >> those references too." I'm wondering whether you actually need the >> IPI. What if, instead, you actually treated mm_cpumask as a refcount >> for real? Roughly, in __mmdrop(), you would only free the page tables >> if mm_cpumask() is empty. And, in the code that removes a CPU from >> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >> you just removed the last bit from mm_cpumask and potentially free the >> mm. >> >> Getting the locking right here could be a bit tricky -- you need to >> avoid two CPUs simultaneously exiting lazy TLB and thinking they >> should free the mm, and you also need to avoid an mm with mm_users >> hitting zero concurrently with the last remote CPU using it lazily >> exiting lazy TLB. Perhaps this could be resolved by having mm_count >> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >> mm" and mm_count == 0 meaning "now it's dead" and using some careful >> cmpxchg or dec_return to make sure that only one CPU frees it. >> >> Or maybe you'd need a lock or RCU for this, but the idea would be to >> only ever take the lock after mm_users goes to zero. > > I don't think it's nonsense, it could be a good way to avoid IPIs. > > I haven't seen much problem here that made me too concerned about IPIs > yet, so I think the simple patch may be good enough to start with > for powerpc. I'm looking at avoiding/reducing the IPIs by combining the > unlazying with the exit TLB flush without doing anything fancy with > ref counting, but we'll see. I would be cautious with benchmarking here. I would expect that the nasty cases may affect power consumption more than performance — the specific issue is IPIs hitting idle cores, and the main effects are to slow down exit() a bit but also to kick the idle core out of idle. Although, if the idle core is in a deep sleep, that IPI could be *very* slow. So I think it’s worth at least giving this a try. > > Thanks, > Nick
Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: > >> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>> >>>> On big systems, the mm refcount can become highly contented when doing >>>> a lot of context switching with threaded applications (particularly >>>> switching between the idle thread and an application thread). >>>> >>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>> any remaining lazy ones. >>>> >>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>> with as many software threads as CPUs (so each switch will go in and >>>> out of idle), upstream can achieve a rate of about 1 million context >>>> switches per second. After this patch it goes up to 118 million. >>>> >>> >>> I read the patch a couple of times, and I have a suggestion that could >>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>> refcount. You're saying "hey, this mm has no more references, but it >>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>> those references too." I'm wondering whether you actually need the >>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>> for real? Roughly, in __mmdrop(), you would only free the page tables >>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>> you just removed the last bit from mm_cpumask and potentially free the >>> mm. >>> >>> Getting the locking right here could be a bit tricky -- you need to >>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>> should free the mm, and you also need to avoid an mm with mm_users >>> hitting zero concurrently with the last remote CPU using it lazily >>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>> cmpxchg or dec_return to make sure that only one CPU frees it. >>> >>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>> only ever take the lock after mm_users goes to zero. >> >> I don't think it's nonsense, it could be a good way to avoid IPIs. >> >> I haven't seen much problem here that made me too concerned about IPIs >> yet, so I think the simple patch may be good enough to start with >> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >> unlazying with the exit TLB flush without doing anything fancy with >> ref counting, but we'll see. > > I would be cautious with benchmarking here. I would expect that the > nasty cases may affect power consumption more than performance — the > specific issue is IPIs hitting idle cores, and the main effects are to > slow down exit() a bit but also to kick the idle core out of idle. > Although, if the idle core is in a deep sleep, that IPI could be > *very* slow. It will tend to be self-limiting to some degree (deeper idle cores would tend to have less chance of IPI) but we have bigger issues on powerpc with that, like broadcast IPIs to the mm cpumask for THP management. Power hasn't really shown up as an issue but powerpc CPUs may have their own requirements and issues there, shall we say. > So I think it’s worth at least giving this a try. To be clear it's not a complete solution itself. The problem is of course that mm cpumask gives you false negatives, so the bits won't always clean up after themselves as CPUs switch away from their lazy tlb mms. I would suspect it _may_ help with garbage collecting some remainders nicely after exit, but only with somewhat of a different accounting system than powerpc uses -- we tie mm_cpumask to TLB valids, so it can become spread over CPUs that don't (and even have never) used that mm as a lazy mm I don't know that the self-culling trick would help a great deal within that scheme. So powerpc needs a bit more work on that side of things too, hence looking at doing more of this in the final TLB shootdown. There's actually a lot of other things we can do as well to reduce IPIs, batching being a simple hammer, some kind of quiescing, testing the remote CPU to check what active mm it is using, doing the un-lazy at certain defined points etc, so I'm actually not worried about IPIs suddenly popping up and rendering the whole concept unworkable. At some point (unless we go something pretty complex like a SRCU type thing, or adding extra locking .e.g, to use_mm()), then at least sometimes an IPI will be required so I think it's reasonable to start here and introduce complexity more slowly if it's justified. Thanks, Nick
Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm: > Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >> >>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >>> >>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>>> >>>>> On big systems, the mm refcount can become highly contented when doing >>>>> a lot of context switching with threaded applications (particularly >>>>> switching between the idle thread and an application thread). >>>>> >>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>>> any remaining lazy ones. >>>>> >>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>> with as many software threads as CPUs (so each switch will go in and >>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>> switches per second. After this patch it goes up to 118 million. >>>>> >>>> >>>> I read the patch a couple of times, and I have a suggestion that could >>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>> refcount. You're saying "hey, this mm has no more references, but it >>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>> those references too." I'm wondering whether you actually need the >>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>> for real? Roughly, in __mmdrop(), you would only free the page tables >>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>>> you just removed the last bit from mm_cpumask and potentially free the >>>> mm. >>>> >>>> Getting the locking right here could be a bit tricky -- you need to >>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>> should free the mm, and you also need to avoid an mm with mm_users >>>> hitting zero concurrently with the last remote CPU using it lazily >>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>> >>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>> only ever take the lock after mm_users goes to zero. >>> >>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>> >>> I haven't seen much problem here that made me too concerned about IPIs >>> yet, so I think the simple patch may be good enough to start with >>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >>> unlazying with the exit TLB flush without doing anything fancy with >>> ref counting, but we'll see. >> >> I would be cautious with benchmarking here. I would expect that the >> nasty cases may affect power consumption more than performance — the >> specific issue is IPIs hitting idle cores, and the main effects are to >> slow down exit() a bit but also to kick the idle core out of idle. >> Although, if the idle core is in a deep sleep, that IPI could be >> *very* slow. > > It will tend to be self-limiting to some degree (deeper idle cores > would tend to have less chance of IPI) but we have bigger issues on > powerpc with that, like broadcast IPIs to the mm cpumask for THP > management. Power hasn't really shown up as an issue but powerpc > CPUs may have their own requirements and issues there, shall we say. > >> So I think it’s worth at least giving this a try. > > To be clear it's not a complete solution itself. The problem is of > course that mm cpumask gives you false negatives, so the bits > won't always clean up after themselves as CPUs switch away from their > lazy tlb mms. ^^ False positives: CPU is in the mm_cpumask, but is not using the mm as a lazy tlb. So there can be bits left and never freed. If you closed the false positives, you're back to a shared mm cache line on lazy mm context switches. Thanks, Nick
> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm: >> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >>> >>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >>>> >>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>>>> >>>>>> On big systems, the mm refcount can become highly contented when doing >>>>>> a lot of context switching with threaded applications (particularly >>>>>> switching between the idle thread and an application thread). >>>>>> >>>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>>>> any remaining lazy ones. >>>>>> >>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>>> with as many software threads as CPUs (so each switch will go in and >>>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>>> switches per second. After this patch it goes up to 118 million. >>>>>> >>>>> >>>>> I read the patch a couple of times, and I have a suggestion that could >>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>>> refcount. You're saying "hey, this mm has no more references, but it >>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>>> those references too." I'm wondering whether you actually need the >>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>>> for real? Roughly, in __mmdrop(), you would only free the page tables >>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>>>> you just removed the last bit from mm_cpumask and potentially free the >>>>> mm. >>>>> >>>>> Getting the locking right here could be a bit tricky -- you need to >>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>>> should free the mm, and you also need to avoid an mm with mm_users >>>>> hitting zero concurrently with the last remote CPU using it lazily >>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>>> >>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>>> only ever take the lock after mm_users goes to zero. >>>> >>>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>>> >>>> I haven't seen much problem here that made me too concerned about IPIs >>>> yet, so I think the simple patch may be good enough to start with >>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >>>> unlazying with the exit TLB flush without doing anything fancy with >>>> ref counting, but we'll see. >>> >>> I would be cautious with benchmarking here. I would expect that the >>> nasty cases may affect power consumption more than performance — the >>> specific issue is IPIs hitting idle cores, and the main effects are to >>> slow down exit() a bit but also to kick the idle core out of idle. >>> Although, if the idle core is in a deep sleep, that IPI could be >>> *very* slow. >> >> It will tend to be self-limiting to some degree (deeper idle cores >> would tend to have less chance of IPI) but we have bigger issues on >> powerpc with that, like broadcast IPIs to the mm cpumask for THP >> management. Power hasn't really shown up as an issue but powerpc >> CPUs may have their own requirements and issues there, shall we say. >> >>> So I think it’s worth at least giving this a try. >> >> To be clear it's not a complete solution itself. The problem is of >> course that mm cpumask gives you false negatives, so the bits >> won't always clean up after themselves as CPUs switch away from their >> lazy tlb mms. > > ^^ > > False positives: CPU is in the mm_cpumask, but is not using the mm > as a lazy tlb. So there can be bits left and never freed. > > If you closed the false positives, you're back to a shared mm cache > line on lazy mm context switches. x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :) Can your share your benchmark? > > Thanks, > Nick
On Tue, Jul 14, 2020 at 05:46:05AM -0700, Andy Lutomirski wrote:
> x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :)
I've seen patches for a 'sparse' bitmap to solve related problems.
It's basically the same code, except it multiplies everything (size,
bit-nr) by a constant to reduce the number of active bits per line.
This sadly doesn't take topology into account, but reducing contention
is still good ofcourse.
Excerpts from Andy Lutomirski's message of July 14, 2020 10:46 pm: > > >> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm: >>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >>>> >>>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >>>>> >>>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>>>>> >>>>>>> On big systems, the mm refcount can become highly contented when doing >>>>>>> a lot of context switching with threaded applications (particularly >>>>>>> switching between the idle thread and an application thread). >>>>>>> >>>>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>>>>> any remaining lazy ones. >>>>>>> >>>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>>>> with as many software threads as CPUs (so each switch will go in and >>>>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>>>> switches per second. After this patch it goes up to 118 million. >>>>>>> >>>>>> >>>>>> I read the patch a couple of times, and I have a suggestion that could >>>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>>>> refcount. You're saying "hey, this mm has no more references, but it >>>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>>>> those references too." I'm wondering whether you actually need the >>>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>>>> for real? Roughly, in __mmdrop(), you would only free the page tables >>>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>>>>> you just removed the last bit from mm_cpumask and potentially free the >>>>>> mm. >>>>>> >>>>>> Getting the locking right here could be a bit tricky -- you need to >>>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>>>> should free the mm, and you also need to avoid an mm with mm_users >>>>>> hitting zero concurrently with the last remote CPU using it lazily >>>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>>>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>>>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>>>> >>>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>>>> only ever take the lock after mm_users goes to zero. >>>>> >>>>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>>>> >>>>> I haven't seen much problem here that made me too concerned about IPIs >>>>> yet, so I think the simple patch may be good enough to start with >>>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >>>>> unlazying with the exit TLB flush without doing anything fancy with >>>>> ref counting, but we'll see. >>>> >>>> I would be cautious with benchmarking here. I would expect that the >>>> nasty cases may affect power consumption more than performance — the >>>> specific issue is IPIs hitting idle cores, and the main effects are to >>>> slow down exit() a bit but also to kick the idle core out of idle. >>>> Although, if the idle core is in a deep sleep, that IPI could be >>>> *very* slow. >>> >>> It will tend to be self-limiting to some degree (deeper idle cores >>> would tend to have less chance of IPI) but we have bigger issues on >>> powerpc with that, like broadcast IPIs to the mm cpumask for THP >>> management. Power hasn't really shown up as an issue but powerpc >>> CPUs may have their own requirements and issues there, shall we say. >>> >>>> So I think it’s worth at least giving this a try. >>> >>> To be clear it's not a complete solution itself. The problem is of >>> course that mm cpumask gives you false negatives, so the bits >>> won't always clean up after themselves as CPUs switch away from their >>> lazy tlb mms. >> >> ^^ >> >> False positives: CPU is in the mm_cpumask, but is not using the mm >> as a lazy tlb. So there can be bits left and never freed. >> >> If you closed the false positives, you're back to a shared mm cache >> line on lazy mm context switches. > > x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :) > > Can your share your benchmark? It's just context_switch1_threads from will-it-scale, running with 1/2 the number of CPUs, and core affinity with an SMT processor (so each thread from the switching pairs gets spread to their own CPU and so you get the task->idle->task switching. It's really just about the worst case, so I wouldn't say it's something to panic about. Thanks, Nick
Excerpts from Andy Lutomirski's message of July 14, 2020 10:46 pm: > > >> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm: >>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >>>> >>>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >>>>> >>>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>>>>> >>>>>>> On big systems, the mm refcount can become highly contented when doing >>>>>>> a lot of context switching with threaded applications (particularly >>>>>>> switching between the idle thread and an application thread). >>>>>>> >>>>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>>>>> any remaining lazy ones. >>>>>>> >>>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>>>> with as many software threads as CPUs (so each switch will go in and >>>>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>>>> switches per second. After this patch it goes up to 118 million. >>>>>>> >>>>>> >>>>>> I read the patch a couple of times, and I have a suggestion that could >>>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>>>> refcount. You're saying "hey, this mm has no more references, but it >>>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>>>> those references too." I'm wondering whether you actually need the >>>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>>>> for real? Roughly, in __mmdrop(), you would only free the page tables >>>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>>>>> you just removed the last bit from mm_cpumask and potentially free the >>>>>> mm. >>>>>> >>>>>> Getting the locking right here could be a bit tricky -- you need to >>>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>>>> should free the mm, and you also need to avoid an mm with mm_users >>>>>> hitting zero concurrently with the last remote CPU using it lazily >>>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>>>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>>>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>>>> >>>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>>>> only ever take the lock after mm_users goes to zero. >>>>> >>>>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>>>> >>>>> I haven't seen much problem here that made me too concerned about IPIs >>>>> yet, so I think the simple patch may be good enough to start with >>>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >>>>> unlazying with the exit TLB flush without doing anything fancy with >>>>> ref counting, but we'll see. >>>> >>>> I would be cautious with benchmarking here. I would expect that the >>>> nasty cases may affect power consumption more than performance — the >>>> specific issue is IPIs hitting idle cores, and the main effects are to >>>> slow down exit() a bit but also to kick the idle core out of idle. >>>> Although, if the idle core is in a deep sleep, that IPI could be >>>> *very* slow. >>> >>> It will tend to be self-limiting to some degree (deeper idle cores >>> would tend to have less chance of IPI) but we have bigger issues on >>> powerpc with that, like broadcast IPIs to the mm cpumask for THP >>> management. Power hasn't really shown up as an issue but powerpc >>> CPUs may have their own requirements and issues there, shall we say. >>> >>>> So I think it’s worth at least giving this a try. >>> >>> To be clear it's not a complete solution itself. The problem is of >>> course that mm cpumask gives you false negatives, so the bits >>> won't always clean up after themselves as CPUs switch away from their >>> lazy tlb mms. >> >> ^^ >> >> False positives: CPU is in the mm_cpumask, but is not using the mm >> as a lazy tlb. So there can be bits left and never freed. >> >> If you closed the false positives, you're back to a shared mm cache >> line on lazy mm context switches. > > x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :) > > Can your share your benchmark? Just testing the IPI rates (on a smaller 176 CPU system), on a kernel compile, it causes about 300 shootdown interrupts (not 300 broadcasts but total interrupts). And very short lived fork;exec;exit things like typical scripting commands doesn't typically generate any. So yeah the really high exit rate things self-limit pretty well. I documented the concern and added a few of the possible ways to further reduce IPIs in the comments though. Thanks, Nick
diff --git a/arch/Kconfig b/arch/Kconfig index 2daf8fe6146a..edf69437a971 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -418,6 +418,22 @@ config MMU_LAZY_TLB help Enable "lazy TLB" mmu context switching for kernel threads. +config MMU_LAZY_TLB_REFCOUNT + def_bool y + depends on MMU_LAZY_TLB + depends on !MMU_LAZY_TLB_SHOOTDOWN + +config MMU_LAZY_TLB_SHOOTDOWN + bool + depends on MMU_LAZY_TLB + help + Instead of refcounting the "lazy tlb" mm struct, which can cause + contention with multi-threaded apps on large multiprocessor systems, + this option causes __mmdrop to IPI all CPUs in the mm_cpumask and + switch to init_mm if they were using the to-be-freed mm as the lazy + tlb. Architectures which do not track all possible lazy tlb CPUs in + mm_cpumask can not use this (without modification). + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 920c4e3ca4ef..24ac85c868db 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -225,6 +225,7 @@ config PPC select HAVE_PERF_USER_STACK_DUMP select MMU_GATHER_RCU_TABLE_FREE select MMU_GATHER_PAGE_SIZE + select MMU_LAZY_TLB_SHOOTDOWN select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && CPU_LITTLE_ENDIAN select HAVE_SYSCALL_TRACEPOINTS diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 2c2b20e2ccc7..1067af8039bd 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -53,19 +53,19 @@ void mmdrop(struct mm_struct *mm); /* Helpers for lazy TLB mm refcounting */ static inline void mmgrab_lazy_tlb(struct mm_struct *mm) { - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB)) + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) mmgrab(mm); } static inline void mmdrop_lazy_tlb(struct mm_struct *mm) { - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB)) + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) mmdrop(mm); } static inline void mmdrop_lazy_tlb_smp_mb(struct mm_struct *mm) { - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB)) + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) mmdrop(mm); /* This depends on mmdrop providing a full smp_mb() */ else smp_mb(); diff --git a/kernel/fork.c b/kernel/fork.c index 142b23645d82..da0fba9e6079 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -685,6 +685,40 @@ static void check_mm(struct mm_struct *mm) #define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL)) #define free_mm(mm) (kmem_cache_free(mm_cachep, (mm))) +static void do_shoot_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + if (current->active_mm == mm) { + BUG_ON(current->mm); + switch_mm(mm, &init_mm, current); + current->active_mm = &init_mm; + } +} + +static void do_check_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + BUG_ON(current->active_mm == mm); +} + +static void shoot_lazy_tlbs(struct mm_struct *mm) +{ + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { + smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); + do_shoot_lazy_tlb(mm); + } +} + +static void check_lazy_tlbs(struct mm_struct *mm) +{ + if (IS_ENABLED(CONFIG_DEBUG_VM)) { + smp_call_function(do_check_lazy_tlb, (void *)mm, 1); + do_check_lazy_tlb(mm); + } +} + /* * Called when the last reference to the mm * is dropped: either by a lazy thread or by @@ -695,6 +729,11 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); WARN_ON_ONCE(mm == current->mm); WARN_ON_ONCE(mm == current->active_mm); + + /* Ensure no CPUs are using this as their lazy tlb mm */ + shoot_lazy_tlbs(mm); + check_lazy_tlbs(mm); + mm_free_pgd(mm); destroy_context(mm); mmu_notifier_subscriptions_destroy(mm);
On big systems, the mm refcount can become highly contented when doing a lot of context switching with threaded applications (particularly switching between the idle thread and an application thread). Abandoning lazy tlb slows switching down quite a bit in the important user->idle->user cases, so so instead implement a non-refcounted scheme that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down any remaining lazy ones. On a 16-socket 192-core POWER8 system, a context switching benchmark with as many software threads as CPUs (so each switch will go in and out of idle), upstream can achieve a rate of about 1 million context switches per second. After this patch it goes up to 118 million. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- arch/Kconfig | 16 ++++++++++++++++ arch/powerpc/Kconfig | 1 + include/linux/sched/mm.h | 6 +++--- kernel/fork.c | 39 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 59 insertions(+), 3 deletions(-)