Message ID | 20190329150934.17694-2-jgross@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen: add core scheduling support | expand |
Hi, On 3/29/19 3:08 PM, Juergen Gross wrote: > cpu_disable_scheduler() is being called from __cpu_disable() today. > There is no need to execute it on the cpu just being disabled, so use > the CPU_DEAD case of the cpu notifier chain. Moving the call out of > stop_machine() context is fine, as we just need to hold the domain RCU > lock and need the scheduler percpu data to be still allocated. > > Add another hook for CPU_DOWN_PREPARE to bail out early in case > cpu_disable_scheduler() would fail. This will avoid crashes in rare > cases for cpu hotplug or suspend. > > While at it remove a superfluous smp_mb() in the ARM __cpu_disable() > incarnation. This is not obvious why the smp_mb() is superfluous. Can you please provide more details on why this is not necessary? Cheers,
On 01/04/2019 11:21, Julien Grall wrote: > Hi, > > On 3/29/19 3:08 PM, Juergen Gross wrote: >> cpu_disable_scheduler() is being called from __cpu_disable() today. >> There is no need to execute it on the cpu just being disabled, so use >> the CPU_DEAD case of the cpu notifier chain. Moving the call out of >> stop_machine() context is fine, as we just need to hold the domain RCU >> lock and need the scheduler percpu data to be still allocated. >> >> Add another hook for CPU_DOWN_PREPARE to bail out early in case >> cpu_disable_scheduler() would fail. This will avoid crashes in rare >> cases for cpu hotplug or suspend. >> >> While at it remove a superfluous smp_mb() in the ARM __cpu_disable() >> incarnation. > > This is not obvious why the smp_mb() is superfluous. Can you please > provide more details on why this is not necessary? cpumask_clear_cpu() should already have the needed semantics, no? It is based on clear_bit() which is defined to be atomic. Juergen
Hi, On 4/1/19 10:40 AM, Juergen Gross wrote: > On 01/04/2019 11:21, Julien Grall wrote: >> Hi, >> >> On 3/29/19 3:08 PM, Juergen Gross wrote: >>> cpu_disable_scheduler() is being called from __cpu_disable() today. >>> There is no need to execute it on the cpu just being disabled, so use >>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of >>> stop_machine() context is fine, as we just need to hold the domain RCU >>> lock and need the scheduler percpu data to be still allocated. >>> >>> Add another hook for CPU_DOWN_PREPARE to bail out early in case >>> cpu_disable_scheduler() would fail. This will avoid crashes in rare >>> cases for cpu hotplug or suspend. >>> >>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable() >>> incarnation. >> >> This is not obvious why the smp_mb() is superfluous. Can you please >> provide more details on why this is not necessary? > > cpumask_clear_cpu() should already have the needed semantics, no? > It is based on clear_bit() which is defined to be atomic. atomicity does not mean the store/load cannot be re-ordered by the CPU. You would need a barrier to prevent re-ordering. cpumask_clear_cpu() and clear_bit() does not contain any barrier, so store/load can be re-ordered. I see we have similar smp_mb() barrier in __cpu_die(). Sadly, there are no documentation in the code why the barrier is here. The logs don't help either. The barrier here will ensure that the load/store related to disabling the CPU are seen before any load/store happening after the return. Although, I am not sure why this is necessary. Stefano, Do you remember the rationale? Cheers,
On 01/04/2019 12:29, Julien Grall wrote: > Hi, > > On 4/1/19 10:40 AM, Juergen Gross wrote: >> On 01/04/2019 11:21, Julien Grall wrote: >>> Hi, >>> >>> On 3/29/19 3:08 PM, Juergen Gross wrote: >>>> cpu_disable_scheduler() is being called from __cpu_disable() today. >>>> There is no need to execute it on the cpu just being disabled, so use >>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of >>>> stop_machine() context is fine, as we just need to hold the domain RCU >>>> lock and need the scheduler percpu data to be still allocated. >>>> >>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case >>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare >>>> cases for cpu hotplug or suspend. >>>> >>>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable() >>>> incarnation. >>> >>> This is not obvious why the smp_mb() is superfluous. Can you please >>> provide more details on why this is not necessary? >> >> cpumask_clear_cpu() should already have the needed semantics, no? >> It is based on clear_bit() which is defined to be atomic. > > atomicity does not mean the store/load cannot be re-ordered by the CPU. > You would need a barrier to prevent re-ordering. > > cpumask_clear_cpu() and clear_bit() does not contain any barrier, so > store/load can be re-ordered. Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment there suggests the sequence of setting the blocked bit and doing the test is important for avoiding a race... Juergen > > I see we have similar smp_mb() barrier in __cpu_die(). Sadly, there are > no documentation in the code why the barrier is here. The logs don't > help either. > > The barrier here will ensure that the load/store related to disabling > the CPU are seen before any load/store happening after the return. > Although, I am not sure why this is necessary. > > Stefano, Do you remember the rationale? > > Cheers, >
Hi Juergen, On 4/1/19 11:37 AM, Juergen Gross wrote: > On 01/04/2019 12:29, Julien Grall wrote: >> Hi, >> >> On 4/1/19 10:40 AM, Juergen Gross wrote: >>> On 01/04/2019 11:21, Julien Grall wrote: >>>> Hi, >>>> >>>> On 3/29/19 3:08 PM, Juergen Gross wrote: >>>>> cpu_disable_scheduler() is being called from __cpu_disable() today. >>>>> There is no need to execute it on the cpu just being disabled, so use >>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of >>>>> stop_machine() context is fine, as we just need to hold the domain RCU >>>>> lock and need the scheduler percpu data to be still allocated. >>>>> >>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case >>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare >>>>> cases for cpu hotplug or suspend. >>>>> >>>>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable() >>>>> incarnation. >>>> >>>> This is not obvious why the smp_mb() is superfluous. Can you please >>>> provide more details on why this is not necessary? >>> >>> cpumask_clear_cpu() should already have the needed semantics, no? >>> It is based on clear_bit() which is defined to be atomic. >> >> atomicity does not mean the store/load cannot be re-ordered by the CPU. >> You would need a barrier to prevent re-ordering. >> >> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so >> store/load can be re-ordered. > > Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment > there suggests the sequence of setting the blocked bit and doing the > test is important for avoiding a race... Hmmm... looking at the other usage (such as in do_poll), on non-x86 platform, there is a smp_mb() between set_bit(...) and checking the event with a similar comment above. I don't know enough the scheduler code to know why the barrier is needed. But for consistency, it seems to me the smp_mb() would be required in vcpu_block() as well. Also, it is quite interesting that the barrier is not presence for x86. If I understand correctly the comment on top of set_bit/clear_bit, it could as well be re-ordered. So we seem to relying on the underlying implementation of set_bit/clear_bit. Wouldn't it make sense to try to uniformize the semantics? Maybe by introducing a new helper? Cheers,
On 01/04/2019 15:21, Julien Grall wrote: > Hi Juergen, > > On 4/1/19 11:37 AM, Juergen Gross wrote: >> On 01/04/2019 12:29, Julien Grall wrote: >>> Hi, >>> >>> On 4/1/19 10:40 AM, Juergen Gross wrote: >>>> On 01/04/2019 11:21, Julien Grall wrote: >>>>> Hi, >>>>> >>>>> On 3/29/19 3:08 PM, Juergen Gross wrote: >>>>>> cpu_disable_scheduler() is being called from __cpu_disable() today. >>>>>> There is no need to execute it on the cpu just being disabled, so use >>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of >>>>>> stop_machine() context is fine, as we just need to hold the domain >>>>>> RCU >>>>>> lock and need the scheduler percpu data to be still allocated. >>>>>> >>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case >>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare >>>>>> cases for cpu hotplug or suspend. >>>>>> >>>>>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable() >>>>>> incarnation. >>>>> >>>>> This is not obvious why the smp_mb() is superfluous. Can you please >>>>> provide more details on why this is not necessary? >>>> >>>> cpumask_clear_cpu() should already have the needed semantics, no? >>>> It is based on clear_bit() which is defined to be atomic. >>> >>> atomicity does not mean the store/load cannot be re-ordered by the CPU. >>> You would need a barrier to prevent re-ordering. >>> >>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so >>> store/load can be re-ordered. >> >> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment >> there suggests the sequence of setting the blocked bit and doing the >> test is important for avoiding a race... > > Hmmm... looking at the other usage (such as in do_poll), on non-x86 > platform, there is a smp_mb() between set_bit(...) and checking the > event with a similar comment above. > > I don't know enough the scheduler code to know why the barrier is > needed. But for consistency, it seems to me the smp_mb() would be > required in vcpu_block() as well. > > Also, it is quite interesting that the barrier is not presence for x86. > If I understand correctly the comment on top of set_bit/clear_bit, it > could as well be re-ordered. So we seem to relying on the underlying > implementation of set_bit/clear_bit. On x86 reads and writes can't be reordered with locked operations (SDM Vol 3 8.2.2). So the barrier is really not needed AFAIU. include/asm-x86/bitops.h: * clear_bit() is atomic and may not be reordered. > Wouldn't it make sense to try to uniformize the semantics? Maybe by > introducing a new helper? Or adding the barrier on ARM for the atomic operations? Juergen
Hi, On 4/1/19 2:33 PM, Juergen Gross wrote: > On 01/04/2019 15:21, Julien Grall wrote: >> Hi Juergen, >> >> On 4/1/19 11:37 AM, Juergen Gross wrote: >>> On 01/04/2019 12:29, Julien Grall wrote: >>>> Hi, >>>> >>>> On 4/1/19 10:40 AM, Juergen Gross wrote: >>>>> On 01/04/2019 11:21, Julien Grall wrote: >>>>>> Hi, >>>>>> >>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote: >>>>>>> cpu_disable_scheduler() is being called from __cpu_disable() today. >>>>>>> There is no need to execute it on the cpu just being disabled, so use >>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of >>>>>>> stop_machine() context is fine, as we just need to hold the domain >>>>>>> RCU >>>>>>> lock and need the scheduler percpu data to be still allocated. >>>>>>> >>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case >>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare >>>>>>> cases for cpu hotplug or suspend. >>>>>>> >>>>>>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable() >>>>>>> incarnation. >>>>>> >>>>>> This is not obvious why the smp_mb() is superfluous. Can you please >>>>>> provide more details on why this is not necessary? >>>>> >>>>> cpumask_clear_cpu() should already have the needed semantics, no? >>>>> It is based on clear_bit() which is defined to be atomic. >>>> >>>> atomicity does not mean the store/load cannot be re-ordered by the CPU. >>>> You would need a barrier to prevent re-ordering. >>>> >>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so >>>> store/load can be re-ordered. >>> >>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment >>> there suggests the sequence of setting the blocked bit and doing the >>> test is important for avoiding a race... >> >> Hmmm... looking at the other usage (such as in do_poll), on non-x86 >> platform, there is a smp_mb() between set_bit(...) and checking the >> event with a similar comment above. >> >> I don't know enough the scheduler code to know why the barrier is >> needed. But for consistency, it seems to me the smp_mb() would be >> required in vcpu_block() as well. >> >> Also, it is quite interesting that the barrier is not presence for x86. >> If I understand correctly the comment on top of set_bit/clear_bit, it >> could as well be re-ordered. So we seem to relying on the underlying >> implementation of set_bit/clear_bit. > > On x86 reads and writes can't be reordered with locked operations (SDM > Vol 3 8.2.2). So the barrier is really not needed AFAIU. > > include/asm-x86/bitops.h: > > * clear_bit() is atomic and may not be reordered. I interpreted the "may not" as you should not rely on the re-ordering to not happen. In place were re-ordering should not happen (e.g test_and_set_bit) we use the wording "cannot". > >> Wouldn't it make sense to try to uniformize the semantics? Maybe by >> introducing a new helper? > > Or adding the barrier on ARM for the atomic operations? On which basis? Why should we impact every users for fixing a bug in the scheduler? Cheers,
On 01/04/2019 16:01, Julien Grall wrote: > Hi, > > On 4/1/19 2:33 PM, Juergen Gross wrote: >> On 01/04/2019 15:21, Julien Grall wrote: >>> Hi Juergen, >>> >>> On 4/1/19 11:37 AM, Juergen Gross wrote: >>>> On 01/04/2019 12:29, Julien Grall wrote: >>>>> Hi, >>>>> >>>>> On 4/1/19 10:40 AM, Juergen Gross wrote: >>>>>> On 01/04/2019 11:21, Julien Grall wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote: >>>>>>>> cpu_disable_scheduler() is being called from __cpu_disable() today. >>>>>>>> There is no need to execute it on the cpu just being disabled, >>>>>>>> so use >>>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of >>>>>>>> stop_machine() context is fine, as we just need to hold the domain >>>>>>>> RCU >>>>>>>> lock and need the scheduler percpu data to be still allocated. >>>>>>>> >>>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case >>>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare >>>>>>>> cases for cpu hotplug or suspend. >>>>>>>> >>>>>>>> While at it remove a superfluous smp_mb() in the ARM >>>>>>>> __cpu_disable() >>>>>>>> incarnation. >>>>>>> >>>>>>> This is not obvious why the smp_mb() is superfluous. Can you please >>>>>>> provide more details on why this is not necessary? >>>>>> >>>>>> cpumask_clear_cpu() should already have the needed semantics, no? >>>>>> It is based on clear_bit() which is defined to be atomic. >>>>> >>>>> atomicity does not mean the store/load cannot be re-ordered by the >>>>> CPU. >>>>> You would need a barrier to prevent re-ordering. >>>>> >>>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so >>>>> store/load can be re-ordered. >>>> >>>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment >>>> there suggests the sequence of setting the blocked bit and doing the >>>> test is important for avoiding a race... >>> >>> Hmmm... looking at the other usage (such as in do_poll), on non-x86 >>> platform, there is a smp_mb() between set_bit(...) and checking the >>> event with a similar comment above. >>> >>> I don't know enough the scheduler code to know why the barrier is >>> needed. But for consistency, it seems to me the smp_mb() would be >>> required in vcpu_block() as well. >>> >>> Also, it is quite interesting that the barrier is not presence for x86. >>> If I understand correctly the comment on top of set_bit/clear_bit, it >>> could as well be re-ordered. So we seem to relying on the underlying >>> implementation of set_bit/clear_bit. >> >> On x86 reads and writes can't be reordered with locked operations (SDM >> Vol 3 8.2.2). So the barrier is really not needed AFAIU. >> >> include/asm-x86/bitops.h: >> >> * clear_bit() is atomic and may not be reordered. > > I interpreted the "may not" as you should not rely on the re-ordering to > not happen. > > In place were re-ordering should not happen (e.g test_and_set_bit) we > use the wording "cannot". The SDM is very clear here: "Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions." >>> Wouldn't it make sense to try to uniformize the semantics? Maybe by >>> introducing a new helper? >> >> Or adding the barrier on ARM for the atomic operations? > > On which basis? Why should we impact every users for fixing a bug in > the scheduler? I'm assuming there are more places like this either in common code or code copied verbatim from arch/x86 to arch/arm with that problem. So I take it you'd rather let me add that smp_mb() in __cpu_disable() again. Juergen
Hi, On 4/1/19 3:23 PM, Juergen Gross wrote: > On 01/04/2019 16:01, Julien Grall wrote: >> Hi, >> >> On 4/1/19 2:33 PM, Juergen Gross wrote: >>> On 01/04/2019 15:21, Julien Grall wrote: >>>> Hi Juergen, >>>> >>>> On 4/1/19 11:37 AM, Juergen Gross wrote: >>>>> On 01/04/2019 12:29, Julien Grall wrote: >>>>>> Hi, >>>>>> >>>>>> On 4/1/19 10:40 AM, Juergen Gross wrote: >>>>>>> On 01/04/2019 11:21, Julien Grall wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote: >>>>>>>>> cpu_disable_scheduler() is being called from __cpu_disable() today. >>>>>>>>> There is no need to execute it on the cpu just being disabled, >>>>>>>>> so use >>>>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of >>>>>>>>> stop_machine() context is fine, as we just need to hold the domain >>>>>>>>> RCU >>>>>>>>> lock and need the scheduler percpu data to be still allocated. >>>>>>>>> >>>>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case >>>>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare >>>>>>>>> cases for cpu hotplug or suspend. >>>>>>>>> >>>>>>>>> While at it remove a superfluous smp_mb() in the ARM >>>>>>>>> __cpu_disable() >>>>>>>>> incarnation. >>>>>>>> >>>>>>>> This is not obvious why the smp_mb() is superfluous. Can you please >>>>>>>> provide more details on why this is not necessary? >>>>>>> >>>>>>> cpumask_clear_cpu() should already have the needed semantics, no? >>>>>>> It is based on clear_bit() which is defined to be atomic. >>>>>> >>>>>> atomicity does not mean the store/load cannot be re-ordered by the >>>>>> CPU. >>>>>> You would need a barrier to prevent re-ordering. >>>>>> >>>>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so >>>>>> store/load can be re-ordered. >>>>> >>>>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment >>>>> there suggests the sequence of setting the blocked bit and doing the >>>>> test is important for avoiding a race... >>>> >>>> Hmmm... looking at the other usage (such as in do_poll), on non-x86 >>>> platform, there is a smp_mb() between set_bit(...) and checking the >>>> event with a similar comment above. >>>> >>>> I don't know enough the scheduler code to know why the barrier is >>>> needed. But for consistency, it seems to me the smp_mb() would be >>>> required in vcpu_block() as well. >>>> >>>> Also, it is quite interesting that the barrier is not presence for x86. >>>> If I understand correctly the comment on top of set_bit/clear_bit, it >>>> could as well be re-ordered. So we seem to relying on the underlying >>>> implementation of set_bit/clear_bit. >>> >>> On x86 reads and writes can't be reordered with locked operations (SDM >>> Vol 3 8.2.2). So the barrier is really not needed AFAIU. >>> >>> include/asm-x86/bitops.h: >>> >>> * clear_bit() is atomic and may not be reordered. >> >> I interpreted the "may not" as you should not rely on the re-ordering to >> not happen. >> >> In place were re-ordering should not happen (e.g test_and_set_bit) we >> use the wording "cannot". > > The SDM is very clear here: > > "Reads or writes cannot be reordered with I/O instructions, locked > instructions, or serializing instructions." This is what the specification says not the intended semantic. Helper may have a more relaxed semantics to accommodate other architecture. I believe, this is the case here. The semantic is more relaxed than the implementation. So you don't have to impose a barrier in architecture with a more relaxed memory ordering. > >>>> Wouldn't it make sense to try to uniformize the semantics? Maybe by >>>> introducing a new helper? >>> >>> Or adding the barrier on ARM for the atomic operations? >> >> On which basis? Why should we impact every users for fixing a bug in >> the scheduler? > > I'm assuming there are more places like this either in common code or > code copied verbatim from arch/x86 to arch/arm with that problem. Adding it in the *_set helpers is just the poor's man fix. If we do that this is going to stick for a long time and impact performance. Instead we should fix the scheduler code (and hopefully only that) where the ordering is necessary. > > So I take it you'd rather let me add that smp_mb() in __cpu_disable() > again. Removing/Adding barriers should be accompanied with a proper justifications in the commit message. Additionally, new barrier should have a comment explaining what they are for. In this case, I don't know what is the correct answer. It feels to me we should keep it until we have a better understanding of this code. But then it raises the question whether a barrier would also be necessary after calling cpu_disable_scheduler(). Cheers,
On 01/04/2019 17:15, Julien Grall wrote: > Hi, > > On 4/1/19 3:23 PM, Juergen Gross wrote: >> On 01/04/2019 16:01, Julien Grall wrote: >>> Hi, >>> >>> On 4/1/19 2:33 PM, Juergen Gross wrote: >>>> On 01/04/2019 15:21, Julien Grall wrote: >>>>> Hi Juergen, >>>>> >>>>> On 4/1/19 11:37 AM, Juergen Gross wrote: >>>>>> On 01/04/2019 12:29, Julien Grall wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On 4/1/19 10:40 AM, Juergen Gross wrote: >>>>>>>> On 01/04/2019 11:21, Julien Grall wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote: >>>>>>>>>> cpu_disable_scheduler() is being called from __cpu_disable() >>>>>>>>>> today. >>>>>>>>>> There is no need to execute it on the cpu just being disabled, >>>>>>>>>> so use >>>>>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call >>>>>>>>>> out of >>>>>>>>>> stop_machine() context is fine, as we just need to hold the >>>>>>>>>> domain >>>>>>>>>> RCU >>>>>>>>>> lock and need the scheduler percpu data to be still allocated. >>>>>>>>>> >>>>>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case >>>>>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in >>>>>>>>>> rare >>>>>>>>>> cases for cpu hotplug or suspend. >>>>>>>>>> >>>>>>>>>> While at it remove a superfluous smp_mb() in the ARM >>>>>>>>>> __cpu_disable() >>>>>>>>>> incarnation. >>>>>>>>> >>>>>>>>> This is not obvious why the smp_mb() is superfluous. Can you >>>>>>>>> please >>>>>>>>> provide more details on why this is not necessary? >>>>>>>> >>>>>>>> cpumask_clear_cpu() should already have the needed semantics, no? >>>>>>>> It is based on clear_bit() which is defined to be atomic. >>>>>>> >>>>>>> atomicity does not mean the store/load cannot be re-ordered by the >>>>>>> CPU. >>>>>>> You would need a barrier to prevent re-ordering. >>>>>>> >>>>>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so >>>>>>> store/load can be re-ordered. >>>>>> >>>>>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment >>>>>> there suggests the sequence of setting the blocked bit and doing the >>>>>> test is important for avoiding a race... >>>>> >>>>> Hmmm... looking at the other usage (such as in do_poll), on non-x86 >>>>> platform, there is a smp_mb() between set_bit(...) and checking the >>>>> event with a similar comment above. >>>>> >>>>> I don't know enough the scheduler code to know why the barrier is >>>>> needed. But for consistency, it seems to me the smp_mb() would be >>>>> required in vcpu_block() as well. >>>>> >>>>> Also, it is quite interesting that the barrier is not presence for >>>>> x86. >>>>> If I understand correctly the comment on top of set_bit/clear_bit, it >>>>> could as well be re-ordered. So we seem to relying on the underlying >>>>> implementation of set_bit/clear_bit. >>>> >>>> On x86 reads and writes can't be reordered with locked operations (SDM >>>> Vol 3 8.2.2). So the barrier is really not needed AFAIU. >>>> >>>> include/asm-x86/bitops.h: >>>> >>>> * clear_bit() is atomic and may not be reordered. >>> >>> I interpreted the "may not" as you should not rely on the re-ordering to >>> not happen. >>> >>> In place were re-ordering should not happen (e.g test_and_set_bit) we >>> use the wording "cannot". >> >> The SDM is very clear here: >> >> "Reads or writes cannot be reordered with I/O instructions, locked >> instructions, or serializing instructions." > > This is what the specification says not the intended semantic. Helper > may have a more relaxed semantics to accommodate other architecture. > > I believe, this is the case here. The semantic is more relaxed than the > implementation. So you don't have to impose a barrier in architecture > with a more relaxed memory ordering. > >> >>>>> Wouldn't it make sense to try to uniformize the semantics? Maybe by >>>>> introducing a new helper? >>>> >>>> Or adding the barrier on ARM for the atomic operations? >>> >>> On which basis? Why should we impact every users for fixing a bug in >>> the scheduler? >> >> I'm assuming there are more places like this either in common code or >> code copied verbatim from arch/x86 to arch/arm with that problem. > > Adding it in the *_set helpers is just the poor's man fix. If we do that > this is going to stick for a long time and impact performance. > > Instead we should fix the scheduler code (and hopefully only that) where > the ordering is necessary. I believe that should be a patch on its own. Are you doing that? >> So I take it you'd rather let me add that smp_mb() in __cpu_disable() >> again. > > Removing/Adding barriers should be accompanied with a proper > justifications in the commit message. Additionally, new barrier should > have a comment explaining what they are for. > > In this case, I don't know what is the correct answer. It feels to me we > should keep it until we have a better understanding of this code. But Okay. > then it raises the question whether a barrier would also be necessary > after calling cpu_disable_scheduler(). That one is quite easy: all paths of cpu_disable_scheduler() are doing an unlock operation at the end, so the barrier is already there. Juergen
Hi, On 4/1/19 5:00 PM, Juergen Gross wrote: > On 01/04/2019 17:15, Julien Grall wrote: >> Hi, >> >> On 4/1/19 3:23 PM, Juergen Gross wrote: >>> On 01/04/2019 16:01, Julien Grall wrote: >>>> Hi, >>>> >>>> On 4/1/19 2:33 PM, Juergen Gross wrote: >>>>> On 01/04/2019 15:21, Julien Grall wrote: >>>>>> Hi Juergen, >>>>>> >>>>>> On 4/1/19 11:37 AM, Juergen Gross wrote: >>>>>>> On 01/04/2019 12:29, Julien Grall wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On 4/1/19 10:40 AM, Juergen Gross wrote: >>>>>>>>> On 01/04/2019 11:21, Julien Grall wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote: >>>>>>>>>>> cpu_disable_scheduler() is being called from __cpu_disable() >>>>>>>>>>> today. >>>>>>>>>>> There is no need to execute it on the cpu just being disabled, >>>>>>>>>>> so use >>>>>>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call >>>>>>>>>>> out of >>>>>>>>>>> stop_machine() context is fine, as we just need to hold the >>>>>>>>>>> domain >>>>>>>>>>> RCU >>>>>>>>>>> lock and need the scheduler percpu data to be still allocated. >>>>>>>>>>> >>>>>>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case >>>>>>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in >>>>>>>>>>> rare >>>>>>>>>>> cases for cpu hotplug or suspend. >>>>>>>>>>> >>>>>>>>>>> While at it remove a superfluous smp_mb() in the ARM >>>>>>>>>>> __cpu_disable() >>>>>>>>>>> incarnation. >>>>>>>>>> >>>>>>>>>> This is not obvious why the smp_mb() is superfluous. Can you >>>>>>>>>> please >>>>>>>>>> provide more details on why this is not necessary? >>>>>>>>> >>>>>>>>> cpumask_clear_cpu() should already have the needed semantics, no? >>>>>>>>> It is based on clear_bit() which is defined to be atomic. >>>>>>>> >>>>>>>> atomicity does not mean the store/load cannot be re-ordered by the >>>>>>>> CPU. >>>>>>>> You would need a barrier to prevent re-ordering. >>>>>>>> >>>>>>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so >>>>>>>> store/load can be re-ordered. >>>>>>> >>>>>>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment >>>>>>> there suggests the sequence of setting the blocked bit and doing the >>>>>>> test is important for avoiding a race... >>>>>> >>>>>> Hmmm... looking at the other usage (such as in do_poll), on non-x86 >>>>>> platform, there is a smp_mb() between set_bit(...) and checking the >>>>>> event with a similar comment above. >>>>>> >>>>>> I don't know enough the scheduler code to know why the barrier is >>>>>> needed. But for consistency, it seems to me the smp_mb() would be >>>>>> required in vcpu_block() as well. >>>>>> >>>>>> Also, it is quite interesting that the barrier is not presence for >>>>>> x86. >>>>>> If I understand correctly the comment on top of set_bit/clear_bit, it >>>>>> could as well be re-ordered. So we seem to relying on the underlying >>>>>> implementation of set_bit/clear_bit. >>>>> >>>>> On x86 reads and writes can't be reordered with locked operations (SDM >>>>> Vol 3 8.2.2). So the barrier is really not needed AFAIU. >>>>> >>>>> include/asm-x86/bitops.h: >>>>> >>>>> * clear_bit() is atomic and may not be reordered. >>>> >>>> I interpreted the "may not" as you should not rely on the re-ordering to >>>> not happen. >>>> >>>> In place were re-ordering should not happen (e.g test_and_set_bit) we >>>> use the wording "cannot". >>> >>> The SDM is very clear here: >>> >>> "Reads or writes cannot be reordered with I/O instructions, locked >>> instructions, or serializing instructions." >> >> This is what the specification says not the intended semantic. Helper >> may have a more relaxed semantics to accommodate other architecture. >> >> I believe, this is the case here. The semantic is more relaxed than the >> implementation. So you don't have to impose a barrier in architecture >> with a more relaxed memory ordering. >> >>> >>>>>> Wouldn't it make sense to try to uniformize the semantics? Maybe by >>>>>> introducing a new helper? >>>>> >>>>> Or adding the barrier on ARM for the atomic operations? >>>> >>>> On which basis? Why should we impact every users for fixing a bug in >>>> the scheduler? >>> >>> I'm assuming there are more places like this either in common code or >>> code copied verbatim from arch/x86 to arch/arm with that problem. >> >> Adding it in the *_set helpers is just the poor's man fix. If we do that >> this is going to stick for a long time and impact performance. >> >> Instead we should fix the scheduler code (and hopefully only that) where >> the ordering is necessary. > > I believe that should be a patch on its own. Are you doing that? I will try to have a look tomorrow. > >>> So I take it you'd rather let me add that smp_mb() in __cpu_disable() >>> again. >> >> Removing/Adding barriers should be accompanied with a proper >> justifications in the commit message. Additionally, new barrier should >> have a comment explaining what they are for. >> >> In this case, I don't know what is the correct answer. It feels to me we >> should keep it until we have a better understanding of this code. But > > Okay. > >> then it raises the question whether a barrier would also be necessary >> after calling cpu_disable_scheduler(). > > That one is quite easy: all paths of cpu_disable_scheduler() are doing > an unlock operation at the end, so the barrier is already there. Oh, nothing to worry then :). Thank you for look at it. Cheers,
On Mon, 1 Apr 2019, Julien Grall wrote: > Hi, > > On 4/1/19 10:40 AM, Juergen Gross wrote: > > On 01/04/2019 11:21, Julien Grall wrote: > > > Hi, > > > > > > On 3/29/19 3:08 PM, Juergen Gross wrote: > > > > cpu_disable_scheduler() is being called from __cpu_disable() today. > > > > There is no need to execute it on the cpu just being disabled, so use > > > > the CPU_DEAD case of the cpu notifier chain. Moving the call out of > > > > stop_machine() context is fine, as we just need to hold the domain RCU > > > > lock and need the scheduler percpu data to be still allocated. > > > > > > > > Add another hook for CPU_DOWN_PREPARE to bail out early in case > > > > cpu_disable_scheduler() would fail. This will avoid crashes in rare > > > > cases for cpu hotplug or suspend. > > > > > > > > While at it remove a superfluous smp_mb() in the ARM __cpu_disable() > > > > incarnation. > > > > > > This is not obvious why the smp_mb() is superfluous. Can you please > > > provide more details on why this is not necessary? > > > > cpumask_clear_cpu() should already have the needed semantics, no? > > It is based on clear_bit() which is defined to be atomic. > > atomicity does not mean the store/load cannot be re-ordered by the CPU. You > would need a barrier to prevent re-ordering. > > cpumask_clear_cpu() and clear_bit() does not contain any barrier, so > store/load can be re-ordered. > > I see we have similar smp_mb() barrier in __cpu_die(). Sadly, there are no > documentation in the code why the barrier is here. The logs don't help either. > > The barrier here will ensure that the load/store related to disabling the CPU > are seen before any load/store happening after the return. Although, I am not > sure why this is necessary. > > Stefano, Do you remember the rationale? /me doing some archeology I am pretty sure it was meant to accompany the cpumask_clear_cpu call. I think we should keep it in __cpu_disable right after cpumask_clear_cpu.
diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c index 25cd44549c..0728a9b505 100644 --- a/xen/arch/arm/smpboot.c +++ b/xen/arch/arm/smpboot.c @@ -386,10 +386,6 @@ void __cpu_disable(void) /* It's now safe to remove this processor from the online map */ cpumask_clear_cpu(cpu, &cpu_online_map); - if ( cpu_disable_scheduler(cpu) ) - BUG(); - smp_mb(); - /* Return to caller; eventually the IPI mechanism will unwind and the * scheduler will drop to the idle loop, which will call stop_cpu(). */ } diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index 7d1226d7bc..b7a0a4a419 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -1221,9 +1221,6 @@ void __cpu_disable(void) cpumask_clear_cpu(cpu, &cpu_online_map); fixup_irqs(&cpu_online_map, 1); fixup_eoi(); - - if ( cpu_disable_scheduler(cpu) ) - BUG(); } void __cpu_die(unsigned int cpu) diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 60755a631e..5d2bbd5198 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -773,8 +773,9 @@ void restore_vcpu_affinity(struct domain *d) } /* - * This function is used by cpu_hotplug code from stop_machine context + * This function is used by cpu_hotplug code via cpu notifier chain * and from cpupools to switch schedulers on a cpu. + * Caller must get domlist_read_lock. */ int cpu_disable_scheduler(unsigned int cpu) { @@ -789,12 +790,6 @@ int cpu_disable_scheduler(unsigned int cpu) if ( c == NULL ) return ret; - /* - * We'd need the domain RCU lock, but: - * - when we are called from cpupool code, it's acquired there already; - * - when we are called for CPU teardown, we're in stop-machine context, - * so that's not be a problem. - */ for_each_domain_in_cpupool ( d, c ) { for_each_vcpu ( d, v ) @@ -893,6 +888,30 @@ int cpu_disable_scheduler(unsigned int cpu) return ret; } +static int cpu_disable_scheduler_check(unsigned int cpu) +{ + struct domain *d; + struct vcpu *v; + struct cpupool *c; + + c = per_cpu(cpupool, cpu); + if ( c == NULL ) + return 0; + + for_each_domain_in_cpupool ( d, c ) + { + for_each_vcpu ( d, v ) + { + if ( v->affinity_broken ) + return -EADDRINUSE; + if ( system_state != SYS_STATE_suspend && v->processor == cpu ) + return -EAGAIN; + } + } + + return 0; +} + /* * In general, this must be called with the scheduler lock held, because the * adjust_affinity hook may want to modify the vCPU state. However, when the @@ -1737,7 +1756,16 @@ static int cpu_schedule_callback( case CPU_UP_PREPARE: rc = cpu_schedule_up(cpu); break; + case CPU_DOWN_PREPARE: + rcu_read_lock(&domlist_read_lock); + rc = cpu_disable_scheduler_check(cpu); + rcu_read_unlock(&domlist_read_lock); + break; case CPU_DEAD: + rcu_read_lock(&domlist_read_lock); + rc = cpu_disable_scheduler(cpu); + BUG_ON(rc); + rcu_read_unlock(&domlist_read_lock); SCHED_OP(sched, deinit_pdata, sd->sched_priv, cpu); /* Fallthrough */ case CPU_UP_CANCELED:
cpu_disable_scheduler() is being called from __cpu_disable() today. There is no need to execute it on the cpu just being disabled, so use the CPU_DEAD case of the cpu notifier chain. Moving the call out of stop_machine() context is fine, as we just need to hold the domain RCU lock and need the scheduler percpu data to be still allocated. Add another hook for CPU_DOWN_PREPARE to bail out early in case cpu_disable_scheduler() would fail. This will avoid crashes in rare cases for cpu hotplug or suspend. While at it remove a superfluous smp_mb() in the ARM __cpu_disable() incarnation. Signed-off-by: Juergen Gross <jgross@suse.com> --- V2: - add CPU_DOWN_PREPARE hook - BUG() in case of cpu_disable_scheduler() failing in CPU_DEAD (Jan Beulich) - modify ARM __cpu_disable(), too (Andrew Cooper) --- xen/arch/arm/smpboot.c | 4 ---- xen/arch/x86/smpboot.c | 3 --- xen/common/schedule.c | 42 +++++++++++++++++++++++++++++++++++------- 3 files changed, 35 insertions(+), 14 deletions(-)