Message ID | 4FFA4269.5050808@jp.fujitsu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: > Hi Srivatsa, > > Thank you for your reviewing. > > 2012/07/06 18:51, Srivatsa S. Bhat wrote: >> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: >>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu. >> >> Ouch! >> >>> But in this case, it should return error number since some process may run on >>> the cpu. If the cpu has a running process and the cpu is turned the power off, >>> the system cannot work well. >>> >>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> >>> >>> --- >>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ >>> 1 file changed, 12 insertions(+), 6 deletions(-) >>> >>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>> =================================================================== >>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-06-25 04:53:04.000000000 +0900 >>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 21:02:58.711285382 +0900 >>> @@ -610,7 +610,7 @@ err_free_pr: >>> static int acpi_processor_remove(struct acpi_device *device, int type) >>> { >>> struct acpi_processor *pr = NULL; >>> - >>> + int ret; >>> >>> if (!device || !acpi_driver_data(device)) >>> return -EINVAL; >>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct >>> goto free; >>> >>> if (type == ACPI_BUS_REMOVAL_EJECT) { >>> - if (acpi_processor_handle_eject(pr)) >>> - return -EINVAL; >>> + ret = acpi_processor_handle_eject(pr); >>> + if (ret) >>> + return ret; >>> } >>> >>> acpi_processor_power_exit(pr, device); >>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd >>> >>> static int acpi_processor_handle_eject(struct acpi_processor *pr) >>> { >>> - if (cpu_online(pr->id)) >>> - cpu_down(pr->id); >>> + int ret; >>> + >>> + if (cpu_online(pr->id)) { >>> + ret = cpu_down(pr->id); >>> + if (ret) >>> + return ret; >>> + } >>> >> >> Strictly speaking, this is not thorough enough. What prevents someone >> from onlining that same cpu again, at this point? >> So, IMHO, you need to wrap the contents of this function inside a >> get_online_cpus()/put_online_cpus() block, to prevent anyone else >> from messing with CPU hotplug at the same time. > > If I understand your comment by mistake, please let me know. > If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block > as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and > cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. > You are right. Sorry, I overlooked that. > + get_online_cpus() > + if (cpu_online(pr->id)) { > + ret = cpu_down(pr->id); > + if (ret) > + return ret; > + } > + put_online_cpus() > > I think following patch can prevent it correctly. How about the patch? > > --- > drivers/acpi/processor_driver.c | 12 ++++++++++++ > kernel/cpu.c | 8 +++++--- > 2 files changed, 17 insertions(+), 3 deletions(-) > > Index: linux-3.5-rc4/drivers/acpi/processor_driver.c > =================================================================== > --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 09:59:01.280211202 +0900 > +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 11:05:34.559859236 +0900 > @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s > { > int ret; > > +retry: > if (cpu_online(pr->id)) { > ret = cpu_down(pr->id); > if (ret) > return ret; > } > > + get_online_cpus(); > + /* > + * Someone might online the cpu again at this point. So we check that > + * cpu has been onlined or not. If cpu is online, we try to offline > + * the cpu again. > + */ > + if (cpu_online(pr->id)) { How about this: if (unlikely(cpu_online(pr->id)) { since the probability of this happening is quite small... > + put_online_cpus(); > + goto retry; > + } > arch_unregister_cpu(pr->id); > acpi_unmap_lsapic(pr->id); > + put_online_cpus(); > return ret; > } This retry logic doesn't look elegant, but I don't see any better method :-( > #else > Index: linux-3.5-rc4/kernel/cpu.c > =================================================================== > --- linux-3.5-rc4.orig/kernel/cpu.c 2012-07-09 09:59:01.280211202 +0900 > +++ linux-3.5-rc4/kernel/cpu.c 2012-07-09 09:59:02.903190965 +0900 > @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in > unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; > struct task_struct *idle; > > - if (cpu_online(cpu) || !cpu_present(cpu)) > - return -EINVAL; > - > cpu_hotplug_begin(); > > + if (cpu_online(cpu) || !cpu_present(cpu)) { > + ret = -EINVAL; > + goto out; > + } > + Firstly, why is this change needed? Secondly, if the change is indeed an improvement, then why is it in _this_ patch? IMHO, in that case it should be part of a separate patch. Coming back to my first point, I don't see why this hunk is needed. We already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before checking the status of the cpu (online or present). And all hotplug operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that lock. Isn't that enough? Or am I missing something? > idle = idle_thread_get(cpu); > if (IS_ERR(idle)) { > ret = PTR_ERR(idle); > Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote: > On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: > > Hi Srivatsa, > > > > Thank you for your reviewing. > > > > 2012/07/06 18:51, Srivatsa S. Bhat wrote: > >> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: > >>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu. > >> > >> Ouch! > >> > >>> But in this case, it should return error number since some process may run on > >>> the cpu. If the cpu has a running process and the cpu is turned the power off, > >>> the system cannot work well. > >>> > >>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> > >>> > >>> --- > >>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ > >>> 1 file changed, 12 insertions(+), 6 deletions(-) > >>> > >>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c > >>> =================================================================== > >>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-06-25 04:53:04.000000000 +0900 > >>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 21:02:58.711285382 +0900 > >>> @@ -610,7 +610,7 @@ err_free_pr: > >>> static int acpi_processor_remove(struct acpi_device *device, int type) > >>> { > >>> struct acpi_processor *pr = NULL; > >>> - > >>> + int ret; > >>> > >>> if (!device || !acpi_driver_data(device)) > >>> return -EINVAL; > >>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct > >>> goto free; > >>> > >>> if (type == ACPI_BUS_REMOVAL_EJECT) { > >>> - if (acpi_processor_handle_eject(pr)) > >>> - return -EINVAL; > >>> + ret = acpi_processor_handle_eject(pr); > >>> + if (ret) > >>> + return ret; > >>> } > >>> > >>> acpi_processor_power_exit(pr, device); > >>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd > >>> > >>> static int acpi_processor_handle_eject(struct acpi_processor *pr) > >>> { > >>> - if (cpu_online(pr->id)) > >>> - cpu_down(pr->id); > >>> + int ret; > >>> + > >>> + if (cpu_online(pr->id)) { > >>> + ret = cpu_down(pr->id); > >>> + if (ret) > >>> + return ret; > >>> + } > >>> > >> > >> Strictly speaking, this is not thorough enough. What prevents someone > >> from onlining that same cpu again, at this point? > >> So, IMHO, you need to wrap the contents of this function inside a > >> get_online_cpus()/put_online_cpus() block, to prevent anyone else > >> from messing with CPU hotplug at the same time. > > > > If I understand your comment by mistake, please let me know. > > If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block > > as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and > > cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. > > > > You are right. Sorry, I overlooked that. > > > + get_online_cpus() > > + if (cpu_online(pr->id)) { > > + ret = cpu_down(pr->id); > > + if (ret) > > + return ret; > > + } > > + put_online_cpus() > > > > I think following patch can prevent it correctly. How about the patch? > > > > --- > > drivers/acpi/processor_driver.c | 12 ++++++++++++ > > kernel/cpu.c | 8 +++++--- > > 2 files changed, 17 insertions(+), 3 deletions(-) > > > > Index: linux-3.5-rc4/drivers/acpi/processor_driver.c > > =================================================================== > > --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 09:59:01.280211202 +0900 > > +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 11:05:34.559859236 +0900 > > @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s > > { > > int ret; > > > > +retry: > > if (cpu_online(pr->id)) { > > ret = cpu_down(pr->id); > > if (ret) > > return ret; > > } > > > > + get_online_cpus(); > > + /* > > + * Someone might online the cpu again at this point. So we check that > > + * cpu has been onlined or not. If cpu is online, we try to offline > > + * the cpu again. > > + */ > > + if (cpu_online(pr->id)) { > > How about this: > if (unlikely(cpu_online(pr->id)) { > since the probability of this happening is quite small... > > > + put_online_cpus(); > > + goto retry; > > + } > > arch_unregister_cpu(pr->id); > > acpi_unmap_lsapic(pr->id); > > + put_online_cpus(); > > return ret; > > } > > This retry logic doesn't look elegant, but I don't see any better method :-( Another possible option is to fail the request instead of retrying it. It would be quite challenging to allow on-lining and off-lining operations to run concurrently. In fact, even if we close this window, there is yet another window right after the new put_online_cpus() call. This CPU may become online before calling _EJ0 in the case of hot-remove. This goes beyond the scope of this patch, but IMHO, we should serialize in the request level. That is, a new on-lining request should not be allowed to proceed until the current request is complete. This scheme only allows a single operation at a time per OS instance, but I do not think it is a big issue. Serializing in the request level is also important when supporting container hot-remove, which can remove multiple children objects under a parent container object. For instance, a Node hot-remove may also remove multiple processors underneath of it. This kind of the operations has to make sure that all children objects are remained off-line until it ejects the parent object. Thanks, -Toshi -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Srivatsa, 2012/07/09 20:25, Srivatsa S. Bhat wrote: > On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: >> Hi Srivatsa, >> >> Thank you for your reviewing. >> >> 2012/07/06 18:51, Srivatsa S. Bhat wrote: >>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: >>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu. >>> >>> Ouch! >>> >>>> But in this case, it should return error number since some process may run on >>>> the cpu. If the cpu has a running process and the cpu is turned the power off, >>>> the system cannot work well. >>>> >>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> >>>> >>>> --- >>>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ >>>> 1 file changed, 12 insertions(+), 6 deletions(-) >>>> >>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>> =================================================================== >>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-06-25 04:53:04.000000000 +0900 >>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 21:02:58.711285382 +0900 >>>> @@ -610,7 +610,7 @@ err_free_pr: >>>> static int acpi_processor_remove(struct acpi_device *device, int type) >>>> { >>>> struct acpi_processor *pr = NULL; >>>> - >>>> + int ret; >>>> >>>> if (!device || !acpi_driver_data(device)) >>>> return -EINVAL; >>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct >>>> goto free; >>>> >>>> if (type == ACPI_BUS_REMOVAL_EJECT) { >>>> - if (acpi_processor_handle_eject(pr)) >>>> - return -EINVAL; >>>> + ret = acpi_processor_handle_eject(pr); >>>> + if (ret) >>>> + return ret; >>>> } >>>> >>>> acpi_processor_power_exit(pr, device); >>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd >>>> >>>> static int acpi_processor_handle_eject(struct acpi_processor *pr) >>>> { >>>> - if (cpu_online(pr->id)) >>>> - cpu_down(pr->id); >>>> + int ret; >>>> + >>>> + if (cpu_online(pr->id)) { >>>> + ret = cpu_down(pr->id); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> >>> >>> Strictly speaking, this is not thorough enough. What prevents someone >>> from onlining that same cpu again, at this point? >>> So, IMHO, you need to wrap the contents of this function inside a >>> get_online_cpus()/put_online_cpus() block, to prevent anyone else >>> from messing with CPU hotplug at the same time. >> >> If I understand your comment by mistake, please let me know. >> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block >> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and >> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. >> > > You are right. Sorry, I overlooked that. > >> + get_online_cpus() >> + if (cpu_online(pr->id)) { >> + ret = cpu_down(pr->id); >> + if (ret) >> + return ret; >> + } >> + put_online_cpus() >> >> I think following patch can prevent it correctly. How about the patch? >> >> --- >> drivers/acpi/processor_driver.c | 12 ++++++++++++ >> kernel/cpu.c | 8 +++++--- >> 2 files changed, 17 insertions(+), 3 deletions(-) >> >> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >> =================================================================== >> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 09:59:01.280211202 +0900 >> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 11:05:34.559859236 +0900 >> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s >> { >> int ret; >> >> +retry: >> if (cpu_online(pr->id)) { >> ret = cpu_down(pr->id); >> if (ret) >> return ret; >> } >> >> + get_online_cpus(); >> + /* >> + * Someone might online the cpu again at this point. So we check that >> + * cpu has been onlined or not. If cpu is online, we try to offline >> + * the cpu again. >> + */ >> + if (cpu_online(pr->id)) { > > How about this: > if (unlikely(cpu_online(pr->id)) { > since the probability of this happening is quite small... Thanks. I'll update it. >> + put_online_cpus(); >> + goto retry; >> + } >> arch_unregister_cpu(pr->id); >> acpi_unmap_lsapic(pr->id); >> + put_online_cpus(); >> return ret; >> } > > This retry logic doesn't look elegant, but I don't see any better method :-( > >> #else >> Index: linux-3.5-rc4/kernel/cpu.c >> =================================================================== >> --- linux-3.5-rc4.orig/kernel/cpu.c 2012-07-09 09:59:01.280211202 +0900 >> +++ linux-3.5-rc4/kernel/cpu.c 2012-07-09 09:59:02.903190965 +0900 >> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in >> unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; >> struct task_struct *idle; >> >> - if (cpu_online(cpu) || !cpu_present(cpu)) >> - return -EINVAL; >> - >> cpu_hotplug_begin(); >> >> + if (cpu_online(cpu) || !cpu_present(cpu)) { >> + ret = -EINVAL; >> + goto out; >> + } >> + > > Firstly, why is this change needed? I cared the race of hot-remove cpu and _cpu_up(). If I do not change it, there is the following race. hot-remove cpu | _cpu_up() ------------------------------------- ------------------------------------ call acpi_processor_handle_eject() | call cpu_down() | call get_online_cpus() | | call cpu_hotplug_begin() and stop here call arch_unregister_cpu() | call acpi_unmap_lsapic() | call put_online_cpus() | | start and continue _cpu_up() return acpi_processor_remove() | continue hot-remove the cpu | So _cpu_up() can continue to itself. And hot-remove cpu can also continue itself. If I change it, I think the race disappears as below: hot-remove cpu | _cpu_up() ----------------------------------------------------------------------- call acpi_processor_handle_eject() | call cpu_down() | call get_online_cpus() | | call cpu_hotplug_begin() and stop here call arch_unregister_cpu() | call acpi_unmap_lsapic() | cpu's cpu_present is set | to false by set_cpu_present()| call put_online_cpus() | | start _cpu_up() | check cpu_present() and return -EINVAL return acpi_processor_remove() | continue hot-remove the cpu | Thus I think the change is necessary. Thanks, Yasuaki Ishimatsu > Secondly, if the change is indeed an improvement, then why is it > in _this_ patch? IMHO, in that case it should be part of a separate patch. > > Coming back to my first point, I don't see why this hunk is needed. We > already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before > checking the status of the cpu (online or present). And all hotplug > operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that > lock. Isn't that enough? Or am I missing something? > >> idle = idle_thread_get(cpu); >> if (IS_ERR(idle)) { >> ret = PTR_ERR(idle); >> > > Regards, > Srivatsa S. Bhat > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Toshi, 2012/07/10 6:15, Toshi Kani wrote: > On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote: >> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: >>> Hi Srivatsa, >>> >>> Thank you for your reviewing. >>> >>> 2012/07/06 18:51, Srivatsa S. Bhat wrote: >>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: >>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu. >>>> >>>> Ouch! >>>> >>>>> But in this case, it should return error number since some process may run on >>>>> the cpu. If the cpu has a running process and the cpu is turned the power off, >>>>> the system cannot work well. >>>>> >>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> >>>>> >>>>> --- >>>>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ >>>>> 1 file changed, 12 insertions(+), 6 deletions(-) >>>>> >>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>>> =================================================================== >>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-06-25 04:53:04.000000000 +0900 >>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 21:02:58.711285382 +0900 >>>>> @@ -610,7 +610,7 @@ err_free_pr: >>>>> static int acpi_processor_remove(struct acpi_device *device, int type) >>>>> { >>>>> struct acpi_processor *pr = NULL; >>>>> - >>>>> + int ret; >>>>> >>>>> if (!device || !acpi_driver_data(device)) >>>>> return -EINVAL; >>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct >>>>> goto free; >>>>> >>>>> if (type == ACPI_BUS_REMOVAL_EJECT) { >>>>> - if (acpi_processor_handle_eject(pr)) >>>>> - return -EINVAL; >>>>> + ret = acpi_processor_handle_eject(pr); >>>>> + if (ret) >>>>> + return ret; >>>>> } >>>>> >>>>> acpi_processor_power_exit(pr, device); >>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd >>>>> >>>>> static int acpi_processor_handle_eject(struct acpi_processor *pr) >>>>> { >>>>> - if (cpu_online(pr->id)) >>>>> - cpu_down(pr->id); >>>>> + int ret; >>>>> + >>>>> + if (cpu_online(pr->id)) { >>>>> + ret = cpu_down(pr->id); >>>>> + if (ret) >>>>> + return ret; >>>>> + } >>>>> >>>> >>>> Strictly speaking, this is not thorough enough. What prevents someone >>>> from onlining that same cpu again, at this point? >>>> So, IMHO, you need to wrap the contents of this function inside a >>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else >>>> from messing with CPU hotplug at the same time. >>> >>> If I understand your comment by mistake, please let me know. >>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block >>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and >>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. >>> >> >> You are right. Sorry, I overlooked that. >> >>> + get_online_cpus() >>> + if (cpu_online(pr->id)) { >>> + ret = cpu_down(pr->id); >>> + if (ret) >>> + return ret; >>> + } >>> + put_online_cpus() >>> >>> I think following patch can prevent it correctly. How about the patch? >>> >>> --- >>> drivers/acpi/processor_driver.c | 12 ++++++++++++ >>> kernel/cpu.c | 8 +++++--- >>> 2 files changed, 17 insertions(+), 3 deletions(-) >>> >>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>> =================================================================== >>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 09:59:01.280211202 +0900 >>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 11:05:34.559859236 +0900 >>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s >>> { >>> int ret; >>> >>> +retry: >>> if (cpu_online(pr->id)) { >>> ret = cpu_down(pr->id); >>> if (ret) >>> return ret; >>> } >>> >>> + get_online_cpus(); >>> + /* >>> + * Someone might online the cpu again at this point. So we check that >>> + * cpu has been onlined or not. If cpu is online, we try to offline >>> + * the cpu again. >>> + */ >>> + if (cpu_online(pr->id)) { >> >> How about this: >> if (unlikely(cpu_online(pr->id)) { >> since the probability of this happening is quite small... >> >>> + put_online_cpus(); >>> + goto retry; >>> + } >>> arch_unregister_cpu(pr->id); >>> acpi_unmap_lsapic(pr->id); >>> + put_online_cpus(); >>> return ret; >>> } >> >> This retry logic doesn't look elegant, but I don't see any better method :-( > > Another possible option is to fail the request instead of retrying it. Good idea!! I'll update it. > > It would be quite challenging to allow on-lining and off-lining > operations to run concurrently. In fact, even if we close this window, > there is yet another window right after the new put_online_cpus() call. I think if we close the window, another window does not open. acpi_unmap_lsapic() sets cpu_present mask to false before new put_online_cpus() is called. So even if _cpu_up() is called, the function returns -EINAVL by following added code. @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; struct task_struct *idle; - if (cpu_online(cpu) || !cpu_present(cpu)) - return -EINVAL; - cpu_hotplug_begin(); + if (cpu_online(cpu) || !cpu_present(cpu)) { + ret = -EINVAL; + goto out; + } + Thanks, Yasuaki Ishimatsu > This CPU may become online before calling _EJ0 in the case of > hot-remove. > > This goes beyond the scope of this patch, but IMHO, we should serialize > in the request level. That is, a new on-lining request should not be > allowed to proceed until the current request is complete. This scheme > only allows a single operation at a time per OS instance, but I do not > think it is a big issue. > > Serializing in the request level is also important when supporting > container hot-remove, which can remove multiple children objects under a > parent container object. For instance, a Node hot-remove may also > remove multiple processors underneath of it. This kind of the > operations has to make sure that all children objects are remained > off-line until it ejects the parent object. > > Thanks, > -Toshi > > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Srivatsa, 2012/07/10 9:13, Yasuaki Ishimatsu wrote: > Hi Srivatsa, > > 2012/07/09 20:25, Srivatsa S. Bhat wrote: >> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: >>> Hi Srivatsa, >>> >>> Thank you for your reviewing. >>> >>> 2012/07/06 18:51, Srivatsa S. Bhat wrote: >>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: >>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu. >>>> >>>> Ouch! >>>> >>>>> But in this case, it should return error number since some process may run on >>>>> the cpu. If the cpu has a running process and the cpu is turned the power off, >>>>> the system cannot work well. >>>>> >>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> >>>>> >>>>> --- >>>>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ >>>>> 1 file changed, 12 insertions(+), 6 deletions(-) >>>>> >>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>>> =================================================================== >>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-06-25 04:53:04.000000000 +0900 >>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 21:02:58.711285382 +0900 >>>>> @@ -610,7 +610,7 @@ err_free_pr: >>>>> static int acpi_processor_remove(struct acpi_device *device, int type) >>>>> { >>>>> struct acpi_processor *pr = NULL; >>>>> - >>>>> + int ret; >>>>> >>>>> if (!device || !acpi_driver_data(device)) >>>>> return -EINVAL; >>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct >>>>> goto free; >>>>> >>>>> if (type == ACPI_BUS_REMOVAL_EJECT) { >>>>> - if (acpi_processor_handle_eject(pr)) >>>>> - return -EINVAL; >>>>> + ret = acpi_processor_handle_eject(pr); >>>>> + if (ret) >>>>> + return ret; >>>>> } >>>>> >>>>> acpi_processor_power_exit(pr, device); >>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd >>>>> >>>>> static int acpi_processor_handle_eject(struct acpi_processor *pr) >>>>> { >>>>> - if (cpu_online(pr->id)) >>>>> - cpu_down(pr->id); >>>>> + int ret; >>>>> + >>>>> + if (cpu_online(pr->id)) { >>>>> + ret = cpu_down(pr->id); >>>>> + if (ret) >>>>> + return ret; >>>>> + } >>>>> >>>> >>>> Strictly speaking, this is not thorough enough. What prevents someone >>>> from onlining that same cpu again, at this point? >>>> So, IMHO, you need to wrap the contents of this function inside a >>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else >>>> from messing with CPU hotplug at the same time. >>> >>> If I understand your comment by mistake, please let me know. >>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block >>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and >>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. >>> >> >> You are right. Sorry, I overlooked that. >> >>> + get_online_cpus() >>> + if (cpu_online(pr->id)) { >>> + ret = cpu_down(pr->id); >>> + if (ret) >>> + return ret; >>> + } >>> + put_online_cpus() >>> >>> I think following patch can prevent it correctly. How about the patch? >>> >>> --- >>> drivers/acpi/processor_driver.c | 12 ++++++++++++ >>> kernel/cpu.c | 8 +++++--- >>> 2 files changed, 17 insertions(+), 3 deletions(-) >>> >>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>> =================================================================== >>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 09:59:01.280211202 +0900 >>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 11:05:34.559859236 +0900 >>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s >>> { >>> int ret; >>> >>> +retry: >>> if (cpu_online(pr->id)) { >>> ret = cpu_down(pr->id); >>> if (ret) >>> return ret; >>> } >>> >>> + get_online_cpus(); >>> + /* >>> + * Someone might online the cpu again at this point. So we check that >>> + * cpu has been onlined or not. If cpu is online, we try to offline >>> + * the cpu again. >>> + */ >>> + if (cpu_online(pr->id)) { >> >> How about this: >> if (unlikely(cpu_online(pr->id)) { >> since the probability of this happening is quite small... > > Thanks. I'll update it. > >>> + put_online_cpus(); >>> + goto retry; >>> + } >>> arch_unregister_cpu(pr->id); >>> acpi_unmap_lsapic(pr->id); >>> + put_online_cpus(); >>> return ret; >>> } >> >> This retry logic doesn't look elegant, but I don't see any better method :-( >> >>> #else >>> Index: linux-3.5-rc4/kernel/cpu.c >>> =================================================================== >>> --- linux-3.5-rc4.orig/kernel/cpu.c 2012-07-09 09:59:01.280211202 +0900 >>> +++ linux-3.5-rc4/kernel/cpu.c 2012-07-09 09:59:02.903190965 +0900 >>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in >>> unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; >>> struct task_struct *idle; >>> >>> - if (cpu_online(cpu) || !cpu_present(cpu)) >>> - return -EINVAL; >>> - >>> cpu_hotplug_begin(); >>> >>> + if (cpu_online(cpu) || !cpu_present(cpu)) { >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + >> >> Firstly, why is this change needed? > > I cared the race of hot-remove cpu and _cpu_up(). If I do not change it, > there is the following race. > > hot-remove cpu | _cpu_up() > ------------------------------------- ------------------------------------ > call acpi_processor_handle_eject() | > call cpu_down() | > call get_online_cpus() | > | call cpu_hotplug_begin() and stop here > call arch_unregister_cpu() | > call acpi_unmap_lsapic() | > call put_online_cpus() | > | start and continue _cpu_up() > return acpi_processor_remove() | > continue hot-remove the cpu | > > So _cpu_up() can continue to itself. And hot-remove cpu can also continue > itself. If I change it, I think the race disappears as below: > > hot-remove cpu | _cpu_up() > ----------------------------------------------------------------------- > call acpi_processor_handle_eject() | > call cpu_down() | > call get_online_cpus() | > | call cpu_hotplug_begin() and stop here > call arch_unregister_cpu() | > call acpi_unmap_lsapic() | > cpu's cpu_present is set | > to false by set_cpu_present()| > call put_online_cpus() | > | start _cpu_up() > | check cpu_present() and return -EINVAL > return acpi_processor_remove() | > continue hot-remove the cpu | > > Thus I think the change is necessary. > > Thanks, > Yasuaki Ishimatsu > >> Secondly, if the change is indeed an improvement, then why is it >> in _this_ patch? IMHO, in that case it should be part of a separate patch. I forget to answer the question. As I answered in the above your first question, the fix is related to acpi_processor_handle_eject(). So the fix should be in the patch. Thanks, Yasuaki Ishimatsu >> >> Coming back to my first point, I don't see why this hunk is needed. We >> already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before >> checking the status of the cpu (online or present). And all hotplug >> operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that >> lock. Isn't that enough? Or am I missing something? >> >>> idle = idle_thread_get(cpu); >>> if (IS_ERR(idle)) { >>> ret = PTR_ERR(idle); >>> >> >> Regards, >> Srivatsa S. Bhat >> > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/10/2012 05:43 AM, Yasuaki Ishimatsu wrote: > Hi Srivatsa, > > 2012/07/09 20:25, Srivatsa S. Bhat wrote: >> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: >>> Hi Srivatsa, >>> >>> Thank you for your reviewing. >>> >>> 2012/07/06 18:51, Srivatsa S. Bhat wrote: >>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: >>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu. >>>> >>>> Ouch! >>>> >>>>> But in this case, it should return error number since some process may run on >>>>> the cpu. If the cpu has a running process and the cpu is turned the power off, >>>>> the system cannot work well. >>>>> >>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> >>>>> >>>>> --- >>>>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ >>>>> 1 file changed, 12 insertions(+), 6 deletions(-) >>>>> >>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>>> =================================================================== >>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-06-25 04:53:04.000000000 +0900 >>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 21:02:58.711285382 +0900 >>>>> @@ -610,7 +610,7 @@ err_free_pr: >>>>> static int acpi_processor_remove(struct acpi_device *device, int type) >>>>> { >>>>> struct acpi_processor *pr = NULL; >>>>> - >>>>> + int ret; >>>>> >>>>> if (!device || !acpi_driver_data(device)) >>>>> return -EINVAL; >>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct >>>>> goto free; >>>>> >>>>> if (type == ACPI_BUS_REMOVAL_EJECT) { >>>>> - if (acpi_processor_handle_eject(pr)) >>>>> - return -EINVAL; >>>>> + ret = acpi_processor_handle_eject(pr); >>>>> + if (ret) >>>>> + return ret; >>>>> } >>>>> >>>>> acpi_processor_power_exit(pr, device); >>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd >>>>> >>>>> static int acpi_processor_handle_eject(struct acpi_processor *pr) >>>>> { >>>>> - if (cpu_online(pr->id)) >>>>> - cpu_down(pr->id); >>>>> + int ret; >>>>> + >>>>> + if (cpu_online(pr->id)) { >>>>> + ret = cpu_down(pr->id); >>>>> + if (ret) >>>>> + return ret; >>>>> + } >>>>> >>>> >>>> Strictly speaking, this is not thorough enough. What prevents someone >>>> from onlining that same cpu again, at this point? >>>> So, IMHO, you need to wrap the contents of this function inside a >>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else >>>> from messing with CPU hotplug at the same time. >>> >>> If I understand your comment by mistake, please let me know. >>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block >>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and >>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. >>> >> >> You are right. Sorry, I overlooked that. >> >>> + get_online_cpus() >>> + if (cpu_online(pr->id)) { >>> + ret = cpu_down(pr->id); >>> + if (ret) >>> + return ret; >>> + } >>> + put_online_cpus() >>> >>> I think following patch can prevent it correctly. How about the patch? >>> >>> --- >>> drivers/acpi/processor_driver.c | 12 ++++++++++++ >>> kernel/cpu.c | 8 +++++--- >>> 2 files changed, 17 insertions(+), 3 deletions(-) >>> >>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>> =================================================================== >>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 09:59:01.280211202 +0900 >>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 11:05:34.559859236 +0900 >>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s >>> { >>> int ret; >>> >>> +retry: >>> if (cpu_online(pr->id)) { >>> ret = cpu_down(pr->id); >>> if (ret) >>> return ret; >>> } >>> >>> + get_online_cpus(); >>> + /* >>> + * Someone might online the cpu again at this point. So we check that >>> + * cpu has been onlined or not. If cpu is online, we try to offline >>> + * the cpu again. >>> + */ >>> + if (cpu_online(pr->id)) { >> >> How about this: >> if (unlikely(cpu_online(pr->id)) { >> since the probability of this happening is quite small... > > Thanks. I'll update it. > >>> + put_online_cpus(); >>> + goto retry; >>> + } >>> arch_unregister_cpu(pr->id); >>> acpi_unmap_lsapic(pr->id); >>> + put_online_cpus(); >>> return ret; >>> } >> >> This retry logic doesn't look elegant, but I don't see any better method :-( >> >>> #else >>> Index: linux-3.5-rc4/kernel/cpu.c >>> =================================================================== >>> --- linux-3.5-rc4.orig/kernel/cpu.c 2012-07-09 09:59:01.280211202 +0900 >>> +++ linux-3.5-rc4/kernel/cpu.c 2012-07-09 09:59:02.903190965 +0900 >>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in >>> unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; >>> struct task_struct *idle; >>> >>> - if (cpu_online(cpu) || !cpu_present(cpu)) >>> - return -EINVAL; >>> - >>> cpu_hotplug_begin(); >>> >>> + if (cpu_online(cpu) || !cpu_present(cpu)) { >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + >> >> Firstly, why is this change needed? > > I cared the race of hot-remove cpu and _cpu_up(). If I do not change it, > there is the following race. > > hot-remove cpu | _cpu_up() > ------------------------------------- ------------------------------------ > call acpi_processor_handle_eject() | > call cpu_down() | > call get_online_cpus() | > | call cpu_hotplug_begin() and stop here > call arch_unregister_cpu() | > call acpi_unmap_lsapic() | > call put_online_cpus() | > | start and continue _cpu_up() > return acpi_processor_remove() | > continue hot-remove the cpu | > > So _cpu_up() can continue to itself. And hot-remove cpu can also continue > itself. If I change it, I think the race disappears as below: > > hot-remove cpu | _cpu_up() > ----------------------------------------------------------------------- > call acpi_processor_handle_eject() | > call cpu_down() | > call get_online_cpus() | > | call cpu_hotplug_begin() and stop here > call arch_unregister_cpu() | > call acpi_unmap_lsapic() | > cpu's cpu_present is set | > to false by set_cpu_present()| > call put_online_cpus() | > | start _cpu_up() > | check cpu_present() and return -EINVAL > return acpi_processor_remove() | > continue hot-remove the cpu | > > Thus I think the change is necessary. > Thanks for the detailed explanation. I had missed this race condition. Now I see why all the changes in your patch are needed. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/10/2012 10:44 AM, Yasuaki Ishimatsu wrote: > Hi Srivatsa, > > 2012/07/10 9:13, Yasuaki Ishimatsu wrote: >> Hi Srivatsa, >> >> 2012/07/09 20:25, Srivatsa S. Bhat wrote: >>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: >>>> Hi Srivatsa, >>>> >>>> Thank you for your reviewing. >>>> >>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote: >>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: >>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu. >>>>> >>>>> Ouch! >>>>> >>>>>> But in this case, it should return error number since some process may run on >>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off, >>>>>> the system cannot work well. >>>>>> >>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> >>>>>> >>>>>> --- >>>>>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ >>>>>> 1 file changed, 12 insertions(+), 6 deletions(-) >>>>>> >>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>>>> =================================================================== >>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-06-25 04:53:04.000000000 +0900 >>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 21:02:58.711285382 +0900 >>>>>> @@ -610,7 +610,7 @@ err_free_pr: >>>>>> static int acpi_processor_remove(struct acpi_device *device, int type) >>>>>> { >>>>>> struct acpi_processor *pr = NULL; >>>>>> - >>>>>> + int ret; >>>>>> >>>>>> if (!device || !acpi_driver_data(device)) >>>>>> return -EINVAL; >>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct >>>>>> goto free; >>>>>> >>>>>> if (type == ACPI_BUS_REMOVAL_EJECT) { >>>>>> - if (acpi_processor_handle_eject(pr)) >>>>>> - return -EINVAL; >>>>>> + ret = acpi_processor_handle_eject(pr); >>>>>> + if (ret) >>>>>> + return ret; >>>>>> } >>>>>> >>>>>> acpi_processor_power_exit(pr, device); >>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd >>>>>> >>>>>> static int acpi_processor_handle_eject(struct acpi_processor *pr) >>>>>> { >>>>>> - if (cpu_online(pr->id)) >>>>>> - cpu_down(pr->id); >>>>>> + int ret; >>>>>> + >>>>>> + if (cpu_online(pr->id)) { >>>>>> + ret = cpu_down(pr->id); >>>>>> + if (ret) >>>>>> + return ret; >>>>>> + } >>>>>> >>>>> >>>>> Strictly speaking, this is not thorough enough. What prevents someone >>>>> from onlining that same cpu again, at this point? >>>>> So, IMHO, you need to wrap the contents of this function inside a >>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else >>>>> from messing with CPU hotplug at the same time. >>>> >>>> If I understand your comment by mistake, please let me know. >>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block >>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and >>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. >>>> >>> >>> You are right. Sorry, I overlooked that. >>> >>>> + get_online_cpus() >>>> + if (cpu_online(pr->id)) { >>>> + ret = cpu_down(pr->id); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + put_online_cpus() >>>> >>>> I think following patch can prevent it correctly. How about the patch? >>>> >>>> --- >>>> drivers/acpi/processor_driver.c | 12 ++++++++++++ >>>> kernel/cpu.c | 8 +++++--- >>>> 2 files changed, 17 insertions(+), 3 deletions(-) >>>> >>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>> =================================================================== >>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 09:59:01.280211202 +0900 >>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 11:05:34.559859236 +0900 >>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s >>>> { >>>> int ret; >>>> >>>> +retry: >>>> if (cpu_online(pr->id)) { >>>> ret = cpu_down(pr->id); >>>> if (ret) >>>> return ret; >>>> } >>>> >>>> + get_online_cpus(); >>>> + /* >>>> + * Someone might online the cpu again at this point. So we check that >>>> + * cpu has been onlined or not. If cpu is online, we try to offline >>>> + * the cpu again. >>>> + */ >>>> + if (cpu_online(pr->id)) { >>> >>> How about this: >>> if (unlikely(cpu_online(pr->id)) { >>> since the probability of this happening is quite small... >> >> Thanks. I'll update it. >> >>>> + put_online_cpus(); >>>> + goto retry; >>>> + } >>>> arch_unregister_cpu(pr->id); >>>> acpi_unmap_lsapic(pr->id); >>>> + put_online_cpus(); >>>> return ret; >>>> } >>> >>> This retry logic doesn't look elegant, but I don't see any better method :-( >>> >>>> #else >>>> Index: linux-3.5-rc4/kernel/cpu.c >>>> =================================================================== >>>> --- linux-3.5-rc4.orig/kernel/cpu.c 2012-07-09 09:59:01.280211202 +0900 >>>> +++ linux-3.5-rc4/kernel/cpu.c 2012-07-09 09:59:02.903190965 +0900 >>>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in >>>> unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; >>>> struct task_struct *idle; >>>> >>>> - if (cpu_online(cpu) || !cpu_present(cpu)) >>>> - return -EINVAL; >>>> - >>>> cpu_hotplug_begin(); >>>> >>>> + if (cpu_online(cpu) || !cpu_present(cpu)) { >>>> + ret = -EINVAL; >>>> + goto out; >>>> + } >>>> + >>> >>> Firstly, why is this change needed? >> >> I cared the race of hot-remove cpu and _cpu_up(). If I do not change it, >> there is the following race. >> >> hot-remove cpu | _cpu_up() >> ------------------------------------- ------------------------------------ >> call acpi_processor_handle_eject() | >> call cpu_down() | >> call get_online_cpus() | >> | call cpu_hotplug_begin() and stop here >> call arch_unregister_cpu() | >> call acpi_unmap_lsapic() | >> call put_online_cpus() | >> | start and continue _cpu_up() >> return acpi_processor_remove() | >> continue hot-remove the cpu | >> >> So _cpu_up() can continue to itself. And hot-remove cpu can also continue >> itself. If I change it, I think the race disappears as below: >> >> hot-remove cpu | _cpu_up() >> ----------------------------------------------------------------------- >> call acpi_processor_handle_eject() | >> call cpu_down() | >> call get_online_cpus() | >> | call cpu_hotplug_begin() and stop here >> call arch_unregister_cpu() | >> call acpi_unmap_lsapic() | >> cpu's cpu_present is set | >> to false by set_cpu_present()| >> call put_online_cpus() | >> | start _cpu_up() >> | check cpu_present() and return -EINVAL >> return acpi_processor_remove() | >> continue hot-remove the cpu | >> >> Thus I think the change is necessary. >> >> Thanks, >> Yasuaki Ishimatsu >> >>> Secondly, if the change is indeed an improvement, then why is it >>> in _this_ patch? IMHO, in that case it should be part of a separate patch. > > I forget to answer the question. > As I answered in the above your first question, the fix is related to > acpi_processor_handle_eject(). So the fix should be in the patch. > Yep, got it now. Thanks! Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote: > Hi Toshi, > > 2012/07/10 6:15, Toshi Kani wrote: >> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote: >>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: >>>> Hi Srivatsa, >>>> >>>> Thank you for your reviewing. >>>> >>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote: >>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: >>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to >>>>>> remove the cpu. >>>>> >>>>> Ouch! >>>>> >>>>>> But in this case, it should return error number since some process >>>>>> may run on >>>>>> the cpu. If the cpu has a running process and the cpu is turned >>>>>> the power off, >>>>>> the system cannot work well. >>>>>> >>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> >>>>>> >>>>>> --- >>>>>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ >>>>>> 1 file changed, 12 insertions(+), 6 deletions(-) >>>>>> >>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>>>> =================================================================== >>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c >>>>>> 2012-06-25 04:53:04.000000000 +0900 >>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 >>>>>> 21:02:58.711285382 +0900 >>>>>> @@ -610,7 +610,7 @@ err_free_pr: >>>>>> static int acpi_processor_remove(struct acpi_device *device, >>>>>> int type) >>>>>> { >>>>>> struct acpi_processor *pr = NULL; >>>>>> - >>>>>> + int ret; >>>>>> >>>>>> if (!device || !acpi_driver_data(device)) >>>>>> return -EINVAL; >>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct >>>>>> goto free; >>>>>> >>>>>> if (type == ACPI_BUS_REMOVAL_EJECT) { >>>>>> - if (acpi_processor_handle_eject(pr)) >>>>>> - return -EINVAL; >>>>>> + ret = acpi_processor_handle_eject(pr); >>>>>> + if (ret) >>>>>> + return ret; >>>>>> } >>>>>> >>>>>> acpi_processor_power_exit(pr, device); >>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd >>>>>> >>>>>> static int acpi_processor_handle_eject(struct acpi_processor *pr) >>>>>> { >>>>>> - if (cpu_online(pr->id)) >>>>>> - cpu_down(pr->id); >>>>>> + int ret; >>>>>> + >>>>>> + if (cpu_online(pr->id)) { >>>>>> + ret = cpu_down(pr->id); >>>>>> + if (ret) >>>>>> + return ret; >>>>>> + } >>>>>> >>>>> >>>>> Strictly speaking, this is not thorough enough. What prevents someone >>>>> from onlining that same cpu again, at this point? >>>>> So, IMHO, you need to wrap the contents of this function inside a >>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else >>>>> from messing with CPU hotplug at the same time. >>>> >>>> If I understand your comment by mistake, please let me know. >>>> If the contents is wrapped a inside >>>> get_online_cpus()/put_online_cpus() block >>>> as below, cpu_down() will stop since cpu_down() calls >>>> cpu_hotplug_begin() and >>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. >>>> >>> >>> You are right. Sorry, I overlooked that. >>> >>>> + get_online_cpus() >>>> + if (cpu_online(pr->id)) { >>>> + ret = cpu_down(pr->id); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + put_online_cpus() >>>> >>>> I think following patch can prevent it correctly. How about the patch? >>>> >>>> --- >>>> drivers/acpi/processor_driver.c | 12 ++++++++++++ >>>> kernel/cpu.c | 8 +++++--- >>>> 2 files changed, 17 insertions(+), 3 deletions(-) >>>> >>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>> =================================================================== >>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 >>>> 09:59:01.280211202 +0900 >>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 >>>> 11:05:34.559859236 +0900 >>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s >>>> { >>>> int ret; >>>> >>>> +retry: >>>> if (cpu_online(pr->id)) { >>>> ret = cpu_down(pr->id); >>>> if (ret) >>>> return ret; >>>> } >>>> >>>> + get_online_cpus(); >>>> + /* >>>> + * Someone might online the cpu again at this point. So we >>>> check that >>>> + * cpu has been onlined or not. If cpu is online, we try to >>>> offline >>>> + * the cpu again. >>>> + */ >>>> + if (cpu_online(pr->id)) { >>> >>> How about this: >>> if (unlikely(cpu_online(pr->id)) { >>> since the probability of this happening is quite small... >>> >>>> + put_online_cpus(); >>>> + goto retry; >>>> + } >>>> arch_unregister_cpu(pr->id); >>>> acpi_unmap_lsapic(pr->id); >>>> + put_online_cpus(); >>>> return ret; >>>> } >>> >>> This retry logic doesn't look elegant, but I don't see any better >>> method :-( >> >> Another possible option is to fail the request instead of retrying it. > > Good idea!! I'll update it. > >> >> It would be quite challenging to allow on-lining and off-lining >> operations to run concurrently. In fact, even if we close this window, >> there is yet another window right after the new put_online_cpus() call. > > I think if we close the window, another window does not open. > acpi_unmap_lsapic() sets cpu_present mask to false before new > put_online_cpus() > is called. So even if _cpu_up() is called, the function returns -EINAVL by > following added code. > > @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in > unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; > struct task_struct *idle; > > - if (cpu_online(cpu) || !cpu_present(cpu)) > - return -EINVAL; > - > cpu_hotplug_begin(); > > + if (cpu_online(cpu) || !cpu_present(cpu)) { > + ret = -EINVAL; > + goto out; > + } > + > Right. Yasuaki's patch will ensure that there are no more race conditions because it does the cpu_present() check after taking the cpu_hotplug.lock. So I think it is safe and still doable from the kernel's perspective. But the question is, "should we do it?" I think Toshi's suggestion of failing the hot-remove request (if we find that the cpu has been onlined again by some other task) sounds like a good idea for another reason: cpu hotplug is not initiated by the OS by itself; its requested by the user; so if something is onlining the cpu back again, the user better take a second look and decide what exactly he wants to do with that cpu - whether keep it online or hot-remove it out. Trying to online as well as hot-remove the same cpu simultaneously sounds like a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case (ie., failing that request) would give a warning to the user and a chance to reflect upon what exactly he wants to do with the cpu. So, IMHO, we should protect against the race condition (between cpu_up and hot-remove) but choose to fail the hot-remove request, and add a comment saying why we chose to fail the request, even though we could have gone ahead and completed it. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Srivatsa, 2012/07/10 16:57, Srivatsa S. Bhat wrote: > On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote: >> Hi Toshi, >> >> 2012/07/10 6:15, Toshi Kani wrote: >>> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote: >>>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: >>>>> Hi Srivatsa, >>>>> >>>>> Thank you for your reviewing. >>>>> >>>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote: >>>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: >>>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to >>>>>>> remove the cpu. >>>>>> >>>>>> Ouch! >>>>>> >>>>>>> But in this case, it should return error number since some process >>>>>>> may run on >>>>>>> the cpu. If the cpu has a running process and the cpu is turned >>>>>>> the power off, >>>>>>> the system cannot work well. >>>>>>> >>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> >>>>>>> >>>>>>> --- >>>>>>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ >>>>>>> 1 file changed, 12 insertions(+), 6 deletions(-) >>>>>>> >>>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>>>>> =================================================================== >>>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c >>>>>>> 2012-06-25 04:53:04.000000000 +0900 >>>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 >>>>>>> 21:02:58.711285382 +0900 >>>>>>> @@ -610,7 +610,7 @@ err_free_pr: >>>>>>> static int acpi_processor_remove(struct acpi_device *device, >>>>>>> int type) >>>>>>> { >>>>>>> struct acpi_processor *pr = NULL; >>>>>>> - >>>>>>> + int ret; >>>>>>> >>>>>>> if (!device || !acpi_driver_data(device)) >>>>>>> return -EINVAL; >>>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct >>>>>>> goto free; >>>>>>> >>>>>>> if (type == ACPI_BUS_REMOVAL_EJECT) { >>>>>>> - if (acpi_processor_handle_eject(pr)) >>>>>>> - return -EINVAL; >>>>>>> + ret = acpi_processor_handle_eject(pr); >>>>>>> + if (ret) >>>>>>> + return ret; >>>>>>> } >>>>>>> >>>>>>> acpi_processor_power_exit(pr, device); >>>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd >>>>>>> >>>>>>> static int acpi_processor_handle_eject(struct acpi_processor *pr) >>>>>>> { >>>>>>> - if (cpu_online(pr->id)) >>>>>>> - cpu_down(pr->id); >>>>>>> + int ret; >>>>>>> + >>>>>>> + if (cpu_online(pr->id)) { >>>>>>> + ret = cpu_down(pr->id); >>>>>>> + if (ret) >>>>>>> + return ret; >>>>>>> + } >>>>>>> >>>>>> >>>>>> Strictly speaking, this is not thorough enough. What prevents someone >>>>>> from onlining that same cpu again, at this point? >>>>>> So, IMHO, you need to wrap the contents of this function inside a >>>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else >>>>>> from messing with CPU hotplug at the same time. >>>>> >>>>> If I understand your comment by mistake, please let me know. >>>>> If the contents is wrapped a inside >>>>> get_online_cpus()/put_online_cpus() block >>>>> as below, cpu_down() will stop since cpu_down() calls >>>>> cpu_hotplug_begin() and >>>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. >>>>> >>>> >>>> You are right. Sorry, I overlooked that. >>>> >>>>> + get_online_cpus() >>>>> + if (cpu_online(pr->id)) { >>>>> + ret = cpu_down(pr->id); >>>>> + if (ret) >>>>> + return ret; >>>>> + } >>>>> + put_online_cpus() >>>>> >>>>> I think following patch can prevent it correctly. How about the patch? >>>>> >>>>> --- >>>>> drivers/acpi/processor_driver.c | 12 ++++++++++++ >>>>> kernel/cpu.c | 8 +++++--- >>>>> 2 files changed, 17 insertions(+), 3 deletions(-) >>>>> >>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c >>>>> =================================================================== >>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 >>>>> 09:59:01.280211202 +0900 >>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 >>>>> 11:05:34.559859236 +0900 >>>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s >>>>> { >>>>> int ret; >>>>> >>>>> +retry: >>>>> if (cpu_online(pr->id)) { >>>>> ret = cpu_down(pr->id); >>>>> if (ret) >>>>> return ret; >>>>> } >>>>> >>>>> + get_online_cpus(); >>>>> + /* >>>>> + * Someone might online the cpu again at this point. So we >>>>> check that >>>>> + * cpu has been onlined or not. If cpu is online, we try to >>>>> offline >>>>> + * the cpu again. >>>>> + */ >>>>> + if (cpu_online(pr->id)) { >>>> >>>> How about this: >>>> if (unlikely(cpu_online(pr->id)) { >>>> since the probability of this happening is quite small... >>>> >>>>> + put_online_cpus(); >>>>> + goto retry; >>>>> + } >>>>> arch_unregister_cpu(pr->id); >>>>> acpi_unmap_lsapic(pr->id); >>>>> + put_online_cpus(); >>>>> return ret; >>>>> } >>>> >>>> This retry logic doesn't look elegant, but I don't see any better >>>> method :-( >>> >>> Another possible option is to fail the request instead of retrying it. >> >> Good idea!! I'll update it. >> >>> >>> It would be quite challenging to allow on-lining and off-lining >>> operations to run concurrently. In fact, even if we close this window, >>> there is yet another window right after the new put_online_cpus() call. >> >> I think if we close the window, another window does not open. >> acpi_unmap_lsapic() sets cpu_present mask to false before new >> put_online_cpus() >> is called. So even if _cpu_up() is called, the function returns -EINAVL by >> following added code. >> >> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in >> unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; >> struct task_struct *idle; >> >> - if (cpu_online(cpu) || !cpu_present(cpu)) >> - return -EINVAL; >> - >> cpu_hotplug_begin(); >> >> + if (cpu_online(cpu) || !cpu_present(cpu)) { >> + ret = -EINVAL; >> + goto out; >> + } >> + >> > > Right. Yasuaki's patch will ensure that there are no more race conditions > because it does the cpu_present() check after taking the cpu_hotplug.lock. > So I think it is safe and still doable from the kernel's perspective. > > But the question is, "should we do it?" I think Toshi's suggestion of failing > the hot-remove request (if we find that the cpu has been onlined again by some > other task) sounds like a good idea for another reason: cpu hotplug is not > initiated by the OS by itself; its requested by the user; so if something is > onlining the cpu back again, the user better take a second look and decide > what exactly he wants to do with that cpu - whether keep it online or > hot-remove it out. I think so too. Failing the hot-remove request is good idea. > > Trying to online as well as hot-remove the same cpu simultaneously sounds like > a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case > (ie., failing that request) would give a warning to the user and a chance to > reflect upon what exactly he wants to do with the cpu. > > So, IMHO, we should protect against the race condition (between cpu_up and > hot-remove) but choose to fail the hot-remove request, and add a comment saying > why we chose to fail the request, even though we could have gone ahead and > completed it. I have already sent 2nd version of the patch. But the warning message is not included in the patch. So I will add the warning message into 3rd version. Thanks, Yasuaki Ishimatsu > Regards, > Srivatsa S. Bhat > > -- > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-07-10 at 13:27 +0530, Srivatsa S. Bhat wrote: > On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote: > > Hi Toshi, > > > > 2012/07/10 6:15, Toshi Kani wrote: > >> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote: > >>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote: > >>>> Hi Srivatsa, > >>>> > >>>> Thank you for your reviewing. > >>>> > >>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote: > >>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote: > >>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to > >>>>>> remove the cpu. > >>>>> > >>>>> Ouch! > >>>>> > >>>>>> But in this case, it should return error number since some process > >>>>>> may run on > >>>>>> the cpu. If the cpu has a running process and the cpu is turned > >>>>>> the power off, > >>>>>> the system cannot work well. > >>>>>> > >>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> > >>>>>> > >>>>>> --- > >>>>>> drivers/acpi/processor_driver.c | 18 ++++++++++++------ > >>>>>> 1 file changed, 12 insertions(+), 6 deletions(-) > >>>>>> > >>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c > >>>>>> =================================================================== > >>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c > >>>>>> 2012-06-25 04:53:04.000000000 +0900 > >>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05 > >>>>>> 21:02:58.711285382 +0900 > >>>>>> @@ -610,7 +610,7 @@ err_free_pr: > >>>>>> static int acpi_processor_remove(struct acpi_device *device, > >>>>>> int type) > >>>>>> { > >>>>>> struct acpi_processor *pr = NULL; > >>>>>> - > >>>>>> + int ret; > >>>>>> > >>>>>> if (!device || !acpi_driver_data(device)) > >>>>>> return -EINVAL; > >>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct > >>>>>> goto free; > >>>>>> > >>>>>> if (type == ACPI_BUS_REMOVAL_EJECT) { > >>>>>> - if (acpi_processor_handle_eject(pr)) > >>>>>> - return -EINVAL; > >>>>>> + ret = acpi_processor_handle_eject(pr); > >>>>>> + if (ret) > >>>>>> + return ret; > >>>>>> } > >>>>>> > >>>>>> acpi_processor_power_exit(pr, device); > >>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd > >>>>>> > >>>>>> static int acpi_processor_handle_eject(struct acpi_processor *pr) > >>>>>> { > >>>>>> - if (cpu_online(pr->id)) > >>>>>> - cpu_down(pr->id); > >>>>>> + int ret; > >>>>>> + > >>>>>> + if (cpu_online(pr->id)) { > >>>>>> + ret = cpu_down(pr->id); > >>>>>> + if (ret) > >>>>>> + return ret; > >>>>>> + } > >>>>>> > >>>>> > >>>>> Strictly speaking, this is not thorough enough. What prevents someone > >>>>> from onlining that same cpu again, at this point? > >>>>> So, IMHO, you need to wrap the contents of this function inside a > >>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else > >>>>> from messing with CPU hotplug at the same time. > >>>> > >>>> If I understand your comment by mistake, please let me know. > >>>> If the contents is wrapped a inside > >>>> get_online_cpus()/put_online_cpus() block > >>>> as below, cpu_down() will stop since cpu_down() calls > >>>> cpu_hotplug_begin() and > >>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0. > >>>> > >>> > >>> You are right. Sorry, I overlooked that. > >>> > >>>> + get_online_cpus() > >>>> + if (cpu_online(pr->id)) { > >>>> + ret = cpu_down(pr->id); > >>>> + if (ret) > >>>> + return ret; > >>>> + } > >>>> + put_online_cpus() > >>>> > >>>> I think following patch can prevent it correctly. How about the patch? > >>>> > >>>> --- > >>>> drivers/acpi/processor_driver.c | 12 ++++++++++++ > >>>> kernel/cpu.c | 8 +++++--- > >>>> 2 files changed, 17 insertions(+), 3 deletions(-) > >>>> > >>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c > >>>> =================================================================== > >>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 > >>>> 09:59:01.280211202 +0900 > >>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 > >>>> 11:05:34.559859236 +0900 > >>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s > >>>> { > >>>> int ret; > >>>> > >>>> +retry: > >>>> if (cpu_online(pr->id)) { > >>>> ret = cpu_down(pr->id); > >>>> if (ret) > >>>> return ret; > >>>> } > >>>> > >>>> + get_online_cpus(); > >>>> + /* > >>>> + * Someone might online the cpu again at this point. So we > >>>> check that > >>>> + * cpu has been onlined or not. If cpu is online, we try to > >>>> offline > >>>> + * the cpu again. > >>>> + */ > >>>> + if (cpu_online(pr->id)) { > >>> > >>> How about this: > >>> if (unlikely(cpu_online(pr->id)) { > >>> since the probability of this happening is quite small... > >>> > >>>> + put_online_cpus(); > >>>> + goto retry; > >>>> + } > >>>> arch_unregister_cpu(pr->id); > >>>> acpi_unmap_lsapic(pr->id); > >>>> + put_online_cpus(); > >>>> return ret; > >>>> } > >>> > >>> This retry logic doesn't look elegant, but I don't see any better > >>> method :-( > >> > >> Another possible option is to fail the request instead of retrying it. > > > > Good idea!! I'll update it. Sounds good. Thanks Yasuaki for updating it. > >> It would be quite challenging to allow on-lining and off-lining > >> operations to run concurrently. In fact, even if we close this window, > >> there is yet another window right after the new put_online_cpus() call. > > > > I think if we close the window, another window does not open. > > acpi_unmap_lsapic() sets cpu_present mask to false before new > > put_online_cpus() > > is called. So even if _cpu_up() is called, the function returns -EINAVL by > > following added code. > > > > @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in > > unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; > > struct task_struct *idle; > > > > - if (cpu_online(cpu) || !cpu_present(cpu)) > > - return -EINVAL; > > - > > cpu_hotplug_begin(); > > > > + if (cpu_online(cpu) || !cpu_present(cpu)) { > > + ret = -EINVAL; > > + goto out; > > + } > > + > > > > Right. Yasuaki's patch will ensure that there are no more race conditions > because it does the cpu_present() check after taking the cpu_hotplug.lock. > So I think it is safe and still doable from the kernel's perspective. > > But the question is, "should we do it?" I think Toshi's suggestion of failing > the hot-remove request (if we find that the cpu has been onlined again by some > other task) sounds like a good idea for another reason: cpu hotplug is not > initiated by the OS by itself; its requested by the user; so if something is > onlining the cpu back again, the user better take a second look and decide > what exactly he wants to do with that cpu - whether keep it online or > hot-remove it out. > > Trying to online as well as hot-remove the same cpu simultaneously sounds like > a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case > (ie., failing that request) would give a warning to the user and a chance to > reflect upon what exactly he wants to do with the cpu. > > So, IMHO, we should protect against the race condition (between cpu_up and > hot-remove) but choose to fail the hot-remove request, and add a comment saying > why we chose to fail the request, even though we could have gone ahead and > completed it. Agreed. Thanks for the nice summary, Srivatsa! Thanks, -Toshi > Regards, > Srivatsa S. Bhat > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-3.5-rc4/drivers/acpi/processor_driver.c =================================================================== --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09 09:59:01.280211202 +0900 +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09 11:05:34.559859236 +0900 @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s { int ret; +retry: if (cpu_online(pr->id)) { ret = cpu_down(pr->id); if (ret) return ret; } + get_online_cpus(); + /* + * Someone might online the cpu again at this point. So we check that + * cpu has been onlined or not. If cpu is online, we try to offline + * the cpu again. + */ + if (cpu_online(pr->id)) { + put_online_cpus(); + goto retry; + } arch_unregister_cpu(pr->id); acpi_unmap_lsapic(pr->id); + put_online_cpus(); return ret; } #else Index: linux-3.5-rc4/kernel/cpu.c =================================================================== --- linux-3.5-rc4.orig/kernel/cpu.c 2012-07-09 09:59:01.280211202 +0900 +++ linux-3.5-rc4/kernel/cpu.c 2012-07-09 09:59:02.903190965 +0900 @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; struct task_struct *idle; - if (cpu_online(cpu) || !cpu_present(cpu)) - return -EINVAL; - cpu_hotplug_begin(); + if (cpu_online(cpu) || !cpu_present(cpu)) { + ret = -EINVAL; + goto out; + } + idle = idle_thread_get(cpu); if (IS_ERR(idle)) { ret = PTR_ERR(idle);