diff mbox

[1/2] acpi : cpu hot-remove returns error number when cpu_down() fails

Message ID 4FFA4269.5050808@jp.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yasuaki Ishimatsu July 9, 2012, 2:31 a.m. UTC
Hi Srivatsa,

Thank you for your reviewing.

2012/07/06 18:51, Srivatsa S. Bhat wrote:
> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
> 
> Ouch!
> 
>> But in this case, it should return error number since some process may run on
>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>> the system cannot work well.
>>
>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>
>> ---
>>   drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>   1 file changed, 12 insertions(+), 6 deletions(-)
>>
>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>> ===================================================================
>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>> @@ -610,7 +610,7 @@ err_free_pr:
>>   static int acpi_processor_remove(struct acpi_device *device, int type)
>>   {
>>   	struct acpi_processor *pr = NULL;
>> -
>> +	int ret;
>>
>>   	if (!device || !acpi_driver_data(device))
>>   		return -EINVAL;
>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>   		goto free;
>>
>>   	if (type == ACPI_BUS_REMOVAL_EJECT) {
>> -		if (acpi_processor_handle_eject(pr))
>> -			return -EINVAL;
>> +		ret = acpi_processor_handle_eject(pr);
>> +		if (ret)
>> +			return ret;
>>   	}
>>
>>   	acpi_processor_power_exit(pr, device);
>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>
>>   static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>   {
>> -	if (cpu_online(pr->id))
>> -		cpu_down(pr->id);
>> +	int ret;
>> +
>> +	if (cpu_online(pr->id)) {
>> +		ret = cpu_down(pr->id);
>> +		if (ret)
>> +			return ret;
>> +	}
>>
> 
> Strictly speaking, this is not thorough enough. What prevents someone
> from onlining that same cpu again, at this point?
> So, IMHO, you need to wrap the contents of this function inside a
> get_online_cpus()/put_online_cpus() block, to prevent anyone else
> from messing with CPU hotplug at the same time.

If I understand your comment by mistake, please let me know.
If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.

+	get_online_cpus()
+	if (cpu_online(pr->id)) {
+		ret = cpu_down(pr->id);
+		if (ret)
+			return ret;
+	}
+	put_online_cpus()

I think following patch can prevent it correctly. How about the patch?

---
 drivers/acpi/processor_driver.c |   12 ++++++++++++
 kernel/cpu.c                    |    8 +++++---
 2 files changed, 17 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Srivatsa S. Bhat July 9, 2012, 11:25 a.m. UTC | #1
On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
> Hi Srivatsa,
> 
> Thank you for your reviewing.
> 
> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>
>> Ouch!
>>
>>> But in this case, it should return error number since some process may run on
>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>> the system cannot work well.
>>>
>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>
>>> ---
>>>   drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>   1 file changed, 12 insertions(+), 6 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>   static int acpi_processor_remove(struct acpi_device *device, int type)
>>>   {
>>>   	struct acpi_processor *pr = NULL;
>>> -
>>> +	int ret;
>>>
>>>   	if (!device || !acpi_driver_data(device))
>>>   		return -EINVAL;
>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>   		goto free;
>>>
>>>   	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>> -		if (acpi_processor_handle_eject(pr))
>>> -			return -EINVAL;
>>> +		ret = acpi_processor_handle_eject(pr);
>>> +		if (ret)
>>> +			return ret;
>>>   	}
>>>
>>>   	acpi_processor_power_exit(pr, device);
>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>
>>>   static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>   {
>>> -	if (cpu_online(pr->id))
>>> -		cpu_down(pr->id);
>>> +	int ret;
>>> +
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>>
>>
>> Strictly speaking, this is not thorough enough. What prevents someone
>> from onlining that same cpu again, at this point?
>> So, IMHO, you need to wrap the contents of this function inside a
>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>> from messing with CPU hotplug at the same time.
> 
> If I understand your comment by mistake, please let me know.
> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
> 

You are right. Sorry, I overlooked that.

> +	get_online_cpus()
> +	if (cpu_online(pr->id)) {
> +		ret = cpu_down(pr->id);
> +		if (ret)
> +			return ret;
> +	}
> +	put_online_cpus()
> 
> I think following patch can prevent it correctly. How about the patch?
>
> ---
>  drivers/acpi/processor_driver.c |   12 ++++++++++++
>  kernel/cpu.c                    |    8 +++++---
>  2 files changed, 17 insertions(+), 3 deletions(-)
> 
> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> ===================================================================
> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>  {
>  	int ret;
> 
> +retry:
>  	if (cpu_online(pr->id)) {
>  		ret = cpu_down(pr->id);
>  		if (ret)
>  			return ret;
>  	}
> 
> +	get_online_cpus();
> +	/*
> +	 * Someone might online the cpu again at this point. So we check that
> +	 * cpu has been onlined or not. If cpu is online, we try to offline
> +	 * the cpu again.
> +	 */
> +	if (cpu_online(pr->id)) {

How about this:
	if (unlikely(cpu_online(pr->id)) {
since the probability of this happening is quite small...

> +		put_online_cpus();
> +		goto retry;
> +	}
>  	arch_unregister_cpu(pr->id);
>  	acpi_unmap_lsapic(pr->id);
> +	put_online_cpus();
>  	return ret;
>  }

This retry logic doesn't look elegant, but I don't see any better method :-(

>  #else
> Index: linux-3.5-rc4/kernel/cpu.c
> ===================================================================
> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>  	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>  	struct task_struct *idle;
> 
> -	if (cpu_online(cpu) || !cpu_present(cpu))
> -		return -EINVAL;
> -
>  	cpu_hotplug_begin();
> 
> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +

Firstly, why is this change needed?
Secondly, if the change is indeed an improvement, then why is it
in _this_ patch? IMHO, in that case it should be part of a separate patch.

Coming back to my first point, I don't see why this hunk is needed. We
already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before
checking the status of the cpu (online or present). And all hotplug
operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that
lock. Isn't that enough? Or am I missing something?

>  	idle = idle_thread_get(cpu);
>  	if (IS_ERR(idle)) {
>  		ret = PTR_ERR(idle);
> 
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshi Kani July 9, 2012, 9:15 p.m. UTC | #2
On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
> > Hi Srivatsa,
> > 
> > Thank you for your reviewing.
> > 
> > 2012/07/06 18:51, Srivatsa S. Bhat wrote:
> >> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
> >>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
> >>
> >> Ouch!
> >>
> >>> But in this case, it should return error number since some process may run on
> >>> the cpu. If the cpu has a running process and the cpu is turned the power off,
> >>> the system cannot work well.
> >>>
> >>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> >>>
> >>> ---
> >>>   drivers/acpi/processor_driver.c |   18 ++++++++++++------
> >>>   1 file changed, 12 insertions(+), 6 deletions(-)
> >>>
> >>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> >>> ===================================================================
> >>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
> >>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
> >>> @@ -610,7 +610,7 @@ err_free_pr:
> >>>   static int acpi_processor_remove(struct acpi_device *device, int type)
> >>>   {
> >>>   	struct acpi_processor *pr = NULL;
> >>> -
> >>> +	int ret;
> >>>
> >>>   	if (!device || !acpi_driver_data(device))
> >>>   		return -EINVAL;
> >>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
> >>>   		goto free;
> >>>
> >>>   	if (type == ACPI_BUS_REMOVAL_EJECT) {
> >>> -		if (acpi_processor_handle_eject(pr))
> >>> -			return -EINVAL;
> >>> +		ret = acpi_processor_handle_eject(pr);
> >>> +		if (ret)
> >>> +			return ret;
> >>>   	}
> >>>
> >>>   	acpi_processor_power_exit(pr, device);
> >>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
> >>>
> >>>   static int acpi_processor_handle_eject(struct acpi_processor *pr)
> >>>   {
> >>> -	if (cpu_online(pr->id))
> >>> -		cpu_down(pr->id);
> >>> +	int ret;
> >>> +
> >>> +	if (cpu_online(pr->id)) {
> >>> +		ret = cpu_down(pr->id);
> >>> +		if (ret)
> >>> +			return ret;
> >>> +	}
> >>>
> >>
> >> Strictly speaking, this is not thorough enough. What prevents someone
> >> from onlining that same cpu again, at this point?
> >> So, IMHO, you need to wrap the contents of this function inside a
> >> get_online_cpus()/put_online_cpus() block, to prevent anyone else
> >> from messing with CPU hotplug at the same time.
> > 
> > If I understand your comment by mistake, please let me know.
> > If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
> > as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
> > cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
> > 
> 
> You are right. Sorry, I overlooked that.
> 
> > +	get_online_cpus()
> > +	if (cpu_online(pr->id)) {
> > +		ret = cpu_down(pr->id);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +	put_online_cpus()
> > 
> > I think following patch can prevent it correctly. How about the patch?
> >
> > ---
> >  drivers/acpi/processor_driver.c |   12 ++++++++++++
> >  kernel/cpu.c                    |    8 +++++---
> >  2 files changed, 17 insertions(+), 3 deletions(-)
> > 
> > Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> > ===================================================================
> > --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
> > +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
> > @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
> >  {
> >  	int ret;
> > 
> > +retry:
> >  	if (cpu_online(pr->id)) {
> >  		ret = cpu_down(pr->id);
> >  		if (ret)
> >  			return ret;
> >  	}
> > 
> > +	get_online_cpus();
> > +	/*
> > +	 * Someone might online the cpu again at this point. So we check that
> > +	 * cpu has been onlined or not. If cpu is online, we try to offline
> > +	 * the cpu again.
> > +	 */
> > +	if (cpu_online(pr->id)) {
> 
> How about this:
> 	if (unlikely(cpu_online(pr->id)) {
> since the probability of this happening is quite small...
> 
> > +		put_online_cpus();
> > +		goto retry;
> > +	}
> >  	arch_unregister_cpu(pr->id);
> >  	acpi_unmap_lsapic(pr->id);
> > +	put_online_cpus();
> >  	return ret;
> >  }
> 
> This retry logic doesn't look elegant, but I don't see any better method :-(

Another possible option is to fail the request instead of retrying it.

It would be quite challenging to allow on-lining and off-lining
operations to run concurrently.  In fact, even if we close this window,
there is yet another window right after the new put_online_cpus() call.
This CPU may become online before calling _EJ0 in the case of
hot-remove.

This goes beyond the scope of this patch, but IMHO, we should serialize
in the request level.  That is, a new on-lining request should not be
allowed to proceed until the current request is complete.  This scheme
only allows a single operation at a time per OS instance, but I do not
think it is a big issue.

Serializing in the request level is also important when supporting
container hot-remove, which can remove multiple children objects under a
parent container object.  For instance, a Node hot-remove may also
remove multiple processors underneath of it.  This kind of the
operations has to make sure that all children objects are remained
off-line until it ejects the parent object.  

Thanks,
-Toshi


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yasuaki Ishimatsu July 10, 2012, 12:13 a.m. UTC | #3
Hi Srivatsa,

2012/07/09 20:25, Srivatsa S. Bhat wrote:
> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>> Hi Srivatsa,
>>
>> Thank you for your reviewing.
>>
>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>
>>> Ouch!
>>>
>>>> But in this case, it should return error number since some process may run on
>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>> the system cannot work well.
>>>>
>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>
>>>> ---
>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>
>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>> ===================================================================
>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>    static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>    {
>>>>    	struct acpi_processor *pr = NULL;
>>>> -
>>>> +	int ret;
>>>>
>>>>    	if (!device || !acpi_driver_data(device))
>>>>    		return -EINVAL;
>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>    		goto free;
>>>>
>>>>    	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>> -		if (acpi_processor_handle_eject(pr))
>>>> -			return -EINVAL;
>>>> +		ret = acpi_processor_handle_eject(pr);
>>>> +		if (ret)
>>>> +			return ret;
>>>>    	}
>>>>
>>>>    	acpi_processor_power_exit(pr, device);
>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>
>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>    {
>>>> -	if (cpu_online(pr->id))
>>>> -		cpu_down(pr->id);
>>>> +	int ret;
>>>> +
>>>> +	if (cpu_online(pr->id)) {
>>>> +		ret = cpu_down(pr->id);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +	}
>>>>
>>>
>>> Strictly speaking, this is not thorough enough. What prevents someone
>>> from onlining that same cpu again, at this point?
>>> So, IMHO, you need to wrap the contents of this function inside a
>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>> from messing with CPU hotplug at the same time.
>>
>> If I understand your comment by mistake, please let me know.
>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>
> 
> You are right. Sorry, I overlooked that.
> 
>> +	get_online_cpus()
>> +	if (cpu_online(pr->id)) {
>> +		ret = cpu_down(pr->id);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +	put_online_cpus()
>>
>> I think following patch can prevent it correctly. How about the patch?
>>
>> ---
>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>   kernel/cpu.c                    |    8 +++++---
>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>
>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>> ===================================================================
>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>   {
>>   	int ret;
>>
>> +retry:
>>   	if (cpu_online(pr->id)) {
>>   		ret = cpu_down(pr->id);
>>   		if (ret)
>>   			return ret;
>>   	}
>>
>> +	get_online_cpus();
>> +	/*
>> +	 * Someone might online the cpu again at this point. So we check that
>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>> +	 * the cpu again.
>> +	 */
>> +	if (cpu_online(pr->id)) {
> 
> How about this:
> 	if (unlikely(cpu_online(pr->id)) {
> since the probability of this happening is quite small...

Thanks. I'll update it.

>> +		put_online_cpus();
>> +		goto retry;
>> +	}
>>   	arch_unregister_cpu(pr->id);
>>   	acpi_unmap_lsapic(pr->id);
>> +	put_online_cpus();
>>   	return ret;
>>   }
> 
> This retry logic doesn't look elegant, but I don't see any better method :-(
> 
>>   #else
>> Index: linux-3.5-rc4/kernel/cpu.c
>> ===================================================================
>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>   	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>   	struct task_struct *idle;
>>
>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>> -		return -EINVAL;
>> -
>>   	cpu_hotplug_begin();
>>
>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
> 
> Firstly, why is this change needed?

I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
there is the following race.

hot-remove cpu                         |  _cpu_up()
------------------------------------- ------------------------------------
call acpi_processor_handle_eject()     |
     call cpu_down()                   |
     call get_online_cpus()            |
                                       | call cpu_hotplug_begin() and stop here
     call arch_unregister_cpu()        |
     call acpi_unmap_lsapic()          |
     call put_online_cpus()            |
                                       | start and continue _cpu_up()
     return acpi_processor_remove()    |
continue hot-remove the cpu            |

So _cpu_up() can continue to itself. And hot-remove cpu can also continue
itself. If I change it, I think the race disappears as below:

hot-remove cpu                         | _cpu_up()
-----------------------------------------------------------------------
call acpi_processor_handle_eject()     |
     call cpu_down()                   |
     call get_online_cpus()            |
                                       | call cpu_hotplug_begin() and stop here
     call arch_unregister_cpu()        |
     call acpi_unmap_lsapic()          |
          cpu's cpu_present is set     |
	  to false by set_cpu_present()|
     call put_online_cpus()            |
                                       | start _cpu_up()
				       | check cpu_present() and return -EINVAL
     return acpi_processor_remove()    |
continue hot-remove the cpu            |

Thus I think the change is necessary.

Thanks,
Yasuaki Ishimatsu

> Secondly, if the change is indeed an improvement, then why is it
> in _this_ patch? IMHO, in that case it should be part of a separate patch.
> 
> Coming back to my first point, I don't see why this hunk is needed. We
> already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before
> checking the status of the cpu (online or present). And all hotplug
> operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that
> lock. Isn't that enough? Or am I missing something?
> 
>>   	idle = idle_thread_get(cpu);
>>   	if (IS_ERR(idle)) {
>>   		ret = PTR_ERR(idle);
>>
>   
> Regards,
> Srivatsa S. Bhat
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yasuaki Ishimatsu July 10, 2012, 4:57 a.m. UTC | #4
Hi Toshi,

2012/07/10 6:15, Toshi Kani wrote:
> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>> Hi Srivatsa,
>>>
>>> Thank you for your reviewing.
>>>
>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>
>>>> Ouch!
>>>>
>>>>> But in this case, it should return error number since some process may run on
>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>> the system cannot work well.
>>>>>
>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>
>>>>> ---
>>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>    static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>    {
>>>>>    	struct acpi_processor *pr = NULL;
>>>>> -
>>>>> +	int ret;
>>>>>
>>>>>    	if (!device || !acpi_driver_data(device))
>>>>>    		return -EINVAL;
>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>    		goto free;
>>>>>
>>>>>    	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>> -			return -EINVAL;
>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>>    	}
>>>>>
>>>>>    	acpi_processor_power_exit(pr, device);
>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>
>>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>    {
>>>>> -	if (cpu_online(pr->id))
>>>>> -		cpu_down(pr->id);
>>>>> +	int ret;
>>>>> +
>>>>> +	if (cpu_online(pr->id)) {
>>>>> +		ret = cpu_down(pr->id);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +	}
>>>>>
>>>>
>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>> from onlining that same cpu again, at this point?
>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>> from messing with CPU hotplug at the same time.
>>>
>>> If I understand your comment by mistake, please let me know.
>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>
>>
>> You are right. Sorry, I overlooked that.
>>
>>> +	get_online_cpus()
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +	put_online_cpus()
>>>
>>> I think following patch can prevent it correctly. How about the patch?
>>>
>>> ---
>>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>   kernel/cpu.c                    |    8 +++++---
>>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>   {
>>>   	int ret;
>>>
>>> +retry:
>>>   	if (cpu_online(pr->id)) {
>>>   		ret = cpu_down(pr->id);
>>>   		if (ret)
>>>   			return ret;
>>>   	}
>>>
>>> +	get_online_cpus();
>>> +	/*
>>> +	 * Someone might online the cpu again at this point. So we check that
>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>> +	 * the cpu again.
>>> +	 */
>>> +	if (cpu_online(pr->id)) {
>>
>> How about this:
>> 	if (unlikely(cpu_online(pr->id)) {
>> since the probability of this happening is quite small...
>>
>>> +		put_online_cpus();
>>> +		goto retry;
>>> +	}
>>>   	arch_unregister_cpu(pr->id);
>>>   	acpi_unmap_lsapic(pr->id);
>>> +	put_online_cpus();
>>>   	return ret;
>>>   }
>>
>> This retry logic doesn't look elegant, but I don't see any better method :-(
>
> Another possible option is to fail the request instead of retrying it.

Good idea!! I'll update it.

>
> It would be quite challenging to allow on-lining and off-lining
> operations to run concurrently.  In fact, even if we close this window,
> there is yet another window right after the new put_online_cpus() call.

I think if we close the window, another window does not open.
acpi_unmap_lsapic() sets cpu_present mask to false before new put_online_cpus()
is called. So even if _cpu_up() is called, the function returns -EINAVL by
following added code.

@@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
  	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
  	struct task_struct *idle;

-	if (cpu_online(cpu) || !cpu_present(cpu))
-		return -EINVAL;
-
  	cpu_hotplug_begin();

+	if (cpu_online(cpu) || !cpu_present(cpu)) {
+		ret = -EINVAL;
+		goto out;
+	}
+

Thanks,
Yasuaki Ishimatsu

> This CPU may become online before calling _EJ0 in the case of
> hot-remove.
>
> This goes beyond the scope of this patch, but IMHO, we should serialize
> in the request level.  That is, a new on-lining request should not be
> allowed to proceed until the current request is complete.  This scheme
> only allows a single operation at a time per OS instance, but I do not
> think it is a big issue.
>
> Serializing in the request level is also important when supporting
> container hot-remove, which can remove multiple children objects under a
> parent container object.  For instance, a Node hot-remove may also
> remove multiple processors underneath of it.  This kind of the
> operations has to make sure that all children objects are remained
> off-line until it ejects the parent object.
>
> Thanks,
> -Toshi
>
>



--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yasuaki Ishimatsu July 10, 2012, 5:14 a.m. UTC | #5
Hi Srivatsa,

2012/07/10 9:13, Yasuaki Ishimatsu wrote:
> Hi Srivatsa,
> 
> 2012/07/09 20:25, Srivatsa S. Bhat wrote:
>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>> Hi Srivatsa,
>>>
>>> Thank you for your reviewing.
>>>
>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>
>>>> Ouch!
>>>>
>>>>> But in this case, it should return error number since some process may run on
>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>> the system cannot work well.
>>>>>
>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>
>>>>> ---
>>>>>     drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>     1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>     static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>     {
>>>>>     	struct acpi_processor *pr = NULL;
>>>>> -
>>>>> +	int ret;
>>>>>
>>>>>     	if (!device || !acpi_driver_data(device))
>>>>>     		return -EINVAL;
>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>     		goto free;
>>>>>
>>>>>     	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>> -			return -EINVAL;
>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>>     	}
>>>>>
>>>>>     	acpi_processor_power_exit(pr, device);
>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>
>>>>>     static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>     {
>>>>> -	if (cpu_online(pr->id))
>>>>> -		cpu_down(pr->id);
>>>>> +	int ret;
>>>>> +
>>>>> +	if (cpu_online(pr->id)) {
>>>>> +		ret = cpu_down(pr->id);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +	}
>>>>>
>>>>
>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>> from onlining that same cpu again, at this point?
>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>> from messing with CPU hotplug at the same time.
>>>
>>> If I understand your comment by mistake, please let me know.
>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>
>>
>> You are right. Sorry, I overlooked that.
>>
>>> +	get_online_cpus()
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +	put_online_cpus()
>>>
>>> I think following patch can prevent it correctly. How about the patch?
>>>
>>> ---
>>>    drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>    kernel/cpu.c                    |    8 +++++---
>>>    2 files changed, 17 insertions(+), 3 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>    {
>>>    	int ret;
>>>
>>> +retry:
>>>    	if (cpu_online(pr->id)) {
>>>    		ret = cpu_down(pr->id);
>>>    		if (ret)
>>>    			return ret;
>>>    	}
>>>
>>> +	get_online_cpus();
>>> +	/*
>>> +	 * Someone might online the cpu again at this point. So we check that
>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>> +	 * the cpu again.
>>> +	 */
>>> +	if (cpu_online(pr->id)) {
>>
>> How about this:
>> 	if (unlikely(cpu_online(pr->id)) {
>> since the probability of this happening is quite small...
> 
> Thanks. I'll update it.
> 
>>> +		put_online_cpus();
>>> +		goto retry;
>>> +	}
>>>    	arch_unregister_cpu(pr->id);
>>>    	acpi_unmap_lsapic(pr->id);
>>> +	put_online_cpus();
>>>    	return ret;
>>>    }
>>
>> This retry logic doesn't look elegant, but I don't see any better method :-(
>>
>>>    #else
>>> Index: linux-3.5-rc4/kernel/cpu.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>>    	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>>    	struct task_struct *idle;
>>>
>>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>>> -		return -EINVAL;
>>> -
>>>    	cpu_hotplug_begin();
>>>
>>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>>> +		ret = -EINVAL;
>>> +		goto out;
>>> +	}
>>> +
>>
>> Firstly, why is this change needed?
> 
> I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
> there is the following race.
> 
> hot-remove cpu                         |  _cpu_up()
> ------------------------------------- ------------------------------------
> call acpi_processor_handle_eject()     |
>       call cpu_down()                   |
>       call get_online_cpus()            |
>                                         | call cpu_hotplug_begin() and stop here
>       call arch_unregister_cpu()        |
>       call acpi_unmap_lsapic()          |
>       call put_online_cpus()            |
>                                         | start and continue _cpu_up()
>       return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> So _cpu_up() can continue to itself. And hot-remove cpu can also continue
> itself. If I change it, I think the race disappears as below:
> 
> hot-remove cpu                         | _cpu_up()
> -----------------------------------------------------------------------
> call acpi_processor_handle_eject()     |
>       call cpu_down()                   |
>       call get_online_cpus()            |
>                                         | call cpu_hotplug_begin() and stop here
>       call arch_unregister_cpu()        |
>       call acpi_unmap_lsapic()          |
>            cpu's cpu_present is set     |
> 	  to false by set_cpu_present()|
>       call put_online_cpus()            |
>                                         | start _cpu_up()
> 				       | check cpu_present() and return -EINVAL
>       return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> Thus I think the change is necessary.
> 
> Thanks,
> Yasuaki Ishimatsu
> 
>> Secondly, if the change is indeed an improvement, then why is it
>> in _this_ patch? IMHO, in that case it should be part of a separate patch.

I forget to answer the question.
As I answered in the above your first question, the fix is related to
acpi_processor_handle_eject(). So the fix should be in the patch.

Thanks,
Yasuaki Ishimatsu

>>
>> Coming back to my first point, I don't see why this hunk is needed. We
>> already take the cpu_add_remove_lock (cpu_maps_update_begin/end) before
>> checking the status of the cpu (online or present). And all hotplug
>> operations (cpu_up/cpu_down/disable|enable_nonboot_cpus) go through that
>> lock. Isn't that enough? Or am I missing something?
>>
>>>    	idle = idle_thread_get(cpu);
>>>    	if (IS_ERR(idle)) {
>>>    		ret = PTR_ERR(idle);
>>>
>>    
>> Regards,
>> Srivatsa S. Bhat
>>
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat July 10, 2012, 6:51 a.m. UTC | #6
On 07/10/2012 05:43 AM, Yasuaki Ishimatsu wrote:
> Hi Srivatsa,
> 
> 2012/07/09 20:25, Srivatsa S. Bhat wrote:
>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>> Hi Srivatsa,
>>>
>>> Thank you for your reviewing.
>>>
>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>
>>>> Ouch!
>>>>
>>>>> But in this case, it should return error number since some process may run on
>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>> the system cannot work well.
>>>>>
>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>
>>>>> ---
>>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>    static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>    {
>>>>>    	struct acpi_processor *pr = NULL;
>>>>> -
>>>>> +	int ret;
>>>>>
>>>>>    	if (!device || !acpi_driver_data(device))
>>>>>    		return -EINVAL;
>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>    		goto free;
>>>>>
>>>>>    	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>> -			return -EINVAL;
>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>>    	}
>>>>>
>>>>>    	acpi_processor_power_exit(pr, device);
>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>
>>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>    {
>>>>> -	if (cpu_online(pr->id))
>>>>> -		cpu_down(pr->id);
>>>>> +	int ret;
>>>>> +
>>>>> +	if (cpu_online(pr->id)) {
>>>>> +		ret = cpu_down(pr->id);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +	}
>>>>>
>>>>
>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>> from onlining that same cpu again, at this point?
>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>> from messing with CPU hotplug at the same time.
>>>
>>> If I understand your comment by mistake, please let me know.
>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>
>>
>> You are right. Sorry, I overlooked that.
>>
>>> +	get_online_cpus()
>>> +	if (cpu_online(pr->id)) {
>>> +		ret = cpu_down(pr->id);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +	put_online_cpus()
>>>
>>> I think following patch can prevent it correctly. How about the patch?
>>>
>>> ---
>>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>   kernel/cpu.c                    |    8 +++++---
>>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>>
>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>   {
>>>   	int ret;
>>>
>>> +retry:
>>>   	if (cpu_online(pr->id)) {
>>>   		ret = cpu_down(pr->id);
>>>   		if (ret)
>>>   			return ret;
>>>   	}
>>>
>>> +	get_online_cpus();
>>> +	/*
>>> +	 * Someone might online the cpu again at this point. So we check that
>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>> +	 * the cpu again.
>>> +	 */
>>> +	if (cpu_online(pr->id)) {
>>
>> How about this:
>> 	if (unlikely(cpu_online(pr->id)) {
>> since the probability of this happening is quite small...
> 
> Thanks. I'll update it.
> 
>>> +		put_online_cpus();
>>> +		goto retry;
>>> +	}
>>>   	arch_unregister_cpu(pr->id);
>>>   	acpi_unmap_lsapic(pr->id);
>>> +	put_online_cpus();
>>>   	return ret;
>>>   }
>>
>> This retry logic doesn't look elegant, but I don't see any better method :-(
>>
>>>   #else
>>> Index: linux-3.5-rc4/kernel/cpu.c
>>> ===================================================================
>>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>>   	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>>   	struct task_struct *idle;
>>>
>>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>>> -		return -EINVAL;
>>> -
>>>   	cpu_hotplug_begin();
>>>
>>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>>> +		ret = -EINVAL;
>>> +		goto out;
>>> +	}
>>> +
>>
>> Firstly, why is this change needed?
> 
> I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
> there is the following race.
> 
> hot-remove cpu                         |  _cpu_up()
> ------------------------------------- ------------------------------------
> call acpi_processor_handle_eject()     |
>      call cpu_down()                   |
>      call get_online_cpus()            |
>                                        | call cpu_hotplug_begin() and stop here
>      call arch_unregister_cpu()        |
>      call acpi_unmap_lsapic()          |
>      call put_online_cpus()            |
>                                        | start and continue _cpu_up()
>      return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> So _cpu_up() can continue to itself. And hot-remove cpu can also continue
> itself. If I change it, I think the race disappears as below:
> 
> hot-remove cpu                         | _cpu_up()
> -----------------------------------------------------------------------
> call acpi_processor_handle_eject()     |
>      call cpu_down()                   |
>      call get_online_cpus()            |
>                                        | call cpu_hotplug_begin() and stop here
>      call arch_unregister_cpu()        |
>      call acpi_unmap_lsapic()          |
>           cpu's cpu_present is set     |
> 	  to false by set_cpu_present()|
>      call put_online_cpus()            |
>                                        | start _cpu_up()
> 				       | check cpu_present() and return -EINVAL
>      return acpi_processor_remove()    |
> continue hot-remove the cpu            |
> 
> Thus I think the change is necessary.
> 

Thanks for the detailed explanation. I had missed this race condition.
Now I see why all the changes in your patch are needed. 

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat July 10, 2012, 6:52 a.m. UTC | #7
On 07/10/2012 10:44 AM, Yasuaki Ishimatsu wrote:
> Hi Srivatsa,
> 
> 2012/07/10 9:13, Yasuaki Ishimatsu wrote:
>> Hi Srivatsa,
>>
>> 2012/07/09 20:25, Srivatsa S. Bhat wrote:
>>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>>> Hi Srivatsa,
>>>>
>>>> Thank you for your reviewing.
>>>>
>>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to remove the cpu.
>>>>>
>>>>> Ouch!
>>>>>
>>>>>> But in this case, it should return error number since some process may run on
>>>>>> the cpu. If the cpu has a running process and the cpu is turned the power off,
>>>>>> the system cannot work well.
>>>>>>
>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>
>>>>>> ---
>>>>>>     drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>>     1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>>> ===================================================================
>>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-06-25 04:53:04.000000000 +0900
>>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-05 21:02:58.711285382 +0900
>>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>>     static int acpi_processor_remove(struct acpi_device *device, int type)
>>>>>>     {
>>>>>>     	struct acpi_processor *pr = NULL;
>>>>>> -
>>>>>> +	int ret;
>>>>>>
>>>>>>     	if (!device || !acpi_driver_data(device))
>>>>>>     		return -EINVAL;
>>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>>     		goto free;
>>>>>>
>>>>>>     	if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>>> -		if (acpi_processor_handle_eject(pr))
>>>>>> -			return -EINVAL;
>>>>>> +		ret = acpi_processor_handle_eject(pr);
>>>>>> +		if (ret)
>>>>>> +			return ret;
>>>>>>     	}
>>>>>>
>>>>>>     	acpi_processor_power_exit(pr, device);
>>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>>
>>>>>>     static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>>     {
>>>>>> -	if (cpu_online(pr->id))
>>>>>> -		cpu_down(pr->id);
>>>>>> +	int ret;
>>>>>> +
>>>>>> +	if (cpu_online(pr->id)) {
>>>>>> +		ret = cpu_down(pr->id);
>>>>>> +		if (ret)
>>>>>> +			return ret;
>>>>>> +	}
>>>>>>
>>>>>
>>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>>> from onlining that same cpu again, at this point?
>>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>>> from messing with CPU hotplug at the same time.
>>>>
>>>> If I understand your comment by mistake, please let me know.
>>>> If the contents is wrapped a inside get_online_cpus()/put_online_cpus() block
>>>> as below, cpu_down() will stop since cpu_down() calls cpu_hotplug_begin() and
>>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>>
>>>
>>> You are right. Sorry, I overlooked that.
>>>
>>>> +	get_online_cpus()
>>>> +	if (cpu_online(pr->id)) {
>>>> +		ret = cpu_down(pr->id);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +	}
>>>> +	put_online_cpus()
>>>>
>>>> I think following patch can prevent it correctly. How about the patch?
>>>>
>>>> ---
>>>>    drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>>    kernel/cpu.c                    |    8 +++++---
>>>>    2 files changed, 17 insertions(+), 3 deletions(-)
>>>>
>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>> ===================================================================
>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
>>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>>    {
>>>>    	int ret;
>>>>
>>>> +retry:
>>>>    	if (cpu_online(pr->id)) {
>>>>    		ret = cpu_down(pr->id);
>>>>    		if (ret)
>>>>    			return ret;
>>>>    	}
>>>>
>>>> +	get_online_cpus();
>>>> +	/*
>>>> +	 * Someone might online the cpu again at this point. So we check that
>>>> +	 * cpu has been onlined or not. If cpu is online, we try to offline
>>>> +	 * the cpu again.
>>>> +	 */
>>>> +	if (cpu_online(pr->id)) {
>>>
>>> How about this:
>>> 	if (unlikely(cpu_online(pr->id)) {
>>> since the probability of this happening is quite small...
>>
>> Thanks. I'll update it.
>>
>>>> +		put_online_cpus();
>>>> +		goto retry;
>>>> +	}
>>>>    	arch_unregister_cpu(pr->id);
>>>>    	acpi_unmap_lsapic(pr->id);
>>>> +	put_online_cpus();
>>>>    	return ret;
>>>>    }
>>>
>>> This retry logic doesn't look elegant, but I don't see any better method :-(
>>>
>>>>    #else
>>>> Index: linux-3.5-rc4/kernel/cpu.c
>>>> ===================================================================
>>>> --- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
>>>> +++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
>>>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>>>    	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>>>    	struct task_struct *idle;
>>>>
>>>> -	if (cpu_online(cpu) || !cpu_present(cpu))
>>>> -		return -EINVAL;
>>>> -
>>>>    	cpu_hotplug_begin();
>>>>
>>>> +	if (cpu_online(cpu) || !cpu_present(cpu)) {
>>>> +		ret = -EINVAL;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>
>>> Firstly, why is this change needed?
>>
>> I cared the race of hot-remove cpu and _cpu_up(). If I do not change it,
>> there is the following race.
>>
>> hot-remove cpu                         |  _cpu_up()
>> ------------------------------------- ------------------------------------
>> call acpi_processor_handle_eject()     |
>>       call cpu_down()                   |
>>       call get_online_cpus()            |
>>                                         | call cpu_hotplug_begin() and stop here
>>       call arch_unregister_cpu()        |
>>       call acpi_unmap_lsapic()          |
>>       call put_online_cpus()            |
>>                                         | start and continue _cpu_up()
>>       return acpi_processor_remove()    |
>> continue hot-remove the cpu            |
>>
>> So _cpu_up() can continue to itself. And hot-remove cpu can also continue
>> itself. If I change it, I think the race disappears as below:
>>
>> hot-remove cpu                         | _cpu_up()
>> -----------------------------------------------------------------------
>> call acpi_processor_handle_eject()     |
>>       call cpu_down()                   |
>>       call get_online_cpus()            |
>>                                         | call cpu_hotplug_begin() and stop here
>>       call arch_unregister_cpu()        |
>>       call acpi_unmap_lsapic()          |
>>            cpu's cpu_present is set     |
>> 	  to false by set_cpu_present()|
>>       call put_online_cpus()            |
>>                                         | start _cpu_up()
>> 				       | check cpu_present() and return -EINVAL
>>       return acpi_processor_remove()    |
>> continue hot-remove the cpu            |
>>
>> Thus I think the change is necessary.
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>>> Secondly, if the change is indeed an improvement, then why is it
>>> in _this_ patch? IMHO, in that case it should be part of a separate patch.
> 
> I forget to answer the question.
> As I answered in the above your first question, the fix is related to
> acpi_processor_handle_eject(). So the fix should be in the patch.
>

Yep, got it now. Thanks!

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat July 10, 2012, 7:57 a.m. UTC | #8
On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote:
> Hi Toshi,
> 
> 2012/07/10 6:15, Toshi Kani wrote:
>> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
>>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>>> Hi Srivatsa,
>>>>
>>>> Thank you for your reviewing.
>>>>
>>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to
>>>>>> remove the cpu.
>>>>>
>>>>> Ouch!
>>>>>
>>>>>> But in this case, it should return error number since some process
>>>>>> may run on
>>>>>> the cpu. If the cpu has a running process and the cpu is turned
>>>>>> the power off,
>>>>>> the system cannot work well.
>>>>>>
>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>
>>>>>> ---
>>>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>>> ===================================================================
>>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c   
>>>>>> 2012-06-25 04:53:04.000000000 +0900
>>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-05
>>>>>> 21:02:58.711285382 +0900
>>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>>    static int acpi_processor_remove(struct acpi_device *device,
>>>>>> int type)
>>>>>>    {
>>>>>>        struct acpi_processor *pr = NULL;
>>>>>> -
>>>>>> +    int ret;
>>>>>>
>>>>>>        if (!device || !acpi_driver_data(device))
>>>>>>            return -EINVAL;
>>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>>            goto free;
>>>>>>
>>>>>>        if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>>> -        if (acpi_processor_handle_eject(pr))
>>>>>> -            return -EINVAL;
>>>>>> +        ret = acpi_processor_handle_eject(pr);
>>>>>> +        if (ret)
>>>>>> +            return ret;
>>>>>>        }
>>>>>>
>>>>>>        acpi_processor_power_exit(pr, device);
>>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>>
>>>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>>    {
>>>>>> -    if (cpu_online(pr->id))
>>>>>> -        cpu_down(pr->id);
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    if (cpu_online(pr->id)) {
>>>>>> +        ret = cpu_down(pr->id);
>>>>>> +        if (ret)
>>>>>> +            return ret;
>>>>>> +    }
>>>>>>
>>>>>
>>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>>> from onlining that same cpu again, at this point?
>>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>>> from messing with CPU hotplug at the same time.
>>>>
>>>> If I understand your comment by mistake, please let me know.
>>>> If the contents is wrapped a inside
>>>> get_online_cpus()/put_online_cpus() block
>>>> as below, cpu_down() will stop since cpu_down() calls
>>>> cpu_hotplug_begin() and
>>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>>
>>>
>>> You are right. Sorry, I overlooked that.
>>>
>>>> +    get_online_cpus()
>>>> +    if (cpu_online(pr->id)) {
>>>> +        ret = cpu_down(pr->id);
>>>> +        if (ret)
>>>> +            return ret;
>>>> +    }
>>>> +    put_online_cpus()
>>>>
>>>> I think following patch can prevent it correctly. How about the patch?
>>>>
>>>> ---
>>>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>>   kernel/cpu.c                    |    8 +++++---
>>>>   2 files changed, 17 insertions(+), 3 deletions(-)
>>>>
>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>> ===================================================================
>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c    2012-07-09
>>>> 09:59:01.280211202 +0900
>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-09
>>>> 11:05:34.559859236 +0900
>>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>>   {
>>>>       int ret;
>>>>
>>>> +retry:
>>>>       if (cpu_online(pr->id)) {
>>>>           ret = cpu_down(pr->id);
>>>>           if (ret)
>>>>               return ret;
>>>>       }
>>>>
>>>> +    get_online_cpus();
>>>> +    /*
>>>> +     * Someone might online the cpu again at this point. So we
>>>> check that
>>>> +     * cpu has been onlined or not. If cpu is online, we try to
>>>> offline
>>>> +     * the cpu again.
>>>> +     */
>>>> +    if (cpu_online(pr->id)) {
>>>
>>> How about this:
>>>     if (unlikely(cpu_online(pr->id)) {
>>> since the probability of this happening is quite small...
>>>
>>>> +        put_online_cpus();
>>>> +        goto retry;
>>>> +    }
>>>>       arch_unregister_cpu(pr->id);
>>>>       acpi_unmap_lsapic(pr->id);
>>>> +    put_online_cpus();
>>>>       return ret;
>>>>   }
>>>
>>> This retry logic doesn't look elegant, but I don't see any better
>>> method :-(
>>
>> Another possible option is to fail the request instead of retrying it.
> 
> Good idea!! I'll update it.
> 
>>
>> It would be quite challenging to allow on-lining and off-lining
>> operations to run concurrently.  In fact, even if we close this window,
>> there is yet another window right after the new put_online_cpus() call.
> 
> I think if we close the window, another window does not open.
> acpi_unmap_lsapic() sets cpu_present mask to false before new
> put_online_cpus()
> is called. So even if _cpu_up() is called, the function returns -EINAVL by
> following added code.
> 
> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>      unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>      struct task_struct *idle;
> 
> -    if (cpu_online(cpu) || !cpu_present(cpu))
> -        return -EINVAL;
> -
>      cpu_hotplug_begin();
> 
> +    if (cpu_online(cpu) || !cpu_present(cpu)) {
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +
> 

Right. Yasuaki's patch will ensure that there are no more race conditions
because it does the cpu_present() check after taking the cpu_hotplug.lock.
So I think it is safe and still doable from the kernel's perspective.

But the question is, "should we do it?" I think Toshi's suggestion of failing
the hot-remove request (if we find that the cpu has been onlined again by some
other task) sounds like a good idea for another reason: cpu hotplug is not
initiated by the OS by itself; its requested by the user; so if something is
onlining the cpu back again, the user better take a second look and decide
what exactly he wants to do with that cpu - whether keep it online or
hot-remove it out.

Trying to online as well as hot-remove the same cpu simultaneously sounds like
a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case
(ie., failing that request) would give a warning to the user and a chance to
reflect upon what exactly he wants to do with the cpu.

So, IMHO, we should protect against the race condition (between cpu_up and
hot-remove) but choose to fail the hot-remove request, and add a comment saying
why we chose to fail the request, even though we could have gone ahead and
completed it.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yasuaki Ishimatsu July 10, 2012, 8:23 a.m. UTC | #9
Hi Srivatsa,

2012/07/10 16:57, Srivatsa S. Bhat wrote:
> On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote:
>> Hi Toshi,
>>
>> 2012/07/10 6:15, Toshi Kani wrote:
>>> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
>>>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
>>>>> Hi Srivatsa,
>>>>>
>>>>> Thank you for your reviewing.
>>>>>
>>>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
>>>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
>>>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to
>>>>>>> remove the cpu.
>>>>>>
>>>>>> Ouch!
>>>>>>
>>>>>>> But in this case, it should return error number since some process
>>>>>>> may run on
>>>>>>> the cpu. If the cpu has a running process and the cpu is turned
>>>>>>> the power off,
>>>>>>> the system cannot work well.
>>>>>>>
>>>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>>>>>
>>>>>>> ---
>>>>>>>     drivers/acpi/processor_driver.c |   18 ++++++++++++------
>>>>>>>     1 file changed, 12 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>>>> ===================================================================
>>>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c
>>>>>>> 2012-06-25 04:53:04.000000000 +0900
>>>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-05
>>>>>>> 21:02:58.711285382 +0900
>>>>>>> @@ -610,7 +610,7 @@ err_free_pr:
>>>>>>>     static int acpi_processor_remove(struct acpi_device *device,
>>>>>>> int type)
>>>>>>>     {
>>>>>>>         struct acpi_processor *pr = NULL;
>>>>>>> -
>>>>>>> +    int ret;
>>>>>>>
>>>>>>>         if (!device || !acpi_driver_data(device))
>>>>>>>             return -EINVAL;
>>>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
>>>>>>>             goto free;
>>>>>>>
>>>>>>>         if (type == ACPI_BUS_REMOVAL_EJECT) {
>>>>>>> -        if (acpi_processor_handle_eject(pr))
>>>>>>> -            return -EINVAL;
>>>>>>> +        ret = acpi_processor_handle_eject(pr);
>>>>>>> +        if (ret)
>>>>>>> +            return ret;
>>>>>>>         }
>>>>>>>
>>>>>>>         acpi_processor_power_exit(pr, device);
>>>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
>>>>>>>
>>>>>>>     static int acpi_processor_handle_eject(struct acpi_processor *pr)
>>>>>>>     {
>>>>>>> -    if (cpu_online(pr->id))
>>>>>>> -        cpu_down(pr->id);
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    if (cpu_online(pr->id)) {
>>>>>>> +        ret = cpu_down(pr->id);
>>>>>>> +        if (ret)
>>>>>>> +            return ret;
>>>>>>> +    }
>>>>>>>
>>>>>>
>>>>>> Strictly speaking, this is not thorough enough. What prevents someone
>>>>>> from onlining that same cpu again, at this point?
>>>>>> So, IMHO, you need to wrap the contents of this function inside a
>>>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
>>>>>> from messing with CPU hotplug at the same time.
>>>>>
>>>>> If I understand your comment by mistake, please let me know.
>>>>> If the contents is wrapped a inside
>>>>> get_online_cpus()/put_online_cpus() block
>>>>> as below, cpu_down() will stop since cpu_down() calls
>>>>> cpu_hotplug_begin() and
>>>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
>>>>>
>>>>
>>>> You are right. Sorry, I overlooked that.
>>>>
>>>>> +    get_online_cpus()
>>>>> +    if (cpu_online(pr->id)) {
>>>>> +        ret = cpu_down(pr->id);
>>>>> +        if (ret)
>>>>> +            return ret;
>>>>> +    }
>>>>> +    put_online_cpus()
>>>>>
>>>>> I think following patch can prevent it correctly. How about the patch?
>>>>>
>>>>> ---
>>>>>    drivers/acpi/processor_driver.c |   12 ++++++++++++
>>>>>    kernel/cpu.c                    |    8 +++++---
>>>>>    2 files changed, 17 insertions(+), 3 deletions(-)
>>>>>
>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
>>>>> ===================================================================
>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c    2012-07-09
>>>>> 09:59:01.280211202 +0900
>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-09
>>>>> 11:05:34.559859236 +0900
>>>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
>>>>>    {
>>>>>        int ret;
>>>>>
>>>>> +retry:
>>>>>        if (cpu_online(pr->id)) {
>>>>>            ret = cpu_down(pr->id);
>>>>>            if (ret)
>>>>>                return ret;
>>>>>        }
>>>>>
>>>>> +    get_online_cpus();
>>>>> +    /*
>>>>> +     * Someone might online the cpu again at this point. So we
>>>>> check that
>>>>> +     * cpu has been onlined or not. If cpu is online, we try to
>>>>> offline
>>>>> +     * the cpu again.
>>>>> +     */
>>>>> +    if (cpu_online(pr->id)) {
>>>>
>>>> How about this:
>>>>      if (unlikely(cpu_online(pr->id)) {
>>>> since the probability of this happening is quite small...
>>>>
>>>>> +        put_online_cpus();
>>>>> +        goto retry;
>>>>> +    }
>>>>>        arch_unregister_cpu(pr->id);
>>>>>        acpi_unmap_lsapic(pr->id);
>>>>> +    put_online_cpus();
>>>>>        return ret;
>>>>>    }
>>>>
>>>> This retry logic doesn't look elegant, but I don't see any better
>>>> method :-(
>>>
>>> Another possible option is to fail the request instead of retrying it.
>>
>> Good idea!! I'll update it.
>>
>>>
>>> It would be quite challenging to allow on-lining and off-lining
>>> operations to run concurrently.  In fact, even if we close this window,
>>> there is yet another window right after the new put_online_cpus() call.
>>
>> I think if we close the window, another window does not open.
>> acpi_unmap_lsapic() sets cpu_present mask to false before new
>> put_online_cpus()
>> is called. So even if _cpu_up() is called, the function returns -EINAVL by
>> following added code.
>>
>> @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
>>       unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
>>       struct task_struct *idle;
>>
>> -    if (cpu_online(cpu) || !cpu_present(cpu))
>> -        return -EINVAL;
>> -
>>       cpu_hotplug_begin();
>>
>> +    if (cpu_online(cpu) || !cpu_present(cpu)) {
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +
>>
>
> Right. Yasuaki's patch will ensure that there are no more race conditions
> because it does the cpu_present() check after taking the cpu_hotplug.lock.
> So I think it is safe and still doable from the kernel's perspective.
>
> But the question is, "should we do it?" I think Toshi's suggestion of failing
> the hot-remove request (if we find that the cpu has been onlined again by some
> other task) sounds like a good idea for another reason: cpu hotplug is not
> initiated by the OS by itself; its requested by the user; so if something is
> onlining the cpu back again, the user better take a second look and decide
> what exactly he wants to do with that cpu - whether keep it online or
> hot-remove it out.

I think so too. Failing the hot-remove request is good idea.

>
> Trying to online as well as hot-remove the same cpu simultaneously sounds like
> a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case
> (ie., failing that request) would give a warning to the user and a chance to
> reflect upon what exactly he wants to do with the cpu.
>
> So, IMHO, we should protect against the race condition (between cpu_up and
> hot-remove) but choose to fail the hot-remove request, and add a comment saying
> why we chose to fail the request, even though we could have gone ahead and
> completed it.

I have already sent 2nd version of the patch. But the warning message is
not included in the patch. So I will add the warning message into 3rd
version.

Thanks,
Yasuaki Ishimatsu

> Regards,
> Srivatsa S. Bhat
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshi Kani July 10, 2012, 4:32 p.m. UTC | #10
On Tue, 2012-07-10 at 13:27 +0530, Srivatsa S. Bhat wrote:
> On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote:
> > Hi Toshi,
> > 
> > 2012/07/10 6:15, Toshi Kani wrote:
> >> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
> >>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
> >>>> Hi Srivatsa,
> >>>>
> >>>> Thank you for your reviewing.
> >>>>
> >>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
> >>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
> >>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to
> >>>>>> remove the cpu.
> >>>>>
> >>>>> Ouch!
> >>>>>
> >>>>>> But in this case, it should return error number since some process
> >>>>>> may run on
> >>>>>> the cpu. If the cpu has a running process and the cpu is turned
> >>>>>> the power off,
> >>>>>> the system cannot work well.
> >>>>>>
> >>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> >>>>>>
> >>>>>> ---
> >>>>>>    drivers/acpi/processor_driver.c |   18 ++++++++++++------
> >>>>>>    1 file changed, 12 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> >>>>>> ===================================================================
> >>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c   
> >>>>>> 2012-06-25 04:53:04.000000000 +0900
> >>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-05
> >>>>>> 21:02:58.711285382 +0900
> >>>>>> @@ -610,7 +610,7 @@ err_free_pr:
> >>>>>>    static int acpi_processor_remove(struct acpi_device *device,
> >>>>>> int type)
> >>>>>>    {
> >>>>>>        struct acpi_processor *pr = NULL;
> >>>>>> -
> >>>>>> +    int ret;
> >>>>>>
> >>>>>>        if (!device || !acpi_driver_data(device))
> >>>>>>            return -EINVAL;
> >>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
> >>>>>>            goto free;
> >>>>>>
> >>>>>>        if (type == ACPI_BUS_REMOVAL_EJECT) {
> >>>>>> -        if (acpi_processor_handle_eject(pr))
> >>>>>> -            return -EINVAL;
> >>>>>> +        ret = acpi_processor_handle_eject(pr);
> >>>>>> +        if (ret)
> >>>>>> +            return ret;
> >>>>>>        }
> >>>>>>
> >>>>>>        acpi_processor_power_exit(pr, device);
> >>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
> >>>>>>
> >>>>>>    static int acpi_processor_handle_eject(struct acpi_processor *pr)
> >>>>>>    {
> >>>>>> -    if (cpu_online(pr->id))
> >>>>>> -        cpu_down(pr->id);
> >>>>>> +    int ret;
> >>>>>> +
> >>>>>> +    if (cpu_online(pr->id)) {
> >>>>>> +        ret = cpu_down(pr->id);
> >>>>>> +        if (ret)
> >>>>>> +            return ret;
> >>>>>> +    }
> >>>>>>
> >>>>>
> >>>>> Strictly speaking, this is not thorough enough. What prevents someone
> >>>>> from onlining that same cpu again, at this point?
> >>>>> So, IMHO, you need to wrap the contents of this function inside a
> >>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
> >>>>> from messing with CPU hotplug at the same time.
> >>>>
> >>>> If I understand your comment by mistake, please let me know.
> >>>> If the contents is wrapped a inside
> >>>> get_online_cpus()/put_online_cpus() block
> >>>> as below, cpu_down() will stop since cpu_down() calls
> >>>> cpu_hotplug_begin() and
> >>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
> >>>>
> >>>
> >>> You are right. Sorry, I overlooked that.
> >>>
> >>>> +    get_online_cpus()
> >>>> +    if (cpu_online(pr->id)) {
> >>>> +        ret = cpu_down(pr->id);
> >>>> +        if (ret)
> >>>> +            return ret;
> >>>> +    }
> >>>> +    put_online_cpus()
> >>>>
> >>>> I think following patch can prevent it correctly. How about the patch?
> >>>>
> >>>> ---
> >>>>   drivers/acpi/processor_driver.c |   12 ++++++++++++
> >>>>   kernel/cpu.c                    |    8 +++++---
> >>>>   2 files changed, 17 insertions(+), 3 deletions(-)
> >>>>
> >>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> >>>> ===================================================================
> >>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c    2012-07-09
> >>>> 09:59:01.280211202 +0900
> >>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c    2012-07-09
> >>>> 11:05:34.559859236 +0900
> >>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
> >>>>   {
> >>>>       int ret;
> >>>>
> >>>> +retry:
> >>>>       if (cpu_online(pr->id)) {
> >>>>           ret = cpu_down(pr->id);
> >>>>           if (ret)
> >>>>               return ret;
> >>>>       }
> >>>>
> >>>> +    get_online_cpus();
> >>>> +    /*
> >>>> +     * Someone might online the cpu again at this point. So we
> >>>> check that
> >>>> +     * cpu has been onlined or not. If cpu is online, we try to
> >>>> offline
> >>>> +     * the cpu again.
> >>>> +     */
> >>>> +    if (cpu_online(pr->id)) {
> >>>
> >>> How about this:
> >>>     if (unlikely(cpu_online(pr->id)) {
> >>> since the probability of this happening is quite small...
> >>>
> >>>> +        put_online_cpus();
> >>>> +        goto retry;
> >>>> +    }
> >>>>       arch_unregister_cpu(pr->id);
> >>>>       acpi_unmap_lsapic(pr->id);
> >>>> +    put_online_cpus();
> >>>>       return ret;
> >>>>   }
> >>>
> >>> This retry logic doesn't look elegant, but I don't see any better
> >>> method :-(
> >>
> >> Another possible option is to fail the request instead of retrying it.
> > 
> > Good idea!! I'll update it.

Sounds good.  Thanks Yasuaki for updating it.


> >> It would be quite challenging to allow on-lining and off-lining
> >> operations to run concurrently.  In fact, even if we close this window,
> >> there is yet another window right after the new put_online_cpus() call.
> > 
> > I think if we close the window, another window does not open.
> > acpi_unmap_lsapic() sets cpu_present mask to false before new
> > put_online_cpus()
> > is called. So even if _cpu_up() is called, the function returns -EINAVL by
> > following added code.
> > 
> > @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
> >      unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
> >      struct task_struct *idle;
> > 
> > -    if (cpu_online(cpu) || !cpu_present(cpu))
> > -        return -EINVAL;
> > -
> >      cpu_hotplug_begin();
> > 
> > +    if (cpu_online(cpu) || !cpu_present(cpu)) {
> > +        ret = -EINVAL;
> > +        goto out;
> > +    }
> > +
> > 
> 
> Right. Yasuaki's patch will ensure that there are no more race conditions
> because it does the cpu_present() check after taking the cpu_hotplug.lock.
> So I think it is safe and still doable from the kernel's perspective.
> 
> But the question is, "should we do it?" I think Toshi's suggestion of failing
> the hot-remove request (if we find that the cpu has been onlined again by some
> other task) sounds like a good idea for another reason: cpu hotplug is not
> initiated by the OS by itself; its requested by the user; so if something is
> onlining the cpu back again, the user better take a second look and decide
> what exactly he wants to do with that cpu - whether keep it online or
> hot-remove it out.
> 
> Trying to online as well as hot-remove the same cpu simultaneously sounds like
> a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case
> (ie., failing that request) would give a warning to the user and a chance to
> reflect upon what exactly he wants to do with the cpu.
> 
> So, IMHO, we should protect against the race condition (between cpu_up and
> hot-remove) but choose to fail the hot-remove request, and add a comment saying
> why we chose to fail the request, even though we could have gone ahead and
> completed it.

Agreed.  Thanks for the nice summary, Srivatsa!

Thanks,
-Toshi



> Regards,
> Srivatsa S. Bhat
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c	2012-07-09 09:59:01.280211202 +0900
+++ linux-3.5-rc4/drivers/acpi/processor_driver.c	2012-07-09 11:05:34.559859236 +0900
@@ -844,14 +844,26 @@  static int acpi_processor_handle_eject(s
 {
 	int ret;

+retry:
 	if (cpu_online(pr->id)) {
 		ret = cpu_down(pr->id);
 		if (ret)
 			return ret;
 	}

+	get_online_cpus();
+	/*
+	 * Someone might online the cpu again at this point. So we check that
+	 * cpu has been onlined or not. If cpu is online, we try to offline
+	 * the cpu again.
+	 */
+	if (cpu_online(pr->id)) {
+		put_online_cpus();
+		goto retry;
+	}
 	arch_unregister_cpu(pr->id);
 	acpi_unmap_lsapic(pr->id);
+	put_online_cpus();
 	return ret;
 }
 #else
Index: linux-3.5-rc4/kernel/cpu.c
===================================================================
--- linux-3.5-rc4.orig/kernel/cpu.c	2012-07-09 09:59:01.280211202 +0900
+++ linux-3.5-rc4/kernel/cpu.c	2012-07-09 09:59:02.903190965 +0900
@@ -343,11 +343,13 @@  static int __cpuinit _cpu_up(unsigned in
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct task_struct *idle;

-	if (cpu_online(cpu) || !cpu_present(cpu))
-		return -EINVAL;
-
 	cpu_hotplug_begin();

+	if (cpu_online(cpu) || !cpu_present(cpu)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	idle = idle_thread_get(cpu);
 	if (IS_ERR(idle)) {
 		ret = PTR_ERR(idle);