diff mbox

[V4,2/3] tick/cpuidle: Initialize hrtimer mode of broadcast

Message ID 20140207080632.17187.80532.stgit@preeti.in.ibm.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

preeti Feb. 7, 2014, 8:06 a.m. UTC
From: Thomas Gleixner <tglx@linutronix.de>

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device. This patch includes support in the broadcast framework to handle
the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer
on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states.

This patchset introduces a pseudo clock device which can be registered by the
archs as tick_broadcast_device in the absence of a real external clock
device. Once registered, the broadcast framework will work as is for these
architectures as long as the archs take care of the BROADCAST_ENTER
notification failing for one of the CPUs. This CPU is made the stand by CPU to
handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*.

The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the
stand by CPU dynamically moves around and so does the hrtimer which is queued
to trigger at the next earliest wakeup time. This is consistent with the case where
an external clock device is present. The smp affinity of this clock device is
set to the CPU with the earliest wakeup. This patchset handles the hotplug of
the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD
notification.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
[Added Changelog and code to handle reprogramming of hrtimer]
---

 include/linux/clockchips.h           |    9 +++
 kernel/time/Makefile                 |    2 -
 kernel/time/tick-broadcast-hrtimer.c |  105 ++++++++++++++++++++++++++++++++++
 kernel/time/tick-broadcast.c         |   54 +++++++++++++++++
 4 files changed, 166 insertions(+), 4 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Daniel Lezcano Feb. 11, 2014, 10:16 a.m. UTC | #1
On 02/07/2014 09:06 AM, Preeti U Murthy wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> On some architectures, in certain CPU deep idle states the local timers stop.
> An external clock device is used to wakeup these CPUs. The kernel support for the
> wakeup of these CPUs is provided by the tick broadcast framework by using the
> external clock device as the wakeup source.
>
> However not all implementations of architectures provide such an external
> clock device. This patch includes support in the broadcast framework to handle
> the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer
> on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states.
>
> This patchset introduces a pseudo clock device which can be registered by the
> archs as tick_broadcast_device in the absence of a real external clock
> device. Once registered, the broadcast framework will work as is for these
> architectures as long as the archs take care of the BROADCAST_ENTER
> notification failing for one of the CPUs. This CPU is made the stand by CPU to
> handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*.
>
> The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the
> stand by CPU dynamically moves around and so does the hrtimer which is queued
> to trigger at the next earliest wakeup time. This is consistent with the case where
> an external clock device is present. The smp affinity of this clock device is
> set to the CPU with the earliest wakeup.

Hi Preeti,

jumping a bit late in the thread...

Setting the smp affinity on the earliest timer should be handled 
automatically with the CLOCK_EVT_FEAT_DYNIRQ flag. Did you look at using 
this flag ?

Another comment is the overall approach. We enter the cpuidle idle 
framework with a specific state to go to and it is the tick framework 
telling us we mustn't go to this state. IMO the logic is wrong, the 
decision to not enter this state should be moved somewhere else.

Why don't you create a cpuidle driver with the shallow idle states 
assigned to a cpu (let's say cpu0) and another one with all the deeper 
idle states for the rest of the cpus ? Using the multiple cpuidle driver 
support makes it possible. The timer won't be moving around and a cpu 
will be dedicated to act as the broadcast timer.

Wouldn't make sense and be less intrusive than the patchset you proposed ?


> This patchset handles the hotplug of
> the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD
> notification.
>
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> [Added Changelog and code to handle reprogramming of hrtimer]
> ---
>
>   include/linux/clockchips.h           |    9 +++
>   kernel/time/Makefile                 |    2 -
>   kernel/time/tick-broadcast-hrtimer.c |  105 ++++++++++++++++++++++++++++++++++
>   kernel/time/tick-broadcast.c         |   54 +++++++++++++++++
>   4 files changed, 166 insertions(+), 4 deletions(-)
>   create mode 100644 kernel/time/tick-broadcast-hrtimer.c
>
> diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
> index e0c5a6c..dbe9e14 100644
> --- a/include/linux/clockchips.h
> +++ b/include/linux/clockchips.h
> @@ -62,6 +62,11 @@ enum clock_event_mode {
>   #define CLOCK_EVT_FEAT_DYNIRQ		0x000020
>   #define CLOCK_EVT_FEAT_PERCPU		0x000040
>
> +/*
> + * Clockevent device is based on a hrtimer for broadcast
> + */
> +#define CLOCK_EVT_FEAT_HRTIMER		0x000080
> +
>   /**
>    * struct clock_event_device - clock event device descriptor
>    * @event_handler:	Assigned by the framework to be called by the low
> @@ -83,6 +88,7 @@ enum clock_event_mode {
>    * @name:		ptr to clock event name
>    * @rating:		variable to rate clock event devices
>    * @irq:		IRQ number (only for non CPU local devices)
> + * @bound_on:		Bound on CPU
>    * @cpumask:		cpumask to indicate for which CPUs this device works
>    * @list:		list head for the management code
>    * @owner:		module reference
> @@ -113,6 +119,7 @@ struct clock_event_device {
>   	const char		*name;
>   	int			rating;
>   	int			irq;
> +	int			bound_on;
>   	const struct cpumask	*cpumask;
>   	struct list_head	list;
>   	struct module		*owner;
> @@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void);
>   #endif
>
>   #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT)
> +extern void tick_setup_hrtimer_broadcast(void);
>   extern int tick_check_broadcast_expired(void);
>   #else
>   static inline int tick_check_broadcast_expired(void) { return 0; }
> +static void tick_setup_hrtimer_broadcast(void) {};
>   #endif
>
>   #ifdef CONFIG_GENERIC_CLOCKEVENTS
> diff --git a/kernel/time/Makefile b/kernel/time/Makefile
> index 9250130..06151ef 100644
> --- a/kernel/time/Makefile
> +++ b/kernel/time/Makefile
> @@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
>
>   obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)		+= clockevents.o
>   obj-$(CONFIG_GENERIC_CLOCKEVENTS)		+= tick-common.o
> -obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o
> +obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o tick-broadcast-hrtimer.o
>   obj-$(CONFIG_GENERIC_SCHED_CLOCK)		+= sched_clock.o
>   obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o
>   obj-$(CONFIG_TICK_ONESHOT)			+= tick-sched.o
> diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
> new file mode 100644
> index 0000000..af1e119
> --- /dev/null
> +++ b/kernel/time/tick-broadcast-hrtimer.c
> @@ -0,0 +1,105 @@
> +/*
> + * linux/kernel/time/tick-broadcast-hrtimer.c
> + * This file emulates a local clock event device
> + * via a pseudo clock device.
> + */
> +#include <linux/cpu.h>
> +#include <linux/err.h>
> +#include <linux/hrtimer.h>
> +#include <linux/interrupt.h>
> +#include <linux/percpu.h>
> +#include <linux/profile.h>
> +#include <linux/clockchips.h>
> +#include <linux/sched.h>
> +#include <linux/smp.h>
> +#include <linux/module.h>
> +
> +#include "tick-internal.h"
> +
> +static struct hrtimer bctimer;
> +
> +static void bc_set_mode(enum clock_event_mode mode,
> +			struct clock_event_device *bc)
> +{
> +	switch (mode) {
> +	case CLOCK_EVT_MODE_SHUTDOWN:
> +		/*
> +		 * Note, we cannot cancel the timer here as we might
> +		 * run into the following live lock scenario:
> +		 *
> +		 * cpu 0		cpu1
> +		 * lock(broadcast_lock);
> +		 *			hrtimer_interrupt()
> +		 *			bc_handler()
> +		 *			   tick_handle_oneshot_broadcast();
> +		 *			    lock(broadcast_lock);
> +		 * hrtimer_cancel()
> +		 *  wait_for_callback()
> +		 */
> +		hrtimer_try_to_cancel(&bctimer);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
> +/*
> + * This is called from the guts of the broadcast code when the cpu
> + * which is about to enter idle has the earliest broadcast timer event.
> + */
> +static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
> +{
> +	/*
> +	 * We try to cancel the timer first. If the callback is on
> +	 * flight on some other cpu then we let it handle it. If we
> +	 * were able to cancel the timer nothing can rearm it as we
> +	 * own broadcast_lock.
> +	 *
> +	 * However we can also be called from the event handler of
> +	 * ce_broadcast_hrtimer itself when it expires. We cannot therefore
> +	 * restart the timer since it is on flight on the same CPU. But
> +	 * due to the same reason we can reset it.
> +	 */
> +	if (hrtimer_try_to_cancel(&bctimer) >= 0) {
> +		hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED);
> +		/* Bind the "device" to the cpu */
> +		bc->bound_on = smp_processor_id();
> +	} else if (bc->bound_on == smp_processor_id()) {
> +		hrtimer_set_expires(&bctimer, expires);
> +	}
> +	return 0;
> +}
> +
> +static struct clock_event_device ce_broadcast_hrtimer = {
> +	.set_mode		= bc_set_mode,
> +	.set_next_ktime		= bc_set_next,
> +	.features		= CLOCK_EVT_FEAT_ONESHOT |
> +				  CLOCK_EVT_FEAT_KTIME |
> +				  CLOCK_EVT_FEAT_HRTIMER,
> +	.rating			= 0,
> +	.bound_on		= -1,
> +	.min_delta_ns		= 1,
> +	.max_delta_ns		= KTIME_MAX,
> +	.min_delta_ticks	= 1,
> +	.max_delta_ticks	= KTIME_MAX,
> +	.mult			= 1,
> +	.shift			= 0,
> +	.cpumask		= cpu_all_mask,
> +};
> +
> +static enum hrtimer_restart bc_handler(struct hrtimer *t)
> +{
> +	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
> +
> +	if (ce_broadcast_hrtimer.next_event.tv64 == KTIME_MAX)
> +		return HRTIMER_NORESTART;
> +
> +	return HRTIMER_RESTART;
> +}
> +
> +void tick_setup_hrtimer_broadcast(void)
> +{
> +	hrtimer_init(&bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
> +	bctimer.function = bc_handler;
> +	clockevents_register_device(&ce_broadcast_hrtimer);
> +}
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index ddf2ac2..2f013c3 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -630,6 +630,42 @@ again:
>   	raw_spin_unlock(&tick_broadcast_lock);
>   }
>
> +static int broadcast_needs_cpu(struct clock_event_device *bc, int cpu)
> +{
> +	if (!(bc->features & CLOCK_EVT_FEAT_HRTIMER))
> +		return 0;
> +	if (bc->next_event.tv64 == KTIME_MAX)
> +		return 0;
> +	return bc->bound_on == cpu ? -EBUSY : 0;
> +}
> +
> +static void broadcast_shutdown_local(struct clock_event_device *bc,
> +				     struct clock_event_device *dev)
> +{
> +	/*
> +	 * For hrtimer based broadcasting we cannot shutdown the cpu
> +	 * local device if our own event is the first one to expire or
> +	 * if we own the broadcast timer.
> +	 */
> +	if (bc->features & CLOCK_EVT_FEAT_HRTIMER) {
> +		if (broadcast_needs_cpu(bc, smp_processor_id()))
> +			return;
> +		if (dev->next_event.tv64 < bc->next_event.tv64)
> +			return;
> +	}
> +	clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
> +}
> +
> +static void broadcast_move_bc(int deadcpu)
> +{
> +	struct clock_event_device *bc = tick_broadcast_device.evtdev;
> +
> +	if (!bc || !broadcast_needs_cpu(bc, deadcpu))
> +		return;
> +	/* This moves the broadcast assignment to this cpu */
> +	clockevents_program_event(bc, bc->next_event, 1);
> +}
> +
>   /*
>    * Powerstate information: The system enters/leaves a state, where
>    * affected devices might stop
> @@ -648,7 +684,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
>   	 * states
>   	 */
>   	if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
> -		return;
> +		return 0;
>
>   	/*
>   	 * We are called with preemtion disabled from the depth of the
> @@ -659,7 +695,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
>   	dev = td->evtdev;
>
>   	if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
> -		return;
> +		return 0;
>
>   	bc = tick_broadcast_device.evtdev;
>
> @@ -667,7 +703,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
>   	if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
>   		if (!cpumask_test_and_set_cpu(cpu, tick_broadcast_oneshot_mask)) {
>   			WARN_ON_ONCE(cpumask_test_cpu(cpu, tick_broadcast_pending_mask));
> -			clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
> +			broadcast_shutdown_local(bc, dev);
>   			/*
>   			 * We only reprogram the broadcast timer if we
>   			 * did not mark ourself in the force mask and
> @@ -680,6 +716,16 @@ int tick_broadcast_oneshot_control(unsigned long reason)
>   			    dev->next_event.tv64 < bc->next_event.tv64)
>   				tick_broadcast_set_event(bc, cpu, dev->next_event, 1);
>   		}
> +		/*
> +		 * If the current CPU owns the hrtimer broadcast
> +		 * mechanism, it cannot go deep idle and we remove the
> +		 * CPU from the broadcast mask. We don't have to go
> +		 * through the EXIT path as the local timer is not
> +		 * shutdown.
> +		 */
> +		ret = broadcast_needs_cpu(bc, cpu);
> +		if (ret)
> +			cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
>   	} else {
>   		if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_oneshot_mask)) {
>   			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
> @@ -853,6 +899,8 @@ void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
>   	cpumask_clear_cpu(cpu, tick_broadcast_pending_mask);
>   	cpumask_clear_cpu(cpu, tick_broadcast_force_mask);
>
> +	broadcast_move_bc(cpu);
> +
>   	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
>   }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Thomas Gleixner Feb. 11, 2014, 3:58 p.m. UTC | #2
On Tue, 11 Feb 2014, Daniel Lezcano wrote:
> On 02/07/2014 09:06 AM, Preeti U Murthy wrote:
> Setting the smp affinity on the earliest timer should be handled automatically
> with the CLOCK_EVT_FEAT_DYNIRQ flag. Did you look at using this flag ?

How should this flag help? Not at all, because the hrtimer based
broadcast device cannot assign affinities.
 
> Another comment is the overall approach. We enter the cpuidle idle framework
> with a specific state to go to and it is the tick framework telling us we
> mustn't go to this state. IMO the logic is wrong, the decision to not enter
> this state should be moved somewhere else.
> 
> Why don't you create a cpuidle driver with the shallow idle states assigned to
> a cpu (let's say cpu0) and another one with all the deeper idle states for the
> rest of the cpus ? Using the multiple cpuidle driver support makes it
> possible. The timer won't be moving around and a cpu will be dedicated to act
> as the broadcast timer.
> 
> Wouldn't make sense and be less intrusive than the patchset you proposed ?

How do you arm the broadcast timer on CPU0 from CPU1? You can't!

You cannot access the cpu local timer on a different cpu. So you would
have to send an IPI over to CPU0 so that it can reevaluate and
schedule the broadcast. That's even more backwards than telling the
cpuidle code that the CPU is not in a state to go deep.

Thanks,

	tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
preeti Feb. 11, 2014, 4:09 p.m. UTC | #3
Hi Daniel,

Thank you very much for the review.

On 02/11/2014 03:46 PM, Daniel Lezcano wrote:
> On 02/07/2014 09:06 AM, Preeti U Murthy wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>>
>> On some architectures, in certain CPU deep idle states the local
>> timers stop.
>> An external clock device is used to wakeup these CPUs. The kernel
>> support for the
>> wakeup of these CPUs is provided by the tick broadcast framework by
>> using the
>> external clock device as the wakeup source.
>>
>> However not all implementations of architectures provide such an external
>> clock device. This patch includes support in the broadcast framework
>> to handle
>> the wakeup of the CPUs in deep idle states on such systems by queuing
>> a hrtimer
>> on one of the CPUs, which is meant to handle the wakeup of CPUs in
>> deep idle states.
>>
>> This patchset introduces a pseudo clock device which can be registered
>> by the
>> archs as tick_broadcast_device in the absence of a real external clock
>> device. Once registered, the broadcast framework will work as is for
>> these
>> architectures as long as the archs take care of the BROADCAST_ENTER
>> notification failing for one of the CPUs. This CPU is made the stand
>> by CPU to
>> handle wakeup of the CPUs in deep idle and it *must not enter deep
>> idle states*.
>>
>> The CPU with the earliest wakeup is chosen to be this CPU. Hence this
>> way the
>> stand by CPU dynamically moves around and so does the hrtimer which is
>> queued
>> to trigger at the next earliest wakeup time. This is consistent with
>> the case where
>> an external clock device is present. The smp affinity of this clock
>> device is
>> set to the CPU with the earliest wakeup.
> 
> Hi Preeti,
> 
> jumping a bit late in the thread...
> 
> Setting the smp affinity on the earliest timer should be handled
> automatically with the CLOCK_EVT_FEAT_DYNIRQ flag. Did you look at using
> this flag ?

This patch is not setting the smp affinity of the pseudo clock device at
all. Its not required to for the reason that it does not exist.

I mentioned this point because we assign a CPU with the earliest wakeup
as standby. I compared this logic to the one used by the tick broadcast
framework for archs which have an external clock device to set the smp
affinity of the device.

If these archs do not have the flag CLOCK_EVT_FEAT_DYNIRQ set for the
external clock device, the tick broadcast framework sets the smp
affinity of this device to the CPU with the earliest wakeup. We are
using the same logic in this patchset as well to assign the stand by CPU.

> 
> Another comment is the overall approach. We enter the cpuidle idle
> framework with a specific state to go to and it is the tick framework
> telling us we mustn't go to this state. IMO the logic is wrong, the
> decision to not enter this state should be moved somewhere else.

Its not the tick framework which tells us that we cannot enter deep idle
state, its the *tick broadcast* framework specifically. The tick
broadcast framework was introduced with the primary intention of
handling wakeup of CPUs in deep idle states when the local timers become
non-functional. Therefore there is a co-operation between this tick
broadcast framework and cpuidle. This has always been the case.

That is why just before cpus go into deep idle, they call into the
broadcast framework. Till now it was assumed that the tick broadcast
framework would find no problems with the cpus entering deep idle.
Therefore cpuidle would simply assume that all is well and go ahead and
enter deep idle state.
  But today there is a scenario when there could be problems if all cpus
enter deep idle states and the tick broadcast framework now notifies the
cpuidle framework to hold back one cpu. This is just a simple extension
of the current interaction between cpuidle and tick broadcast framework.

> 
> Why don't you create a cpuidle driver with the shallow idle states
> assigned to a cpu (let's say cpu0) and another one with all the deeper
> idle states for the rest of the cpus ? Using the multiple cpuidle driver
> support makes it possible. The timer won't be moving around and a cpu
> will be dedicated to act as the broadcast timer.
> 

Having a dedicated stand by cpu for broadcasting has some issues which
were pointed to when I posted the initial versions of this patchset.
https://lkml.org/lkml/2013/7/27/14

1. This could create power/thermal imbalance on the chip since only the
standby cpu cannot enter deep idle state at all times.

2. If it is cpu0 it is fine, else with the logic that you suggest,
hot-plugging out the dedicated stand by cpu would mean moving the work
of broadcasting to another cpu and modifying the cpuidle state table for
it. Even with cpu0, if support to hotplug it out is enabled (maybe it is
already), we will face the same issue and this gets very messy.

> Wouldn't make sense and be less intrusive than the patchset you proposed ?

Actually this patchset brings in a solution that is as less intrusive as
possible. It makes the problem nearly invisible except for a failed
return from a call into the broadcast framework. It simply asks the
archs which do not have an external clock device to register a pseudo
device with the broadcast framework and from then on everything just
falls in place to enable deep idle states for such archs.

Thanks

Regards
Preeti U Murthy
> 
> 
>> This patchset handles the hotplug of
>> the stand by CPU as well by moving the hrtimer on to the CPU handling
>> the CPU_DEAD
>> notification.
>>
>> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>> [Added Changelog and code to handle reprogramming of hrtimer]
>> ---
>>
>>   include/linux/clockchips.h           |    9 +++
>>   kernel/time/Makefile                 |    2 -
>>   kernel/time/tick-broadcast-hrtimer.c |  105
>> ++++++++++++++++++++++++++++++++++
>>   kernel/time/tick-broadcast.c         |   54 +++++++++++++++++
>>   4 files changed, 166 insertions(+), 4 deletions(-)
>>   create mode 100644 kernel/time/tick-broadcast-hrtimer.c
>>
>> diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
>> index e0c5a6c..dbe9e14 100644
>> --- a/include/linux/clockchips.h
>> +++ b/include/linux/clockchips.h
>> @@ -62,6 +62,11 @@ enum clock_event_mode {
>>   #define CLOCK_EVT_FEAT_DYNIRQ        0x000020
>>   #define CLOCK_EVT_FEAT_PERCPU        0x000040
>>
>> +/*
>> + * Clockevent device is based on a hrtimer for broadcast
>> + */
>> +#define CLOCK_EVT_FEAT_HRTIMER        0x000080
>> +
>>   /**
>>    * struct clock_event_device - clock event device descriptor
>>    * @event_handler:    Assigned by the framework to be called by the low
>> @@ -83,6 +88,7 @@ enum clock_event_mode {
>>    * @name:        ptr to clock event name
>>    * @rating:        variable to rate clock event devices
>>    * @irq:        IRQ number (only for non CPU local devices)
>> + * @bound_on:        Bound on CPU
>>    * @cpumask:        cpumask to indicate for which CPUs this device
>> works
>>    * @list:        list head for the management code
>>    * @owner:        module reference
>> @@ -113,6 +119,7 @@ struct clock_event_device {
>>       const char        *name;
>>       int            rating;
>>       int            irq;
>> +    int            bound_on;
>>       const struct cpumask    *cpumask;
>>       struct list_head    list;
>>       struct module        *owner;
>> @@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void);
>>   #endif
>>
>>   #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) &&
>> defined(CONFIG_TICK_ONESHOT)
>> +extern void tick_setup_hrtimer_broadcast(void);
>>   extern int tick_check_broadcast_expired(void);
>>   #else
>>   static inline int tick_check_broadcast_expired(void) { return 0; }
>> +static void tick_setup_hrtimer_broadcast(void) {};
>>   #endif
>>
>>   #ifdef CONFIG_GENERIC_CLOCKEVENTS
>> diff --git a/kernel/time/Makefile b/kernel/time/Makefile
>> index 9250130..06151ef 100644
>> --- a/kernel/time/Makefile
>> +++ b/kernel/time/Makefile
>> @@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
>>
>>   obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)        += clockevents.o
>>   obj-$(CONFIG_GENERIC_CLOCKEVENTS)        += tick-common.o
>> -obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)    += tick-broadcast.o
>> +obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)    += tick-broadcast.o
>> tick-broadcast-hrtimer.o
>>   obj-$(CONFIG_GENERIC_SCHED_CLOCK)        += sched_clock.o
>>   obj-$(CONFIG_TICK_ONESHOT)            += tick-oneshot.o
>>   obj-$(CONFIG_TICK_ONESHOT)            += tick-sched.o
>> diff --git a/kernel/time/tick-broadcast-hrtimer.c
>> b/kernel/time/tick-broadcast-hrtimer.c
>> new file mode 100644
>> index 0000000..af1e119
>> --- /dev/null
>> +++ b/kernel/time/tick-broadcast-hrtimer.c
>> @@ -0,0 +1,105 @@
>> +/*
>> + * linux/kernel/time/tick-broadcast-hrtimer.c
>> + * This file emulates a local clock event device
>> + * via a pseudo clock device.
>> + */
>> +#include <linux/cpu.h>
>> +#include <linux/err.h>
>> +#include <linux/hrtimer.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/percpu.h>
>> +#include <linux/profile.h>
>> +#include <linux/clockchips.h>
>> +#include <linux/sched.h>
>> +#include <linux/smp.h>
>> +#include <linux/module.h>
>> +
>> +#include "tick-internal.h"
>> +
>> +static struct hrtimer bctimer;
>> +
>> +static void bc_set_mode(enum clock_event_mode mode,
>> +            struct clock_event_device *bc)
>> +{
>> +    switch (mode) {
>> +    case CLOCK_EVT_MODE_SHUTDOWN:
>> +        /*
>> +         * Note, we cannot cancel the timer here as we might
>> +         * run into the following live lock scenario:
>> +         *
>> +         * cpu 0        cpu1
>> +         * lock(broadcast_lock);
>> +         *            hrtimer_interrupt()
>> +         *            bc_handler()
>> +         *               tick_handle_oneshot_broadcast();
>> +         *                lock(broadcast_lock);
>> +         * hrtimer_cancel()
>> +         *  wait_for_callback()
>> +         */
>> +        hrtimer_try_to_cancel(&bctimer);
>> +        break;
>> +    default:
>> +        break;
>> +    }
>> +}
>> +
>> +/*
>> + * This is called from the guts of the broadcast code when the cpu
>> + * which is about to enter idle has the earliest broadcast timer event.
>> + */
>> +static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
>> +{
>> +    /*
>> +     * We try to cancel the timer first. If the callback is on
>> +     * flight on some other cpu then we let it handle it. If we
>> +     * were able to cancel the timer nothing can rearm it as we
>> +     * own broadcast_lock.
>> +     *
>> +     * However we can also be called from the event handler of
>> +     * ce_broadcast_hrtimer itself when it expires. We cannot therefore
>> +     * restart the timer since it is on flight on the same CPU. But
>> +     * due to the same reason we can reset it.
>> +     */
>> +    if (hrtimer_try_to_cancel(&bctimer) >= 0) {
>> +        hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED);
>> +        /* Bind the "device" to the cpu */
>> +        bc->bound_on = smp_processor_id();
>> +    } else if (bc->bound_on == smp_processor_id()) {
>> +        hrtimer_set_expires(&bctimer, expires);
>> +    }
>> +    return 0;
>> +}
>> +
>> +static struct clock_event_device ce_broadcast_hrtimer = {
>> +    .set_mode        = bc_set_mode,
>> +    .set_next_ktime        = bc_set_next,
>> +    .features        = CLOCK_EVT_FEAT_ONESHOT |
>> +                  CLOCK_EVT_FEAT_KTIME |
>> +                  CLOCK_EVT_FEAT_HRTIMER,
>> +    .rating            = 0,
>> +    .bound_on        = -1,
>> +    .min_delta_ns        = 1,
>> +    .max_delta_ns        = KTIME_MAX,
>> +    .min_delta_ticks    = 1,
>> +    .max_delta_ticks    = KTIME_MAX,
>> +    .mult            = 1,
>> +    .shift            = 0,
>> +    .cpumask        = cpu_all_mask,
>> +};
>> +
>> +static enum hrtimer_restart bc_handler(struct hrtimer *t)
>> +{
>> +    ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
>> +
>> +    if (ce_broadcast_hrtimer.next_event.tv64 == KTIME_MAX)
>> +        return HRTIMER_NORESTART;
>> +
>> +    return HRTIMER_RESTART;
>> +}
>> +
>> +void tick_setup_hrtimer_broadcast(void)
>> +{
>> +    hrtimer_init(&bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
>> +    bctimer.function = bc_handler;
>> +    clockevents_register_device(&ce_broadcast_hrtimer);
>> +}
>> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
>> index ddf2ac2..2f013c3 100644
>> --- a/kernel/time/tick-broadcast.c
>> +++ b/kernel/time/tick-broadcast.c
>> @@ -630,6 +630,42 @@ again:
>>       raw_spin_unlock(&tick_broadcast_lock);
>>   }
>>
>> +static int broadcast_needs_cpu(struct clock_event_device *bc, int cpu)
>> +{
>> +    if (!(bc->features & CLOCK_EVT_FEAT_HRTIMER))
>> +        return 0;
>> +    if (bc->next_event.tv64 == KTIME_MAX)
>> +        return 0;
>> +    return bc->bound_on == cpu ? -EBUSY : 0;
>> +}
>> +
>> +static void broadcast_shutdown_local(struct clock_event_device *bc,
>> +                     struct clock_event_device *dev)
>> +{
>> +    /*
>> +     * For hrtimer based broadcasting we cannot shutdown the cpu
>> +     * local device if our own event is the first one to expire or
>> +     * if we own the broadcast timer.
>> +     */
>> +    if (bc->features & CLOCK_EVT_FEAT_HRTIMER) {
>> +        if (broadcast_needs_cpu(bc, smp_processor_id()))
>> +            return;
>> +        if (dev->next_event.tv64 < bc->next_event.tv64)
>> +            return;
>> +    }
>> +    clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
>> +}
>> +
>> +static void broadcast_move_bc(int deadcpu)
>> +{
>> +    struct clock_event_device *bc = tick_broadcast_device.evtdev;
>> +
>> +    if (!bc || !broadcast_needs_cpu(bc, deadcpu))
>> +        return;
>> +    /* This moves the broadcast assignment to this cpu */
>> +    clockevents_program_event(bc, bc->next_event, 1);
>> +}
>> +
>>   /*
>>    * Powerstate information: The system enters/leaves a state, where
>>    * affected devices might stop
>> @@ -648,7 +684,7 @@ int tick_broadcast_oneshot_control(unsigned long
>> reason)
>>        * states
>>        */
>>       if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
>> -        return;
>> +        return 0;
>>
>>       /*
>>        * We are called with preemtion disabled from the depth of the
>> @@ -659,7 +695,7 @@ int tick_broadcast_oneshot_control(unsigned long
>> reason)
>>       dev = td->evtdev;
>>
>>       if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
>> -        return;
>> +        return 0;
>>
>>       bc = tick_broadcast_device.evtdev;
>>
>> @@ -667,7 +703,7 @@ int tick_broadcast_oneshot_control(unsigned long
>> reason)
>>       if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
>>           if (!cpumask_test_and_set_cpu(cpu,
>> tick_broadcast_oneshot_mask)) {
>>               WARN_ON_ONCE(cpumask_test_cpu(cpu,
>> tick_broadcast_pending_mask));
>> -            clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
>> +            broadcast_shutdown_local(bc, dev);
>>               /*
>>                * We only reprogram the broadcast timer if we
>>                * did not mark ourself in the force mask and
>> @@ -680,6 +716,16 @@ int tick_broadcast_oneshot_control(unsigned long
>> reason)
>>                   dev->next_event.tv64 < bc->next_event.tv64)
>>                   tick_broadcast_set_event(bc, cpu, dev->next_event, 1);
>>           }
>> +        /*
>> +         * If the current CPU owns the hrtimer broadcast
>> +         * mechanism, it cannot go deep idle and we remove the
>> +         * CPU from the broadcast mask. We don't have to go
>> +         * through the EXIT path as the local timer is not
>> +         * shutdown.
>> +         */
>> +        ret = broadcast_needs_cpu(bc, cpu);
>> +        if (ret)
>> +            cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
>>       } else {
>>           if (cpumask_test_and_clear_cpu(cpu,
>> tick_broadcast_oneshot_mask)) {
>>               clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
>> @@ -853,6 +899,8 @@ void tick_shutdown_broadcast_oneshot(unsigned int
>> *cpup)
>>       cpumask_clear_cpu(cpu, tick_broadcast_pending_mask);
>>       cpumask_clear_cpu(cpu, tick_broadcast_force_mask);
>>
>> +    broadcast_move_bc(cpu);
>> +
>>       raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
>>   }
>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Feb. 11, 2014, 10:01 p.m. UTC | #4
On 02/11/2014 04:58 PM, Thomas Gleixner wrote:
> On Tue, 11 Feb 2014, Daniel Lezcano wrote:
>> On 02/07/2014 09:06 AM, Preeti U Murthy wrote:
>> Setting the smp affinity on the earliest timer should be handled automatically
>> with the CLOCK_EVT_FEAT_DYNIRQ flag. Did you look at using this flag ?
>
> How should this flag help? Not at all, because the hrtimer based
> broadcast device cannot assign affinities.
>
>> Another comment is the overall approach. We enter the cpuidle idle framework
>> with a specific state to go to and it is the tick framework telling us we
>> mustn't go to this state. IMO the logic is wrong, the decision to not enter
>> this state should be moved somewhere else.
>>
>> Why don't you create a cpuidle driver with the shallow idle states assigned to
>> a cpu (let's say cpu0) and another one with all the deeper idle states for the
>> rest of the cpus ? Using the multiple cpuidle driver support makes it
>> possible. The timer won't be moving around and a cpu will be dedicated to act
>> as the broadcast timer.
>>
>> Wouldn't make sense and be less intrusive than the patchset you proposed ?
>
> How do you arm the broadcast timer on CPU0 from CPU1? You can't!
>
> You cannot access the cpu local timer on a different cpu. So you would
> have to send an IPI over to CPU0 so that it can reevaluate and
> schedule the broadcast. That's even more backwards than telling the
> cpuidle code that the CPU is not in a state to go deep.

Indeed :)

Thanks for the clarification.

   -- Daniel
Daniel Lezcano Feb. 11, 2014, 10:05 p.m. UTC | #5
On 02/11/2014 05:09 PM, Preeti U Murthy wrote:
> Hi Daniel,
>
> Thank you very much for the review.
>
> On 02/11/2014 03:46 PM, Daniel Lezcano wrote:
>> On 02/07/2014 09:06 AM, Preeti U Murthy wrote:
>>> From: Thomas Gleixner <tglx@linutronix.de>
>>>
>>> On some architectures, in certain CPU deep idle states the local
>>> timers stop.
>>> An external clock device is used to wakeup these CPUs. The kernel
>>> support for the
>>> wakeup of these CPUs is provided by the tick broadcast framework by
>>> using the
>>> external clock device as the wakeup source.
>>>
>>> However not all implementations of architectures provide such an external
>>> clock device. This patch includes support in the broadcast framework
>>> to handle
>>> the wakeup of the CPUs in deep idle states on such systems by queuing
>>> a hrtimer
>>> on one of the CPUs, which is meant to handle the wakeup of CPUs in
>>> deep idle states.
>>>
>>> This patchset introduces a pseudo clock device which can be registered
>>> by the
>>> archs as tick_broadcast_device in the absence of a real external clock
>>> device. Once registered, the broadcast framework will work as is for
>>> these
>>> architectures as long as the archs take care of the BROADCAST_ENTER
>>> notification failing for one of the CPUs. This CPU is made the stand
>>> by CPU to
>>> handle wakeup of the CPUs in deep idle and it *must not enter deep
>>> idle states*.
>>>
>>> The CPU with the earliest wakeup is chosen to be this CPU. Hence this
>>> way the
>>> stand by CPU dynamically moves around and so does the hrtimer which is
>>> queued
>>> to trigger at the next earliest wakeup time. This is consistent with
>>> the case where
>>> an external clock device is present. The smp affinity of this clock
>>> device is
>>> set to the CPU with the earliest wakeup.
>>
>> Hi Preeti,
>>
>> jumping a bit late in the thread...
>>
>> Setting the smp affinity on the earliest timer should be handled
>> automatically with the CLOCK_EVT_FEAT_DYNIRQ flag. Did you look at using
>> this flag ?
>
> This patch is not setting the smp affinity of the pseudo clock device at
> all. Its not required to for the reason that it does not exist.
>
> I mentioned this point because we assign a CPU with the earliest wakeup
> as standby. I compared this logic to the one used by the tick broadcast
> framework for archs which have an external clock device to set the smp
> affinity of the device.
>
> If these archs do not have the flag CLOCK_EVT_FEAT_DYNIRQ set for the
> external clock device, the tick broadcast framework sets the smp
> affinity of this device to the CPU with the earliest wakeup. We are
> using the same logic in this patchset as well to assign the stand by CPU.
>
>>
>> Another comment is the overall approach. We enter the cpuidle idle
>> framework with a specific state to go to and it is the tick framework
>> telling us we mustn't go to this state. IMO the logic is wrong, the
>> decision to not enter this state should be moved somewhere else.
>
> Its not the tick framework which tells us that we cannot enter deep idle
> state, its the *tick broadcast* framework specifically. The tick
> broadcast framework was introduced with the primary intention of
> handling wakeup of CPUs in deep idle states when the local timers become
> non-functional. Therefore there is a co-operation between this tick
> broadcast framework and cpuidle. This has always been the case.
>
> That is why just before cpus go into deep idle, they call into the
> broadcast framework. Till now it was assumed that the tick broadcast
> framework would find no problems with the cpus entering deep idle.
> Therefore cpuidle would simply assume that all is well and go ahead and
> enter deep idle state.
>    But today there is a scenario when there could be problems if all cpus
> enter deep idle states and the tick broadcast framework now notifies the
> cpuidle framework to hold back one cpu. This is just a simple extension
> of the current interaction between cpuidle and tick broadcast framework.
>
>>
>> Why don't you create a cpuidle driver with the shallow idle states
>> assigned to a cpu (let's say cpu0) and another one with all the deeper
>> idle states for the rest of the cpus ? Using the multiple cpuidle driver
>> support makes it possible. The timer won't be moving around and a cpu
>> will be dedicated to act as the broadcast timer.
>>
>
> Having a dedicated stand by cpu for broadcasting has some issues which
> were pointed to when I posted the initial versions of this patchset.
> https://lkml.org/lkml/2013/7/27/14
>
> 1. This could create power/thermal imbalance on the chip since only the
> standby cpu cannot enter deep idle state at all times.
>
> 2. If it is cpu0 it is fine, else with the logic that you suggest,
> hot-plugging out the dedicated stand by cpu would mean moving the work
> of broadcasting to another cpu and modifying the cpuidle state table for
> it. Even with cpu0, if support to hotplug it out is enabled (maybe it is
> already), we will face the same issue and this gets very messy.
>
>> Wouldn't make sense and be less intrusive than the patchset you proposed ?
>
> Actually this patchset brings in a solution that is as less intrusive as
> possible. It makes the problem nearly invisible except for a failed
> return from a call into the broadcast framework. It simply asks the
> archs which do not have an external clock device to register a pseudo
> device with the broadcast framework and from then on everything just
> falls in place to enable deep idle states for such archs.

Hi Preeti,

thanks for the detailed explanation. I understand better now why you did 
it this way. Furthermore, as Thomas pointed me, what I missed is one cpu 
can't setup a local timer for another cpu, so what I was talking about 
does not make sense.

   -- Daniel
diff mbox

Patch

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index e0c5a6c..dbe9e14 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -62,6 +62,11 @@  enum clock_event_mode {
 #define CLOCK_EVT_FEAT_DYNIRQ		0x000020
 #define CLOCK_EVT_FEAT_PERCPU		0x000040
 
+/*
+ * Clockevent device is based on a hrtimer for broadcast
+ */
+#define CLOCK_EVT_FEAT_HRTIMER		0x000080
+
 /**
  * struct clock_event_device - clock event device descriptor
  * @event_handler:	Assigned by the framework to be called by the low
@@ -83,6 +88,7 @@  enum clock_event_mode {
  * @name:		ptr to clock event name
  * @rating:		variable to rate clock event devices
  * @irq:		IRQ number (only for non CPU local devices)
+ * @bound_on:		Bound on CPU
  * @cpumask:		cpumask to indicate for which CPUs this device works
  * @list:		list head for the management code
  * @owner:		module reference
@@ -113,6 +119,7 @@  struct clock_event_device {
 	const char		*name;
 	int			rating;
 	int			irq;
+	int			bound_on;
 	const struct cpumask	*cpumask;
 	struct list_head	list;
 	struct module		*owner;
@@ -180,9 +187,11 @@  extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT)
+extern void tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
+static void tick_setup_hrtimer_broadcast(void) {};
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 9250130..06151ef 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -3,7 +3,7 @@  obj-y += timeconv.o posix-clock.o alarmtimer.o
 
 obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)		+= clockevents.o
 obj-$(CONFIG_GENERIC_CLOCKEVENTS)		+= tick-common.o
-obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o
+obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o tick-broadcast-hrtimer.o
 obj-$(CONFIG_GENERIC_SCHED_CLOCK)		+= sched_clock.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-sched.o
diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
new file mode 100644
index 0000000..af1e119
--- /dev/null
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -0,0 +1,105 @@ 
+/*
+ * linux/kernel/time/tick-broadcast-hrtimer.c
+ * This file emulates a local clock event device
+ * via a pseudo clock device.
+ */
+#include <linux/cpu.h>
+#include <linux/err.h>
+#include <linux/hrtimer.h>
+#include <linux/interrupt.h>
+#include <linux/percpu.h>
+#include <linux/profile.h>
+#include <linux/clockchips.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
+#include <linux/module.h>
+
+#include "tick-internal.h"
+
+static struct hrtimer bctimer;
+
+static void bc_set_mode(enum clock_event_mode mode,
+			struct clock_event_device *bc)
+{
+	switch (mode) {
+	case CLOCK_EVT_MODE_SHUTDOWN:
+		/*
+		 * Note, we cannot cancel the timer here as we might
+		 * run into the following live lock scenario:
+		 *
+		 * cpu 0		cpu1
+		 * lock(broadcast_lock);
+		 *			hrtimer_interrupt()
+		 *			bc_handler()
+		 *			   tick_handle_oneshot_broadcast();
+		 *			    lock(broadcast_lock);
+		 * hrtimer_cancel()
+		 *  wait_for_callback()
+		 */
+		hrtimer_try_to_cancel(&bctimer);
+		break;
+	default:
+		break;
+	}
+}
+
+/*
+ * This is called from the guts of the broadcast code when the cpu
+ * which is about to enter idle has the earliest broadcast timer event.
+ */
+static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
+{
+	/*
+	 * We try to cancel the timer first. If the callback is on
+	 * flight on some other cpu then we let it handle it. If we
+	 * were able to cancel the timer nothing can rearm it as we
+	 * own broadcast_lock.
+	 *
+	 * However we can also be called from the event handler of
+	 * ce_broadcast_hrtimer itself when it expires. We cannot therefore
+	 * restart the timer since it is on flight on the same CPU. But
+	 * due to the same reason we can reset it.
+	 */
+	if (hrtimer_try_to_cancel(&bctimer) >= 0) {
+		hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED);
+		/* Bind the "device" to the cpu */
+		bc->bound_on = smp_processor_id();
+	} else if (bc->bound_on == smp_processor_id()) {
+		hrtimer_set_expires(&bctimer, expires);
+	}
+	return 0;
+}
+
+static struct clock_event_device ce_broadcast_hrtimer = {
+	.set_mode		= bc_set_mode,
+	.set_next_ktime		= bc_set_next,
+	.features		= CLOCK_EVT_FEAT_ONESHOT |
+				  CLOCK_EVT_FEAT_KTIME |
+				  CLOCK_EVT_FEAT_HRTIMER,
+	.rating			= 0,
+	.bound_on		= -1,
+	.min_delta_ns		= 1,
+	.max_delta_ns		= KTIME_MAX,
+	.min_delta_ticks	= 1,
+	.max_delta_ticks	= KTIME_MAX,
+	.mult			= 1,
+	.shift			= 0,
+	.cpumask		= cpu_all_mask,
+};
+
+static enum hrtimer_restart bc_handler(struct hrtimer *t)
+{
+	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
+
+	if (ce_broadcast_hrtimer.next_event.tv64 == KTIME_MAX)
+		return HRTIMER_NORESTART;
+
+	return HRTIMER_RESTART;
+}
+
+void tick_setup_hrtimer_broadcast(void)
+{
+	hrtimer_init(&bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	bctimer.function = bc_handler;
+	clockevents_register_device(&ce_broadcast_hrtimer);
+}
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index ddf2ac2..2f013c3 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -630,6 +630,42 @@  again:
 	raw_spin_unlock(&tick_broadcast_lock);
 }
 
+static int broadcast_needs_cpu(struct clock_event_device *bc, int cpu)
+{
+	if (!(bc->features & CLOCK_EVT_FEAT_HRTIMER))
+		return 0;
+	if (bc->next_event.tv64 == KTIME_MAX)
+		return 0;
+	return bc->bound_on == cpu ? -EBUSY : 0;
+}
+
+static void broadcast_shutdown_local(struct clock_event_device *bc,
+				     struct clock_event_device *dev)
+{
+	/*
+	 * For hrtimer based broadcasting we cannot shutdown the cpu
+	 * local device if our own event is the first one to expire or
+	 * if we own the broadcast timer.
+	 */
+	if (bc->features & CLOCK_EVT_FEAT_HRTIMER) {
+		if (broadcast_needs_cpu(bc, smp_processor_id()))
+			return;
+		if (dev->next_event.tv64 < bc->next_event.tv64)
+			return;
+	}
+	clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
+}
+
+static void broadcast_move_bc(int deadcpu)
+{
+	struct clock_event_device *bc = tick_broadcast_device.evtdev;
+
+	if (!bc || !broadcast_needs_cpu(bc, deadcpu))
+		return;
+	/* This moves the broadcast assignment to this cpu */
+	clockevents_program_event(bc, bc->next_event, 1);
+}
+
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
@@ -648,7 +684,7 @@  int tick_broadcast_oneshot_control(unsigned long reason)
 	 * states
 	 */
 	if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
-		return;
+		return 0;
 
 	/*
 	 * We are called with preemtion disabled from the depth of the
@@ -659,7 +695,7 @@  int tick_broadcast_oneshot_control(unsigned long reason)
 	dev = td->evtdev;
 
 	if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
-		return;
+		return 0;
 
 	bc = tick_broadcast_device.evtdev;
 
@@ -667,7 +703,7 @@  int tick_broadcast_oneshot_control(unsigned long reason)
 	if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
 		if (!cpumask_test_and_set_cpu(cpu, tick_broadcast_oneshot_mask)) {
 			WARN_ON_ONCE(cpumask_test_cpu(cpu, tick_broadcast_pending_mask));
-			clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
+			broadcast_shutdown_local(bc, dev);
 			/*
 			 * We only reprogram the broadcast timer if we
 			 * did not mark ourself in the force mask and
@@ -680,6 +716,16 @@  int tick_broadcast_oneshot_control(unsigned long reason)
 			    dev->next_event.tv64 < bc->next_event.tv64)
 				tick_broadcast_set_event(bc, cpu, dev->next_event, 1);
 		}
+		/*
+		 * If the current CPU owns the hrtimer broadcast
+		 * mechanism, it cannot go deep idle and we remove the
+		 * CPU from the broadcast mask. We don't have to go
+		 * through the EXIT path as the local timer is not
+		 * shutdown.
+		 */
+		ret = broadcast_needs_cpu(bc, cpu);
+		if (ret)
+			cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
 	} else {
 		if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_oneshot_mask)) {
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
@@ -853,6 +899,8 @@  void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
 	cpumask_clear_cpu(cpu, tick_broadcast_pending_mask);
 	cpumask_clear_cpu(cpu, tick_broadcast_force_mask);
 
+	broadcast_move_bc(cpu);
+
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }