[v6,2/2] clocksource: add J-Core timer/clocksource driver

Message ID	f127e6d3f3ef64ea286b7b8f485e5d0c7f3faa6f.147018b3518.git.dalias@libc.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-sh-owner@kernel.org> Message-Id: <f127e6d3f3ef64ea286b7b8f485e5d0c7f3faa6f.147018b3518.git.dalias@libc.org> In-Reply-To: <cover.147018b3518.git.dalias@libc.org> References: <cover.147018b3518.git.dalias@libc.org> From: Rich Felker <dalias@libc.org> Subject: [PATCH v6 2/2] clocksource: add J-Core timer/clocksource driver To: devicetree@vger.kernel.org, linux-kernel@vger.kernel.org, linux-sh@vger.kernel.org Cc: Rob Herring <robh+dt@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Daniel Lezcano <daniel.lezcano@linaro.org>, Thomas Gleixner <tglx@linutronix.de> Date: Thu, 04 Aug 2016 04:30:37 +0000 Sender: linux-sh-owner@vger.kernel.org Precedence: bulk

Rich Felker Aug. 4, 2016, 4:30 a.m. UTC

At the hardware level, the J-Core PIT is integrated with the interrupt
controller, but it is represented as its own device and has an
independent programming interface. It provides a 12-bit countdown
timer, which is not presently used, and a periodic timer. The interval
length for the latter is programmable via a 32-bit throttle register
whose units are determined by a bus-period register. The periodic
timer is used to implement both periodic and oneshot clock event
modes; in oneshot mode the interrupt handler simply disables the timer
as soon as it fires.

Despite its device tree node representing an interrupt for the PIT,
the actual irq generated is programmable, not hard-wired. The driver
is responsible for programming the PIT to generate the hardware irq
number that the DT assigns to it.

On SMP configurations, J-Core provides cpu-local instances of the PIT;
no broadcast timer is needed. This driver supports the creation of the
necessary per-cpu clock_event_device instances. The code has been
tested and works on SMP, but will not be usable without additional
J-Core SMP-support patches and appropriate hardware capable of running
SMP.

A nanosecond-resolution clocksource is provided using the J-Core "RTC"
registers, which give a 64-bit seconds count and 32-bit nanoseconds.
The driver converts these to a 64-bit nanoseconds count.

Signed-off-by: Rich Felker <dalias@libc.org>
---
 drivers/clocksource/Kconfig     |   9 ++
 drivers/clocksource/Makefile    |   1 +
 drivers/clocksource/jcore-pit.c | 242 ++++++++++++++++++++++++++++++++++++++++
 include/linux/cpuhotplug.h      |   1 +
 4 files changed, 253 insertions(+)
 create mode 100644 drivers/clocksource/jcore-pit.c

Alexander Kuleshov Aug. 4, 2016, 8:24 a.m. UTC | #1

Hello Rich,

On 08-04-16, Rich Felker wrote:
> At the hardware level, the J-Core PIT is integrated with the interrupt
> controller, but it is represented as its own device and has an
> independent programming interface. It provides a 12-bit countdown
> timer, which is not presently used, and a periodic timer. The interval
> length for the latter is programmable via a 32-bit throttle register
> whose units are determined by a bus-period register. The periodic
> timer is used to implement both periodic and oneshot clock event
> modes; in oneshot mode the interrupt handler simply disables the timer
> as soon as it fires.
>
> Despite its device tree node representing an interrupt for the PIT,
> the actual irq generated is programmable, not hard-wired. The driver
> is responsible for programming the PIT to generate the hardware irq
> number that the DT assigns to it.
>
> On SMP configurations, J-Core provides cpu-local instances of the PIT;
> no broadcast timer is needed. This driver supports the creation of the
> necessary per-cpu clock_event_device instances. The code has been
> tested and works on SMP, but will not be usable without additional
> J-Core SMP-support patches and appropriate hardware capable of running
> SMP.
>
> A nanosecond-resolution clocksource is provided using the J-Core "RTC"
> registers, which give a 64-bit seconds count and 32-bit nanoseconds.
> The driver converts these to a 64-bit nanoseconds count.
>
> Signed-off-by: Rich Felker <dalias@libc.org>
> ...
> ...
> ...
> +
> +static int jcore_pit_set_state_shutdown(struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_disable(pit);
> +}
> +
> +static int jcore_pit_set_state_oneshot(struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_disable(pit);
> +}

Maybe to use only jcore_pit_set_state_shutdown() for both shutdown/oneshot
states as it is implemented for some of others clocksource drivers?

all the more so as described in your commit message, they do the same:

> in oneshot mode the interrupt handler simply disables the timer
> as soon as it fires

right?

> +static int __init jcore_pit_init(struct device_node *node)
> +{
> +	int err;
> +	unsigned pit_irq, cpu;
> +	unsigned long hwirq;
> +	u32 irqprio, enable_val;
> +
> +	jcore_pit_base = of_iomap(node, 0);
> +	if (!jcore_pit_base) {
> +		pr_err("Error: Cannot map base address for J-Core PIT\n");
> +		return -ENXIO;
> +	}
> +
> +	pit_irq = irq_of_parse_and_map(node, 0);
> +	if (!pit_irq) {
> +		pr_err("Error: J-Core PIT has no IRQ\n");
> +		return -ENXIO;
> +	}
> +
> +	pr_info("Initializing J-Core PIT at %p IRQ %d\n",
> +		jcore_pit_base, pit_irq);
> +
> +	jcore_cs.name = "jcore_pit_cs";
> +	jcore_cs.rating = 400;
> +	jcore_cs.read = jcore_clocksource_read;
> +	jcore_cs.mult = 1;
> +	jcore_cs.shift = 0;
> +	jcore_cs.mask = CLOCKSOURCE_MASK(32);
> +	jcore_cs.flags = CLOCK_SOURCE_IS_CONTINUOUS;
> +
> +	err = clocksource_register_hz(&jcore_cs, NSEC_PER_SEC);
> +	if (err) {
> +		pr_err("Error registering clocksource device: %d\n", err);
> +		return err;
> +	}
> +
> +	sched_clock_register(jcore_sched_clock_read, 32, NSEC_PER_SEC);
> +
> +	jcore_pit_percpu = alloc_percpu(struct jcore_pit);
> +	if (!jcore_pit_percpu) {
> +		pr_err("Failed to allocate memory for clock event device\n");
> +		return -ENOMEM;
> +	}
> +
> +	err = request_irq(pit_irq, jcore_timer_interrupt,
> +			  IRQF_TIMER | IRQF_PERCPU,
> +			  "jcore_pit", jcore_pit_percpu);
> +	if (err) {
> +		pr_err("pit irq request failed: %d\n", err);
> +		return err;
> +	}

free_percpu() missed in a case when request_irq() failed.

--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rich Felker Aug. 4, 2016, 7:42 p.m. UTC | #2

On Thu, Aug 04, 2016 at 02:24:51PM +0600, Alexnader Kuleshov wrote:
> Hello Rich,
> 
> On 08-04-16, Rich Felker wrote:
> > At the hardware level, the J-Core PIT is integrated with the interrupt
> > controller, but it is represented as its own device and has an
> > independent programming interface. It provides a 12-bit countdown
> > timer, which is not presently used, and a periodic timer. The interval
> > length for the latter is programmable via a 32-bit throttle register
> > whose units are determined by a bus-period register. The periodic
> > timer is used to implement both periodic and oneshot clock event
> > modes; in oneshot mode the interrupt handler simply disables the timer
> > as soon as it fires.
> >
> > Despite its device tree node representing an interrupt for the PIT,
> > the actual irq generated is programmable, not hard-wired. The driver
> > is responsible for programming the PIT to generate the hardware irq
> > number that the DT assigns to it.
> >
> > On SMP configurations, J-Core provides cpu-local instances of the PIT;
> > no broadcast timer is needed. This driver supports the creation of the
> > necessary per-cpu clock_event_device instances. The code has been
> > tested and works on SMP, but will not be usable without additional
> > J-Core SMP-support patches and appropriate hardware capable of running
> > SMP.
> >
> > A nanosecond-resolution clocksource is provided using the J-Core "RTC"
> > registers, which give a 64-bit seconds count and 32-bit nanoseconds.
> > The driver converts these to a 64-bit nanoseconds count.
> >
> > Signed-off-by: Rich Felker <dalias@libc.org>
> > ...
> > ...
> > ...
> > +
> > +static int jcore_pit_set_state_shutdown(struct clock_event_device *ced)
> > +{
> > +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> > +
> > +	return jcore_pit_disable(pit);
> > +}
> > +
> > +static int jcore_pit_set_state_oneshot(struct clock_event_device *ced)
> > +{
> > +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> > +
> > +	return jcore_pit_disable(pit);
> > +}
> 
> Maybe to use only jcore_pit_set_state_shutdown() for both shutdown/oneshot
> states as it is implemented for some of others clocksource drivers?
> 
> all the more so as described in your commit message, they do the same:
> 
> > in oneshot mode the interrupt handler simply disables the timer
> > as soon as it fires
> 
> right?

I separated these out semantically due to another request earlier in
this patch's life cycle. At this point I feel like issues like this
are really a bikeshed, and rather than changing trivial details back
and forth over and over I'd like to see it go upstream so that I don't
have to keep rebasing it on infrastructure that's changing underneath
it.

> > +static int __init jcore_pit_init(struct device_node *node)
> > +{
> > +	int err;
> > +	unsigned pit_irq, cpu;
> > +	unsigned long hwirq;
> > +	u32 irqprio, enable_val;
> > +
> > +	jcore_pit_base = of_iomap(node, 0);
> > +	if (!jcore_pit_base) {
> > +		pr_err("Error: Cannot map base address for J-Core PIT\n");
> > +		return -ENXIO;
> > +	}
> > +
> > +	pit_irq = irq_of_parse_and_map(node, 0);
> > +	if (!pit_irq) {
> > +		pr_err("Error: J-Core PIT has no IRQ\n");
> > +		return -ENXIO;
> > +	}
> > +
> > +	pr_info("Initializing J-Core PIT at %p IRQ %d\n",
> > +		jcore_pit_base, pit_irq);
> > +
> > +	jcore_cs.name = "jcore_pit_cs";
> > +	jcore_cs.rating = 400;
> > +	jcore_cs.read = jcore_clocksource_read;
> > +	jcore_cs.mult = 1;
> > +	jcore_cs.shift = 0;
> > +	jcore_cs.mask = CLOCKSOURCE_MASK(32);
> > +	jcore_cs.flags = CLOCK_SOURCE_IS_CONTINUOUS;
> > +
> > +	err = clocksource_register_hz(&jcore_cs, NSEC_PER_SEC);
> > +	if (err) {
> > +		pr_err("Error registering clocksource device: %d\n", err);
> > +		return err;
> > +	}
> > +
> > +	sched_clock_register(jcore_sched_clock_read, 32, NSEC_PER_SEC);
> > +
> > +	jcore_pit_percpu = alloc_percpu(struct jcore_pit);
> > +	if (!jcore_pit_percpu) {
> > +		pr_err("Failed to allocate memory for clock event device\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	err = request_irq(pit_irq, jcore_timer_interrupt,
> > +			  IRQF_TIMER | IRQF_PERCPU,
> > +			  "jcore_pit", jcore_pit_percpu);
> > +	if (err) {
> > +		pr_err("pit irq request failed: %d\n", err);
> > +		return err;
> > +	}
> 
> free_percpu() missed in a case when request_irq() failed.

Shall I submit a new version with this change? I don't think it's
particular useful since you're not going to have a working system if
the timer fails to initialize anyway, but I can do it if desired.

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Daniel Lezcano Aug. 24, 2016, 4:42 p.m. UTC | #3

On 08/04/2016 06:30 AM, Rich Felker wrote:
> At the hardware level, the J-Core PIT is integrated with the interrupt
> controller, but it is represented as its own device and has an
> independent programming interface. It provides a 12-bit countdown
> timer, which is not presently used, and a periodic timer. The interval
> length for the latter is programmable via a 32-bit throttle register
> whose units are determined by a bus-period register. The periodic
> timer is used to implement both periodic and oneshot clock event
> modes; in oneshot mode the interrupt handler simply disables the timer
> as soon as it fires.
> 
> Despite its device tree node representing an interrupt for the PIT,
> the actual irq generated is programmable, not hard-wired. The driver
> is responsible for programming the PIT to generate the hardware irq
> number that the DT assigns to it.
> 
> On SMP configurations, J-Core provides cpu-local instances of the PIT;
> no broadcast timer is needed. This driver supports the creation of the
> necessary per-cpu clock_event_device instances. The code has been
> tested and works on SMP, but will not be usable without additional
> J-Core SMP-support patches and appropriate hardware capable of running
> SMP.
> 
> A nanosecond-resolution clocksource is provided using the J-Core "RTC"
> registers, which give a 64-bit seconds count and 32-bit nanoseconds.
> The driver converts these to a 64-bit nanoseconds count.
> 
> Signed-off-by: Rich Felker <dalias@libc.org>
> ---
>  drivers/clocksource/Kconfig     |   9 ++
>  drivers/clocksource/Makefile    |   1 +
>  drivers/clocksource/jcore-pit.c | 242 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/cpuhotplug.h      |   1 +
>  4 files changed, 253 insertions(+)
>  create mode 100644 drivers/clocksource/jcore-pit.c
> 
> diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> index 5677886..3210ca5 100644
> --- a/drivers/clocksource/Kconfig
> +++ b/drivers/clocksource/Kconfig
> @@ -407,6 +407,15 @@ config SYS_SUPPORTS_SH_TMU
>  config SYS_SUPPORTS_EM_STI
>          bool
>  
> +config CLKSRC_JCORE_PIT
> +	bool "J-Core PIT timer driver"
> +	depends on OF && (SUPERH || COMPILE_TEST)

Even if this is correct, for the sake of consistency, it is preferable
to change it to:

	bool "J-Core PIT timer driver" if COMPILE_TEST
	depends on SUPERH
	select CLKSRC_OF

> +	depends on GENERIC_CLOCKEVENTS
> +	depends on HAS_IOMEM
> +	help
> +	  This enables build of clocksource and clockevent driver for
> +	  the integrated PIT in the J-Core synthesizable, open source SoC.
> +
>  config SH_TIMER_CMT
>  	bool "Renesas CMT timer driver" if COMPILE_TEST
>  	depends on GENERIC_CLOCKEVENTS
> diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
> index fd9d6df..cf87f40 100644
> --- a/drivers/clocksource/Makefile
> +++ b/drivers/clocksource/Makefile
> @@ -5,6 +5,7 @@ obj-$(CONFIG_ATMEL_TCB_CLKSRC)	+= tcb_clksrc.o
>  obj-$(CONFIG_X86_PM_TIMER)	+= acpi_pm.o
>  obj-$(CONFIG_SCx200HR_TIMER)	+= scx200_hrt.o
>  obj-$(CONFIG_CS5535_CLOCK_EVENT_SRC)	+= cs5535-clockevt.o
> +obj-$(CONFIG_CLKSRC_JCORE_PIT)		+= jcore-pit.o
>  obj-$(CONFIG_SH_TIMER_CMT)	+= sh_cmt.o
>  obj-$(CONFIG_SH_TIMER_MTU2)	+= sh_mtu2.o
>  obj-$(CONFIG_SH_TIMER_TMU)	+= sh_tmu.o
> diff --git a/drivers/clocksource/jcore-pit.c b/drivers/clocksource/jcore-pit.c
> new file mode 100644
> index 0000000..23dee50
> --- /dev/null
> +++ b/drivers/clocksource/jcore-pit.c
> @@ -0,0 +1,242 @@
> +/*
> + * J-Core SoC PIT/clocksource driver
> + *
> + * Copyright (C) 2015-2016 Smart Energy Instruments, Inc.
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License.  See the file "COPYING" in the main directory of this archive
> + * for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/interrupt.h>
> +#include <linux/clockchips.h>
> +#include <linux/clocksource.h>
> +#include <linux/sched_clock.h>
> +#include <linux/cpu.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/of_address.h>
> +#include <linux/of_irq.h>
> +
> +#define PIT_IRQ_SHIFT		12
> +#define PIT_PRIO_SHIFT		20
> +#define PIT_ENABLE_SHIFT	26
> +#define PIT_IRQ_MASK		0x3f
> +#define PIT_PRIO_MASK		0xf
> +
> +#define REG_PITEN		0x00
> +#define REG_THROT		0x10
> +#define REG_COUNT		0x14
> +#define REG_BUSPD		0x18
> +#define REG_SECHI		0x20
> +#define REG_SECLO		0x24
> +#define REG_NSEC		0x28
> +
> +struct jcore_pit {
> +	struct clock_event_device	ced;
> +	__iomem void			*base;
> +	unsigned long			periodic_delta;
> +	unsigned			cpu;
> +	u32				enable_val;
> +};
> +
> +static __iomem void		*jcore_pit_base;
> +static struct clocksource	jcore_cs;
> +struct jcore_pit __percpu	*jcore_pit_percpu;
> +
> +static notrace u64 jcore_sched_clock_read(void)
> +{
> +	u32 seclo, nsec, seclo0;
> +	__iomem void *base = jcore_pit_base;
> +
> +	seclo = __raw_readl(base + REG_SECLO);
> +	do {
> +		seclo0 = seclo;
> +		nsec  = __raw_readl(base + REG_NSEC);
> +		seclo = __raw_readl(base + REG_SECLO);
> +	} while (seclo0 != seclo);
> +
> +	return seclo * NSEC_PER_SEC + nsec;
> +}
> +
> +static cycle_t jcore_clocksource_read(struct clocksource *cs)
> +{
> +	return jcore_sched_clock_read();
> +}
> +
> +static int jcore_pit_disable(struct jcore_pit *pit)
> +{
> +	__raw_writel(0, pit->base + REG_PITEN);
> +	return 0;
> +}
> +
> +static int jcore_pit_set(unsigned long delta, struct jcore_pit *pit)
> +{
> +	jcore_pit_disable(pit);
> +	__raw_writel(delta, pit->base + REG_THROT);
> +	__raw_writel(pit->enable_val, pit->base + REG_PITEN);
> +	return 0;

Why do you need to use __raw_writel ?

s/__raw_writel/writel/

> +}
> +
> +static int jcore_pit_set_state_shutdown(struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_disable(pit);
> +}
> +
> +static int jcore_pit_set_state_oneshot(struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_disable(pit);
> +}
> +
> +static int jcore_pit_set_state_periodic(struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_set(pit->periodic_delta, pit);
> +}
> +
> +static int jcore_pit_set_next_event(unsigned long delta,
> +				    struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_set(delta, pit);
> +}
> +
> +static int jcore_pit_local_init(unsigned cpu)
> +{
> +	struct jcore_pit *pit = this_cpu_ptr(jcore_pit_percpu);
> +	unsigned buspd, freq;
> +
> +	pr_info("Local J-Core PIT init on cpu %u\n", pit->cpu);
> +
> +	buspd = __raw_readl(pit->base + REG_BUSPD);

s/__raw_readl/readl/ ?

> +	freq = DIV_ROUND_CLOSEST(NSEC_PER_SEC, buspd);
> +	pit->periodic_delta = DIV_ROUND_CLOSEST(NSEC_PER_SEC, HZ*buspd);

--->					HZ * buspd

> +
> +	clockevents_config_and_register(&pit->ced, freq, 1, ULONG_MAX);
> +
> +	return 0;
> +}
> +
> +static int jcore_pit_local_shutdown(unsigned cpu)
> +{
> +	return 0;
> +}

This function is useless I think. AFAIU, cpuhp_setup_state can have a
NULL function for the cpu teardown.

> +
> +static irqreturn_t jcore_timer_interrupt(int irq, void *dev_id)
> +{
> +	struct jcore_pit *pit = this_cpu_ptr(dev_id);
> +
> +	if (clockevent_state_oneshot(&pit->ced))
> +		jcore_pit_disable(pit);
> +
> +	pit->ced.event_handler(&pit->ced);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +static int __init jcore_pit_init(struct device_node *node)
> +{
> +	int err;
> +	unsigned pit_irq, cpu;
> +	unsigned long hwirq;
> +	u32 irqprio, enable_val;
> +
> +	jcore_pit_base = of_iomap(node, 0);
> +	if (!jcore_pit_base) {
> +		pr_err("Error: Cannot map base address for J-Core PIT\n");
> +		return -ENXIO;
> +	}
> +
> +	pit_irq = irq_of_parse_and_map(node, 0);
> +	if (!pit_irq) {
> +		pr_err("Error: J-Core PIT has no IRQ\n");
> +		return -ENXIO;
> +	}
> +
> +	pr_info("Initializing J-Core PIT at %p IRQ %d\n",
> +		jcore_pit_base, pit_irq);
> +
> +	jcore_cs.name = "jcore_pit_cs";
> +	jcore_cs.rating = 400;
> +	jcore_cs.read = jcore_clocksource_read;
> +	jcore_cs.mult = 1;
> +	jcore_cs.shift = 0;
> +	jcore_cs.mask = CLOCKSOURCE_MASK(32);
> +	jcore_cs.flags = CLOCK_SOURCE_IS_CONTINUOUS;
> +
> +	err = clocksource_register_hz(&jcore_cs, NSEC_PER_SEC);
> +	if (err) {
> +		pr_err("Error registering clocksource device: %d\n", err);
> +		return err;
> +	}

May be you can consider by replacing the above by:

	clocksource_mmio_init(jcore_pit_base, "jcore_pit_cs",
				NSEC_PER_SEC, 32,
				jcore_clocksource_read);

> +
> +	sched_clock_register(jcore_sched_clock_read, 32, NSEC_PER_SEC);
> +
> +	jcore_pit_percpu = alloc_percpu(struct jcore_pit);
> +	if (!jcore_pit_percpu) {
> +		pr_err("Failed to allocate memory for clock event device\n");
> +		return -ENOMEM;
> +	}
> +
> +	err = request_irq(pit_irq, jcore_timer_interrupt,
> +			  IRQF_TIMER | IRQF_PERCPU,
> +			  "jcore_pit", jcore_pit_percpu);
> +	if (err) {
> +		pr_err("pit irq request failed: %d\n", err);
> +		return err;
> +	}

That is my major concern. Regarding the description of this timer, it
appears there is one timer per cpu but we use request_irq instead of
request_percpu_irq.

IIUC, there is a problem with the interrupt controller where the per irq
line are not working correctly. Is that correct ?

Regarding Marc Zyngier comments about the irq controller driver being
almost empty, I'm wondering if something in the irq controller driver
which shouldn't be added before submitting this timer driver with SMP
support (eg. irq domain ?).

> +	/*
> +	 * The J-Core PIT is not hard-wired to a particular IRQ, but
> +	 * integrated with the interrupt controller such that the IRQ it
> +	 * generates is programmable. The programming interface has a
> +	 * legacy field which was an interrupt priority for AIC1, but
> +	 * which is OR'd onto bits 2-5 of the generated IRQ number when
> +	 * used with J-Core AIC2, so set it to match these bits.
> +	 */
> +	hwirq = irq_get_irq_data(pit_irq)->hwirq;

	irq_hw_number_t hwirq;
	hwirq = irqd_to_hwirq(irq_get_irq_data(pit_irq));

> +	irqprio = (hwirq >> 2) & PIT_PRIO_MASK;
> +	enable_val = (1U << PIT_ENABLE_SHIFT)
> +		   | (hwirq << PIT_IRQ_SHIFT)
> +		   | (irqprio << PIT_PRIO_SHIFT);


I'm missing the connection between the description above and the enable
value computed here. Can you elaborate ?

> +
> +	for_each_present_cpu(cpu) {
> +		struct jcore_pit *pit = per_cpu_ptr(jcore_pit_percpu, cpu);
> +
> +		pit->base = of_iomap(node, cpu);
> +		if (!pit->base) {
> +			pr_err("Unable to map PIT for cpu %u\n", cpu);
> +			continue;
> +		}
> +
> +		pit->ced.name = "jcore_pit";
> +		pit->ced.features = CLOCK_EVT_FEAT_PERIODIC
> +				  | CLOCK_EVT_FEAT_ONESHOT
> +				  | CLOCK_EVT_FEAT_PERCPU;
> +		pit->ced.cpumask = cpumask_of(cpu);
> +		pit->ced.rating = 400;
> +		pit->ced.irq = pit_irq;
> +		pit->ced.set_state_shutdown = jcore_pit_set_state_shutdown;
> +		pit->ced.set_state_periodic = jcore_pit_set_state_periodic;
> +		pit->ced.set_state_oneshot = jcore_pit_set_state_oneshot;
> +		pit->ced.set_next_event = jcore_pit_set_next_event;
> +
> +		pit->cpu = cpu;
> +		pit->enable_val = enable_val;
> +	}
> +
> +	cpuhp_setup_state(CPUHP_AP_JCORE_TIMER_STARTING,
> +			  "AP_JCORE_TIMER_STARTING",
> +			  jcore_pit_local_init, jcore_pit_local_shutdown);
> +
> +	return 0;
> +}
> +
> +CLOCKSOURCE_OF_DECLARE(jcore_pit, "jcore,pit", jcore_pit_init);
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 242bf53..e95ed7d 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -50,6 +50,7 @@ enum cpuhp_state {
>  	CPUHP_AP_ARM_ARCH_TIMER_STARTING,
>  	CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
>  	CPUHP_AP_DUMMY_TIMER_STARTING,
> +	CPUHP_AP_JCORE_TIMER_STARTING,
>  	CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
>  	CPUHP_AP_ARM_TWD_STARTING,
>  	CPUHP_AP_METAG_TIMER_STARTING,
>

Rich Felker Aug. 24, 2016, 5:40 p.m. UTC | #4

On Wed, Aug 24, 2016 at 06:42:05PM +0200, Daniel Lezcano wrote:
> > diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> > index 5677886..3210ca5 100644
> > --- a/drivers/clocksource/Kconfig
> > +++ b/drivers/clocksource/Kconfig
> > @@ -407,6 +407,15 @@ config SYS_SUPPORTS_SH_TMU
> >  config SYS_SUPPORTS_EM_STI
> >          bool
> >  
> > +config CLKSRC_JCORE_PIT
> > +	bool "J-Core PIT timer driver"
> > +	depends on OF && (SUPERH || COMPILE_TEST)
> 
> Even if this is correct, for the sake of consistency, it is preferable
> to change it to:
> 
> 	bool "J-Core PIT timer driver" if COMPILE_TEST
> 	depends on SUPERH
> 	select CLKSRC_OF

Is this functionally equivalent? If so that's non-obvious to me. What
about the dropped OF dependency? The intent is that the option should
always be available for SUPERH targets using OF, otherwise only
available with COMPILE_TEST.

> > +static int jcore_pit_set(unsigned long delta, struct jcore_pit *pit)
> > +{
> > +	jcore_pit_disable(pit);
> > +	__raw_writel(delta, pit->base + REG_THROT);
> > +	__raw_writel(pit->enable_val, pit->base + REG_PITEN);
> > +	return 0;
> 
> Why do you need to use __raw_writel ?
> 
> s/__raw_writel/writel/

I actually tried multiple times to find good resources on policy for
which form to prefer, but didn't have much luck. My understanding is
that __raw_writel/__raw_readl always performs a native-endian
load/store, whereas writel/readl behavior depends on cpu endianness
and whether the arch declares that "pci bus" (that was the term I
found in the source, not relevant here) io is endian-swapped or not.
Can you give me a better explanation of why we might prefer one form
or the other?

> > +	freq = DIV_ROUND_CLOSEST(NSEC_PER_SEC, buspd);
> > +	pit->periodic_delta = DIV_ROUND_CLOSEST(NSEC_PER_SEC, HZ*buspd);
> 
> --->					HZ * buspd

OK.

> > +	clockevents_config_and_register(&pit->ced, freq, 1, ULONG_MAX);
> > +
> > +	return 0;
> > +}
> > +
> > +static int jcore_pit_local_shutdown(unsigned cpu)
> > +{
> > +	return 0;
> > +}
> 
> This function is useless I think. AFAIU, cpuhp_setup_state can have a
> NULL function for the cpu teardown.

OK, I wasn't sure if that was permitted.

> > +	jcore_cs.name = "jcore_pit_cs";
> > +	jcore_cs.rating = 400;
> > +	jcore_cs.read = jcore_clocksource_read;
> > +	jcore_cs.mult = 1;
> > +	jcore_cs.shift = 0;
> > +	jcore_cs.mask = CLOCKSOURCE_MASK(32);
> > +	jcore_cs.flags = CLOCK_SOURCE_IS_CONTINUOUS;
> > +
> > +	err = clocksource_register_hz(&jcore_cs, NSEC_PER_SEC);
> > +	if (err) {
> > +		pr_err("Error registering clocksource device: %d\n", err);
> > +		return err;
> > +	}
> 
> May be you can consider by replacing the above by:
> 
> 	clocksource_mmio_init(jcore_pit_base, "jcore_pit_cs",
> 				NSEC_PER_SEC, 32,
> 				jcore_clocksource_read);

I think you're missing the rating argument. Otherwise it should work,
but is there a good reason to prefer it? The code is slightly simpler;
I'm not sure if using kzalloc vs static storage is better or worse.

> > +	sched_clock_register(jcore_sched_clock_read, 32, NSEC_PER_SEC);
> > +
> > +	jcore_pit_percpu = alloc_percpu(struct jcore_pit);
> > +	if (!jcore_pit_percpu) {
> > +		pr_err("Failed to allocate memory for clock event device\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	err = request_irq(pit_irq, jcore_timer_interrupt,
> > +			  IRQF_TIMER | IRQF_PERCPU,
> > +			  "jcore_pit", jcore_pit_percpu);
> > +	if (err) {
> > +		pr_err("pit irq request failed: %d\n", err);
> > +		return err;
> > +	}
> 
> That is my major concern. Regarding the description of this timer, it
> appears there is one timer per cpu but we use request_irq instead of
> request_percpu_irq.
> 
> IIUC, there is a problem with the interrupt controller where the per irq
> line are not working correctly. Is that correct ?

I don't think that's a correct characterization. Rather the percpu
infrastructure just means something completely different from what you
would expect it to mean. It has nothing to do with the hardware but
rather with kernel-internal choice of whether to do percpu devid
mapping inside the irq infrastructure, and the choice at the
irq-requester side of whether to do this is required to match the
irqchip driver's choice. I explained this better in another email
which I could dig up if necessary, but the essence is that
request_percpu_irq is a misnamed and unusably broken API.

> Regarding Marc Zyngier comments about the irq controller driver being
> almost empty, I'm wondering if something in the irq controller driver
> which shouldn't be added before submitting this timer driver with SMP
> support (eg. irq domain ?).

I don't think so. At most I could make the driver hard-code the percpu
devid model for certain irqs, but that _does not reflect_ anything
about the hardware. Rather it just reflects bad kernel internals. It
would also require writing a new percpu devid version of
handle_simple_irq.

In the other thread where I discussed this, I suggested irq framework
changes that would clean this all up and make the percpu devid stuff
work the way someone trying to use the API would expect, but such
changes are not necessary for the J-Core drivers (or other existing
drivers) to work.

> > +	/*
> > +	 * The J-Core PIT is not hard-wired to a particular IRQ, but
> > +	 * integrated with the interrupt controller such that the IRQ it
> > +	 * generates is programmable. The programming interface has a
> > +	 * legacy field which was an interrupt priority for AIC1, but
> > +	 * which is OR'd onto bits 2-5 of the generated IRQ number when
> > +	 * used with J-Core AIC2, so set it to match these bits.
> > +	 */
> > +	hwirq = irq_get_irq_data(pit_irq)->hwirq;
> 
> 	irq_hw_number_t hwirq;
> 	hwirq = irqd_to_hwirq(irq_get_irq_data(pit_irq));

OK.

> > +	irqprio = (hwirq >> 2) & PIT_PRIO_MASK;
> > +	enable_val = (1U << PIT_ENABLE_SHIFT)
> > +		   | (hwirq << PIT_IRQ_SHIFT)
> > +		   | (irqprio << PIT_PRIO_SHIFT);
> 
> 
> I'm missing the connection between the description above and the enable
> value computed here. Can you elaborate ?

The irqprio field is filled in using a value that matches bits 2 and
up of hwirq; this is what the comment says and what the code does. Can
you elaborate on what you don't understand?

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marc Zyngier Aug. 24, 2016, 7:01 p.m. UTC | #5

On Wed, 24 Aug 2016 13:40:01 -0400
Rich Felker <dalias@libc.org> wrote:

[...]

> > IIUC, there is a problem with the interrupt controller where
the per irq
> > line are not working correctly. Is that correct ?  
> 
> I don't think that's a correct characterization. Rather the percpu
> infrastructure just means something completely different from what you
> would expect it to mean. It has nothing to do with the hardware but
> rather with kernel-internal choice of whether to do percpu devid
> mapping inside the irq infrastructure, and the choice at the
> irq-requester side of whether to do this is required to match the
> irqchip driver's choice. I explained this better in another email
> which I could dig up if necessary, but the essence is that
> request_percpu_irq is a misnamed and unusably broken API.

Or just one that simply doesn't fit your needs, because other
architectures have different semantics than the ones you take for
granted?

> 
> > Regarding Marc Zyngier comments about the irq controller driver being
> > almost empty, I'm wondering if something in the irq controller driver
> > which shouldn't be added before submitting this timer driver with SMP
> > support (eg. irq domain ?).  
> 
> I don't think so. At most I could make the driver hard-code the percpu
> devid model for certain irqs, but that _does not reflect_ anything
> about the hardware. Rather it just reflects bad kernel internals. It

I'd appreciate it if instead of ranting about how broken the kernel is,
you'd submit a patch fixing it, since you seem to have spotted
something that we haven't in several years of using that code on a
couple of ARM-related platforms.

Thanks,

	M.

Rich Felker Aug. 24, 2016, 7:20 p.m. UTC | #6

On Wed, Aug 24, 2016 at 08:01:52PM +0100, Marc Zyngier wrote:
> On Wed, 24 Aug 2016 13:40:01 -0400
> Rich Felker <dalias@libc.org> wrote:
> 
> [...]
> 
> > > IIUC, there is a problem with the interrupt controller where
> the per irq
> > > line are not working correctly. Is that correct ?  
> > 
> > I don't think that's a correct characterization. Rather the percpu
> > infrastructure just means something completely different from what you
> > would expect it to mean. It has nothing to do with the hardware but
> > rather with kernel-internal choice of whether to do percpu devid
> > mapping inside the irq infrastructure, and the choice at the
> > irq-requester side of whether to do this is required to match the
> > irqchip driver's choice. I explained this better in another email
> > which I could dig up if necessary, but the essence is that
> > request_percpu_irq is a misnamed and unusably broken API.

For reference, here's the thread I was referring to:

https://lkml.org/lkml/2016/7/15/585

> Or just one that simply doesn't fit your needs, because other
> architectures have different semantics than the ones you take for
> granted?

I don't think so. The choice of whether to have the irq layer or the
driver's irq handler be responsible for handling a percpu pointer has
nothing to do with the hardware.

Perhaps the intent was that the irqchip driver always knows whether a
given hwirq[-range] is associated with per-cpu events or global events
for which it doesn't matter what cpu they're delivered on. In this
case, the situations where you may want percpu dev_id mapping line up
with some property of the hardware. However that need not be the case,
and it's not when the choice of irq is programmable.

> > > Regarding Marc Zyngier comments about the irq controller driver being
> > > almost empty, I'm wondering if something in the irq controller driver
> > > which shouldn't be added before submitting this timer driver with SMP
> > > support (eg. irq domain ?).  
> > 
> > I don't think so. At most I could make the driver hard-code the percpu
> > devid model for certain irqs, but that _does not reflect_ anything
> > about the hardware. Rather it just reflects bad kernel internals. It
> 
> I'd appreciate it if instead of ranting about how broken the kernel is,
> you'd submit a patch fixing it, since you seem to have spotted
> something that we haven't in several years of using that code on a
> couple of ARM-related platforms.

I didn't intend for this to be a rant. I'm not demanding that it be
changed; I'm only objecting to being asked to make the driver use a
framework that it doesn't need and that can't model what needs to be
done. But I'm happy to discuss whether you would be open to such a
change, and if so, to write and submit a patch. The ideas for what it
would involve are in the linked email, quoted here:

   "... This is because the irq controller driver must, at irqdomain
    mapping time, decide whether to register the handler as
    handle_percpu_devid_irq (which interprets dev_id as a __percpu
    pointer and remaps it for the local cpu before invoking the
    driver's handler) or one of the other handlers that does not
    perform any percpu remapping.

    The right way for this to work would be for
    handle_irq_event_percpu to be responsible for the remapping, but
    do it conditionally on whether the irq was requested via
    request_irq or request_percpu_irq."

Do you disagree with this assessment?

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnd Bergmann Aug. 24, 2016, 8:01 p.m. UTC | #7

On Wednesday, August 24, 2016 1:40:01 PM CEST Rich Felker wrote:
> On Wed, Aug 24, 2016 at 06:42:05PM +0200, Daniel Lezcano wrote:
> > > diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> > > index 5677886..3210ca5 100644
> > > --- a/drivers/clocksource/Kconfig
> > > +++ b/drivers/clocksource/Kconfig
> > > @@ -407,6 +407,15 @@ config SYS_SUPPORTS_SH_TMU
> > >  config SYS_SUPPORTS_EM_STI
> > >          bool
> > >  
> > > +config CLKSRC_JCORE_PIT
> > > +   bool "J-Core PIT timer driver"
> > > +   depends on OF && (SUPERH || COMPILE_TEST)
> > 
> > Even if this is correct, for the sake of consistency, it is preferable
> > to change it to:
> > 
> >       bool "J-Core PIT timer driver" if COMPILE_TEST
> >       depends on SUPERH
> >       select CLKSRC_OF
> 
> Is this functionally equivalent? If so that's non-obvious to me. What
> about the dropped OF dependency? The intent is that the option should
> always be available for SUPERH targets using OF, otherwise only
> available with COMPILE_TEST.

I think your version is correct here, Daniel made a mistake with the
SUPERH dependency. Another way to write it would be

config CLKSRC_JCORE_PIT
   bool "J-Core PIT timer driver" if COMPILE_TEST && !JCORE
   depends on OF
	default JCORE

to make it impossible to disable when JCORE is set, but
user-visible on all other architectures, SUPERH or otherwise.

> > > +static int jcore_pit_set(unsigned long delta, struct jcore_pit *pit)
> > > +{
> > > +   jcore_pit_disable(pit);
> > > +   __raw_writel(delta, pit->base + REG_THROT);
> > > +   __raw_writel(pit->enable_val, pit->base + REG_PITEN);
> > > +   return 0;
> > 
> > Why do you need to use __raw_writel ?
> > 
> > s/__raw_writel/writel/
> 
> I actually tried multiple times to find good resources on policy for
> which form to prefer, but didn't have much luck. My understanding is
> that __raw_writel/__raw_readl always performs a native-endian
> load/store, whereas writel/readl behavior depends on cpu endianness
> and whether the arch declares that "pci bus" (that was the term I
> found in the source, not relevant here) io is endian-swapped or not.
> Can you give me a better explanation of why we might prefer one form
> or the other?

In general, a portable driver should use readl/writel because they
are defined to have the behavior that x86 provides, which is what
most developers understand best. Compared to __raw_readl/__raw_writel,
it guarantees

- correct endianess (in most cases, exceptions see below)
- does not leak out of spinlocks
- accesses are word-sized, the compiler cannot split them into
  byte accesses, or merge subsequent accesses into larger words
- ordering against dma_map_*/dma_unmap_*/dma_sync_* is maintained

The __raw_* accessors are used as building blocks for
readl/outl/ioread32/ioread32_be/readl_relaxed/... and they can
usually be used safely on device RAM such as framebuffers but
not much else in portable code. Some architectures also use the
__raw_* accessors for MMIO registers, but that code is generally
not portable.

Endianess is always more complicated than one thinks, and it's
handled differently across architectures, sometimes because of
hardware differences, sometimes for historic reasons that differ
only in the Linux implementation.

A large majority of devices are fixed to little-endian MMIO
registers (as recommended by PCI), so that is a good default.

Exceptions to this are:

* Some architectures (e.g. PowerPC) typically run big-endian code,
  and hardware designers decided to make the MMIO registers the
  same. In this case, you can generally use ioread32_be/iowrite32_be.

* Some devices have configurable endianess with a device specific
  register. 
  Sometimes those registers are set to match the CPU endianess,
  but for Linux it tends to be better to just set them to
  little-endian and use readl/writel

* Some drivers are configurable at HW synthesis time, and we have
  a mix of devices. If you have this, you usually need to support
  both ways in the driver with a helper function that uses ioread32
  or ioread32_be depending on the configuration. This can be done
  using either compile-time or runtime configuration.

* Some SoCs have boot-time configurable endianess and enforce that
  MMIO registers and the CPU always use the same way. We are
  inconsistent in handling those, I'd recommend creating a header
  file for that SoC that defines vendor specific helper functions.

* Some (usually very old) SoCs have configurable endianess, but
  don't actually change the behavior of the SoC or the MMIO devices,
  but only flip the address lines on the memory interface.
  This is similar to the previous one, but has additional problems
  with FIFO registers. You generally cannot use portable drivers
  here.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rich Felker Aug. 24, 2016, 8:52 p.m. UTC | #8

On Wed, Aug 24, 2016 at 10:01:08PM +0200, Arnd Bergmann wrote:
> On Wednesday, August 24, 2016 1:40:01 PM CEST Rich Felker wrote:
> > On Wed, Aug 24, 2016 at 06:42:05PM +0200, Daniel Lezcano wrote:
> > > > diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> > > > index 5677886..3210ca5 100644
> > > > --- a/drivers/clocksource/Kconfig
> > > > +++ b/drivers/clocksource/Kconfig
> > > > @@ -407,6 +407,15 @@ config SYS_SUPPORTS_SH_TMU
> > > >  config SYS_SUPPORTS_EM_STI
> > > >          bool
> > > >  
> > > > +config CLKSRC_JCORE_PIT
> > > > +   bool "J-Core PIT timer driver"
> > > > +   depends on OF && (SUPERH || COMPILE_TEST)
> > > 
> > > Even if this is correct, for the sake of consistency, it is preferable
> > > to change it to:
> > > 
> > >       bool "J-Core PIT timer driver" if COMPILE_TEST
> > >       depends on SUPERH
> > >       select CLKSRC_OF
> > 
> > Is this functionally equivalent? If so that's non-obvious to me. What
> > about the dropped OF dependency? The intent is that the option should
> > always be available for SUPERH targets using OF, otherwise only
> > available with COMPILE_TEST.
> 
> I think your version is correct here, Daniel made a mistake with the
> SUPERH dependency. Another way to write it would be
> 
> config CLKSRC_JCORE_PIT
>    bool "J-Core PIT timer driver" if COMPILE_TEST && !JCORE
>    depends on OF
> 	default JCORE
> 
> to make it impossible to disable when JCORE is set, but
> user-visible on all other architectures, SUPERH or otherwise.

There's no JCORE config item and in theory you don't even need to pick
CPU_J2 if you don't want SMP; CPU_SH2 should work. In practice the
legacy hw support needs to be converted over to device tree still, but
the goal has been to _avoid_ introducing soc/board/model specific
config stuff and move the arch to completely DT-based.

> > > > +static int jcore_pit_set(unsigned long delta, struct jcore_pit *pit)
> > > > +{
> > > > +   jcore_pit_disable(pit);
> > > > +   __raw_writel(delta, pit->base + REG_THROT);
> > > > +   __raw_writel(pit->enable_val, pit->base + REG_PITEN);
> > > > +   return 0;
> > > 
> > > Why do you need to use __raw_writel ?
> > > 
> > > s/__raw_writel/writel/
> > 
> > I actually tried multiple times to find good resources on policy for
> > which form to prefer, but didn't have much luck. My understanding is
> > that __raw_writel/__raw_readl always performs a native-endian
> > load/store, whereas writel/readl behavior depends on cpu endianness
> > and whether the arch declares that "pci bus" (that was the term I
> > found in the source, not relevant here) io is endian-swapped or not.
> > Can you give me a better explanation of why we might prefer one form
> > or the other?
> 
> In general, a portable driver should use readl/writel because they
> are defined to have the behavior that x86 provides, which is what
> most developers understand best. Compared to __raw_readl/__raw_writel,
> it guarantees
> 
> - correct endianess (in most cases, exceptions see below)
> - does not leak out of spinlocks

Not even normal non-volatile accesses can leak across spinlocks;
they're memory+compiler barriers.

> - accesses are word-sized, the compiler cannot split them into
>   byte accesses, or merge subsequent accesses into larger words

Based on my reading of the code that's also true for the raw ones. At
least GCC interprets volatile as disallowing split/combined
loads/stores when the hardware supports the nominal load/store size,
and this seems to agree with the intent of the C standard that
volatile be usable for memory-mapped devices..

> - ordering against dma_map_*/dma_unmap_*/dma_sync_* is maintained

Aren't they barriers too?

> The __raw_* accessors are used as building blocks for
> readl/outl/ioread32/ioread32_be/readl_relaxed/... and they can
> usually be used safely on device RAM such as framebuffers but
> not much else in portable code. Some architectures also use the
> __raw_* accessors for MMIO registers, but that code is generally
> not portable.

My concern is that, depending on some arcane defines, readl/writel
could start doing byte-reversal of values since the CPU is big-endian.
arch/sh seems to have them defined in such a way that this doesn't
happen, but I was hoping to avoid depending on that. If you think it's
safe to assume they do the right thing, though, I'm okay with it. If
anything breaks when we try a little-endian version of the CPU/SoC
later I can go back and fix it.

> Endianess is always more complicated than one thinks, and it's
> handled differently across architectures, sometimes because of
> hardware differences, sometimes for historic reasons that differ
> only in the Linux implementation.
> 
> A large majority of devices are fixed to little-endian MMIO
> registers (as recommended by PCI), so that is a good default.
> 
> Exceptions to this are:
> 
> * Some architectures (e.g. PowerPC) typically run big-endian code,
>   and hardware designers decided to make the MMIO registers the
>   same. In this case, you can generally use ioread32_be/iowrite32_be.

This is something like our case, but the mmio registers are accessed
as 32-bit values and the endianness does not come into play. Sub-word
accesses just don't work at all for most of the mmio devices. This
will remain the same if/when we add little endian builds (there's work
on this on the jcore list).

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Rutland Aug. 24, 2016, 9:22 p.m. UTC | #9

On Wed, Aug 24, 2016 at 04:52:26PM -0400, Rich Felker wrote:
> On Wed, Aug 24, 2016 at 10:01:08PM +0200, Arnd Bergmann wrote:
> > On Wednesday, August 24, 2016 1:40:01 PM CEST Rich Felker wrote:
> > > I actually tried multiple times to find good resources on policy for
> > > which form to prefer, but didn't have much luck. My understanding is
> > > that __raw_writel/__raw_readl always performs a native-endian
> > > load/store, whereas writel/readl behavior depends on cpu endianness
> > > and whether the arch declares that "pci bus" (that was the term I
> > > found in the source, not relevant here) io is endian-swapped or not.
> > > Can you give me a better explanation of why we might prefer one form
> > > or the other?
> > 
> > In general, a portable driver should use readl/writel because they
> > are defined to have the behavior that x86 provides, which is what
> > most developers understand best. Compared to __raw_readl/__raw_writel,
> > it guarantees
> > 
> > - correct endianess (in most cases, exceptions see below)
> > - does not leak out of spinlocks
> 
> Not even normal non-volatile accesses can leak across spinlocks;
> they're memory+compiler barriers.

Note that Arnd was talking in general, about cross-architectural guarantees.

Across architectures, not all barriers are equivalent. On some architectures,
the barriers used by spinlocks don't necessarily order accesses across
disparate memory types (i.e. DRAM and IO), or might not guarantee that the
ordering is consistent to observers other than (coherent) CPUs. Hence a
portable driver should care about this distinction.

If they're equivalent for you, you may as well use the portably stronger form,
as it's less likely to surprise others.

> > - accesses are word-sized, the compiler cannot split them into
> >   byte accesses, or merge subsequent accesses into larger words
> 
> Based on my reading of the code that's also true for the raw ones. At
> least GCC interprets volatile as disallowing split/combined
> loads/stores when the hardware supports the nominal load/store size,
> and this seems to agree with the intent of the C standard that
> volatile be usable for memory-mapped devices..
> 
> > - ordering against dma_map_*/dma_unmap_*/dma_sync_* is maintained
> 
> Aren't they barriers too?

As above, not all barriers are equivalent. Your architecture may provide
stronger (or weaker) guarantees than others.

> > The __raw_* accessors are used as building blocks for
> > readl/outl/ioread32/ioread32_be/readl_relaxed/... and they can
> > usually be used safely on device RAM such as framebuffers but
> > not much else in portable code. Some architectures also use the
> > __raw_* accessors for MMIO registers, but that code is generally
> > not portable.
> 
> My concern is that, depending on some arcane defines, readl/writel
> could start doing byte-reversal of values since the CPU is big-endian.
> arch/sh seems to have them defined in such a way that this doesn't
> happen, but I was hoping to avoid depending on that. If you think it's
> safe to assume they do the right thing, though, I'm okay with it. If
> anything breaks when we try a little-endian version of the CPU/SoC
> later I can go back and fix it.

I guess this really depends on the expectations you have w.r.t. the endianness
of devices vs the endianness of CPUs. If you expect the two to always be
matched, please add a comment in the driver to that effect.

As Arnd and others mentioned, the vastly common case is that the endianness of
CPUs and devices is not fixed to each other, and devices are typically
little-endian.

> > Endianess is always more complicated than one thinks, and it's
> > handled differently across architectures, sometimes because of
> > hardware differences, sometimes for historic reasons that differ
> > only in the Linux implementation.
> > 
> > A large majority of devices are fixed to little-endian MMIO
> > registers (as recommended by PCI), so that is a good default.
> > 
> > Exceptions to this are:
> > 
> > * Some architectures (e.g. PowerPC) typically run big-endian code,
> >   and hardware designers decided to make the MMIO registers the
> >   same. In this case, you can generally use ioread32_be/iowrite32_be.
> 
> This is something like our case, but the mmio registers are accessed
> as 32-bit values and the endianness does not come into play. 

This really depends on you endianness model, bus, etc. The above sounds like
BE32 rather than BE8, assuming I've understood correctly, which is an unusual
case nowadays.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rich Felker Aug. 24, 2016, 9:44 p.m. UTC | #10

On Wed, Aug 24, 2016 at 10:22:13PM +0100, Mark Rutland wrote:
> On Wed, Aug 24, 2016 at 04:52:26PM -0400, Rich Felker wrote:
> > On Wed, Aug 24, 2016 at 10:01:08PM +0200, Arnd Bergmann wrote:
> > > On Wednesday, August 24, 2016 1:40:01 PM CEST Rich Felker wrote:
> > > > I actually tried multiple times to find good resources on policy for
> > > > which form to prefer, but didn't have much luck. My understanding is
> > > > that __raw_writel/__raw_readl always performs a native-endian
> > > > load/store, whereas writel/readl behavior depends on cpu endianness
> > > > and whether the arch declares that "pci bus" (that was the term I
> > > > found in the source, not relevant here) io is endian-swapped or not.
> > > > Can you give me a better explanation of why we might prefer one form
> > > > or the other?
> > > 
> > > In general, a portable driver should use readl/writel because they
> > > are defined to have the behavior that x86 provides, which is what
> > > most developers understand best. Compared to __raw_readl/__raw_writel,
> > > it guarantees
> > > 
> > > - correct endianess (in most cases, exceptions see below)
> > > - does not leak out of spinlocks
> > 
> > Not even normal non-volatile accesses can leak across spinlocks;
> > they're memory+compiler barriers.
> 
> Note that Arnd was talking in general, about cross-architectural guarantees.
> 
> Across architectures, not all barriers are equivalent. On some architectures,
> the barriers used by spinlocks don't necessarily order accesses across
> disparate memory types (i.e. DRAM and IO), or might not guarantee that the
> ordering is consistent to observers other than (coherent) CPUs. Hence a
> portable driver should care about this distinction.
> 
> If they're equivalent for you, you may as well use the portably stronger form,
> as it's less likely to surprise others.

*nod* Understood.

> > > - accesses are word-sized, the compiler cannot split them into
> > >   byte accesses, or merge subsequent accesses into larger words
> > 
> > Based on my reading of the code that's also true for the raw ones. At
> > least GCC interprets volatile as disallowing split/combined
> > loads/stores when the hardware supports the nominal load/store size,
> > and this seems to agree with the intent of the C standard that
> > volatile be usable for memory-mapped devices..
> > 
> > > - ordering against dma_map_*/dma_unmap_*/dma_sync_* is maintained
> > 
> > Aren't they barriers too?
> 
> As above, not all barriers are equivalent. Your architecture may provide
> stronger (or weaker) guarantees than others.

OK.

> > > The __raw_* accessors are used as building blocks for
> > > readl/outl/ioread32/ioread32_be/readl_relaxed/... and they can
> > > usually be used safely on device RAM such as framebuffers but
> > > not much else in portable code. Some architectures also use the
> > > __raw_* accessors for MMIO registers, but that code is generally
> > > not portable.
> > 
> > My concern is that, depending on some arcane defines, readl/writel
> > could start doing byte-reversal of values since the CPU is big-endian.
> > arch/sh seems to have them defined in such a way that this doesn't
> > happen, but I was hoping to avoid depending on that. If you think it's
> > safe to assume they do the right thing, though, I'm okay with it. If
> > anything breaks when we try a little-endian version of the CPU/SoC
> > later I can go back and fix it.
> 
> I guess this really depends on the expectations you have w.r.t. the endianness
> of devices vs the endianness of CPUs. If you expect the two to always be
> matched, please add a comment in the driver to that effect.
> 
> As Arnd and others mentioned, the vastly common case is that the endianness of
> CPUs and devices is not fixed to each other, and devices are typically
> little-endian.

Since it's not used in other designs, I can't say anything meaningful
about how it hypothetically might be, but the intent on all current
and currently-planned systems is that 32-bit loads/stores be performed
using values that are meaningful as values on the cpu, i.e. with no
byte swapping regardless of cpu endianness. The current hardware is
designed as big-endian, and the proposed patches for supporting little
endian just involve flipping the low bits of the address for 16- and
8-bit accesses (not relevant to mmio registers that are only
accessible in 32-bit units).

> > > Endianess is always more complicated than one thinks, and it's
> > > handled differently across architectures, sometimes because of
> > > hardware differences, sometimes for historic reasons that differ
> > > only in the Linux implementation.
> > > 
> > > A large majority of devices are fixed to little-endian MMIO
> > > registers (as recommended by PCI), so that is a good default.
> > > 
> > > Exceptions to this are:
> > > 
> > > * Some architectures (e.g. PowerPC) typically run big-endian code,
> > >   and hardware designers decided to make the MMIO registers the
> > >   same. In this case, you can generally use ioread32_be/iowrite32_be.
> > 
> > This is something like our case, but the mmio registers are accessed
> > as 32-bit values and the endianness does not come into play. 
> 
> This really depends on you endianness model, bus, etc. The above sounds like
> BE32 rather than BE8, assuming I've understood correctly, which is an unusual
> case nowadays.

I'm not familiar with those classifications, but from what I can tell,
BE32 describes it correctly. I'll see if I can get someone to verify
this. Is there a reason it's not widely used anymore? Perhaps
something related to supporting misaligned word-sized loads/stores?

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnd Bergmann Aug. 24, 2016, 9:57 p.m. UTC | #11

On Wednesday, August 24, 2016 5:44:44 PM CEST Rich Felker wrote:
> On Wed, Aug 24, 2016 at 10:22:13PM +0100, Mark Rutland wrote:
> > On Wed, Aug 24, 2016 at 04:52:26PM -0400, Rich Felker wrote:
> > As Arnd and others mentioned, the vastly common case is that the endianness of
> > CPUs and devices is not fixed to each other, and devices are typically
> > little-endian.
> 
> Since it's not used in other designs, I can't say anything meaningful
> about how it hypothetically might be, but the intent on all current
> and currently-planned systems is that 32-bit loads/stores be performed
> using values that are meaningful as values on the cpu, i.e. with no
> byte swapping regardless of cpu endianness. The current hardware is
> designed as big-endian, and the proposed patches for supporting little
> endian just involve flipping the low bits of the address for 16- and
> 8-bit accesses (not relevant to mmio registers that are only
> accessible in 32-bit units).
>
> > > > 
> > > > Exceptions to this are:
> > > > 
> > > > * Some architectures (e.g. PowerPC) typically run big-endian code,
> > > >   and hardware designers decided to make the MMIO registers the
> > > >   same. In this case, you can generally use ioread32_be/iowrite32_be.
> > > 
> > > This is something like our case, but the mmio registers are accessed
> > > as 32-bit values and the endianness does not come into play. 
> > 
> > This really depends on you endianness model, bus, etc. The above sounds like
> > BE32 rather than BE8, assuming I've understood correctly, which is an unusual
> > case nowadays.

Correct, which is the last case I had in my list, not the first one:

> > > * Some (usually very old) SoCs have configurable endianess, but
> > >   don't actually change the behavior of the SoC or the MMIO devices,
> > >   but only flip the address lines on the memory interface.
> > >   This is similar to the previous one, but has additional problems
> > >   with FIFO registers. You generally cannot use portable drivers
> > >   here.

With your description above, that would probably be called LE32, given
that the native ordering is big-endian, and the little-endian mode
is only approximated using address magic. ARM BE32 mode (e.g. on IXP4xx)
did it the other way round.

> I'm not familiar with those classifications, but from what I can tell,
> BE32 describes it correctly. I'll see if I can get someone to verify
> this. Is there a reason it's not widely used anymore? Perhaps
> something related to supporting misaligned word-sized loads/stores?

The main problem I see is that you can't easily use MMIO registers that
are not 32-bit wide -- you end up having to flip not just the register
contents but also the lower bits of the address in order to reach
the right contents of the register.

If you can guarantee that all registers are 32-bit wide, you can
create a header file with your own MMIO accessors that wrap around
readl/write or ioread32_be/iowrite32_be depending on the kernel
endianess, and don't provide readb/writeb or readw/writew wrappers.

However, what everyone else does these days is to flip the register
values in load/store instructions when the opposite endianess is
selected, or to have two sets of load/store instructions and leave
it up to the compiler to pick which one to use (probably not 
an option for you, since you use an existing instruction set).
If you do that, all portable device drivers should work fine
on your architecture.

	Arnd

Mark Rutland Aug. 24, 2016, 10:21 p.m. UTC | #12

Hi,

On Wed, Aug 24, 2016 at 03:20:09PM -0400, Rich Felker wrote:
> On Wed, Aug 24, 2016 at 08:01:52PM +0100, Marc Zyngier wrote:
> > On Wed, 24 Aug 2016 13:40:01 -0400
> > Rich Felker <dalias@libc.org> wrote:
> > 
> > [...]
> > 
> > > > IIUC, there is a problem with the interrupt controller where
> > the per irq
> > > > line are not working correctly. Is that correct ?  
> > > 
> > > I don't think that's a correct characterization. Rather the percpu
> > > infrastructure just means something completely different from what you
> > > would expect it to mean. It has nothing to do with the hardware but
> > > rather with kernel-internal choice of whether to do percpu devid
> > > mapping inside the irq infrastructure, and the choice at the
> > > irq-requester side of whether to do this is required to match the
> > > irqchip driver's choice. I explained this better in another email
> > > which I could dig up if necessary, but the essence is that
> > > request_percpu_irq is a misnamed and unusably broken API.
> 
> For reference, here's the thread I was referring to:
> 
> https://lkml.org/lkml/2016/7/15/585

That reply seems to imply that the percpu_irq functionality is only there to
shift a percpu pointer, and that any IRQ can be treated either as a 'regular'
interrupt or a percpu one. That's not really the case, and I'm worried that
this implies that the way your hardware and driver behaves violates (implicit)
assumptions made by core code. Please see below for more details and questions.

The percpu IRQ infrastructure was added to handle things like SPI (Shared
Peripheral Interrupt) vs PPI (Private Peripheral Interrupt) on ARM GICs
(Generic Interrupt Controllers), where the two classses of interrupt are
distinct, and the latter is very much a per-cpu resource in hardware, while the
former is not. In the GIC, particular IRQ IDs are reserved for each class,
making them possible to distinguish easily.

Consider an ARM system with N CPUs, a GIC, some CPU-local timers (wired as
PPIs), and some other peripherals (wired as SPIs).

For each SPI, it is as if there is a single wire into the GIC. An SPI can be
programmatically configured to be routed to any arbitrary CPU (or several, if
one really wants that), but it's fundamentally a single interrupt, and there is
a single GIC state machine for it (even if it's routed to multiple CPUs
simultaneously). Any state for this class of interrupt can/must be protected by
some global lock or other synchronisation mechanism, as two CPUs can't take the
same SPI simultaneously, though the affinity might be changed at any point.

For each PPI, it is as if there are N wires into the GIC, each attached to a
CPU-affine device. It's effectively a group of N interrupts for a particular
purpose, sharing the same ID. Each wire is routed to a particular CPU (which
cannot be changed), and there is an independent GIC state machine per-cpu
(accessed through registers which are banked per-cpu). Any state for this class
of interrupt can/must be protected by some per-cpu lock or other
synchronisation mechanism, as the instances of the interrupt can be taken on
multiple CPUs simultaneously.

The above is roughly what Linux understands today as the distinction between a
'regular' and percpu interrupt. As above, you cannot freely treat either as the
other -- it is a hardware property. If you took the same regular interrupt on
different CPUs simultaneously, the core IRQ system is liable to be confused, as
it does not expect this to happen -- e.g. if for some reason we had some global
lock or accounting data structure.

Note that you could have some percpu resource using N SPIs. That would not be a
percpu irq, even if you only want to handle each interrupt on a particular CPU.

With contrast to the above, can you explain how your interrupts behave?

Is the routing to a CPU or CPUs fixed?

Are there independent state machines per-cpu for your timer interrupt in the
interrupt controller hardware?

> > Or just one that simply doesn't fit your needs, because other
> > architectures have different semantics than the ones you take for
> > granted?
> 
> I don't think so. The choice of whether to have the irq layer or the
> driver's irq handler be responsible for handling a percpu pointer has
> nothing to do with the hardware.

As above, percpu interrupts are not just about shifting pointers. Using the
correct class is critical to continued correct operation.

> Perhaps the intent was that the irqchip driver always knows whether a
> given hwirq[-range] is associated with per-cpu events or global events
> for which it doesn't matter what cpu they're delivered on.

So far, the assumption has been that the irqchip (driver) knows whether a
particular hwirq is regular or percpu.

> In this case, the situations where you may want percpu dev_id mapping line up
> with some property of the hardware. However that need not be the case, and
> it's not when the choice of irq is programmable.

I guess this depends on the behaviour of your HW. Does the timer interrupt
behave like the GIC PPI I explained above? I assume others behave like the GIC
SPI? What happens if you route one of those to multiple CPUs?

If the distinctions largely matches SPI/PPI, other than the lack of a fixed ID,
there are ways we can solve this, e.g. annotating percpu interrupts in the DT
with a flag, and deferring their initialisation to the first request.

> > > > Regarding Marc Zyngier comments about the irq controller driver being
> > > > almost empty, I'm wondering if something in the irq controller driver
> > > > which shouldn't be added before submitting this timer driver with SMP
> > > > support (eg. irq domain ?).  
> > > 
> > > I don't think so. At most I could make the driver hard-code the percpu
> > > devid model for certain irqs, but that _does not reflect_ anything
> > > about the hardware. Rather it just reflects bad kernel internals. It
> > 
> > I'd appreciate it if instead of ranting about how broken the kernel is,
> > you'd submit a patch fixing it, since you seem to have spotted
> > something that we haven't in several years of using that code on a
> > couple of ARM-related platforms.
> 
> I didn't intend for this to be a rant. I'm not demanding that it be
> changed; I'm only objecting to being asked to make the driver use a
> framework that it doesn't need and that can't model what needs to be
> done. But I'm happy to discuss whether you would be open to such a
> change, and if so, to write and submit a patch. The ideas for what it
> would involve are in the linked email, quoted here:
> 
>    "... This is because the irq controller driver must, at irqdomain
>     mapping time, decide whether to register the handler as
>     handle_percpu_devid_irq (which interprets dev_id as a __percpu
>     pointer and remaps it for the local cpu before invoking the
>     driver's handler) or one of the other handlers that does not
>     perform any percpu remapping.
> 
>     The right way for this to work would be for
>     handle_irq_event_percpu to be responsible for the remapping, but
>     do it conditionally on whether the irq was requested via
>     request_irq or request_percpu_irq."

As above, I don't think that this is quite the right approach, as this allows
one to erroneously handle distinct classes of interrupt as one another.

I agree that in the case that the percpu-ness of a hwirq number is not fixed
property of the interrupt-controller, we may need to defer initialisation until
the request (where we can query the DT, say).

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Rutland Aug. 24, 2016, 10:54 p.m. UTC | #13

On Wed, Aug 24, 2016 at 05:44:44PM -0400, Rich Felker wrote:
> On Wed, Aug 24, 2016 at 10:22:13PM +0100, Mark Rutland wrote:
> > On Wed, Aug 24, 2016 at 04:52:26PM -0400, Rich Felker wrote:
> > > On Wed, Aug 24, 2016 at 10:01:08PM +0200, Arnd Bergmann wrote:
> > > > On Wednesday, August 24, 2016 1:40:01 PM CEST Rich Felker wrote:
> > > > The __raw_* accessors are used as building blocks for
> > > > readl/outl/ioread32/ioread32_be/readl_relaxed/... and they can
> > > > usually be used safely on device RAM such as framebuffers but
> > > > not much else in portable code. Some architectures also use the
> > > > __raw_* accessors for MMIO registers, but that code is generally
> > > > not portable.
> > > 
> > > My concern is that, depending on some arcane defines, readl/writel
> > > could start doing byte-reversal of values since the CPU is big-endian.
> > > arch/sh seems to have them defined in such a way that this doesn't
> > > happen, but I was hoping to avoid depending on that. If you think it's
> > > safe to assume they do the right thing, though, I'm okay with it. If
> > > anything breaks when we try a little-endian version of the CPU/SoC
> > > later I can go back and fix it.
> > 
> > I guess this really depends on the expectations you have w.r.t. the endianness
> > of devices vs the endianness of CPUs. If you expect the two to always be
> > matched, please add a comment in the driver to that effect.
> > 
> > As Arnd and others mentioned, the vastly common case is that the endianness of
> > CPUs and devices is not fixed to each other, and devices are typically
> > little-endian.
> 
> Since it's not used in other designs, I can't say anything meaningful
> about how it hypothetically might be, but the intent on all current
> and currently-planned systems is that 32-bit loads/stores be performed
> using values that are meaningful as values on the cpu, i.e. with no
> byte swapping regardless of cpu endianness. The current hardware is
> designed as big-endian, and the proposed patches for supporting little
> endian just involve flipping the low bits of the address for 16- and
> 8-bit accesses (not relevant to mmio registers that are only
> accessible in 32-bit units).

Ok.

Generally speaking, it's preferable (to the Linux community) that devices has
some known fixed endianness, and that systems behave in a BE8 fashion (I'll
clarify that below per your prior question), i.e. not just flipping the low
bits of addresses.

That allows common infrastructure like readl/writel to be used consistently and
reliably, regardless of what peripherals are integrated into the system, and
regardless of the endianness of the CPUs. The kernel has to know which
endianness it is (for other reasons at least), and in general the endianness
swap that may be required in some cases is less of a performance const than any
infrastructure required to handle varied integration possibilities (e.g. LE,
BE, 'native'). It will also avoid having to modify each and every driver that
already use the existing infrastructure...

> > > > Endianess is always more complicated than one thinks, and it's
> > > > handled differently across architectures, sometimes because of
> > > > hardware differences, sometimes for historic reasons that differ
> > > > only in the Linux implementation.
> > > > 
> > > > A large majority of devices are fixed to little-endian MMIO
> > > > registers (as recommended by PCI), so that is a good default.
> > > > 
> > > > Exceptions to this are:
> > > > 
> > > > * Some architectures (e.g. PowerPC) typically run big-endian code,
> > > >   and hardware designers decided to make the MMIO registers the
> > > >   same. In this case, you can generally use ioread32_be/iowrite32_be.
> > > 
> > > This is something like our case, but the mmio registers are accessed
> > > as 32-bit values and the endianness does not come into play. 
> > 
> > This really depends on you endianness model, bus, etc. The above sounds like
> > BE32 rather than BE8, assuming I've understood correctly, which is an unusual
> > case nowadays.
> 
> I'm not familiar with those classifications, but from what I can tell,
> BE32 describes it correctly. 

I'm not sure if it's an ARM term or a more general one. See [1] for a
description. Per your description of flipping address lines, it sounds like
you're (planning on) implementing BE32.

> I'll see if I can get someone to verify this. Is there a reason it's not
> widely used anymore? Perhaps something related to supporting misaligned
> word-sized loads/stores?

For one thing, BE8 is self-consistent as you scale to larger word sizes, rather
than having to be special-cased at that boundary (i.e. one always has to endian
swap rather than only swap the 32-bit halves of a 64-bit load with BE32).

BE8 is generally seen as the 'true' big-endian, with BE32 being a particular
'middle-endian' variant [2]. Others are better equipped to provide rationale
and/or war stories...

Thanks,
Mark.

[1] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
[2] https://en.wikipedia.org/wiki/Endianness#Middle-endian
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Gleixner Aug. 25, 2016, 8:07 a.m. UTC | #14

Rich,

On Wed, 24 Aug 2016, Rich Felker wrote:
> I don't think that's a correct characterization. Rather the percpu
> infrastructure just means something completely different from what you
> would expect it to mean. It has nothing to do with the hardware but
> rather with kernel-internal choice of whether to do percpu devid
> mapping inside the irq infrastructure, and the choice at the
> irq-requester side of whether to do this is required to match the
> irqchip driver's choice. I explained this better in another email
> which I could dig up if necessary, but the essence is that
> request_percpu_irq is a misnamed and unusably broken API.

I slowly get tired about your permanent ranting about misnamed, misdesigned
and unusable code in the kernel.

The percpu infrastructure was designed explicitely to reflect the hardware and
is not a random kernel internal choice. It's there to handle the situation
where a single interrupt number is actually available on each CPU.

ARM has a very clear distinction between PPI (per processor interrutps) and
SPI (shared peripheral interrupts). So we end up with the following interrupt
space:

IRQ	CPU0	CPU1
0	PPI	PPI
...	...	...
15	PPI	PPI

16	SPI	SPI
...
N	SPI	SPI

The fundamental difference between PPI and SPI is that PPIs originate from the
CPU itself or are injected as IPIs targeted to a particular CPU. Each CPU has
its own interrupt controller to mask/ack/... PPIs.

The SPIs have a shared interrupt controller and therefor are globally shared
between all cpus.

The percpu infrastructure reflects exaclty this scenario and allows us to
handle PPIs with the same irq number (which makes a lot of sense) seperately
on each CPU.

So how does this not reflect the hardware?

If your particular hardware has the old scheme of seperate interrupt numbers
for per cpu interrupts, then you can simply use the normal interrupt scheme
and request a seperate interrupt per cpu.

Just because something does not fit your needs and does not match your SH
blinkered worldview does not make it misnamed, misdesigned and unusable.

If you want to work with this community you really should start to talk
constructive with us and stop this unjustified ranting attitude.

Thanks,

	tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnd Bergmann Aug. 25, 2016, 10:23 a.m. UTC | #15

On Wednesday, August 24, 2016 11:57:08 PM CEST Arnd Bergmann wrote:
> > I'm not familiar with those classifications, but from what I can tell,
> > BE32 describes it correctly. I'll see if I can get someone to verify
> > this. Is there a reason it's not widely used anymore? Perhaps
> > something related to supporting misaligned word-sized loads/stores?
> 
> The main problem I see is that you can't easily use MMIO registers that
> are not 32-bit wide -- you end up having to flip not just the register
> contents but also the lower bits of the address in order to reach
> the right contents of the register.
> 

Actually there is a much more serious problem with BE32/LE32 mode:
doing unaligned access to RAM is a huge pain at the HW level,
and also when emulating it in the kernel on CPUs that require
aligned access. In ARM THUMB2 mode, you can even have unaligned
32-bit instructions.

	Arnd

--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rich Felker Aug. 25, 2016, 2:56 p.m. UTC | #16

On Thu, Aug 25, 2016 at 10:07:08AM +0200, Thomas Gleixner wrote:
> Rich,
> 
> On Wed, 24 Aug 2016, Rich Felker wrote:
> > I don't think that's a correct characterization. Rather the percpu
> > infrastructure just means something completely different from what you
> > would expect it to mean. It has nothing to do with the hardware but
> > rather with kernel-internal choice of whether to do percpu devid
> > mapping inside the irq infrastructure, and the choice at the
> > irq-requester side of whether to do this is required to match the
> > irqchip driver's choice. I explained this better in another email
> > which I could dig up if necessary, but the essence is that
> > request_percpu_irq is a misnamed and unusably broken API.
> 
> I slowly get tired about your permanent ranting about misnamed, misdesigned
> and unusable code in the kernel.

This is the only instance I'm aware of, and my intent was not to rant.
If you're aware of other cases where I've done this and it's come
across abrasive to you or others, I'd appreciate it if you could point
them out so I can try to avoid doing so in the future.

As for this topic, what happened is that, after the first and second
time of expressing confusion about how the infrastructure did not make
sense and whether I should even be using it, I did not get anything
resembling an explanation of why it's the way it is or why I might be
wrong in thinking it's a bug, and so I went forward with the
assumption that is was just a bug. Now that Mark Rutland has explained
it well (and with your additional explanation below in your email), I
see what the motivation was, but I still think it could be done in a
less-confusing and more-consistent way that doesn't assume ARM-like
irq architecture.

> If your particular hardware has the old scheme of seperate interrupt numbers
> for per cpu interrupts, then you can simply use the normal interrupt scheme
> and request a seperate interrupt per cpu.

Nominally it uses the same range of hardware interrupt numbers for all
(presently both) cpus, but some of them get delivered to a specific
cpu associated with the event (presently, IPI and timer; IPI is on a
fixed number at synthesis time but timer is runtime configurable)
while others are conceptually deliverable to either cpu (presently
only delivered to cpu0, but that's treated as an implementation
detail).

It currently works requesting the irq with flags that ensure the
handler runs on the same cpu it was delivered on, without using any
other percpu irq framework. If you have concerns about ways this could
break and want me to make the drivers do something else, I'm open to
suggestions.

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Gleixner Aug. 25, 2016, 3:41 p.m. UTC | #17

On Thu, 25 Aug 2016, Rich Felker wrote:
> assumption that is was just a bug. Now that Mark Rutland has explained
> it well (and with your additional explanation below in your email), I
> see what the motivation was, but I still think it could be done in a
> less-confusing and more-consistent way that doesn't assume ARM-like
> irq architecture.

It's not only ARM. Some MIPS Octeon stuff has the same layout and requirements
to use a single irq number for interrupts which are delivered on a per cpu
basis.

Patches are welcome :)

> > If your particular hardware has the old scheme of seperate interrupt numbers
> > for per cpu interrupts, then you can simply use the normal interrupt scheme
> > and request a seperate interrupt per cpu.
> 
> Nominally it uses the same range of hardware interrupt numbers for all
> (presently both) cpus, but some of them get delivered to a specific
> cpu associated with the event (presently, IPI and timer; IPI is on a
> fixed number at synthesis time but timer is runtime configurable)
> while others are conceptually deliverable to either cpu (presently
> only delivered to cpu0, but that's treated as an implementation
> detail).

If I understand correctly, then this is the classic scheme:

   CPU0		IPI0	   IRQ-N
   CPU1		IPI1	   IRQ-M

These and the timers or whatever are strict per cpu and therefor not routable.
Regular device interrupts can be routed to any CPU by setting the
affinity. Correct?

> It currently works requesting the irq with flags that ensure the
> handler runs on the same cpu it was delivered on, without using any
> other percpu irq framework.

Which special flag are you referring to? I'm not aware of one.

IRQF_PER_CPU is just telling the core that this is a non routable per cpu
interrupt. It's excluded from affinity setting and also on cpu hot unplug the
per cpu interrupts are not touched and nothing tries to reroute them to one of
the still online cpus.

Regarding the interrupt handler. It runs on the CPU on which the interrupt is
delivered and there is nothing you can influence with a flag.

> If you have concerns about ways this could break and want me to make the
> drivers do something else, I'm open to suggestions.

If I understand the hardware halfways right, then using request_irq() with
IRQF_PER_CPU for these special interrupts is completely correct.

The handler either uses this_cpu_xxx() for accessing the per cpu data related
to the interrupt or you can hand in a percpu pointer as dev_id to
request_irq() which then is handed to the interrupt function as a cookie.

Hope that helps.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Rutland Aug. 25, 2016, 4:38 p.m. UTC | #18

On Thu, Aug 25, 2016 at 10:56:50AM -0400, Rich Felker wrote:
> On Thu, Aug 25, 2016 at 10:07:08AM +0200, Thomas Gleixner wrote:
> > On Wed, 24 Aug 2016, Rich Felker wrote:
> As for this topic, what happened is that, after the first and second
> time of expressing confusion about how the infrastructure did not make
> sense and whether I should even be using it, I did not get anything
> resembling an explanation of why it's the way it is or why I might be
> wrong in thinking it's a bug, and so I went forward with the
> assumption that is was just a bug. Now that Mark Rutland has explained
> it well (and with your additional explanation below in your email), I
> see what the motivation was, but I still think it could be done in a
> less-confusing and more-consistent way that doesn't assume ARM-like
> irq architecture.

The fun this is that right until the point you sent this patch, it was
consistent for all hardware that we support and were aware of. For others, it
is your hardware that is the confusing case. ;)

In future, it's better to state 'this doesn't seem to match my hardware' rather
than 'this is misdesigned'. Ideally, with a description of how your hardware
works, as that's the thing no-one else is familiar with.

> > If your particular hardware has the old scheme of seperate interrupt numbers
> > for per cpu interrupts, then you can simply use the normal interrupt scheme
> > and request a seperate interrupt per cpu.
> 
> Nominally it uses the same range of hardware interrupt numbers for all
> (presently both) cpus, but some of them get delivered to a specific
> cpu associated with the event (presently, IPI and timer; IPI is on a
> fixed number at synthesis time but timer is runtime configurable)
> while others are conceptually deliverable to either cpu (presently
> only delivered to cpu0, but that's treated as an implementation
> detail).

Given you say it's delivered to the CPU associated with the event (and you have
different PIT bases per-cpu), it sounds like your timer interrupt is percpu,
it's just that the hwirq number can be chosen by software.

> It currently works requesting the irq with flags that ensure the
> handler runs on the same cpu it was delivered on, without using any
> other percpu irq framework. If you have concerns about ways this could
> break and want me to make the drivers do something else, I'm open to
> suggestions.

As I suggested, I don't think that this is right, and you need some mechanism
to describe to the kernel that the interrupt is percpu (e.g. a flag in the
interrupt-specifier in DT).

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rich Felker Aug. 25, 2016, 5:45 p.m. UTC | #19

On Thu, Aug 25, 2016 at 05:41:29PM +0200, Thomas Gleixner wrote:
> On Thu, 25 Aug 2016, Rich Felker wrote:
> > assumption that is was just a bug. Now that Mark Rutland has explained
> > it well (and with your additional explanation below in your email), I
> > see what the motivation was, but I still think it could be done in a
> > less-confusing and more-consistent way that doesn't assume ARM-like
> > irq architecture.
> 
> It's not only ARM. Some MIPS Octeon stuff has the same layout and requirements
> to use a single irq number for interrupts which are delivered on a per cpu
> basis.
> 
> Patches are welcome :)

I'm not opposed to working on changes, but based on your below
comments I think maybe this (percpu request) is just infrastructure I
shouldn't be using. I think the source of my frustration was the
repeated (maybe by different people; I don't remember now) suggestions
that I use it even when I found that it didn't currently match well
with the hardware.

> > > If your particular hardware has the old scheme of seperate interrupt numbers
> > > for per cpu interrupts, then you can simply use the normal interrupt scheme
> > > and request a seperate interrupt per cpu.
> > 
> > Nominally it uses the same range of hardware interrupt numbers for all
> > (presently both) cpus, but some of them get delivered to a specific
> > cpu associated with the event (presently, IPI and timer; IPI is on a
> > fixed number at synthesis time but timer is runtime configurable)
> > while others are conceptually deliverable to either cpu (presently
> > only delivered to cpu0, but that's treated as an implementation
> > detail).
> 
> If I understand correctly, then this is the classic scheme:
> 
>    CPU0		IPI0	   IRQ-N
>    CPU1		IPI1	   IRQ-M
> 
> These and the timers or whatever are strict per cpu and therefor not routable.
> Regular device interrupts can be routed to any CPU by setting the
> affinity. Correct?

IPI generates hw irq 97 on whichever cpu it's targeted at, and the
timer generates whatever hw irq you program it to generate (by
convention, currently 72) on the cpu associated with the timer that
expired. Treating "cpu0's irq 97" and "cpu1's irq 97" as separate hw
irq numbers would be possible at the kernel level (just by using the
cpu id as part of the logical hw irq number) but this would require
lots of (imo useless) infrastructure/overhead and hard-coded
assumptions about which irq numbers are used for percpu events, and it
would not model the hardware well.

> > It currently works requesting the irq with flags that ensure the
> > handler runs on the same cpu it was delivered on, without using any
> > other percpu irq framework.
> 
> Which special flag are you referring to? I'm not aware of one.
> 
> IRQF_PER_CPU is just telling the core that this is a non routable per cpu
> interrupt. It's excluded from affinity setting and also on cpu hot unplug the
> per cpu interrupts are not touched and nothing tries to reroute them to one of
> the still online cpus.
> 
> Regarding the interrupt handler. It runs on the CPU on which the interrupt is
> delivered and there is nothing you can influence with a flag.

OK, I was not clear on whether there was such a guarantee in general
but knew there must be one for IRQF_TIMER or IRQF_PER_CPU.

(Without knowing the system doesn't do this, it's possible that the
softirq/tasklet stuff could migrate handling to a different cpu than
the hardware irq was delivered on.)

> > If you have concerns about ways this could break and want me to make the
> > drivers do something else, I'm open to suggestions.
> 
> If I understand the hardware halfways right, then using request_irq() with
> IRQF_PER_CPU for these special interrupts is completely correct.
> 
> The handler either uses this_cpu_xxx() for accessing the per cpu data related
> to the interrupt or you can hand in a percpu pointer as dev_id to
> request_irq() which then is handed to the interrupt function as a cookie.

Yes, that's exactly what my driver is doing now, and I'm happy to
leave it that way. Can we move forward with that? If so I'll make the
other changes requested and submit a new version of the patch.

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rich Felker Aug. 25, 2016, 5:51 p.m. UTC | #20

On Thu, Aug 25, 2016 at 05:38:06PM +0100, Mark Rutland wrote:
> On Thu, Aug 25, 2016 at 10:56:50AM -0400, Rich Felker wrote:
> > On Thu, Aug 25, 2016 at 10:07:08AM +0200, Thomas Gleixner wrote:
> > > On Wed, 24 Aug 2016, Rich Felker wrote:
> > As for this topic, what happened is that, after the first and second
> > time of expressing confusion about how the infrastructure did not make
> > sense and whether I should even be using it, I did not get anything
> > resembling an explanation of why it's the way it is or why I might be
> > wrong in thinking it's a bug, and so I went forward with the
> > assumption that is was just a bug. Now that Mark Rutland has explained
> > it well (and with your additional explanation below in your email), I
> > see what the motivation was, but I still think it could be done in a
> > less-confusing and more-consistent way that doesn't assume ARM-like
> > irq architecture.
> 
> The fun this is that right until the point you sent this patch, it was
> consistent for all hardware that we support and were aware of. For others, it
> is your hardware that is the confusing case. ;)
> 
> In future, it's better to state 'this doesn't seem to match my hardware' rather
> than 'this is misdesigned'. Ideally, with a description of how your hardware
> works, as that's the thing no-one else is familiar with.

OK. What I meant is "it's not sufficiently general". I usually come at
these things not from a particular preconception of how it
is/should-be done on some archs I'm familiar with, but minimizing
unnecessary (i.e. not beneficial to performance or correctness or
simplicity) assumptions at the subsystem level about how hardware
could work.

> > > If your particular hardware has the old scheme of seperate interrupt numbers
> > > for per cpu interrupts, then you can simply use the normal interrupt scheme
> > > and request a seperate interrupt per cpu.
> > 
> > Nominally it uses the same range of hardware interrupt numbers for all
> > (presently both) cpus, but some of them get delivered to a specific
> > cpu associated with the event (presently, IPI and timer; IPI is on a
> > fixed number at synthesis time but timer is runtime configurable)
> > while others are conceptually deliverable to either cpu (presently
> > only delivered to cpu0, but that's treated as an implementation
> > detail).
> 
> Given you say it's delivered to the CPU associated with the event (and you have
> different PIT bases per-cpu), it sounds like your timer interrupt is percpu,
> it's just that the hwirq number can be chosen by software.

It's what I would call percpu in the hardware, but I'm not convinced
that the Linux irq subsystem's "percpu" stuff models it in a way
that fits the hw, nor that it's in any way necessary.

> > It currently works requesting the irq with flags that ensure the
> > handler runs on the same cpu it was delivered on, without using any
> > other percpu irq framework. If you have concerns about ways this could
> > break and want me to make the drivers do something else, I'm open to
> > suggestions.
> 
> As I suggested, I don't think that this is right, and you need some mechanism
> to describe to the kernel that the interrupt is percpu (e.g. a flag in the
> interrupt-specifier in DT).

Thomas seemed to think it's okay as-is. Can you describe what you
expect could go wrong by using request_irq rather than the ARM-style
percpu irq framework?

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Rutland Aug. 25, 2016, 6:21 p.m. UTC | #21

On Thu, Aug 25, 2016 at 01:51:35PM -0400, Rich Felker wrote:
> On Thu, Aug 25, 2016 at 05:38:06PM +0100, Mark Rutland wrote:
> > On Thu, Aug 25, 2016 at 10:56:50AM -0400, Rich Felker wrote:
> > > On Thu, Aug 25, 2016 at 10:07:08AM +0200, Thomas Gleixner wrote:
> > > Nominally it uses the same range of hardware interrupt numbers for all
> > > (presently both) cpus, but some of them get delivered to a specific
> > > cpu associated with the event (presently, IPI and timer; IPI is on a
> > > fixed number at synthesis time but timer is runtime configurable)
> > > while others are conceptually deliverable to either cpu (presently
> > > only delivered to cpu0, but that's treated as an implementation
> > > detail).
> > 
> > Given you say it's delivered to the CPU associated with the event (and you have
> > different PIT bases per-cpu), it sounds like your timer interrupt is percpu,
> > it's just that the hwirq number can be chosen by software.
> 
> It's what I would call percpu in the hardware, but I'm not convinced
> that the Linux irq subsystem's "percpu" stuff models it in a way
> that fits the hw, nor that it's in any way necessary.

My understanding was that you used the same hwirq number to handle interrupts
from per-cpu resources being delivered to their relevant CPUs, independently of
each other.

That in my mind is a perfect match.

The only difference, as I've stated a number of times, seems to be that you can
choose the hwirq number from software.

> > > It currently works requesting the irq with flags that ensure the
> > > handler runs on the same cpu it was delivered on, without using any
> > > other percpu irq framework. If you have concerns about ways this could
> > > break and want me to make the drivers do something else, I'm open to
> > > suggestions.
> > 
> > As I suggested, I don't think that this is right, and you need some mechanism
> > to describe to the kernel that the interrupt is percpu (e.g. a flag in the
> > interrupt-specifier in DT).
> 
> Thomas seemed to think it's okay as-is. Can you describe what you
> expect could go wrong by using request_irq rather than the ARM-style
> percpu irq framework?

The percpu irq code is designed to expect a hwirq number being shared by
banked, cpu-local interrupts, the regular request_irq code is not. Even if the
latter happens to work today for your use-case, that is not by design.

Relying on non-deliberate properties of request_irq makes it far harder for the
generic code to be altered in future, with global vs percpu locking,
synchronisation, accounting, etc being broken.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rich Felker Aug. 25, 2016, 7:20 p.m. UTC | #22

On Thu, Aug 25, 2016 at 07:21:13PM +0100, Mark Rutland wrote:
> On Thu, Aug 25, 2016 at 01:51:35PM -0400, Rich Felker wrote:
> > On Thu, Aug 25, 2016 at 05:38:06PM +0100, Mark Rutland wrote:
> > > On Thu, Aug 25, 2016 at 10:56:50AM -0400, Rich Felker wrote:
> > > > On Thu, Aug 25, 2016 at 10:07:08AM +0200, Thomas Gleixner wrote:
> > > > Nominally it uses the same range of hardware interrupt numbers for all
> > > > (presently both) cpus, but some of them get delivered to a specific
> > > > cpu associated with the event (presently, IPI and timer; IPI is on a
> > > > fixed number at synthesis time but timer is runtime configurable)
> > > > while others are conceptually deliverable to either cpu (presently
> > > > only delivered to cpu0, but that's treated as an implementation
> > > > detail).
> > > 
> > > Given you say it's delivered to the CPU associated with the event (and you have
> > > different PIT bases per-cpu), it sounds like your timer interrupt is percpu,
> > > it's just that the hwirq number can be chosen by software.
> > 
> > It's what I would call percpu in the hardware, but I'm not convinced
> > that the Linux irq subsystem's "percpu" stuff models it in a way
> > that fits the hw, nor that it's in any way necessary.
> 
> My understanding was that you used the same hwirq number to handle interrupts
> from per-cpu resources being delivered to their relevant CPUs, independently of
> each other.
> 
> That in my mind is a perfect match.
> 
> The only difference, as I've stated a number of times, seems to be that you can
> choose the hwirq number from software.

I don't understand where you're getting that from. The timer irq
number can be chosen from software, but by convention the device tree
provides an otherwise-unoccupied irq number to be used. The IPI irq
cannot be chosen by software; it's hard-coded at synthesis time.
Either way, though, I don't see how being able to choose the number
from software would impact the situation.

The reason I don't think the current percpu irq request framework is
well-suited is that, unlike ARM, we don't have any particular set of
irq numbers that are designated as percpu vs global. In theory the DT
could be extended to provide that information somehow, but it's not
information you need to actually be able to use the hardware; if it
were added, it would just be satisfying the Linux irq subsystem's (imo
gratuitous) desire to know.

From my perspective, a hwirq being percpu is logically a property that
only the driver for the attached hardware needs to be aware of. In
fact the drivers using the percpu request already encode this, but
there's an unnecessary requirement that the irqchip driver also have
this knowledge.

> > > > It currently works requesting the irq with flags that ensure the
> > > > handler runs on the same cpu it was delivered on, without using any
> > > > other percpu irq framework. If you have concerns about ways this could
> > > > break and want me to make the drivers do something else, I'm open to
> > > > suggestions.
> > > 
> > > As I suggested, I don't think that this is right, and you need some mechanism
> > > to describe to the kernel that the interrupt is percpu (e.g. a flag in the
> > > interrupt-specifier in DT).
> > 
> > Thomas seemed to think it's okay as-is. Can you describe what you
> > expect could go wrong by using request_irq rather than the ARM-style
> > percpu irq framework?
> 
> The percpu irq code is designed to expect a hwirq number being shared by
> banked, cpu-local interrupts, the regular request_irq code is not. Even if the
> latter happens to work today for your use-case, that is not by design.
> 
> Relying on non-deliberate properties of request_irq makes it far harder for the
> generic code to be altered in future, with global vs percpu locking,
> synchronisation, accounting, etc being broken.

I can see how the irq subsystem wrongly thinking an irq is percpu when
it's actually global could cause problems (missing locking, etc.) but
I fail to see how the opposite could. In an abstract sense, percpu
irqs are just a special case of global ones. The general
infrastructure has no reason to care whether it was delivered on cpuN
because the event was associated with cpuN, or because cpuN happened
to be where the hw decided to deliver it. Only the actual
driver/handler needs to know that the cpu on which it was delivered is
meaningful.

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Daniel Lezcano Aug. 26, 2016, 9:04 a.m. UTC | #23

On 08/24/2016 07:40 PM, Rich Felker wrote:

[ ... ]

>>> +config CLKSRC_JCORE_PIT
>>> +	bool "J-Core PIT timer driver"
>>> +	depends on OF && (SUPERH || COMPILE_TEST)
>>
>> Even if this is correct, for the sake of consistency, it is preferable
>> to change it to:
>>
>> 	bool "J-Core PIT timer driver" if COMPILE_TEST
>> 	depends on SUPERH
>> 	select CLKSRC_OF
> 
> Is this functionally equivalent? If so that's non-obvious to me. What
> about the dropped OF dependency? The intent is that the option should
> always be available for SUPERH targets using OF, otherwise only
> available with COMPILE_TEST.ig

It is not equivalent but consistent with the other options where it is
not possible to manually set/unset the driver config wuthout COMPILE_TEST.

>>> +static int jcore_pit_set(unsigned long delta, struct jcore_pit *pit)
>>> +{
>>> +	jcore_pit_disable(pit);
>>> +	__raw_writel(delta, pit->base + REG_THROT);
>>> +	__raw_writel(pit->enable_val, pit->base + REG_PITEN);
>>> +	return 0;
>>
>> Why do you need to use __raw_writel ?
>>
>> s/__raw_writel/writel/
> 
> I actually tried multiple times to find good resources on policy for
> which form to prefer, but didn't have much luck. My understanding is
> that __raw_writel/__raw_readl always performs a native-endian
> load/store, whereas writel/readl behavior depends on cpu endianness
> and whether the arch declares that "pci bus" (that was the term I
> found in the source, not relevant here) io is endian-swapped or not.
> Can you give me a better explanation of why we might prefer one form
> or the other?

[ skipping this as it was very well commented by Arnd and Mark ]

>>> +	freq = DIV_ROUND_CLOSEST(NSEC_PER_SEC, buspd);
>>> +	pit->periodic_delta = DIV_ROUND_CLOSEST(NSEC_PER_SEC, HZ*buspd);
>>
>> --->					HZ * buspd
> 
> OK.
> 
>>> +	clockevents_config_and_register(&pit->ced, freq, 1, ULONG_MAX);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int jcore_pit_local_shutdown(unsigned cpu)
>>> +{
>>> +	return 0;
>>> +}
>>
>> This function is useless I think. AFAIU, cpuhp_setup_state can have a
>> NULL function for the cpu teardown.
> 
> OK, I wasn't sure if that was permitted.
> 
>>> +	jcore_cs.name = "jcore_pit_cs";
>>> +	jcore_cs.rating = 400;
>>> +	jcore_cs.read = jcore_clocksource_read;
>>> +	jcore_cs.mult = 1;
>>> +	jcore_cs.shift = 0;
>>> +	jcore_cs.mask = CLOCKSOURCE_MASK(32);
>>> +	jcore_cs.flags = CLOCK_SOURCE_IS_CONTINUOUS;
>>> +
>>> +	err = clocksource_register_hz(&jcore_cs, NSEC_PER_SEC);
>>> +	if (err) {
>>> +		pr_err("Error registering clocksource device: %d\n", err);
>>> +		return err;
>>> +	}
>>
>> May be you can consider by replacing the above by:
>>
>> 	clocksource_mmio_init(jcore_pit_base, "jcore_pit_cs",
>> 				NSEC_PER_SEC, 32,
>> 				jcore_clocksource_read);
> 
> I think you're missing the rating argument. Otherwise it should work,
> but is there a good reason to prefer it? The code is slightly simpler;
> I'm not sure if using kzalloc vs static storage is better or worse.

Probably clksrc field pointers are pre-calculated at compile time while
the allocation results in the code into an indirection to compute the
pointers, but I don't think it is noticeable.

[ ... ]

>>> +	/*
>>> +	 * The J-Core PIT is not hard-wired to a particular IRQ, but
>>> +	 * integrated with the interrupt controller such that the IRQ it
>>> +	 * generates is programmable. The programming interface has a
>>> +	 * legacy field which was an interrupt priority for AIC1, but
>>> +	 * which is OR'd onto bits 2-5 of the generated IRQ number when
>>> +	 * used with J-Core AIC2, so set it to match these bits.
>>> +	 */
>>> +	hwirq = irq_get_irq_data(pit_irq)->hwirq;
>>
>> 	irq_hw_number_t hwirq;
>> 	hwirq = irqd_to_hwirq(irq_get_irq_data(pit_irq));
> 
> OK.
> 
>>> +	irqprio = (hwirq >> 2) & PIT_PRIO_MASK;
>>> +	enable_val = (1U << PIT_ENABLE_SHIFT)
>>> +		   | (hwirq << PIT_IRQ_SHIFT)
>>> +		   | (irqprio << PIT_PRIO_SHIFT);
>>
>>
>> I'm missing the connection between the description above and the enable
>> value computed here. Can you elaborate ?
> 
> The irqprio field is filled in using a value that matches bits 2 and
> up of hwirq; this is what the comment says and what the code does. Can
> you elaborate on what you don't understand?

The API to compute the 'enable_val'.

Having a technical reference manual would help a lot.

[v6,2/2] clocksource: add J-Core timer/clocksource driver

Commit Message

Comments

Patch