[RFC,v8,09/21] riscv: Add task switch support for vector

Message ID	0e65c165e3d54a38cbba01603f325dca727274de.1631121222.git.greentime.hu@sifive.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=esFF=N6=lists.infradead.org=linux-riscv-bounces+linux-riscv=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4B97D61158 From: Greentime Hu <greentime.hu@sifive.com> To: linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, aou@eecs.berkeley.edu, palmer@dabbelt.com, paul.walmsley@sifive.com, vincent.chen@sifive.com Subject: [RFC PATCH v8 09/21] riscv: Add task switch support for vector Date: Thu, 9 Sep 2021 01:45:21 +0800 Message-Id: <0e65c165e3d54a38cbba01603f325dca727274de.1631121222.git.greentime.hu@sifive.com> In-Reply-To: <cover.1631121222.git.greentime.hu@sifive.com> References: <cover.1631121222.git.greentime.hu@sifive.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	riscv: Add vector ISA support \| expand [RFC,v8,00/21] riscv: Add vector ISA support [RFC,v8,01/21] riscv: Separate patch for cflags and aflags [RFC,v8,02/21] riscv: Rename __switch_to_aux -> fpu [RFC,v8,03/21] riscv: Extending cpufeature.c to detect V-extension [RFC,v8,04/21] riscv: Add new csr defines related to vector extension [RFC,v8,05/21] riscv: Add vector feature to compile [RFC,v8,06/21] riscv: Add has_vector/riscv_vsize to save vector features. [RFC,v8,07/21] riscv: Reset vector register [RFC,v8,08/21] riscv: Add vector struct and assembler definitions [RFC,v8,09/21] riscv: Add task switch support for vector [RFC,v8,10/21] riscv: Add ptrace vector support [RFC,v8,11/21] riscv: Add sigcontext save/restore for vector [RFC,v8,12/21] riscv: signal: Report signal frame size to userspace via auxv [RFC,v8,13/21] riscv: Add support for kernel mode vector [RFC,v8,14/21] riscv: Use CSR_STATUS to replace sstatus in vector.S [RFC,v8,15/21] riscv: Add vector extension XOR implementation [RFC,v8,16/21] riscv: Initialize vector registers with proper vsetvli then it can work normally [RFC,v8,17/21] riscv: Optimize vector registers initialization [RFC,v8,18/21] riscv: Fix an illegal instruction exception when accessing vlenb without enable vect… [RFC,v8,19/21] riscv: Allocate space for vector registers in start_thread() [RFC,v8,20/21] riscv: Optimize task switch codes of vector [RFC,v8,21/21] riscv: Turn has_vector into a static key if VECTOR=y

Greentime Hu Sept. 8, 2021, 5:45 p.m. UTC

This patch adds task switch support for vector. It supports partial lazy
save and restore mechanism. It also supports all lengths of vlen.

[guoren@linux.alibaba.com: First available porting to support vector
context switching]
[nick.knight@sifive.com: Rewrite vector.S to support dynamic vlen, xlen and
code refine]
[vincent.chen@sifive.co: Fix the might_sleep issue in vstate_save,
vstate_restore]
Co-developed-by: Nick Knight <nick.knight@sifive.com>
Signed-off-by: Nick Knight <nick.knight@sifive.com>
Co-developed-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Co-developed-by: Vincent Chen <vincent.chen@sifive.com>
Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
---
 arch/riscv/include/asm/switch_to.h | 66 +++++++++++++++++++++++
 arch/riscv/kernel/Makefile         |  1 +
 arch/riscv/kernel/process.c        | 38 ++++++++++++++
 arch/riscv/kernel/vector.S         | 84 ++++++++++++++++++++++++++++++
 4 files changed, 189 insertions(+)
 create mode 100644 arch/riscv/kernel/vector.S

Darius Rad Sept. 13, 2021, 12:21 p.m. UTC | #1

On 9/8/21 1:45 PM, Greentime Hu wrote:
> This patch adds task switch support for vector. It supports partial lazy
> save and restore mechanism. It also supports all lengths of vlen.
> 
> [guoren@linux.alibaba.com: First available porting to support vector
> context switching]
> [nick.knight@sifive.com: Rewrite vector.S to support dynamic vlen, xlen and
> code refine]
> [vincent.chen@sifive.co: Fix the might_sleep issue in vstate_save,
> vstate_restore]
> Co-developed-by: Nick Knight <nick.knight@sifive.com>
> Signed-off-by: Nick Knight <nick.knight@sifive.com>
> Co-developed-by: Guo Ren <guoren@linux.alibaba.com>
> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> Co-developed-by: Vincent Chen <vincent.chen@sifive.com>
> Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
> Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
> ---
>   arch/riscv/include/asm/switch_to.h | 66 +++++++++++++++++++++++
>   arch/riscv/kernel/Makefile         |  1 +
>   arch/riscv/kernel/process.c        | 38 ++++++++++++++
>   arch/riscv/kernel/vector.S         | 84 ++++++++++++++++++++++++++++++
>   4 files changed, 189 insertions(+)
>   create mode 100644 arch/riscv/kernel/vector.S
> 
> diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
> index ec83770b3d98..de0573dad78f 100644
> --- a/arch/riscv/include/asm/switch_to.h
> +++ b/arch/riscv/include/asm/switch_to.h
> @@ -7,10 +7,12 @@
>   #define _ASM_RISCV_SWITCH_TO_H
>   
>   #include <linux/jump_label.h>
> +#include <linux/slab.h>
>   #include <linux/sched/task_stack.h>
>   #include <asm/processor.h>
>   #include <asm/ptrace.h>
>   #include <asm/csr.h>
> +#include <asm/asm-offsets.h>
>   
>   #ifdef CONFIG_FPU
>   extern void __fstate_save(struct task_struct *save_to);
> @@ -68,6 +70,68 @@ static __always_inline bool has_fpu(void) { return false; }
>   #define __switch_to_fpu(__prev, __next) do { } while (0)
>   #endif
>   
> +#ifdef CONFIG_VECTOR
> +extern bool has_vector;
> +extern unsigned long riscv_vsize;
> +extern void __vstate_save(struct __riscv_v_state *save_to, void *datap);
> +extern void __vstate_restore(struct __riscv_v_state *restore_from, void *datap);
> +
> +static inline void __vstate_clean(struct pt_regs *regs)
> +{
> +	regs->status = (regs->status & ~(SR_VS)) | SR_VS_CLEAN;
> +}
> +
> +static inline void vstate_off(struct task_struct *task,
> +			      struct pt_regs *regs)
> +{
> +	regs->status = (regs->status & ~SR_VS) | SR_VS_OFF;
> +}
> +
> +static inline void vstate_save(struct task_struct *task,
> +			       struct pt_regs *regs)
> +{
> +	if ((regs->status & SR_VS) == SR_VS_DIRTY) {
> +		struct __riscv_v_state *vstate = &(task->thread.vstate);
> +
> +		__vstate_save(vstate, vstate->datap);
> +		__vstate_clean(regs);
> +	}
> +}
> +
> +static inline void vstate_restore(struct task_struct *task,
> +				  struct pt_regs *regs)
> +{
> +	if ((regs->status & SR_VS) != SR_VS_OFF) {
> +		struct __riscv_v_state *vstate = &(task->thread.vstate);
> +
> +		/* Allocate space for vector registers. */
> +		if (!vstate->datap) {
> +			vstate->datap = kzalloc(riscv_vsize, GFP_ATOMIC);
> +			vstate->size = riscv_vsize;
> +		}
> +		__vstate_restore(vstate, vstate->datap);
> +		__vstate_clean(regs);
> +	}
> +}
> +
> +static inline void __switch_to_vector(struct task_struct *prev,
> +				   struct task_struct *next)
> +{
> +	struct pt_regs *regs;
> +
> +	regs = task_pt_regs(prev);
> +	if (unlikely(regs->status & SR_SD))
> +		vstate_save(prev, regs);
> +	vstate_restore(next, task_pt_regs(next));
> +}
> +
> +#else
> +#define has_vector false
> +#define vstate_save(task, regs) do { } while (0)
> +#define vstate_restore(task, regs) do { } while (0)
> +#define __switch_to_vector(__prev, __next) do { } while (0)
> +#endif
> +
>   extern struct task_struct *__switch_to(struct task_struct *,
>   				       struct task_struct *);
>   
> @@ -77,6 +141,8 @@ do {							\
>   	struct task_struct *__next = (next);		\
>   	if (has_fpu())					\
>   		__switch_to_fpu(__prev, __next);	\
> +	if (has_vector)					\
> +		__switch_to_vector(__prev, __next);	\
>   	((last) = __switch_to(__prev, __next));		\
>   } while (0)
>   
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index 3397ddac1a30..344078080839 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -40,6 +40,7 @@ obj-$(CONFIG_MMU) += vdso.o vdso/
>   
>   obj-$(CONFIG_RISCV_M_MODE)	+= traps_misaligned.o
>   obj-$(CONFIG_FPU)		+= fpu.o
> +obj-$(CONFIG_VECTOR)		+= vector.o
>   obj-$(CONFIG_SMP)		+= smpboot.o
>   obj-$(CONFIG_SMP)		+= smp.o
>   obj-$(CONFIG_SMP)		+= cpu_ops.o
> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> index 03ac3aa611f5..0b86e9e531c9 100644
> --- a/arch/riscv/kernel/process.c
> +++ b/arch/riscv/kernel/process.c
> @@ -95,6 +95,16 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
>   		 */
>   		fstate_restore(current, regs);
>   	}
> +
> +	if (has_vector) {
> +		regs->status |= SR_VS_INITIAL;
> +		/*
> +		 * Restore the initial value to the vector register
> +		 * before starting the user program.
> +		 */
> +		vstate_restore(current, regs);
> +	}
> +

So this will unconditionally enable vector instructions, and allocate 
memory for vector state, for all processes, regardless of whether vector 
instructions are used?

Given the size of the vector state and potential power and performance 
implications of enabling the vector engine, it seems like this should 
treated similarly to Intel AMX on x86.  The full discussion of that is 
here:

https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/

The cover letter for recent Intel AMX patches has a summary of the x86 
implementation:

https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/

If RISC-V were to adopt a similar approach, I think the significant 
points are:

  1. A process (or thread) must specifically request the desire to use
vector extensions (perhaps with some new arch_prctl() API),

  2. The kernel is free to deny permission, perhaps based on 
administrative rules or for other reasons, and

  3. If a process attempts to use vector extensions before doing the 
above, the process will die due to an illegal instruction.


>   	regs->epc = pc;
>   	regs->sp = sp;
>   }
> @@ -110,15 +120,43 @@ void flush_thread(void)
>   	fstate_off(current, task_pt_regs(current));
>   	memset(&current->thread.fstate, 0, sizeof(current->thread.fstate));
>   #endif
> +#ifdef CONFIG_VECTOR
> +	/* Reset vector state */
> +	vstate_off(current, task_pt_regs(current));
> +	memset(&current->thread.vstate, 0, sizeof(current->thread.vstate));
> +#endif
>   }
>   
>   int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
>   {
>   	fstate_save(src, task_pt_regs(src));
> +	if (has_vector)
> +		/* To make sure every dirty vector context is saved. */
> +		vstate_save(src, task_pt_regs(src));
>   	*dst = *src;
> +	if (has_vector) {
> +		/* Copy vector context to the forked task from parent. */
> +		if ((task_pt_regs(src)->status & SR_VS) != SR_VS_OFF) {
> +			dst->thread.vstate.datap = kzalloc(riscv_vsize, GFP_KERNEL);
> +			/* Failed to allocate memory. */
> +			if (!dst->thread.vstate.datap)
> +				return -ENOMEM;
> +			/* Copy the src vector context to dst. */
> +			memcpy(dst->thread.vstate.datap,
> +			       src->thread.vstate.datap, riscv_vsize);
> +		}
> +	}
> +
>   	return 0;
>   }
>   
> +void arch_release_task_struct(struct task_struct *tsk)
> +{
> +	/* Free the vector context of datap. */
> +	if (has_vector)
> +		kfree(tsk->thread.vstate.datap);
> +}
> +
>   int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
>   		struct task_struct *p, unsigned long tls)
>   {
> diff --git a/arch/riscv/kernel/vector.S b/arch/riscv/kernel/vector.S
> new file mode 100644
> index 000000000000..4c880b1c32aa
> --- /dev/null
> +++ b/arch/riscv/kernel/vector.S
> @@ -0,0 +1,84 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2012 Regents of the University of California
> + * Copyright (C) 2017 SiFive
> + * Copyright (C) 2019 Alibaba Group Holding Limited
> + *
> + *   This program is free software; you can redistribute it and/or
> + *   modify it under the terms of the GNU General Public License
> + *   as published by the Free Software Foundation, version 2.
> + *
> + *   This program is distributed in the hope that it will be useful,
> + *   but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *   GNU General Public License for more details.
> + */
> +
> +#include <linux/linkage.h>
> +
> +#include <asm/asm.h>
> +#include <asm/csr.h>
> +#include <asm/asm-offsets.h>
> +
> +#define vstatep  a0
> +#define datap    a1
> +#define x_vstart t0
> +#define x_vtype  t1
> +#define x_vl     t2
> +#define x_vcsr   t3
> +#define incr     t4
> +#define m_one    t5
> +#define status   t6
> +
> +ENTRY(__vstate_save)
> +	li      status, SR_VS
> +	csrs    sstatus, status
> +
> +	csrr    x_vstart, CSR_VSTART
> +	csrr    x_vtype, CSR_VTYPE
> +	csrr    x_vl, CSR_VL
> +	csrr    x_vcsr, CSR_VCSR
> +	li      m_one, -1
> +	vsetvli incr, m_one, e8, m8
> +	vse8.v   v0, (datap)
> +	add     datap, datap, incr
> +	vse8.v   v8, (datap)
> +	add     datap, datap, incr
> +	vse8.v   v16, (datap)
> +	add     datap, datap, incr
> +	vse8.v   v24, (datap)
> +
> +	REG_S   x_vstart, RISCV_V_STATE_VSTART(vstatep)
> +	REG_S   x_vtype, RISCV_V_STATE_VTYPE(vstatep)
> +	REG_S   x_vl, RISCV_V_STATE_VL(vstatep)
> +	REG_S   x_vcsr, RISCV_V_STATE_VCSR(vstatep)
> +
> +	csrc	sstatus, status
> +	ret
> +ENDPROC(__vstate_save)
> +
> +ENTRY(__vstate_restore)
> +	li      status, SR_VS
> +	csrs    sstatus, status
> +
> +	li      m_one, -1
> +	vsetvli incr, m_one, e8, m8
> +	vle8.v   v0, (datap)
> +	add     datap, datap, incr
> +	vle8.v   v8, (datap)
> +	add     datap, datap, incr
> +	vle8.v   v16, (datap)
> +	add     datap, datap, incr
> +	vle8.v   v24, (datap)
> +
> +	REG_L   x_vstart, RISCV_V_STATE_VSTART(vstatep)
> +	REG_L   x_vtype, RISCV_V_STATE_VTYPE(vstatep)
> +	REG_L   x_vl, RISCV_V_STATE_VL(vstatep)
> +	REG_L   x_vcsr, RISCV_V_STATE_VCSR(vstatep)
> +	vsetvl  x0, x_vl, x_vtype
> +	csrw    CSR_VSTART, x_vstart
> +	csrw    CSR_VCSR, x_vcsr
> +
> +	csrc	sstatus, status
> +	ret
> +ENDPROC(__vstate_restore)
>

Greentime Hu Sept. 28, 2021, 2:56 p.m. UTC | #2

Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
>
> On 9/8/21 1:45 PM, Greentime Hu wrote:
> > This patch adds task switch support for vector. It supports partial lazy
> > save and restore mechanism. It also supports all lengths of vlen.
> >
> > [guoren@linux.alibaba.com: First available porting to support vector
> > context switching]
> > [nick.knight@sifive.com: Rewrite vector.S to support dynamic vlen, xlen and
> > code refine]
> > [vincent.chen@sifive.co: Fix the might_sleep issue in vstate_save,
> > vstate_restore]
> > Co-developed-by: Nick Knight <nick.knight@sifive.com>
> > Signed-off-by: Nick Knight <nick.knight@sifive.com>
> > Co-developed-by: Guo Ren <guoren@linux.alibaba.com>
> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > Co-developed-by: Vincent Chen <vincent.chen@sifive.com>
> > Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
> > Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
> > ---
> >   arch/riscv/include/asm/switch_to.h | 66 +++++++++++++++++++++++
> >   arch/riscv/kernel/Makefile         |  1 +
> >   arch/riscv/kernel/process.c        | 38 ++++++++++++++
> >   arch/riscv/kernel/vector.S         | 84 ++++++++++++++++++++++++++++++
> >   4 files changed, 189 insertions(+)
> >   create mode 100644 arch/riscv/kernel/vector.S
> >
> > diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
> > index ec83770b3d98..de0573dad78f 100644
> > --- a/arch/riscv/include/asm/switch_to.h
> > +++ b/arch/riscv/include/asm/switch_to.h
> > @@ -7,10 +7,12 @@
> >   #define _ASM_RISCV_SWITCH_TO_H
> >
> >   #include <linux/jump_label.h>
> > +#include <linux/slab.h>
> >   #include <linux/sched/task_stack.h>
> >   #include <asm/processor.h>
> >   #include <asm/ptrace.h>
> >   #include <asm/csr.h>
> > +#include <asm/asm-offsets.h>
> >
> >   #ifdef CONFIG_FPU
> >   extern void __fstate_save(struct task_struct *save_to);
> > @@ -68,6 +70,68 @@ static __always_inline bool has_fpu(void) { return false; }
> >   #define __switch_to_fpu(__prev, __next) do { } while (0)
> >   #endif
> >
> > +#ifdef CONFIG_VECTOR
> > +extern bool has_vector;
> > +extern unsigned long riscv_vsize;
> > +extern void __vstate_save(struct __riscv_v_state *save_to, void *datap);
> > +extern void __vstate_restore(struct __riscv_v_state *restore_from, void *datap);
> > +
> > +static inline void __vstate_clean(struct pt_regs *regs)
> > +{
> > +     regs->status = (regs->status & ~(SR_VS)) | SR_VS_CLEAN;
> > +}
> > +
> > +static inline void vstate_off(struct task_struct *task,
> > +                           struct pt_regs *regs)
> > +{
> > +     regs->status = (regs->status & ~SR_VS) | SR_VS_OFF;
> > +}
> > +
> > +static inline void vstate_save(struct task_struct *task,
> > +                            struct pt_regs *regs)
> > +{
> > +     if ((regs->status & SR_VS) == SR_VS_DIRTY) {
> > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > +
> > +             __vstate_save(vstate, vstate->datap);
> > +             __vstate_clean(regs);
> > +     }
> > +}
> > +
> > +static inline void vstate_restore(struct task_struct *task,
> > +                               struct pt_regs *regs)
> > +{
> > +     if ((regs->status & SR_VS) != SR_VS_OFF) {
> > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > +
> > +             /* Allocate space for vector registers. */
> > +             if (!vstate->datap) {
> > +                     vstate->datap = kzalloc(riscv_vsize, GFP_ATOMIC);
> > +                     vstate->size = riscv_vsize;
> > +             }
> > +             __vstate_restore(vstate, vstate->datap);
> > +             __vstate_clean(regs);
> > +     }
> > +}
> > +
> > +static inline void __switch_to_vector(struct task_struct *prev,
> > +                                struct task_struct *next)
> > +{
> > +     struct pt_regs *regs;
> > +
> > +     regs = task_pt_regs(prev);
> > +     if (unlikely(regs->status & SR_SD))
> > +             vstate_save(prev, regs);
> > +     vstate_restore(next, task_pt_regs(next));
> > +}
> > +
> > +#else
> > +#define has_vector false
> > +#define vstate_save(task, regs) do { } while (0)
> > +#define vstate_restore(task, regs) do { } while (0)
> > +#define __switch_to_vector(__prev, __next) do { } while (0)
> > +#endif
> > +
> >   extern struct task_struct *__switch_to(struct task_struct *,
> >                                      struct task_struct *);
> >
> > @@ -77,6 +141,8 @@ do {                                                       \
> >       struct task_struct *__next = (next);            \
> >       if (has_fpu())                                  \
> >               __switch_to_fpu(__prev, __next);        \
> > +     if (has_vector)                                 \
> > +             __switch_to_vector(__prev, __next);     \
> >       ((last) = __switch_to(__prev, __next));         \
> >   } while (0)
> >
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index 3397ddac1a30..344078080839 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -40,6 +40,7 @@ obj-$(CONFIG_MMU) += vdso.o vdso/
> >
> >   obj-$(CONFIG_RISCV_M_MODE)  += traps_misaligned.o
> >   obj-$(CONFIG_FPU)           += fpu.o
> > +obj-$(CONFIG_VECTOR)         += vector.o
> >   obj-$(CONFIG_SMP)           += smpboot.o
> >   obj-$(CONFIG_SMP)           += smp.o
> >   obj-$(CONFIG_SMP)           += cpu_ops.o
> > diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> > index 03ac3aa611f5..0b86e9e531c9 100644
> > --- a/arch/riscv/kernel/process.c
> > +++ b/arch/riscv/kernel/process.c
> > @@ -95,6 +95,16 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
> >                */
> >               fstate_restore(current, regs);
> >       }
> > +
> > +     if (has_vector) {
> > +             regs->status |= SR_VS_INITIAL;
> > +             /*
> > +              * Restore the initial value to the vector register
> > +              * before starting the user program.
> > +              */
> > +             vstate_restore(current, regs);
> > +     }
> > +
>
> So this will unconditionally enable vector instructions, and allocate
> memory for vector state, for all processes, regardless of whether vector
> instructions are used?
>

Hi Darius,

Yes, it will enable vector if has_vector() is true. The reason that we
choose to enable and allocate memory for user space program is because
we also implement some common functions in the glibc such as memcpy
vector version and it is called very often by every process. So that
we assume if the user program is running in a CPU with vector ISA
would like to use vector by default. If we disable it by default and
make it trigger the illegal instruction, that might be a burden since
almost every process will use vector glibc memcpy or something like
that.

> Given the size of the vector state and potential power and performance
> implications of enabling the vector engine, it seems like this should
> treated similarly to Intel AMX on x86.  The full discussion of that is
> here:
>
> https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
>
> The cover letter for recent Intel AMX patches has a summary of the x86
> implementation:
>
> https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
>
> If RISC-V were to adopt a similar approach, I think the significant
> points are:
>
>   1. A process (or thread) must specifically request the desire to use
> vector extensions (perhaps with some new arch_prctl() API),
>
>   2. The kernel is free to deny permission, perhaps based on
> administrative rules or for other reasons, and
>
>   3. If a process attempts to use vector extensions before doing the
> above, the process will die due to an illegal instruction.

Thank you for sharing this, but I am not sure if we should treat
vector like AMX on x86. IMHO, compiler might generate code with vector
instructions automatically someday, maybe we should treat vector
extensions like other extensions.
If user knows the vector extension is supported in this CPU and he
would like to use it, it seems we should let user use it directly just
like other extensions.
If user don't know it exists or not, user should use the library API
transparently and let glibc or other library deal with it. The glibc
ifunc feature or multi-lib should be able to choose the correct
implementation.

Darius Rad Sept. 29, 2021, 1:28 p.m. UTC | #3

On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> >
> > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > This patch adds task switch support for vector. It supports partial lazy
> > > save and restore mechanism. It also supports all lengths of vlen.
> > >
> > > [guoren@linux.alibaba.com: First available porting to support vector
> > > context switching]
> > > [nick.knight@sifive.com: Rewrite vector.S to support dynamic vlen, xlen and
> > > code refine]
> > > [vincent.chen@sifive.co: Fix the might_sleep issue in vstate_save,
> > > vstate_restore]
> > > Co-developed-by: Nick Knight <nick.knight@sifive.com>
> > > Signed-off-by: Nick Knight <nick.knight@sifive.com>
> > > Co-developed-by: Guo Ren <guoren@linux.alibaba.com>
> > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > Co-developed-by: Vincent Chen <vincent.chen@sifive.com>
> > > Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
> > > Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
> > > ---
> > >   arch/riscv/include/asm/switch_to.h | 66 +++++++++++++++++++++++
> > >   arch/riscv/kernel/Makefile         |  1 +
> > >   arch/riscv/kernel/process.c        | 38 ++++++++++++++
> > >   arch/riscv/kernel/vector.S         | 84 ++++++++++++++++++++++++++++++
> > >   4 files changed, 189 insertions(+)
> > >   create mode 100644 arch/riscv/kernel/vector.S
> > >
> > > diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
> > > index ec83770b3d98..de0573dad78f 100644
> > > --- a/arch/riscv/include/asm/switch_to.h
> > > +++ b/arch/riscv/include/asm/switch_to.h
> > > @@ -7,10 +7,12 @@
> > >   #define _ASM_RISCV_SWITCH_TO_H
> > >
> > >   #include <linux/jump_label.h>
> > > +#include <linux/slab.h>
> > >   #include <linux/sched/task_stack.h>
> > >   #include <asm/processor.h>
> > >   #include <asm/ptrace.h>
> > >   #include <asm/csr.h>
> > > +#include <asm/asm-offsets.h>
> > >
> > >   #ifdef CONFIG_FPU
> > >   extern void __fstate_save(struct task_struct *save_to);
> > > @@ -68,6 +70,68 @@ static __always_inline bool has_fpu(void) { return false; }
> > >   #define __switch_to_fpu(__prev, __next) do { } while (0)
> > >   #endif
> > >
> > > +#ifdef CONFIG_VECTOR
> > > +extern bool has_vector;
> > > +extern unsigned long riscv_vsize;
> > > +extern void __vstate_save(struct __riscv_v_state *save_to, void *datap);
> > > +extern void __vstate_restore(struct __riscv_v_state *restore_from, void *datap);
> > > +
> > > +static inline void __vstate_clean(struct pt_regs *regs)
> > > +{
> > > +     regs->status = (regs->status & ~(SR_VS)) | SR_VS_CLEAN;
> > > +}
> > > +
> > > +static inline void vstate_off(struct task_struct *task,
> > > +                           struct pt_regs *regs)
> > > +{
> > > +     regs->status = (regs->status & ~SR_VS) | SR_VS_OFF;
> > > +}
> > > +
> > > +static inline void vstate_save(struct task_struct *task,
> > > +                            struct pt_regs *regs)
> > > +{
> > > +     if ((regs->status & SR_VS) == SR_VS_DIRTY) {
> > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > +
> > > +             __vstate_save(vstate, vstate->datap);
> > > +             __vstate_clean(regs);
> > > +     }
> > > +}
> > > +
> > > +static inline void vstate_restore(struct task_struct *task,
> > > +                               struct pt_regs *regs)
> > > +{
> > > +     if ((regs->status & SR_VS) != SR_VS_OFF) {
> > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > +
> > > +             /* Allocate space for vector registers. */
> > > +             if (!vstate->datap) {
> > > +                     vstate->datap = kzalloc(riscv_vsize, GFP_ATOMIC);
> > > +                     vstate->size = riscv_vsize;
> > > +             }
> > > +             __vstate_restore(vstate, vstate->datap);
> > > +             __vstate_clean(regs);
> > > +     }
> > > +}
> > > +
> > > +static inline void __switch_to_vector(struct task_struct *prev,
> > > +                                struct task_struct *next)
> > > +{
> > > +     struct pt_regs *regs;
> > > +
> > > +     regs = task_pt_regs(prev);
> > > +     if (unlikely(regs->status & SR_SD))
> > > +             vstate_save(prev, regs);
> > > +     vstate_restore(next, task_pt_regs(next));
> > > +}
> > > +
> > > +#else
> > > +#define has_vector false
> > > +#define vstate_save(task, regs) do { } while (0)
> > > +#define vstate_restore(task, regs) do { } while (0)
> > > +#define __switch_to_vector(__prev, __next) do { } while (0)
> > > +#endif
> > > +
> > >   extern struct task_struct *__switch_to(struct task_struct *,
> > >                                      struct task_struct *);
> > >
> > > @@ -77,6 +141,8 @@ do {                                                       \
> > >       struct task_struct *__next = (next);            \
> > >       if (has_fpu())                                  \
> > >               __switch_to_fpu(__prev, __next);        \
> > > +     if (has_vector)                                 \
> > > +             __switch_to_vector(__prev, __next);     \
> > >       ((last) = __switch_to(__prev, __next));         \
> > >   } while (0)
> > >
> > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > index 3397ddac1a30..344078080839 100644
> > > --- a/arch/riscv/kernel/Makefile
> > > +++ b/arch/riscv/kernel/Makefile
> > > @@ -40,6 +40,7 @@ obj-$(CONFIG_MMU) += vdso.o vdso/
> > >
> > >   obj-$(CONFIG_RISCV_M_MODE)  += traps_misaligned.o
> > >   obj-$(CONFIG_FPU)           += fpu.o
> > > +obj-$(CONFIG_VECTOR)         += vector.o
> > >   obj-$(CONFIG_SMP)           += smpboot.o
> > >   obj-$(CONFIG_SMP)           += smp.o
> > >   obj-$(CONFIG_SMP)           += cpu_ops.o
> > > diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> > > index 03ac3aa611f5..0b86e9e531c9 100644
> > > --- a/arch/riscv/kernel/process.c
> > > +++ b/arch/riscv/kernel/process.c
> > > @@ -95,6 +95,16 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
> > >                */
> > >               fstate_restore(current, regs);
> > >       }
> > > +
> > > +     if (has_vector) {
> > > +             regs->status |= SR_VS_INITIAL;
> > > +             /*
> > > +              * Restore the initial value to the vector register
> > > +              * before starting the user program.
> > > +              */
> > > +             vstate_restore(current, regs);
> > > +     }
> > > +
> >
> > So this will unconditionally enable vector instructions, and allocate
> > memory for vector state, for all processes, regardless of whether vector
> > instructions are used?
> >
> 
> Hi Darius,
> 
> Yes, it will enable vector if has_vector() is true. The reason that we
> choose to enable and allocate memory for user space program is because
> we also implement some common functions in the glibc such as memcpy
> vector version and it is called very often by every process. So that
> we assume if the user program is running in a CPU with vector ISA
> would like to use vector by default. If we disable it by default and
> make it trigger the illegal instruction, that might be a burden since
> almost every process will use vector glibc memcpy or something like
> that.

Do you have any evidence to support the assertion that almost every process
would use vector operations?  One could easily argue that the converse is
true: no existing software uses the vector extension now, so most likely a
process will not be using it.

> 
> > Given the size of the vector state and potential power and performance
> > implications of enabling the vector engine, it seems like this should
> > treated similarly to Intel AMX on x86.  The full discussion of that is
> > here:
> >
> > https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
> >
> > The cover letter for recent Intel AMX patches has a summary of the x86
> > implementation:
> >
> > https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
> >
> > If RISC-V were to adopt a similar approach, I think the significant
> > points are:
> >
> >   1. A process (or thread) must specifically request the desire to use
> > vector extensions (perhaps with some new arch_prctl() API),
> >
> >   2. The kernel is free to deny permission, perhaps based on
> > administrative rules or for other reasons, and
> >
> >   3. If a process attempts to use vector extensions before doing the
> > above, the process will die due to an illegal instruction.
> 
> Thank you for sharing this, but I am not sure if we should treat
> vector like AMX on x86. IMHO, compiler might generate code with vector
> instructions automatically someday, maybe we should treat vector
> extensions like other extensions.
> If user knows the vector extension is supported in this CPU and he
> would like to use it, it seems we should let user use it directly just
> like other extensions.
> If user don't know it exists or not, user should use the library API
> transparently and let glibc or other library deal with it. The glibc
> ifunc feature or multi-lib should be able to choose the correct
> implementation.

What makes me think that the vector extension should be treated like AMX is
that they both (1) have a significant amount of architectural state, and
(2) likely have a significant power and/or area impact on (non-emulated)
designs.

For example, I think it is possible, maybe even likely, that vector
implementations will have one or more of the following behaviors:

  1. A single vector unit shared among two or more harts,

  2. Additional power consumption when the vector unit is enabled and idle
versus not being enabled at all,

  3. For a system which supports variable operating frequency, a reduction
in the maximum frequency when the vector unit is enabled, and/or

  4. The inability to enter low power states and/or delays to low power
states transitions when the vector unit is enabled.

None of the above constraints apply to more ordinary extensions like
compressed or the various bit manipulation extensions.

The discussion I linked to has some well reasoned arguments on why
substantial extensions should have a mechanism to request using them by
user space.  The discussion was in the context of Intel AMX, but applies to
further x86 extensions, and I think should also apply to similar extensions
on RISC-V, like vector here.

Ley Foon Tan Oct. 1, 2021, 2:46 a.m. UTC | #4

On Wed, Sep 29, 2021 at 11:54 PM Darius Rad <darius@bluespec.com> wrote:
>
> On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > >
> > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > This patch adds task switch support for vector. It supports partial lazy
> > > > save and restore mechanism. It also supports all lengths of vlen.
> > > >
> > > > [guoren@linux.alibaba.com: First available porting to support vector
> > > > context switching]
> > > > [nick.knight@sifive.com: Rewrite vector.S to support dynamic vlen, xlen and
> > > > code refine]
> > > > [vincent.chen@sifive.co: Fix the might_sleep issue in vstate_save,
> > > > vstate_restore]
> > > > Co-developed-by: Nick Knight <nick.knight@sifive.com>
> > > > Signed-off-by: Nick Knight <nick.knight@sifive.com>
> > > > Co-developed-by: Guo Ren <guoren@linux.alibaba.com>
> > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > > Co-developed-by: Vincent Chen <vincent.chen@sifive.com>
> > > > Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
> > > > Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
> > > > ---
> > > >   arch/riscv/include/asm/switch_to.h | 66 +++++++++++++++++++++++
> > > >   arch/riscv/kernel/Makefile         |  1 +
> > > >   arch/riscv/kernel/process.c        | 38 ++++++++++++++
> > > >   arch/riscv/kernel/vector.S         | 84 ++++++++++++++++++++++++++++++
> > > >   4 files changed, 189 insertions(+)
> > > >   create mode 100644 arch/riscv/kernel/vector.S
> > > >
> > > > diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
> > > > index ec83770b3d98..de0573dad78f 100644
> > > > --- a/arch/riscv/include/asm/switch_to.h
> > > > +++ b/arch/riscv/include/asm/switch_to.h
> > > > @@ -7,10 +7,12 @@
> > > >   #define _ASM_RISCV_SWITCH_TO_H
> > > >
> > > >   #include <linux/jump_label.h>
> > > > +#include <linux/slab.h>
> > > >   #include <linux/sched/task_stack.h>
> > > >   #include <asm/processor.h>
> > > >   #include <asm/ptrace.h>
> > > >   #include <asm/csr.h>
> > > > +#include <asm/asm-offsets.h>
> > > >
> > > >   #ifdef CONFIG_FPU
> > > >   extern void __fstate_save(struct task_struct *save_to);
> > > > @@ -68,6 +70,68 @@ static __always_inline bool has_fpu(void) { return false; }
> > > >   #define __switch_to_fpu(__prev, __next) do { } while (0)
> > > >   #endif
> > > >
> > > > +#ifdef CONFIG_VECTOR
> > > > +extern bool has_vector;
> > > > +extern unsigned long riscv_vsize;
> > > > +extern void __vstate_save(struct __riscv_v_state *save_to, void *datap);
> > > > +extern void __vstate_restore(struct __riscv_v_state *restore_from, void *datap);
> > > > +
> > > > +static inline void __vstate_clean(struct pt_regs *regs)
> > > > +{
> > > > +     regs->status = (regs->status & ~(SR_VS)) | SR_VS_CLEAN;
> > > > +}
> > > > +
> > > > +static inline void vstate_off(struct task_struct *task,
> > > > +                           struct pt_regs *regs)
> > > > +{
> > > > +     regs->status = (regs->status & ~SR_VS) | SR_VS_OFF;
> > > > +}
> > > > +
> > > > +static inline void vstate_save(struct task_struct *task,
> > > > +                            struct pt_regs *regs)
> > > > +{
> > > > +     if ((regs->status & SR_VS) == SR_VS_DIRTY) {
> > > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > > +
> > > > +             __vstate_save(vstate, vstate->datap);
> > > > +             __vstate_clean(regs);
> > > > +     }
> > > > +}
> > > > +
> > > > +static inline void vstate_restore(struct task_struct *task,
> > > > +                               struct pt_regs *regs)
> > > > +{
> > > > +     if ((regs->status & SR_VS) != SR_VS_OFF) {
> > > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > > +
> > > > +             /* Allocate space for vector registers. */
> > > > +             if (!vstate->datap) {
> > > > +                     vstate->datap = kzalloc(riscv_vsize, GFP_ATOMIC);
> > > > +                     vstate->size = riscv_vsize;
> > > > +             }
> > > > +             __vstate_restore(vstate, vstate->datap);
> > > > +             __vstate_clean(regs);
> > > > +     }
> > > > +}
> > > > +
> > > > +static inline void __switch_to_vector(struct task_struct *prev,
> > > > +                                struct task_struct *next)
> > > > +{
> > > > +     struct pt_regs *regs;
> > > > +
> > > > +     regs = task_pt_regs(prev);
> > > > +     if (unlikely(regs->status & SR_SD))
> > > > +             vstate_save(prev, regs);
> > > > +     vstate_restore(next, task_pt_regs(next));
> > > > +}
> > > > +
> > > > +#else
> > > > +#define has_vector false
> > > > +#define vstate_save(task, regs) do { } while (0)
> > > > +#define vstate_restore(task, regs) do { } while (0)
> > > > +#define __switch_to_vector(__prev, __next) do { } while (0)
> > > > +#endif
> > > > +
> > > >   extern struct task_struct *__switch_to(struct task_struct *,
> > > >                                      struct task_struct *);
> > > >
> > > > @@ -77,6 +141,8 @@ do {                                                       \
> > > >       struct task_struct *__next = (next);            \
> > > >       if (has_fpu())                                  \
> > > >               __switch_to_fpu(__prev, __next);        \
> > > > +     if (has_vector)                                 \
> > > > +             __switch_to_vector(__prev, __next);     \
> > > >       ((last) = __switch_to(__prev, __next));         \
> > > >   } while (0)
> > > >
> > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > index 3397ddac1a30..344078080839 100644
> > > > --- a/arch/riscv/kernel/Makefile
> > > > +++ b/arch/riscv/kernel/Makefile
> > > > @@ -40,6 +40,7 @@ obj-$(CONFIG_MMU) += vdso.o vdso/
> > > >
> > > >   obj-$(CONFIG_RISCV_M_MODE)  += traps_misaligned.o
> > > >   obj-$(CONFIG_FPU)           += fpu.o
> > > > +obj-$(CONFIG_VECTOR)         += vector.o
> > > >   obj-$(CONFIG_SMP)           += smpboot.o
> > > >   obj-$(CONFIG_SMP)           += smp.o
> > > >   obj-$(CONFIG_SMP)           += cpu_ops.o
> > > > diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> > > > index 03ac3aa611f5..0b86e9e531c9 100644
> > > > --- a/arch/riscv/kernel/process.c
> > > > +++ b/arch/riscv/kernel/process.c
> > > > @@ -95,6 +95,16 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
> > > >                */
> > > >               fstate_restore(current, regs);
> > > >       }
> > > > +
> > > > +     if (has_vector) {
> > > > +             regs->status |= SR_VS_INITIAL;
> > > > +             /*
> > > > +              * Restore the initial value to the vector register
> > > > +              * before starting the user program.
> > > > +              */
> > > > +             vstate_restore(current, regs);
> > > > +     }
> > > > +
> > >
> > > So this will unconditionally enable vector instructions, and allocate
> > > memory for vector state, for all processes, regardless of whether vector
> > > instructions are used?
> > >
> >
> > Hi Darius,
> >
> > Yes, it will enable vector if has_vector() is true. The reason that we
> > choose to enable and allocate memory for user space program is because
> > we also implement some common functions in the glibc such as memcpy
> > vector version and it is called very often by every process. So that
> > we assume if the user program is running in a CPU with vector ISA
> > would like to use vector by default. If we disable it by default and
> > make it trigger the illegal instruction, that might be a burden since
> > almost every process will use vector glibc memcpy or something like
> > that.
>
> Do you have any evidence to support the assertion that almost every process
> would use vector operations?  One could easily argue that the converse is
> true: no existing software uses the vector extension now, so most likely a
> process will not be using it.
>
> >
> > > Given the size of the vector state and potential power and performance
> > > implications of enabling the vector engine, it seems like this should
> > > treated similarly to Intel AMX on x86.  The full discussion of that is
> > > here:
> > >
> > > https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
> > >
> > > The cover letter for recent Intel AMX patches has a summary of the x86
> > > implementation:
> > >
> > > https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
> > >
> > > If RISC-V were to adopt a similar approach, I think the significant
> > > points are:
> > >
> > >   1. A process (or thread) must specifically request the desire to use
> > > vector extensions (perhaps with some new arch_prctl() API),
> > >
> > >   2. The kernel is free to deny permission, perhaps based on
> > > administrative rules or for other reasons, and
> > >
> > >   3. If a process attempts to use vector extensions before doing the
> > > above, the process will die due to an illegal instruction.
> >
> > Thank you for sharing this, but I am not sure if we should treat
> > vector like AMX on x86. IMHO, compiler might generate code with vector
> > instructions automatically someday, maybe we should treat vector
> > extensions like other extensions.
> > If user knows the vector extension is supported in this CPU and he
> > would like to use it, it seems we should let user use it directly just
> > like other extensions.
> > If user don't know it exists or not, user should use the library API
> > transparently and let glibc or other library deal with it. The glibc
> > ifunc feature or multi-lib should be able to choose the correct
> > implementation.
>
> What makes me think that the vector extension should be treated like AMX is
> that they both (1) have a significant amount of architectural state, and
> (2) likely have a significant power and/or area impact on (non-emulated)
> designs.
>
> For example, I think it is possible, maybe even likely, that vector
> implementations will have one or more of the following behaviors:
>
>   1. A single vector unit shared among two or more harts,
>
>   2. Additional power consumption when the vector unit is enabled and idle
> versus not being enabled at all,
>
>   3. For a system which supports variable operating frequency, a reduction
> in the maximum frequency when the vector unit is enabled, and/or
>
>   4. The inability to enter low power states and/or delays to low power
> states transitions when the vector unit is enabled.
>
> None of the above constraints apply to more ordinary extensions like
> compressed or the various bit manipulation extensions.
>
> The discussion I linked to has some well reasoned arguments on why
> substantial extensions should have a mechanism to request using them by
> user space.  The discussion was in the context of Intel AMX, but applies to
> further x86 extensions, and I think should also apply to similar extensions
> on RISC-V, like vector here.
>
There is possible use case where not all cores support vector
extension due to size, area and power.
Perhaps can have the mechanism or flow to determine the
application/thread require vector extension or it specifically request
the desire to use
vector extensions. Then this app/thread run on cpu with vector
extension capability only.

Thanks.

Regards
Ley Foon

Greentime Hu Oct. 4, 2021, 12:36 p.m. UTC | #5

Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
>
> On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > >
> > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > This patch adds task switch support for vector. It supports partial lazy
> > > > save and restore mechanism. It also supports all lengths of vlen.
> > > >
> > > > [guoren@linux.alibaba.com: First available porting to support vector
> > > > context switching]
> > > > [nick.knight@sifive.com: Rewrite vector.S to support dynamic vlen, xlen and
> > > > code refine]
> > > > [vincent.chen@sifive.co: Fix the might_sleep issue in vstate_save,
> > > > vstate_restore]
> > > > Co-developed-by: Nick Knight <nick.knight@sifive.com>
> > > > Signed-off-by: Nick Knight <nick.knight@sifive.com>
> > > > Co-developed-by: Guo Ren <guoren@linux.alibaba.com>
> > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > > Co-developed-by: Vincent Chen <vincent.chen@sifive.com>
> > > > Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
> > > > Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
> > > > ---
> > > >   arch/riscv/include/asm/switch_to.h | 66 +++++++++++++++++++++++
> > > >   arch/riscv/kernel/Makefile         |  1 +
> > > >   arch/riscv/kernel/process.c        | 38 ++++++++++++++
> > > >   arch/riscv/kernel/vector.S         | 84 ++++++++++++++++++++++++++++++
> > > >   4 files changed, 189 insertions(+)
> > > >   create mode 100644 arch/riscv/kernel/vector.S
> > > >
> > > > diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
> > > > index ec83770b3d98..de0573dad78f 100644
> > > > --- a/arch/riscv/include/asm/switch_to.h
> > > > +++ b/arch/riscv/include/asm/switch_to.h
> > > > @@ -7,10 +7,12 @@
> > > >   #define _ASM_RISCV_SWITCH_TO_H
> > > >
> > > >   #include <linux/jump_label.h>
> > > > +#include <linux/slab.h>
> > > >   #include <linux/sched/task_stack.h>
> > > >   #include <asm/processor.h>
> > > >   #include <asm/ptrace.h>
> > > >   #include <asm/csr.h>
> > > > +#include <asm/asm-offsets.h>
> > > >
> > > >   #ifdef CONFIG_FPU
> > > >   extern void __fstate_save(struct task_struct *save_to);
> > > > @@ -68,6 +70,68 @@ static __always_inline bool has_fpu(void) { return false; }
> > > >   #define __switch_to_fpu(__prev, __next) do { } while (0)
> > > >   #endif
> > > >
> > > > +#ifdef CONFIG_VECTOR
> > > > +extern bool has_vector;
> > > > +extern unsigned long riscv_vsize;
> > > > +extern void __vstate_save(struct __riscv_v_state *save_to, void *datap);
> > > > +extern void __vstate_restore(struct __riscv_v_state *restore_from, void *datap);
> > > > +
> > > > +static inline void __vstate_clean(struct pt_regs *regs)
> > > > +{
> > > > +     regs->status = (regs->status & ~(SR_VS)) | SR_VS_CLEAN;
> > > > +}
> > > > +
> > > > +static inline void vstate_off(struct task_struct *task,
> > > > +                           struct pt_regs *regs)
> > > > +{
> > > > +     regs->status = (regs->status & ~SR_VS) | SR_VS_OFF;
> > > > +}
> > > > +
> > > > +static inline void vstate_save(struct task_struct *task,
> > > > +                            struct pt_regs *regs)
> > > > +{
> > > > +     if ((regs->status & SR_VS) == SR_VS_DIRTY) {
> > > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > > +
> > > > +             __vstate_save(vstate, vstate->datap);
> > > > +             __vstate_clean(regs);
> > > > +     }
> > > > +}
> > > > +
> > > > +static inline void vstate_restore(struct task_struct *task,
> > > > +                               struct pt_regs *regs)
> > > > +{
> > > > +     if ((regs->status & SR_VS) != SR_VS_OFF) {
> > > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > > +
> > > > +             /* Allocate space for vector registers. */
> > > > +             if (!vstate->datap) {
> > > > +                     vstate->datap = kzalloc(riscv_vsize, GFP_ATOMIC);
> > > > +                     vstate->size = riscv_vsize;
> > > > +             }
> > > > +             __vstate_restore(vstate, vstate->datap);
> > > > +             __vstate_clean(regs);
> > > > +     }
> > > > +}
> > > > +
> > > > +static inline void __switch_to_vector(struct task_struct *prev,
> > > > +                                struct task_struct *next)
> > > > +{
> > > > +     struct pt_regs *regs;
> > > > +
> > > > +     regs = task_pt_regs(prev);
> > > > +     if (unlikely(regs->status & SR_SD))
> > > > +             vstate_save(prev, regs);
> > > > +     vstate_restore(next, task_pt_regs(next));
> > > > +}
> > > > +
> > > > +#else
> > > > +#define has_vector false
> > > > +#define vstate_save(task, regs) do { } while (0)
> > > > +#define vstate_restore(task, regs) do { } while (0)
> > > > +#define __switch_to_vector(__prev, __next) do { } while (0)
> > > > +#endif
> > > > +
> > > >   extern struct task_struct *__switch_to(struct task_struct *,
> > > >                                      struct task_struct *);
> > > >
> > > > @@ -77,6 +141,8 @@ do {                                                       \
> > > >       struct task_struct *__next = (next);            \
> > > >       if (has_fpu())                                  \
> > > >               __switch_to_fpu(__prev, __next);        \
> > > > +     if (has_vector)                                 \
> > > > +             __switch_to_vector(__prev, __next);     \
> > > >       ((last) = __switch_to(__prev, __next));         \
> > > >   } while (0)
> > > >
> > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > index 3397ddac1a30..344078080839 100644
> > > > --- a/arch/riscv/kernel/Makefile
> > > > +++ b/arch/riscv/kernel/Makefile
> > > > @@ -40,6 +40,7 @@ obj-$(CONFIG_MMU) += vdso.o vdso/
> > > >
> > > >   obj-$(CONFIG_RISCV_M_MODE)  += traps_misaligned.o
> > > >   obj-$(CONFIG_FPU)           += fpu.o
> > > > +obj-$(CONFIG_VECTOR)         += vector.o
> > > >   obj-$(CONFIG_SMP)           += smpboot.o
> > > >   obj-$(CONFIG_SMP)           += smp.o
> > > >   obj-$(CONFIG_SMP)           += cpu_ops.o
> > > > diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> > > > index 03ac3aa611f5..0b86e9e531c9 100644
> > > > --- a/arch/riscv/kernel/process.c
> > > > +++ b/arch/riscv/kernel/process.c
> > > > @@ -95,6 +95,16 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
> > > >                */
> > > >               fstate_restore(current, regs);
> > > >       }
> > > > +
> > > > +     if (has_vector) {
> > > > +             regs->status |= SR_VS_INITIAL;
> > > > +             /*
> > > > +              * Restore the initial value to the vector register
> > > > +              * before starting the user program.
> > > > +              */
> > > > +             vstate_restore(current, regs);
> > > > +     }
> > > > +
> > >
> > > So this will unconditionally enable vector instructions, and allocate
> > > memory for vector state, for all processes, regardless of whether vector
> > > instructions are used?
> > >
> >
> > Hi Darius,
> >
> > Yes, it will enable vector if has_vector() is true. The reason that we
> > choose to enable and allocate memory for user space program is because
> > we also implement some common functions in the glibc such as memcpy
> > vector version and it is called very often by every process. So that
> > we assume if the user program is running in a CPU with vector ISA
> > would like to use vector by default. If we disable it by default and
> > make it trigger the illegal instruction, that might be a burden since
> > almost every process will use vector glibc memcpy or something like
> > that.
>
> Do you have any evidence to support the assertion that almost every process
> would use vector operations?  One could easily argue that the converse is
> true: no existing software uses the vector extension now, so most likely a
> process will not be using it.

Glibc ustreaming is just starting so you didn't see software using the
vector extension now and this patchset is testing based on those
optimized glibc too.
Vincent Chen is working on the glibc vector support upstreaming and we
will also upstream the vector version glibc memcpy, memcmp, memchr,
memmove, memset, strcmp, strlen.
Then we will see platform with vector support can use vector version
mem* and str* functions automatically based on ifunc and platform
without vector will use the original one automatically. These could be
done to select the correct optimized glibc functions by ifunc
mechanism.

>
> >
> > > Given the size of the vector state and potential power and performance
> > > implications of enabling the vector engine, it seems like this should
> > > treated similarly to Intel AMX on x86.  The full discussion of that is
> > > here:
> > >
> > > https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
> > >
> > > The cover letter for recent Intel AMX patches has a summary of the x86
> > > implementation:
> > >
> > > https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
> > >
> > > If RISC-V were to adopt a similar approach, I think the significant
> > > points are:
> > >
> > >   1. A process (or thread) must specifically request the desire to use
> > > vector extensions (perhaps with some new arch_prctl() API),
> > >
> > >   2. The kernel is free to deny permission, perhaps based on
> > > administrative rules or for other reasons, and
> > >
> > >   3. If a process attempts to use vector extensions before doing the
> > > above, the process will die due to an illegal instruction.
> >
> > Thank you for sharing this, but I am not sure if we should treat
> > vector like AMX on x86. IMHO, compiler might generate code with vector
> > instructions automatically someday, maybe we should treat vector
> > extensions like other extensions.
> > If user knows the vector extension is supported in this CPU and he
> > would like to use it, it seems we should let user use it directly just
> > like other extensions.
> > If user don't know it exists or not, user should use the library API
> > transparently and let glibc or other library deal with it. The glibc
> > ifunc feature or multi-lib should be able to choose the correct
> > implementation.
>
> What makes me think that the vector extension should be treated like AMX is
> that they both (1) have a significant amount of architectural state, and
> (2) likely have a significant power and/or area impact on (non-emulated)
> designs.
>
> For example, I think it is possible, maybe even likely, that vector
> implementations will have one or more of the following behaviors:
>
>   1. A single vector unit shared among two or more harts,
>
>   2. Additional power consumption when the vector unit is enabled and idle
> versus not being enabled at all,
>
>   3. For a system which supports variable operating frequency, a reduction
> in the maximum frequency when the vector unit is enabled, and/or
>
>   4. The inability to enter low power states and/or delays to low power
> states transitions when the vector unit is enabled.
>
> None of the above constraints apply to more ordinary extensions like
> compressed or the various bit manipulation extensions.
>
> The discussion I linked to has some well reasoned arguments on why
> substantial extensions should have a mechanism to request using them by
> user space.  The discussion was in the context of Intel AMX, but applies to
> further x86 extensions, and I think should also apply to similar extensions
> on RISC-V, like vector here.

Have you ever checked the SVE/SVE2 of ARM64 implementation in Linux kernel too?
IMHO, the vector of RISCV should be closer to the SVE2 of ARM64.

Greentime Hu Oct. 4, 2021, 12:41 p.m. UTC | #6

Ley Foon Tan <lftan.linux@gmail.com> 於 2021年10月1日 週五 上午10:46寫道：
>
> On Wed, Sep 29, 2021 at 11:54 PM Darius Rad <darius@bluespec.com> wrote:
> >
> > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > >
> > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > save and restore mechanism. It also supports all lengths of vlen.
> > > > >
> > > > > [guoren@linux.alibaba.com: First available porting to support vector
> > > > > context switching]
> > > > > [nick.knight@sifive.com: Rewrite vector.S to support dynamic vlen, xlen and
> > > > > code refine]
> > > > > [vincent.chen@sifive.co: Fix the might_sleep issue in vstate_save,
> > > > > vstate_restore]
> > > > > Co-developed-by: Nick Knight <nick.knight@sifive.com>
> > > > > Signed-off-by: Nick Knight <nick.knight@sifive.com>
> > > > > Co-developed-by: Guo Ren <guoren@linux.alibaba.com>
> > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > > > Co-developed-by: Vincent Chen <vincent.chen@sifive.com>
> > > > > Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
> > > > > Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
> > > > > ---
> > > > >   arch/riscv/include/asm/switch_to.h | 66 +++++++++++++++++++++++
> > > > >   arch/riscv/kernel/Makefile         |  1 +
> > > > >   arch/riscv/kernel/process.c        | 38 ++++++++++++++
> > > > >   arch/riscv/kernel/vector.S         | 84 ++++++++++++++++++++++++++++++
> > > > >   4 files changed, 189 insertions(+)
> > > > >   create mode 100644 arch/riscv/kernel/vector.S
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
> > > > > index ec83770b3d98..de0573dad78f 100644
> > > > > --- a/arch/riscv/include/asm/switch_to.h
> > > > > +++ b/arch/riscv/include/asm/switch_to.h
> > > > > @@ -7,10 +7,12 @@
> > > > >   #define _ASM_RISCV_SWITCH_TO_H
> > > > >
> > > > >   #include <linux/jump_label.h>
> > > > > +#include <linux/slab.h>
> > > > >   #include <linux/sched/task_stack.h>
> > > > >   #include <asm/processor.h>
> > > > >   #include <asm/ptrace.h>
> > > > >   #include <asm/csr.h>
> > > > > +#include <asm/asm-offsets.h>
> > > > >
> > > > >   #ifdef CONFIG_FPU
> > > > >   extern void __fstate_save(struct task_struct *save_to);
> > > > > @@ -68,6 +70,68 @@ static __always_inline bool has_fpu(void) { return false; }
> > > > >   #define __switch_to_fpu(__prev, __next) do { } while (0)
> > > > >   #endif
> > > > >
> > > > > +#ifdef CONFIG_VECTOR
> > > > > +extern bool has_vector;
> > > > > +extern unsigned long riscv_vsize;
> > > > > +extern void __vstate_save(struct __riscv_v_state *save_to, void *datap);
> > > > > +extern void __vstate_restore(struct __riscv_v_state *restore_from, void *datap);
> > > > > +
> > > > > +static inline void __vstate_clean(struct pt_regs *regs)
> > > > > +{
> > > > > +     regs->status = (regs->status & ~(SR_VS)) | SR_VS_CLEAN;
> > > > > +}
> > > > > +
> > > > > +static inline void vstate_off(struct task_struct *task,
> > > > > +                           struct pt_regs *regs)
> > > > > +{
> > > > > +     regs->status = (regs->status & ~SR_VS) | SR_VS_OFF;
> > > > > +}
> > > > > +
> > > > > +static inline void vstate_save(struct task_struct *task,
> > > > > +                            struct pt_regs *regs)
> > > > > +{
> > > > > +     if ((regs->status & SR_VS) == SR_VS_DIRTY) {
> > > > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > > > +
> > > > > +             __vstate_save(vstate, vstate->datap);
> > > > > +             __vstate_clean(regs);
> > > > > +     }
> > > > > +}
> > > > > +
> > > > > +static inline void vstate_restore(struct task_struct *task,
> > > > > +                               struct pt_regs *regs)
> > > > > +{
> > > > > +     if ((regs->status & SR_VS) != SR_VS_OFF) {
> > > > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > > > +
> > > > > +             /* Allocate space for vector registers. */
> > > > > +             if (!vstate->datap) {
> > > > > +                     vstate->datap = kzalloc(riscv_vsize, GFP_ATOMIC);
> > > > > +                     vstate->size = riscv_vsize;
> > > > > +             }
> > > > > +             __vstate_restore(vstate, vstate->datap);
> > > > > +             __vstate_clean(regs);
> > > > > +     }
> > > > > +}
> > > > > +
> > > > > +static inline void __switch_to_vector(struct task_struct *prev,
> > > > > +                                struct task_struct *next)
> > > > > +{
> > > > > +     struct pt_regs *regs;
> > > > > +
> > > > > +     regs = task_pt_regs(prev);
> > > > > +     if (unlikely(regs->status & SR_SD))
> > > > > +             vstate_save(prev, regs);
> > > > > +     vstate_restore(next, task_pt_regs(next));
> > > > > +}
> > > > > +
> > > > > +#else
> > > > > +#define has_vector false
> > > > > +#define vstate_save(task, regs) do { } while (0)
> > > > > +#define vstate_restore(task, regs) do { } while (0)
> > > > > +#define __switch_to_vector(__prev, __next) do { } while (0)
> > > > > +#endif
> > > > > +
> > > > >   extern struct task_struct *__switch_to(struct task_struct *,
> > > > >                                      struct task_struct *);
> > > > >
> > > > > @@ -77,6 +141,8 @@ do {                                                       \
> > > > >       struct task_struct *__next = (next);            \
> > > > >       if (has_fpu())                                  \
> > > > >               __switch_to_fpu(__prev, __next);        \
> > > > > +     if (has_vector)                                 \
> > > > > +             __switch_to_vector(__prev, __next);     \
> > > > >       ((last) = __switch_to(__prev, __next));         \
> > > > >   } while (0)
> > > > >
> > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > index 3397ddac1a30..344078080839 100644
> > > > > --- a/arch/riscv/kernel/Makefile
> > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > @@ -40,6 +40,7 @@ obj-$(CONFIG_MMU) += vdso.o vdso/
> > > > >
> > > > >   obj-$(CONFIG_RISCV_M_MODE)  += traps_misaligned.o
> > > > >   obj-$(CONFIG_FPU)           += fpu.o
> > > > > +obj-$(CONFIG_VECTOR)         += vector.o
> > > > >   obj-$(CONFIG_SMP)           += smpboot.o
> > > > >   obj-$(CONFIG_SMP)           += smp.o
> > > > >   obj-$(CONFIG_SMP)           += cpu_ops.o
> > > > > diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> > > > > index 03ac3aa611f5..0b86e9e531c9 100644
> > > > > --- a/arch/riscv/kernel/process.c
> > > > > +++ b/arch/riscv/kernel/process.c
> > > > > @@ -95,6 +95,16 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
> > > > >                */
> > > > >               fstate_restore(current, regs);
> > > > >       }
> > > > > +
> > > > > +     if (has_vector) {
> > > > > +             regs->status |= SR_VS_INITIAL;
> > > > > +             /*
> > > > > +              * Restore the initial value to the vector register
> > > > > +              * before starting the user program.
> > > > > +              */
> > > > > +             vstate_restore(current, regs);
> > > > > +     }
> > > > > +
> > > >
> > > > So this will unconditionally enable vector instructions, and allocate
> > > > memory for vector state, for all processes, regardless of whether vector
> > > > instructions are used?
> > > >
> > >
> > > Hi Darius,
> > >
> > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > choose to enable and allocate memory for user space program is because
> > > we also implement some common functions in the glibc such as memcpy
> > > vector version and it is called very often by every process. So that
> > > we assume if the user program is running in a CPU with vector ISA
> > > would like to use vector by default. If we disable it by default and
> > > make it trigger the illegal instruction, that might be a burden since
> > > almost every process will use vector glibc memcpy or something like
> > > that.
> >
> > Do you have any evidence to support the assertion that almost every process
> > would use vector operations?  One could easily argue that the converse is
> > true: no existing software uses the vector extension now, so most likely a
> > process will not be using it.
> >
> > >
> > > > Given the size of the vector state and potential power and performance
> > > > implications of enabling the vector engine, it seems like this should
> > > > treated similarly to Intel AMX on x86.  The full discussion of that is
> > > > here:
> > > >
> > > > https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
> > > >
> > > > The cover letter for recent Intel AMX patches has a summary of the x86
> > > > implementation:
> > > >
> > > > https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
> > > >
> > > > If RISC-V were to adopt a similar approach, I think the significant
> > > > points are:
> > > >
> > > >   1. A process (or thread) must specifically request the desire to use
> > > > vector extensions (perhaps with some new arch_prctl() API),
> > > >
> > > >   2. The kernel is free to deny permission, perhaps based on
> > > > administrative rules or for other reasons, and
> > > >
> > > >   3. If a process attempts to use vector extensions before doing the
> > > > above, the process will die due to an illegal instruction.
> > >
> > > Thank you for sharing this, but I am not sure if we should treat
> > > vector like AMX on x86. IMHO, compiler might generate code with vector
> > > instructions automatically someday, maybe we should treat vector
> > > extensions like other extensions.
> > > If user knows the vector extension is supported in this CPU and he
> > > would like to use it, it seems we should let user use it directly just
> > > like other extensions.
> > > If user don't know it exists or not, user should use the library API
> > > transparently and let glibc or other library deal with it. The glibc
> > > ifunc feature or multi-lib should be able to choose the correct
> > > implementation.
> >
> > What makes me think that the vector extension should be treated like AMX is
> > that they both (1) have a significant amount of architectural state, and
> > (2) likely have a significant power and/or area impact on (non-emulated)
> > designs.
> >
> > For example, I think it is possible, maybe even likely, that vector
> > implementations will have one or more of the following behaviors:
> >
> >   1. A single vector unit shared among two or more harts,
> >
> >   2. Additional power consumption when the vector unit is enabled and idle
> > versus not being enabled at all,
> >
> >   3. For a system which supports variable operating frequency, a reduction
> > in the maximum frequency when the vector unit is enabled, and/or
> >
> >   4. The inability to enter low power states and/or delays to low power
> > states transitions when the vector unit is enabled.
> >
> > None of the above constraints apply to more ordinary extensions like
> > compressed or the various bit manipulation extensions.
> >
> > The discussion I linked to has some well reasoned arguments on why
> > substantial extensions should have a mechanism to request using them by
> > user space.  The discussion was in the context of Intel AMX, but applies to
> > further x86 extensions, and I think should also apply to similar extensions
> > on RISC-V, like vector here.
> >
> There is possible use case where not all cores support vector
> extension due to size, area and power.
> Perhaps can have the mechanism or flow to determine the
> application/thread require vector extension or it specifically request
> the desire to use
> vector extensions. Then this app/thread run on cpu with vector
> extension capability only.
>

IIRC, we assume all harts has the same ability in Linux because of SMP
assumption.
If we have more information of hw capability and we may use this
information for scheduler to switch the correct process to the correct
CPU.
Do you have any idea how to implement it in Linux kernel? Maybe we can
list in the TODO list.

Ley Foon Tan Oct. 5, 2021, 2:12 a.m. UTC | #7

On Mon, Oct 4, 2021 at 8:41 PM Greentime Hu <greentime.hu@sifive.com> wrote:
>
> Ley Foon Tan <lftan.linux@gmail.com> 於 2021年10月1日 週五 上午10:46寫道：
> >
> > On Wed, Sep 29, 2021 at 11:54 PM Darius Rad <darius@bluespec.com> wrote:
> > >
> > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > >
[....]


> > > > >
> > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > instructions are used?
> > > > >
> > > >
> > > > Hi Darius,
> > > >
> > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > choose to enable and allocate memory for user space program is because
> > > > we also implement some common functions in the glibc such as memcpy
> > > > vector version and it is called very often by every process. So that
> > > > we assume if the user program is running in a CPU with vector ISA
> > > > would like to use vector by default. If we disable it by default and
> > > > make it trigger the illegal instruction, that might be a burden since
> > > > almost every process will use vector glibc memcpy or something like
> > > > that.
> > >
> > > Do you have any evidence to support the assertion that almost every process
> > > would use vector operations?  One could easily argue that the converse is
> > > true: no existing software uses the vector extension now, so most likely a
> > > process will not be using it.
> > >
> > > >
> > > > > Given the size of the vector state and potential power and performance
> > > > > implications of enabling the vector engine, it seems like this should
> > > > > treated similarly to Intel AMX on x86.  The full discussion of that is
> > > > > here:
> > > > >
> > > > > https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
> > > > >
> > > > > The cover letter for recent Intel AMX patches has a summary of the x86
> > > > > implementation:
> > > > >
> > > > > https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
> > > > >
> > > > > If RISC-V were to adopt a similar approach, I think the significant
> > > > > points are:
> > > > >
> > > > >   1. A process (or thread) must specifically request the desire to use
> > > > > vector extensions (perhaps with some new arch_prctl() API),
> > > > >
> > > > >   2. The kernel is free to deny permission, perhaps based on
> > > > > administrative rules or for other reasons, and
> > > > >
> > > > >   3. If a process attempts to use vector extensions before doing the
> > > > > above, the process will die due to an illegal instruction.
> > > >
> > > > Thank you for sharing this, but I am not sure if we should treat
> > > > vector like AMX on x86. IMHO, compiler might generate code with vector
> > > > instructions automatically someday, maybe we should treat vector
> > > > extensions like other extensions.
> > > > If user knows the vector extension is supported in this CPU and he
> > > > would like to use it, it seems we should let user use it directly just
> > > > like other extensions.
> > > > If user don't know it exists or not, user should use the library API
> > > > transparently and let glibc or other library deal with it. The glibc
> > > > ifunc feature or multi-lib should be able to choose the correct
> > > > implementation.
> > >
> > > What makes me think that the vector extension should be treated like AMX is
> > > that they both (1) have a significant amount of architectural state, and
> > > (2) likely have a significant power and/or area impact on (non-emulated)
> > > designs.
> > >
> > > For example, I think it is possible, maybe even likely, that vector
> > > implementations will have one or more of the following behaviors:
> > >
> > >   1. A single vector unit shared among two or more harts,
> > >
> > >   2. Additional power consumption when the vector unit is enabled and idle
> > > versus not being enabled at all,
> > >
> > >   3. For a system which supports variable operating frequency, a reduction
> > > in the maximum frequency when the vector unit is enabled, and/or
> > >
> > >   4. The inability to enter low power states and/or delays to low power
> > > states transitions when the vector unit is enabled.
> > >
> > > None of the above constraints apply to more ordinary extensions like
> > > compressed or the various bit manipulation extensions.
> > >
> > > The discussion I linked to has some well reasoned arguments on why
> > > substantial extensions should have a mechanism to request using them by
> > > user space.  The discussion was in the context of Intel AMX, but applies to
> > > further x86 extensions, and I think should also apply to similar extensions
> > > on RISC-V, like vector here.
> > >
> > There is possible use case where not all cores support vector
> > extension due to size, area and power.
> > Perhaps can have the mechanism or flow to determine the
> > application/thread require vector extension or it specifically request
> > the desire to use
> > vector extensions. Then this app/thread run on cpu with vector
> > extension capability only.
> >
>
> IIRC, we assume all harts has the same ability in Linux because of SMP
> assumption.
> If we have more information of hw capability and we may use this
> information for scheduler to switch the correct process to the correct
> CPU.
> Do you have any idea how to implement it in Linux kernel? Maybe we can
> list in the TODO list.
I think we can refer to other arch implementations as reference:

1. ARM64 supports 32-bit thread on asymmetric AArch32 systems. There
is a flag in ELF to check, then start the thread on the core that
supports 32-bit execution. This patchset is merged to mainline 5.15.
https://lore.kernel.org/linux-arm-kernel/20210730112443.23245-8-will@kernel.org/T/

2. Link shared by Darius, on-demand request implementation on Intel AMX
https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/

glibc support optimized library functions with vector, this is enabled
by default if compiler is with vector extension enabled? If yes, then
most of the app required vector core.

Regards
Ley Foon

Darius Rad Oct. 5, 2021, 1:57 p.m. UTC | #8

On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> >
> > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > >
> > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > save and restore mechanism. It also supports all lengths of vlen.
> > > > >
> > > > > [guoren@linux.alibaba.com: First available porting to support vector
> > > > > context switching]
> > > > > [nick.knight@sifive.com: Rewrite vector.S to support dynamic vlen, xlen and
> > > > > code refine]
> > > > > [vincent.chen@sifive.co: Fix the might_sleep issue in vstate_save,
> > > > > vstate_restore]
> > > > > Co-developed-by: Nick Knight <nick.knight@sifive.com>
> > > > > Signed-off-by: Nick Knight <nick.knight@sifive.com>
> > > > > Co-developed-by: Guo Ren <guoren@linux.alibaba.com>
> > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > > > Co-developed-by: Vincent Chen <vincent.chen@sifive.com>
> > > > > Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
> > > > > Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
> > > > > ---
> > > > >   arch/riscv/include/asm/switch_to.h | 66 +++++++++++++++++++++++
> > > > >   arch/riscv/kernel/Makefile         |  1 +
> > > > >   arch/riscv/kernel/process.c        | 38 ++++++++++++++
> > > > >   arch/riscv/kernel/vector.S         | 84 ++++++++++++++++++++++++++++++
> > > > >   4 files changed, 189 insertions(+)
> > > > >   create mode 100644 arch/riscv/kernel/vector.S
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
> > > > > index ec83770b3d98..de0573dad78f 100644
> > > > > --- a/arch/riscv/include/asm/switch_to.h
> > > > > +++ b/arch/riscv/include/asm/switch_to.h
> > > > > @@ -7,10 +7,12 @@
> > > > >   #define _ASM_RISCV_SWITCH_TO_H
> > > > >
> > > > >   #include <linux/jump_label.h>
> > > > > +#include <linux/slab.h>
> > > > >   #include <linux/sched/task_stack.h>
> > > > >   #include <asm/processor.h>
> > > > >   #include <asm/ptrace.h>
> > > > >   #include <asm/csr.h>
> > > > > +#include <asm/asm-offsets.h>
> > > > >
> > > > >   #ifdef CONFIG_FPU
> > > > >   extern void __fstate_save(struct task_struct *save_to);
> > > > > @@ -68,6 +70,68 @@ static __always_inline bool has_fpu(void) { return false; }
> > > > >   #define __switch_to_fpu(__prev, __next) do { } while (0)
> > > > >   #endif
> > > > >
> > > > > +#ifdef CONFIG_VECTOR
> > > > > +extern bool has_vector;
> > > > > +extern unsigned long riscv_vsize;
> > > > > +extern void __vstate_save(struct __riscv_v_state *save_to, void *datap);
> > > > > +extern void __vstate_restore(struct __riscv_v_state *restore_from, void *datap);
> > > > > +
> > > > > +static inline void __vstate_clean(struct pt_regs *regs)
> > > > > +{
> > > > > +     regs->status = (regs->status & ~(SR_VS)) | SR_VS_CLEAN;
> > > > > +}
> > > > > +
> > > > > +static inline void vstate_off(struct task_struct *task,
> > > > > +                           struct pt_regs *regs)
> > > > > +{
> > > > > +     regs->status = (regs->status & ~SR_VS) | SR_VS_OFF;
> > > > > +}
> > > > > +
> > > > > +static inline void vstate_save(struct task_struct *task,
> > > > > +                            struct pt_regs *regs)
> > > > > +{
> > > > > +     if ((regs->status & SR_VS) == SR_VS_DIRTY) {
> > > > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > > > +
> > > > > +             __vstate_save(vstate, vstate->datap);
> > > > > +             __vstate_clean(regs);
> > > > > +     }
> > > > > +}
> > > > > +
> > > > > +static inline void vstate_restore(struct task_struct *task,
> > > > > +                               struct pt_regs *regs)
> > > > > +{
> > > > > +     if ((regs->status & SR_VS) != SR_VS_OFF) {
> > > > > +             struct __riscv_v_state *vstate = &(task->thread.vstate);
> > > > > +
> > > > > +             /* Allocate space for vector registers. */
> > > > > +             if (!vstate->datap) {
> > > > > +                     vstate->datap = kzalloc(riscv_vsize, GFP_ATOMIC);
> > > > > +                     vstate->size = riscv_vsize;
> > > > > +             }
> > > > > +             __vstate_restore(vstate, vstate->datap);
> > > > > +             __vstate_clean(regs);
> > > > > +     }
> > > > > +}
> > > > > +
> > > > > +static inline void __switch_to_vector(struct task_struct *prev,
> > > > > +                                struct task_struct *next)
> > > > > +{
> > > > > +     struct pt_regs *regs;
> > > > > +
> > > > > +     regs = task_pt_regs(prev);
> > > > > +     if (unlikely(regs->status & SR_SD))
> > > > > +             vstate_save(prev, regs);
> > > > > +     vstate_restore(next, task_pt_regs(next));
> > > > > +}
> > > > > +
> > > > > +#else
> > > > > +#define has_vector false
> > > > > +#define vstate_save(task, regs) do { } while (0)
> > > > > +#define vstate_restore(task, regs) do { } while (0)
> > > > > +#define __switch_to_vector(__prev, __next) do { } while (0)
> > > > > +#endif
> > > > > +
> > > > >   extern struct task_struct *__switch_to(struct task_struct *,
> > > > >                                      struct task_struct *);
> > > > >
> > > > > @@ -77,6 +141,8 @@ do {                                                       \
> > > > >       struct task_struct *__next = (next);            \
> > > > >       if (has_fpu())                                  \
> > > > >               __switch_to_fpu(__prev, __next);        \
> > > > > +     if (has_vector)                                 \
> > > > > +             __switch_to_vector(__prev, __next);     \
> > > > >       ((last) = __switch_to(__prev, __next));         \
> > > > >   } while (0)
> > > > >
> > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > index 3397ddac1a30..344078080839 100644
> > > > > --- a/arch/riscv/kernel/Makefile
> > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > @@ -40,6 +40,7 @@ obj-$(CONFIG_MMU) += vdso.o vdso/
> > > > >
> > > > >   obj-$(CONFIG_RISCV_M_MODE)  += traps_misaligned.o
> > > > >   obj-$(CONFIG_FPU)           += fpu.o
> > > > > +obj-$(CONFIG_VECTOR)         += vector.o
> > > > >   obj-$(CONFIG_SMP)           += smpboot.o
> > > > >   obj-$(CONFIG_SMP)           += smp.o
> > > > >   obj-$(CONFIG_SMP)           += cpu_ops.o
> > > > > diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> > > > > index 03ac3aa611f5..0b86e9e531c9 100644
> > > > > --- a/arch/riscv/kernel/process.c
> > > > > +++ b/arch/riscv/kernel/process.c
> > > > > @@ -95,6 +95,16 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
> > > > >                */
> > > > >               fstate_restore(current, regs);
> > > > >       }
> > > > > +
> > > > > +     if (has_vector) {
> > > > > +             regs->status |= SR_VS_INITIAL;
> > > > > +             /*
> > > > > +              * Restore the initial value to the vector register
> > > > > +              * before starting the user program.
> > > > > +              */
> > > > > +             vstate_restore(current, regs);
> > > > > +     }
> > > > > +
> > > >
> > > > So this will unconditionally enable vector instructions, and allocate
> > > > memory for vector state, for all processes, regardless of whether vector
> > > > instructions are used?
> > > >
> > >
> > > Hi Darius,
> > >
> > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > choose to enable and allocate memory for user space program is because
> > > we also implement some common functions in the glibc such as memcpy
> > > vector version and it is called very often by every process. So that
> > > we assume if the user program is running in a CPU with vector ISA
> > > would like to use vector by default. If we disable it by default and
> > > make it trigger the illegal instruction, that might be a burden since
> > > almost every process will use vector glibc memcpy or something like
> > > that.
> >
> > Do you have any evidence to support the assertion that almost every process
> > would use vector operations?  One could easily argue that the converse is
> > true: no existing software uses the vector extension now, so most likely a
> > process will not be using it.
> 
> Glibc ustreaming is just starting so you didn't see software using the
> vector extension now and this patchset is testing based on those
> optimized glibc too.
> Vincent Chen is working on the glibc vector support upstreaming and we
> will also upstream the vector version glibc memcpy, memcmp, memchr,
> memmove, memset, strcmp, strlen.
> Then we will see platform with vector support can use vector version
> mem* and str* functions automatically based on ifunc and platform
> without vector will use the original one automatically. These could be
> done to select the correct optimized glibc functions by ifunc
> mechanism.
> 
> >
> > >
> > > > Given the size of the vector state and potential power and performance
> > > > implications of enabling the vector engine, it seems like this should
> > > > treated similarly to Intel AMX on x86.  The full discussion of that is
> > > > here:
> > > >
> > > > https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
> > > >
> > > > The cover letter for recent Intel AMX patches has a summary of the x86
> > > > implementation:
> > > >
> > > > https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
> > > >
> > > > If RISC-V were to adopt a similar approach, I think the significant
> > > > points are:
> > > >
> > > >   1. A process (or thread) must specifically request the desire to use
> > > > vector extensions (perhaps with some new arch_prctl() API),
> > > >
> > > >   2. The kernel is free to deny permission, perhaps based on
> > > > administrative rules or for other reasons, and
> > > >
> > > >   3. If a process attempts to use vector extensions before doing the
> > > > above, the process will die due to an illegal instruction.
> > >
> > > Thank you for sharing this, but I am not sure if we should treat
> > > vector like AMX on x86. IMHO, compiler might generate code with vector
> > > instructions automatically someday, maybe we should treat vector
> > > extensions like other extensions.
> > > If user knows the vector extension is supported in this CPU and he
> > > would like to use it, it seems we should let user use it directly just
> > > like other extensions.
> > > If user don't know it exists or not, user should use the library API
> > > transparently and let glibc or other library deal with it. The glibc
> > > ifunc feature or multi-lib should be able to choose the correct
> > > implementation.
> >
> > What makes me think that the vector extension should be treated like AMX is
> > that they both (1) have a significant amount of architectural state, and
> > (2) likely have a significant power and/or area impact on (non-emulated)
> > designs.
> >
> > For example, I think it is possible, maybe even likely, that vector
> > implementations will have one or more of the following behaviors:
> >
> >   1. A single vector unit shared among two or more harts,
> >
> >   2. Additional power consumption when the vector unit is enabled and idle
> > versus not being enabled at all,
> >
> >   3. For a system which supports variable operating frequency, a reduction
> > in the maximum frequency when the vector unit is enabled, and/or
> >
> >   4. The inability to enter low power states and/or delays to low power
> > states transitions when the vector unit is enabled.
> >
> > None of the above constraints apply to more ordinary extensions like
> > compressed or the various bit manipulation extensions.
> >
> > The discussion I linked to has some well reasoned arguments on why
> > substantial extensions should have a mechanism to request using them by
> > user space.  The discussion was in the context of Intel AMX, but applies to
> > further x86 extensions, and I think should also apply to similar extensions
> > on RISC-V, like vector here.
> 
> Have you ever checked the SVE/SVE2 of ARM64 implementation in Linux kernel too?
> IMHO, the vector of RISCV should be closer to the SVE2 of ARM64.

For SVE on arm64, memory is only allocated and the extension is only
enabled when a process is actively using it, which is not what this patch
set does.  If the memory allocation for state memory fails, it triggers a
BUG(); there is no graceful way to report this to the application.

To do something similar for RISC-V, you will need to write an illegal
instruction handler to retrieve the faulting instruction and partially
decode it enough to determine it is a vector instruction.  That seems
needlessly complicated, doesn't provide a way to gracefully report an
error if memory allocation fails, and doesn't provide any of the other
benefits that a defined API to request use of the vector extension would
provide.

Did you read the discussion on Intel AMX support that I previously linked
to?  There are well reasoned arguments why it is beneficial to require that
a process request access to substantial extensions, like RISC-V vector.

Greentime Hu Oct. 5, 2021, 3:46 p.m. UTC | #9

Ley Foon Tan <lftan.linux@gmail.com> 於 2021年10月5日 週二 上午10:12寫道：
>
> On Mon, Oct 4, 2021 at 8:41 PM Greentime Hu <greentime.hu@sifive.com> wrote:
> >
> > Ley Foon Tan <lftan.linux@gmail.com> 於 2021年10月1日 週五 上午10:46寫道：
> > >
> > > On Wed, Sep 29, 2021 at 11:54 PM Darius Rad <darius@bluespec.com> wrote:
> > > >
> > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > >
> [....]
>
>
> > > > > >
> > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > instructions are used?
> > > > > >
> > > > >
> > > > > Hi Darius,
> > > > >
> > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > choose to enable and allocate memory for user space program is because
> > > > > we also implement some common functions in the glibc such as memcpy
> > > > > vector version and it is called very often by every process. So that
> > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > would like to use vector by default. If we disable it by default and
> > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > almost every process will use vector glibc memcpy or something like
> > > > > that.
> > > >
> > > > Do you have any evidence to support the assertion that almost every process
> > > > would use vector operations?  One could easily argue that the converse is
> > > > true: no existing software uses the vector extension now, so most likely a
> > > > process will not be using it.
> > > >
> > > > >
> > > > > > Given the size of the vector state and potential power and performance
> > > > > > implications of enabling the vector engine, it seems like this should
> > > > > > treated similarly to Intel AMX on x86.  The full discussion of that is
> > > > > > here:
> > > > > >
> > > > > > https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
> > > > > >
> > > > > > The cover letter for recent Intel AMX patches has a summary of the x86
> > > > > > implementation:
> > > > > >
> > > > > > https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
> > > > > >
> > > > > > If RISC-V were to adopt a similar approach, I think the significant
> > > > > > points are:
> > > > > >
> > > > > >   1. A process (or thread) must specifically request the desire to use
> > > > > > vector extensions (perhaps with some new arch_prctl() API),
> > > > > >
> > > > > >   2. The kernel is free to deny permission, perhaps based on
> > > > > > administrative rules or for other reasons, and
> > > > > >
> > > > > >   3. If a process attempts to use vector extensions before doing the
> > > > > > above, the process will die due to an illegal instruction.
> > > > >
> > > > > Thank you for sharing this, but I am not sure if we should treat
> > > > > vector like AMX on x86. IMHO, compiler might generate code with vector
> > > > > instructions automatically someday, maybe we should treat vector
> > > > > extensions like other extensions.
> > > > > If user knows the vector extension is supported in this CPU and he
> > > > > would like to use it, it seems we should let user use it directly just
> > > > > like other extensions.
> > > > > If user don't know it exists or not, user should use the library API
> > > > > transparently and let glibc or other library deal with it. The glibc
> > > > > ifunc feature or multi-lib should be able to choose the correct
> > > > > implementation.
> > > >
> > > > What makes me think that the vector extension should be treated like AMX is
> > > > that they both (1) have a significant amount of architectural state, and
> > > > (2) likely have a significant power and/or area impact on (non-emulated)
> > > > designs.
> > > >
> > > > For example, I think it is possible, maybe even likely, that vector
> > > > implementations will have one or more of the following behaviors:
> > > >
> > > >   1. A single vector unit shared among two or more harts,
> > > >
> > > >   2. Additional power consumption when the vector unit is enabled and idle
> > > > versus not being enabled at all,
> > > >
> > > >   3. For a system which supports variable operating frequency, a reduction
> > > > in the maximum frequency when the vector unit is enabled, and/or
> > > >
> > > >   4. The inability to enter low power states and/or delays to low power
> > > > states transitions when the vector unit is enabled.
> > > >
> > > > None of the above constraints apply to more ordinary extensions like
> > > > compressed or the various bit manipulation extensions.
> > > >
> > > > The discussion I linked to has some well reasoned arguments on why
> > > > substantial extensions should have a mechanism to request using them by
> > > > user space.  The discussion was in the context of Intel AMX, but applies to
> > > > further x86 extensions, and I think should also apply to similar extensions
> > > > on RISC-V, like vector here.
> > > >
> > > There is possible use case where not all cores support vector
> > > extension due to size, area and power.
> > > Perhaps can have the mechanism or flow to determine the
> > > application/thread require vector extension or it specifically request
> > > the desire to use
> > > vector extensions. Then this app/thread run on cpu with vector
> > > extension capability only.
> > >
> >
> > IIRC, we assume all harts has the same ability in Linux because of SMP
> > assumption.
> > If we have more information of hw capability and we may use this
> > information for scheduler to switch the correct process to the correct
> > CPU.
> > Do you have any idea how to implement it in Linux kernel? Maybe we can
> > list in the TODO list.
> I think we can refer to other arch implementations as reference:
>
> 1. ARM64 supports 32-bit thread on asymmetric AArch32 systems. There
> is a flag in ELF to check, then start the thread on the core that
> supports 32-bit execution. This patchset is merged to mainline 5.15.
> https://lore.kernel.org/linux-arm-kernel/20210730112443.23245-8-will@kernel.org/T/

Wow! This is useful for AMP.

>
> 2. Link shared by Darius, on-demand request implementation on Intel AMX
> https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
>
> glibc support optimized library functions with vector, this is enabled
> by default if compiler is with vector extension enabled? If yes, then
> most of the app required vector core.

As I mentioned earlier, glibc ifunc will solve this issue. The
Linux/glibc can run on platform with vector or without vector and
glibc will use the information get from Linux kernel and using ifunc
to decide whether it should use the vector version or not.
Which means even your toolchain has vector glibc support and your
Linux kernel told the glibc this platform doesn't support vector then
the ifunc mechanism will choose the non-vector version ones.

Ley Foon Tan Oct. 7, 2021, 10:10 a.m. UTC | #10

On Tue, Oct 5, 2021 at 11:47 PM Greentime Hu <greentime.hu@sifive.com> wrote:
>
> Ley Foon Tan <lftan.linux@gmail.com> 於 2021年10月5日 週二 上午10:12寫道：
> >
> > On Mon, Oct 4, 2021 at 8:41 PM Greentime Hu <greentime.hu@sifive.com> wrote:
> > >
> > > Ley Foon Tan <lftan.linux@gmail.com> 於 2021年10月1日 週五 上午10:46寫道：
> > > >
> > > > On Wed, Sep 29, 2021 at 11:54 PM Darius Rad <darius@bluespec.com> wrote:
> > > > >
> > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > >
> > [....]
> >
> >
> > > > > > >
> > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > instructions are used?
> > > > > > >
> > > > > >
> > > > > > Hi Darius,
> > > > > >
> > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > choose to enable and allocate memory for user space program is because
> > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > vector version and it is called very often by every process. So that
> > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > would like to use vector by default. If we disable it by default and
> > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > that.
> > > > >
> > > > > Do you have any evidence to support the assertion that almost every process
> > > > > would use vector operations?  One could easily argue that the converse is
> > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > process will not be using it.
> > > > >
> > > > > >
> > > > > > > Given the size of the vector state and potential power and performance
> > > > > > > implications of enabling the vector engine, it seems like this should
> > > > > > > treated similarly to Intel AMX on x86.  The full discussion of that is
> > > > > > > here:
> > > > > > >
> > > > > > > https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
> > > > > > >
> > > > > > > The cover letter for recent Intel AMX patches has a summary of the x86
> > > > > > > implementation:
> > > > > > >
> > > > > > > https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
> > > > > > >
> > > > > > > If RISC-V were to adopt a similar approach, I think the significant
> > > > > > > points are:
> > > > > > >
> > > > > > >   1. A process (or thread) must specifically request the desire to use
> > > > > > > vector extensions (perhaps with some new arch_prctl() API),
> > > > > > >
> > > > > > >   2. The kernel is free to deny permission, perhaps based on
> > > > > > > administrative rules or for other reasons, and
> > > > > > >
> > > > > > >   3. If a process attempts to use vector extensions before doing the
> > > > > > > above, the process will die due to an illegal instruction.
> > > > > >
> > > > > > Thank you for sharing this, but I am not sure if we should treat
> > > > > > vector like AMX on x86. IMHO, compiler might generate code with vector
> > > > > > instructions automatically someday, maybe we should treat vector
> > > > > > extensions like other extensions.
> > > > > > If user knows the vector extension is supported in this CPU and he
> > > > > > would like to use it, it seems we should let user use it directly just
> > > > > > like other extensions.
> > > > > > If user don't know it exists or not, user should use the library API
> > > > > > transparently and let glibc or other library deal with it. The glibc
> > > > > > ifunc feature or multi-lib should be able to choose the correct
> > > > > > implementation.
> > > > >
> > > > > What makes me think that the vector extension should be treated like AMX is
> > > > > that they both (1) have a significant amount of architectural state, and
> > > > > (2) likely have a significant power and/or area impact on (non-emulated)
> > > > > designs.
> > > > >
> > > > > For example, I think it is possible, maybe even likely, that vector
> > > > > implementations will have one or more of the following behaviors:
> > > > >
> > > > >   1. A single vector unit shared among two or more harts,
> > > > >
> > > > >   2. Additional power consumption when the vector unit is enabled and idle
> > > > > versus not being enabled at all,
> > > > >
> > > > >   3. For a system which supports variable operating frequency, a reduction
> > > > > in the maximum frequency when the vector unit is enabled, and/or
> > > > >
> > > > >   4. The inability to enter low power states and/or delays to low power
> > > > > states transitions when the vector unit is enabled.
> > > > >
> > > > > None of the above constraints apply to more ordinary extensions like
> > > > > compressed or the various bit manipulation extensions.
> > > > >
> > > > > The discussion I linked to has some well reasoned arguments on why
> > > > > substantial extensions should have a mechanism to request using them by
> > > > > user space.  The discussion was in the context of Intel AMX, but applies to
> > > > > further x86 extensions, and I think should also apply to similar extensions
> > > > > on RISC-V, like vector here.
> > > > >
> > > > There is possible use case where not all cores support vector
> > > > extension due to size, area and power.
> > > > Perhaps can have the mechanism or flow to determine the
> > > > application/thread require vector extension or it specifically request
> > > > the desire to use
> > > > vector extensions. Then this app/thread run on cpu with vector
> > > > extension capability only.
> > > >
> > >
> > > IIRC, we assume all harts has the same ability in Linux because of SMP
> > > assumption.
> > > If we have more information of hw capability and we may use this
> > > information for scheduler to switch the correct process to the correct
> > > CPU.
> > > Do you have any idea how to implement it in Linux kernel? Maybe we can
> > > list in the TODO list.
> > I think we can refer to other arch implementations as reference:
> >
> > 1. ARM64 supports 32-bit thread on asymmetric AArch32 systems. There
> > is a flag in ELF to check, then start the thread on the core that
> > supports 32-bit execution. This patchset is merged to mainline 5.15.
> > https://lore.kernel.org/linux-arm-kernel/20210730112443.23245-8-will@kernel.org/T/
>
> Wow! This is useful for AMP.
>
> >
> > 2. Link shared by Darius, on-demand request implementation on Intel AMX
> > https://lore.kernel.org/lkml/20210825155413.19673-1-chang.seok.bae@intel.com/
> >
> > glibc support optimized library functions with vector, this is enabled
> > by default if compiler is with vector extension enabled? If yes, then
> > most of the app required vector core.
>
> As I mentioned earlier, glibc ifunc will solve this issue. The
> Linux/glibc can run on platform with vector or without vector and
> glibc will use the information get from Linux kernel and using ifunc
> to decide whether it should use the vector version or not.
> Which means even your toolchain has vector glibc support and your
> Linux kernel told the glibc this platform doesn't support vector then
> the ifunc mechanism will choose the non-vector version ones.
Okay. Then Linux kernel needs to report vector capability as per core
feature, if not all SMP cores support vector.

Paul Walmsley Oct. 21, 2021, 1:01 a.m. UTC | #11

Hello Darius,

On Tue, 5 Oct 2021, Darius Rad wrote:

> On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > >
> > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > >
> > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > save and restore mechanism. It also supports all lengths of vlen.

[ ... ]

> > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > instructions are used?
> > > >
> > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > choose to enable and allocate memory for user space program is because
> > > > we also implement some common functions in the glibc such as memcpy
> > > > vector version and it is called very often by every process. So that
> > > > we assume if the user program is running in a CPU with vector ISA
> > > > would like to use vector by default. If we disable it by default and
> > > > make it trigger the illegal instruction, that might be a burden since
> > > > almost every process will use vector glibc memcpy or something like
> > > > that.
> > >
> > > Do you have any evidence to support the assertion that almost every process
> > > would use vector operations?  One could easily argue that the converse is
> > > true: no existing software uses the vector extension now, so most likely a
> > > process will not be using it.
> > 
> > Glibc ustreaming is just starting so you didn't see software using the 
> > vector extension now and this patchset is testing based on those 
> > optimized glibc too. Vincent Chen is working on the glibc vector 
> > support upstreaming and we will also upstream the vector version glibc 
> > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will 
> > see platform with vector support can use vector version mem* and str* 
> > functions automatically based on ifunc and platform without vector 
> > will use the original one automatically. These could be done to select 
> > the correct optimized glibc functions by ifunc mechanism.

In your reply, I noticed that you didn't address Greentime's response 
here.  But this looks like the key issue.  If common library functions are 
vector-accelerated, wouldn't it make sense that almost every process would 
wind up using vector instructions?  And thus there wouldn't be much point 
to skipping the vector context memory allocation?


- Paul

Darius Rad Oct. 21, 2021, 10:50 a.m. UTC | #12

On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> Hello Darius,
> 
> On Tue, 5 Oct 2021, Darius Rad wrote:
> 
> > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > >
> > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > >
> > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> 
> [ ... ]
> 
> > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > instructions are used?
> > > > >
> > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > choose to enable and allocate memory for user space program is because
> > > > > we also implement some common functions in the glibc such as memcpy
> > > > > vector version and it is called very often by every process. So that
> > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > would like to use vector by default. If we disable it by default and
> > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > almost every process will use vector glibc memcpy or something like
> > > > > that.
> > > >
> > > > Do you have any evidence to support the assertion that almost every process
> > > > would use vector operations?  One could easily argue that the converse is
> > > > true: no existing software uses the vector extension now, so most likely a
> > > > process will not be using it.
> > > 
> > > Glibc ustreaming is just starting so you didn't see software using the 
> > > vector extension now and this patchset is testing based on those 
> > > optimized glibc too. Vincent Chen is working on the glibc vector 
> > > support upstreaming and we will also upstream the vector version glibc 
> > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will 
> > > see platform with vector support can use vector version mem* and str* 
> > > functions automatically based on ifunc and platform without vector 
> > > will use the original one automatically. These could be done to select 
> > > the correct optimized glibc functions by ifunc mechanism.
> 
> In your reply, I noticed that you didn't address Greentime's response 
> here.  But this looks like the key issue.  If common library functions are 
> vector-accelerated, wouldn't it make sense that almost every process would 
> wind up using vector instructions?  And thus there wouldn't be much point 
> to skipping the vector context memory allocation?
> 

This issue was addressed in the thread regarding Intel AMX I linked to in a
previous message.  I don't agree that this is the key issue; it is one of a
number of issues.  What if I don't want to take the potential
power/frequency hit for the vector unit for a workload that, at best, uses
it for the occasional memcpy?  What if the allocation fails, how will that
get reported to user space (hint: not well)?  According to Greentime,
RISC-V vector is similar to ARM SVE, which allocates memory for context
state on first use and not unconditionally for all processes.

// darius

Vincent Chen Oct. 22, 2021, 3:52 a.m. UTC | #13

On Thu, Oct 21, 2021 at 6:50 PM Darius Rad <darius@bluespec.com> wrote:
>
> On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> > Hello Darius,
> >
> > On Tue, 5 Oct 2021, Darius Rad wrote:
> >
> > > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > > >
> > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > >
> > > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> >
> > [ ... ]
> >
> > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > instructions are used?
> > > > > >
> > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > choose to enable and allocate memory for user space program is because
> > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > vector version and it is called very often by every process. So that
> > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > would like to use vector by default. If we disable it by default and
> > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > that.
> > > > >
> > > > > Do you have any evidence to support the assertion that almost every process
> > > > > would use vector operations?  One could easily argue that the converse is
> > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > process will not be using it.
> > > >
> > > > Glibc ustreaming is just starting so you didn't see software using the
> > > > vector extension now and this patchset is testing based on those
> > > > optimized glibc too. Vincent Chen is working on the glibc vector
> > > > support upstreaming and we will also upstream the vector version glibc
> > > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will
> > > > see platform with vector support can use vector version mem* and str*
> > > > functions automatically based on ifunc and platform without vector
> > > > will use the original one automatically. These could be done to select
> > > > the correct optimized glibc functions by ifunc mechanism.
> >
> > In your reply, I noticed that you didn't address Greentime's response
> > here.  But this looks like the key issue.  If common library functions are
> > vector-accelerated, wouldn't it make sense that almost every process would
> > wind up using vector instructions?  And thus there wouldn't be much point
> > to skipping the vector context memory allocation?
> >
>
> This issue was addressed in the thread regarding Intel AMX I linked to in a
> previous message.  I don't agree that this is the key issue; it is one of a
> number of issues.  What if I don't want to take the potential
> power/frequency hit for the vector unit for a workload that, at best, uses
> it for the occasional memcpy?  What if the allocation fails, how will that

Hi Darius,
The memcpy function seems not to be occasionally used in the programs
because many functions in Glibc use memcpy() to complete the memory
copy. I use the following simple case as an example.
test.c
void main(void) {
    return;
}
Then, we compile it by "gcc test.c -o a.out" and execute it. In the
execution, the memcpy() has been called unexpectedly. It is because
many libc initialized functions will be executed before entering the
user-defined main function. One of the example is __libc_setup_tls(),
which is called by __libc_start_main(). The __libc_setup_tls() will
use memcpy() during the process of creating the Dynamic Thread Vector
(DTV).

Therefore, I think the memcpy() is widely used in most programs.

> get reported to user space (hint: not well)?  According to Greentime,
> RISC-V vector is similar to ARM SVE, which allocates memory for context
> state on first use and not unconditionally for all processes.
>
> // darius
>

Darius Rad Oct. 22, 2021, 10:40 a.m. UTC | #14

On Fri, Oct 22, 2021 at 11:52:01AM +0800, Vincent Chen wrote:
> On Thu, Oct 21, 2021 at 6:50 PM Darius Rad <darius@bluespec.com> wrote:
> >
> > On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> > > Hello Darius,
> > >
> > > On Tue, 5 Oct 2021, Darius Rad wrote:
> > >
> > > > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > > > >
> > > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > > >
> > > > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> > >
> > > [ ... ]
> > >
> > > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > > instructions are used?
> > > > > > >
> > > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > > choose to enable and allocate memory for user space program is because
> > > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > > vector version and it is called very often by every process. So that
> > > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > > would like to use vector by default. If we disable it by default and
> > > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > > that.
> > > > > >
> > > > > > Do you have any evidence to support the assertion that almost every process
> > > > > > would use vector operations?  One could easily argue that the converse is
> > > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > > process will not be using it.
> > > > >
> > > > > Glibc ustreaming is just starting so you didn't see software using the
> > > > > vector extension now and this patchset is testing based on those
> > > > > optimized glibc too. Vincent Chen is working on the glibc vector
> > > > > support upstreaming and we will also upstream the vector version glibc
> > > > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will
> > > > > see platform with vector support can use vector version mem* and str*
> > > > > functions automatically based on ifunc and platform without vector
> > > > > will use the original one automatically. These could be done to select
> > > > > the correct optimized glibc functions by ifunc mechanism.
> > >
> > > In your reply, I noticed that you didn't address Greentime's response
> > > here.  But this looks like the key issue.  If common library functions are
> > > vector-accelerated, wouldn't it make sense that almost every process would
> > > wind up using vector instructions?  And thus there wouldn't be much point
> > > to skipping the vector context memory allocation?
> > >
> >
> > This issue was addressed in the thread regarding Intel AMX I linked to in a
> > previous message.  I don't agree that this is the key issue; it is one of a
> > number of issues.  What if I don't want to take the potential
> > power/frequency hit for the vector unit for a workload that, at best, uses
> > it for the occasional memcpy?  What if the allocation fails, how will that
> 
> Hi Darius,
> The memcpy function seems not to be occasionally used in the programs
> because many functions in Glibc use memcpy() to complete the memory
> copy. I use the following simple case as an example.
> test.c
> void main(void) {
>     return;
> }
> Then, we compile it by "gcc test.c -o a.out" and execute it. In the
> execution, the memcpy() has been called unexpectedly. It is because
> many libc initialized functions will be executed before entering the
> user-defined main function. One of the example is __libc_setup_tls(),
> which is called by __libc_start_main(). The __libc_setup_tls() will
> use memcpy() during the process of creating the Dynamic Thread Vector
> (DTV).
> 
> Therefore, I think the memcpy() is widely used in most programs.
> 

You're missing my point.  Not every (any?) program spends a majority of the
time doing memcpy(), and even if a program did, all of my points are still
valid.

Please read the discussion in the thread I referenced and the questions in
my prior message.

> > get reported to user space (hint: not well)?  According to Greentime,
> > RISC-V vector is similar to ARM SVE, which allocates memory for context
> > state on first use and not unconditionally for all processes.
> >

Greentime Hu Oct. 25, 2021, 4:47 a.m. UTC | #15

Darius Rad <darius@bluespec.com> 於 2021年10月22日 週五 下午6:40寫道：
>
> On Fri, Oct 22, 2021 at 11:52:01AM +0800, Vincent Chen wrote:
> > On Thu, Oct 21, 2021 at 6:50 PM Darius Rad <darius@bluespec.com> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> > > > Hello Darius,
> > > >
> > > > On Tue, 5 Oct 2021, Darius Rad wrote:
> > > >
> > > > > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > > > > >
> > > > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > > > >
> > > > > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> > > >
> > > > [ ... ]
> > > >
> > > > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > > > instructions are used?
> > > > > > > >
> > > > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > > > choose to enable and allocate memory for user space program is because
> > > > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > > > vector version and it is called very often by every process. So that
> > > > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > > > would like to use vector by default. If we disable it by default and
> > > > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > > > that.
> > > > > > >
> > > > > > > Do you have any evidence to support the assertion that almost every process
> > > > > > > would use vector operations?  One could easily argue that the converse is
> > > > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > > > process will not be using it.
> > > > > >
> > > > > > Glibc ustreaming is just starting so you didn't see software using the
> > > > > > vector extension now and this patchset is testing based on those
> > > > > > optimized glibc too. Vincent Chen is working on the glibc vector
> > > > > > support upstreaming and we will also upstream the vector version glibc
> > > > > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will
> > > > > > see platform with vector support can use vector version mem* and str*
> > > > > > functions automatically based on ifunc and platform without vector
> > > > > > will use the original one automatically. These could be done to select
> > > > > > the correct optimized glibc functions by ifunc mechanism.
> > > >
> > > > In your reply, I noticed that you didn't address Greentime's response
> > > > here.  But this looks like the key issue.  If common library functions are
> > > > vector-accelerated, wouldn't it make sense that almost every process would
> > > > wind up using vector instructions?  And thus there wouldn't be much point
> > > > to skipping the vector context memory allocation?
> > > >
> > >
> > > This issue was addressed in the thread regarding Intel AMX I linked to in a
> > > previous message.  I don't agree that this is the key issue; it is one of a
> > > number of issues.  What if I don't want to take the potential
> > > power/frequency hit for the vector unit for a workload that, at best, uses
> > > it for the occasional memcpy?  What if the allocation fails, how will that
> >
> > Hi Darius,
> > The memcpy function seems not to be occasionally used in the programs
> > because many functions in Glibc use memcpy() to complete the memory
> > copy. I use the following simple case as an example.
> > test.c
> > void main(void) {
> >     return;
> > }
> > Then, we compile it by "gcc test.c -o a.out" and execute it. In the
> > execution, the memcpy() has been called unexpectedly. It is because
> > many libc initialized functions will be executed before entering the
> > user-defined main function. One of the example is __libc_setup_tls(),
> > which is called by __libc_start_main(). The __libc_setup_tls() will
> > use memcpy() during the process of creating the Dynamic Thread Vector
> > (DTV).
> >
> > Therefore, I think the memcpy() is widely used in most programs.
> >
>
> You're missing my point.  Not every (any?) program spends a majority of the
> time doing memcpy(), and even if a program did, all of my points are still
> valid.
>
> Please read the discussion in the thread I referenced and the questions in
> my prior message.
>

Hi Darius,

As I mentioned before, we want to treat vector ISA like a general ISA
instead of a specific IP. User program should be able to use it
transparently just like FPU.
It seems that the use case you want is asking user to use vector like
a specific IP, user program should ask kernel before they use it and
that is not what we want to do in this patchset.

Darius Rad Oct. 25, 2021, 4:22 p.m. UTC | #16

On Mon, Oct 25, 2021 at 12:47:49PM +0800, Greentime Hu wrote:
> Darius Rad <darius@bluespec.com> 於 2021年10月22日 週五 下午6:40寫道：
> >
> > On Fri, Oct 22, 2021 at 11:52:01AM +0800, Vincent Chen wrote:
> > > On Thu, Oct 21, 2021 at 6:50 PM Darius Rad <darius@bluespec.com> wrote:
> > > >
> > > > On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> > > > > Hello Darius,
> > > > >
> > > > > On Tue, 5 Oct 2021, Darius Rad wrote:
> > > > >
> > > > > > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > > > > > >
> > > > > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > > > > >
> > > > > > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> > > > >
> > > > > [ ... ]
> > > > >
> > > > > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > > > > instructions are used?
> > > > > > > > >
> > > > > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > > > > choose to enable and allocate memory for user space program is because
> > > > > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > > > > vector version and it is called very often by every process. So that
> > > > > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > > > > would like to use vector by default. If we disable it by default and
> > > > > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > > > > that.
> > > > > > > >
> > > > > > > > Do you have any evidence to support the assertion that almost every process
> > > > > > > > would use vector operations?  One could easily argue that the converse is
> > > > > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > > > > process will not be using it.
> > > > > > >
> > > > > > > Glibc ustreaming is just starting so you didn't see software using the
> > > > > > > vector extension now and this patchset is testing based on those
> > > > > > > optimized glibc too. Vincent Chen is working on the glibc vector
> > > > > > > support upstreaming and we will also upstream the vector version glibc
> > > > > > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will
> > > > > > > see platform with vector support can use vector version mem* and str*
> > > > > > > functions automatically based on ifunc and platform without vector
> > > > > > > will use the original one automatically. These could be done to select
> > > > > > > the correct optimized glibc functions by ifunc mechanism.
> > > > >
> > > > > In your reply, I noticed that you didn't address Greentime's response
> > > > > here.  But this looks like the key issue.  If common library functions are
> > > > > vector-accelerated, wouldn't it make sense that almost every process would
> > > > > wind up using vector instructions?  And thus there wouldn't be much point
> > > > > to skipping the vector context memory allocation?
> > > > >
> > > >
> > > > This issue was addressed in the thread regarding Intel AMX I linked to in a
> > > > previous message.  I don't agree that this is the key issue; it is one of a
> > > > number of issues.  What if I don't want to take the potential
> > > > power/frequency hit for the vector unit for a workload that, at best, uses
> > > > it for the occasional memcpy?  What if the allocation fails, how will that
> > >
> > > Hi Darius,
> > > The memcpy function seems not to be occasionally used in the programs
> > > because many functions in Glibc use memcpy() to complete the memory
> > > copy. I use the following simple case as an example.
> > > test.c
> > > void main(void) {
> > >     return;
> > > }
> > > Then, we compile it by "gcc test.c -o a.out" and execute it. In the
> > > execution, the memcpy() has been called unexpectedly. It is because
> > > many libc initialized functions will be executed before entering the
> > > user-defined main function. One of the example is __libc_setup_tls(),
> > > which is called by __libc_start_main(). The __libc_setup_tls() will
> > > use memcpy() during the process of creating the Dynamic Thread Vector
> > > (DTV).
> > >
> > > Therefore, I think the memcpy() is widely used in most programs.
> > >
> >
> > You're missing my point.  Not every (any?) program spends a majority of the
> > time doing memcpy(), and even if a program did, all of my points are still
> > valid.
> >
> > Please read the discussion in the thread I referenced and the questions in
> > my prior message.
> >
> 
> Hi Darius,
> 
> As I mentioned before, we want to treat vector ISA like a general ISA
> instead of a specific IP. User program should be able to use it
> transparently just like FPU.
> It seems that the use case you want is asking user to use vector like
> a specific IP, user program should ask kernel before they use it and
> that is not what we want to do in this patchset.
> 

Hi Greentime,

Right.

But beyond what I want to do or what you want to do, is what *should* Linux
do?  I have attempted to provide evidence to support my position.  You have
not responded to or addressed the majority of my questions, which is
concerning to me.

// darius

Greentime Hu Oct. 26, 2021, 4:44 a.m. UTC | #17

Darius Rad <darius@bluespec.com> 於 2021年10月26日 週二 上午12:22寫道：
>
> On Mon, Oct 25, 2021 at 12:47:49PM +0800, Greentime Hu wrote:
> > Darius Rad <darius@bluespec.com> 於 2021年10月22日 週五 下午6:40寫道：
> > >
> > > On Fri, Oct 22, 2021 at 11:52:01AM +0800, Vincent Chen wrote:
> > > > On Thu, Oct 21, 2021 at 6:50 PM Darius Rad <darius@bluespec.com> wrote:
> > > > >
> > > > > On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> > > > > > Hello Darius,
> > > > > >
> > > > > > On Tue, 5 Oct 2021, Darius Rad wrote:
> > > > > >
> > > > > > > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > > > > > > >
> > > > > > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > > > > > >
> > > > > > > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> > > > > >
> > > > > > [ ... ]
> > > > > >
> > > > > > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > > > > > instructions are used?
> > > > > > > > > >
> > > > > > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > > > > > choose to enable and allocate memory for user space program is because
> > > > > > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > > > > > vector version and it is called very often by every process. So that
> > > > > > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > > > > > would like to use vector by default. If we disable it by default and
> > > > > > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > > > > > that.
> > > > > > > > >
> > > > > > > > > Do you have any evidence to support the assertion that almost every process
> > > > > > > > > would use vector operations?  One could easily argue that the converse is
> > > > > > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > > > > > process will not be using it.
> > > > > > > >
> > > > > > > > Glibc ustreaming is just starting so you didn't see software using the
> > > > > > > > vector extension now and this patchset is testing based on those
> > > > > > > > optimized glibc too. Vincent Chen is working on the glibc vector
> > > > > > > > support upstreaming and we will also upstream the vector version glibc
> > > > > > > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will
> > > > > > > > see platform with vector support can use vector version mem* and str*
> > > > > > > > functions automatically based on ifunc and platform without vector
> > > > > > > > will use the original one automatically. These could be done to select
> > > > > > > > the correct optimized glibc functions by ifunc mechanism.
> > > > > >
> > > > > > In your reply, I noticed that you didn't address Greentime's response
> > > > > > here.  But this looks like the key issue.  If common library functions are
> > > > > > vector-accelerated, wouldn't it make sense that almost every process would
> > > > > > wind up using vector instructions?  And thus there wouldn't be much point
> > > > > > to skipping the vector context memory allocation?
> > > > > >
> > > > >
> > > > > This issue was addressed in the thread regarding Intel AMX I linked to in a
> > > > > previous message.  I don't agree that this is the key issue; it is one of a
> > > > > number of issues.  What if I don't want to take the potential
> > > > > power/frequency hit for the vector unit for a workload that, at best, uses
> > > > > it for the occasional memcpy?  What if the allocation fails, how will that
> > > >
> > > > Hi Darius,
> > > > The memcpy function seems not to be occasionally used in the programs
> > > > because many functions in Glibc use memcpy() to complete the memory
> > > > copy. I use the following simple case as an example.
> > > > test.c
> > > > void main(void) {
> > > >     return;
> > > > }
> > > > Then, we compile it by "gcc test.c -o a.out" and execute it. In the
> > > > execution, the memcpy() has been called unexpectedly. It is because
> > > > many libc initialized functions will be executed before entering the
> > > > user-defined main function. One of the example is __libc_setup_tls(),
> > > > which is called by __libc_start_main(). The __libc_setup_tls() will
> > > > use memcpy() during the process of creating the Dynamic Thread Vector
> > > > (DTV).
> > > >
> > > > Therefore, I think the memcpy() is widely used in most programs.
> > > >
> > >
> > > You're missing my point.  Not every (any?) program spends a majority of the
> > > time doing memcpy(), and even if a program did, all of my points are still
> > > valid.
> > >
> > > Please read the discussion in the thread I referenced and the questions in
> > > my prior message.
> > >
> >
> > Hi Darius,
> >
> > As I mentioned before, we want to treat vector ISA like a general ISA
> > instead of a specific IP. User program should be able to use it
> > transparently just like FPU.
> > It seems that the use case you want is asking user to use vector like
> > a specific IP, user program should ask kernel before they use it and
> > that is not what we want to do in this patchset.
> >
>
> Hi Greentime,
>
> Right.
>
> But beyond what I want to do or what you want to do, is what *should* Linux
> do?  I have attempted to provide evidence to support my position.  You have
> not responded to or addressed the majority of my questions, which is
> concerning to me.

Hi Darius,

What is your majority questions?

Heiko Stübner Oct. 26, 2021, 2:58 p.m. UTC | #18

Hi Darius,

Am Freitag, 22. Oktober 2021, 12:40:46 CEST schrieb Darius Rad:
> On Fri, Oct 22, 2021 at 11:52:01AM +0800, Vincent Chen wrote:
> > On Thu, Oct 21, 2021 at 6:50 PM Darius Rad <darius@bluespec.com> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> > > > Hello Darius,
> > > >
> > > > On Tue, 5 Oct 2021, Darius Rad wrote:
> > > >
> > > > > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > > > > >
> > > > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > > > >
> > > > > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> > > >
> > > > [ ... ]
> > > >
> > > > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > > > instructions are used?
> > > > > > > >
> > > > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > > > choose to enable and allocate memory for user space program is because
> > > > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > > > vector version and it is called very often by every process. So that
> > > > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > > > would like to use vector by default. If we disable it by default and
> > > > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > > > that.
> > > > > > >
> > > > > > > Do you have any evidence to support the assertion that almost every process
> > > > > > > would use vector operations?  One could easily argue that the converse is
> > > > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > > > process will not be using it.
> > > > > >
> > > > > > Glibc ustreaming is just starting so you didn't see software using the
> > > > > > vector extension now and this patchset is testing based on those
> > > > > > optimized glibc too. Vincent Chen is working on the glibc vector
> > > > > > support upstreaming and we will also upstream the vector version glibc
> > > > > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will
> > > > > > see platform with vector support can use vector version mem* and str*
> > > > > > functions automatically based on ifunc and platform without vector
> > > > > > will use the original one automatically. These could be done to select
> > > > > > the correct optimized glibc functions by ifunc mechanism.
> > > >
> > > > In your reply, I noticed that you didn't address Greentime's response
> > > > here.  But this looks like the key issue.  If common library functions are
> > > > vector-accelerated, wouldn't it make sense that almost every process would
> > > > wind up using vector instructions?  And thus there wouldn't be much point
> > > > to skipping the vector context memory allocation?
> > > >
> > >
> > > This issue was addressed in the thread regarding Intel AMX I linked to in a
> > > previous message.  I don't agree that this is the key issue; it is one of a
> > > number of issues.  What if I don't want to take the potential
> > > power/frequency hit for the vector unit for a workload that, at best, uses
> > > it for the occasional memcpy?  What if the allocation fails, how will that
> > 
> > Hi Darius,
> > The memcpy function seems not to be occasionally used in the programs
> > because many functions in Glibc use memcpy() to complete the memory
> > copy. I use the following simple case as an example.
> > test.c
> > void main(void) {
> >     return;
> > }
> > Then, we compile it by "gcc test.c -o a.out" and execute it. In the
> > execution, the memcpy() has been called unexpectedly. It is because
> > many libc initialized functions will be executed before entering the
> > user-defined main function. One of the example is __libc_setup_tls(),
> > which is called by __libc_start_main(). The __libc_setup_tls() will
> > use memcpy() during the process of creating the Dynamic Thread Vector
> > (DTV).
> > 
> > Therefore, I think the memcpy() is widely used in most programs.
> > 
> 
> You're missing my point.  Not every (any?) program spends a majority of the
> time doing memcpy(), and even if a program did, all of my points are still
> valid.
> 
> Please read the discussion in the thread I referenced and the questions in
> my prior message.

for people reading along at home, do have a different link by chance?

I.e. the link to
	https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
is not a known message-id on lore.kernel.org it seems.


Thanks
Heiko

> > > get reported to user space (hint: not well)?  According to Greentime,
> > > RISC-V vector is similar to ARM SVE, which allocates memory for context
> > > state on first use and not unconditionally for all processes.
> > >
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
>

Darius Rad Oct. 27, 2021, 12:58 p.m. UTC | #19

On Tue, Oct 26, 2021 at 12:44:31PM +0800, Greentime Hu wrote:
> Darius Rad <darius@bluespec.com> 於 2021年10月26日 週二 上午12:22寫道：
> >
> > On Mon, Oct 25, 2021 at 12:47:49PM +0800, Greentime Hu wrote:
> > > Darius Rad <darius@bluespec.com> 於 2021年10月22日 週五 下午6:40寫道：
> > > >
> > > > On Fri, Oct 22, 2021 at 11:52:01AM +0800, Vincent Chen wrote:
> > > > > On Thu, Oct 21, 2021 at 6:50 PM Darius Rad <darius@bluespec.com> wrote:
> > > > > >
> > > > > > On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> > > > > > > Hello Darius,
> > > > > > >
> > > > > > > On Tue, 5 Oct 2021, Darius Rad wrote:
> > > > > > >
> > > > > > > > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > > > > > > > >
> > > > > > > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > > > > > > >
> > > > > > > > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> > > > > > >
> > > > > > > [ ... ]
> > > > > > >
> > > > > > > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > > > > > > instructions are used?
> > > > > > > > > > >
> > > > > > > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > > > > > > choose to enable and allocate memory for user space program is because
> > > > > > > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > > > > > > vector version and it is called very often by every process. So that
> > > > > > > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > > > > > > would like to use vector by default. If we disable it by default and
> > > > > > > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > > > > > > that.
> > > > > > > > > >
> > > > > > > > > > Do you have any evidence to support the assertion that almost every process
> > > > > > > > > > would use vector operations?  One could easily argue that the converse is
> > > > > > > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > > > > > > process will not be using it.
> > > > > > > > >
> > > > > > > > > Glibc ustreaming is just starting so you didn't see software using the
> > > > > > > > > vector extension now and this patchset is testing based on those
> > > > > > > > > optimized glibc too. Vincent Chen is working on the glibc vector
> > > > > > > > > support upstreaming and we will also upstream the vector version glibc
> > > > > > > > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will
> > > > > > > > > see platform with vector support can use vector version mem* and str*
> > > > > > > > > functions automatically based on ifunc and platform without vector
> > > > > > > > > will use the original one automatically. These could be done to select
> > > > > > > > > the correct optimized glibc functions by ifunc mechanism.
> > > > > > >
> > > > > > > In your reply, I noticed that you didn't address Greentime's response
> > > > > > > here.  But this looks like the key issue.  If common library functions are
> > > > > > > vector-accelerated, wouldn't it make sense that almost every process would
> > > > > > > wind up using vector instructions?  And thus there wouldn't be much point
> > > > > > > to skipping the vector context memory allocation?
> > > > > > >
> > > > > >
> > > > > > This issue was addressed in the thread regarding Intel AMX I linked to in a
> > > > > > previous message.  I don't agree that this is the key issue; it is one of a
> > > > > > number of issues.  What if I don't want to take the potential
> > > > > > power/frequency hit for the vector unit for a workload that, at best, uses
> > > > > > it for the occasional memcpy?  What if the allocation fails, how will that
> > > > >
> > > > > Hi Darius,
> > > > > The memcpy function seems not to be occasionally used in the programs
> > > > > because many functions in Glibc use memcpy() to complete the memory
> > > > > copy. I use the following simple case as an example.
> > > > > test.c
> > > > > void main(void) {
> > > > >     return;
> > > > > }
> > > > > Then, we compile it by "gcc test.c -o a.out" and execute it. In the
> > > > > execution, the memcpy() has been called unexpectedly. It is because
> > > > > many libc initialized functions will be executed before entering the
> > > > > user-defined main function. One of the example is __libc_setup_tls(),
> > > > > which is called by __libc_start_main(). The __libc_setup_tls() will
> > > > > use memcpy() during the process of creating the Dynamic Thread Vector
> > > > > (DTV).
> > > > >
> > > > > Therefore, I think the memcpy() is widely used in most programs.
> > > > >
> > > >
> > > > You're missing my point.  Not every (any?) program spends a majority of the
> > > > time doing memcpy(), and even if a program did, all of my points are still
> > > > valid.
> > > >
> > > > Please read the discussion in the thread I referenced and the questions in
> > > > my prior message.
> > > >
> > >
> > > Hi Darius,
> > >
> > > As I mentioned before, we want to treat vector ISA like a general ISA
> > > instead of a specific IP. User program should be able to use it
> > > transparently just like FPU.
> > > It seems that the use case you want is asking user to use vector like
> > > a specific IP, user program should ask kernel before they use it and
> > > that is not what we want to do in this patchset.
> > >
> >
> > Hi Greentime,
> >
> > Right.
> >
> > But beyond what I want to do or what you want to do, is what *should* Linux
> > do?  I have attempted to provide evidence to support my position.  You have
> > not responded to or addressed the majority of my questions, which is
> > concerning to me.
> 
> Hi Darius,
> 
> What is your majority questions?
> 

1. How will memory allocation failures for context state memory be reported
to user space?

2. How will a system administrator (i.e., the user) be able to effectively
manage a system where the vector unit, which could have a considerable area
and/or power impact to the system, has one or more of the following
properties:

  a. A single vector unit shared among two or more harts,

  b. Additional power consumption when the vector unit is enabled and idle
versus not being enabled at all,

  c. For a system which supports variable operating frequency, a reduction
in the maximum frequency when the vector unit is enabled, and/or

  d. The inability to enter low power states and/or delays to low power
states transitions when the vector unit is enabled.

3. You contend that the RISC-V V-extension resembles ARM SVE/SVE2, at least
more than Intel AMX.  I do not agree, but nevertheless, why then does this
patchset not behave similar to SVE?  On arm64, SVE is only enabled and
memory is only allocated on first use, *not* unconditionally for all tasks.

// darius

Greentime Hu Nov. 9, 2021, 9:49 a.m. UTC | #20

Darius Rad <darius@bluespec.com> 於 2021年10月27日 週三 下午8:58寫道：
>
> On Tue, Oct 26, 2021 at 12:44:31PM +0800, Greentime Hu wrote:
> > Darius Rad <darius@bluespec.com> 於 2021年10月26日 週二 上午12:22寫道：
> > >
> > > On Mon, Oct 25, 2021 at 12:47:49PM +0800, Greentime Hu wrote:
> > > > Darius Rad <darius@bluespec.com> 於 2021年10月22日 週五 下午6:40寫道：
> > > > >
> > > > > On Fri, Oct 22, 2021 at 11:52:01AM +0800, Vincent Chen wrote:
> > > > > > On Thu, Oct 21, 2021 at 6:50 PM Darius Rad <darius@bluespec.com> wrote:
> > > > > > >
> > > > > > > On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> > > > > > > > Hello Darius,
> > > > > > > >
> > > > > > > > On Tue, 5 Oct 2021, Darius Rad wrote:
> > > > > > > >
> > > > > > > > > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > > > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > > > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> > > > > > > >
> > > > > > > > [ ... ]
> > > > > > > >
> > > > > > > > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > > > > > > > instructions are used?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > > > > > > > choose to enable and allocate memory for user space program is because
> > > > > > > > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > > > > > > > vector version and it is called very often by every process. So that
> > > > > > > > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > > > > > > > would like to use vector by default. If we disable it by default and
> > > > > > > > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > > > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > > > > > > > that.
> > > > > > > > > > >
> > > > > > > > > > > Do you have any evidence to support the assertion that almost every process
> > > > > > > > > > > would use vector operations?  One could easily argue that the converse is
> > > > > > > > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > > > > > > > process will not be using it.
> > > > > > > > > >
> > > > > > > > > > Glibc ustreaming is just starting so you didn't see software using the
> > > > > > > > > > vector extension now and this patchset is testing based on those
> > > > > > > > > > optimized glibc too. Vincent Chen is working on the glibc vector
> > > > > > > > > > support upstreaming and we will also upstream the vector version glibc
> > > > > > > > > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will
> > > > > > > > > > see platform with vector support can use vector version mem* and str*
> > > > > > > > > > functions automatically based on ifunc and platform without vector
> > > > > > > > > > will use the original one automatically. These could be done to select
> > > > > > > > > > the correct optimized glibc functions by ifunc mechanism.
> > > > > > > >
> > > > > > > > In your reply, I noticed that you didn't address Greentime's response
> > > > > > > > here.  But this looks like the key issue.  If common library functions are
> > > > > > > > vector-accelerated, wouldn't it make sense that almost every process would
> > > > > > > > wind up using vector instructions?  And thus there wouldn't be much point
> > > > > > > > to skipping the vector context memory allocation?
> > > > > > > >
> > > > > > >
> > > > > > > This issue was addressed in the thread regarding Intel AMX I linked to in a
> > > > > > > previous message.  I don't agree that this is the key issue; it is one of a
> > > > > > > number of issues.  What if I don't want to take the potential
> > > > > > > power/frequency hit for the vector unit for a workload that, at best, uses
> > > > > > > it for the occasional memcpy?  What if the allocation fails, how will that
> > > > > >
> > > > > > Hi Darius,
> > > > > > The memcpy function seems not to be occasionally used in the programs
> > > > > > because many functions in Glibc use memcpy() to complete the memory
> > > > > > copy. I use the following simple case as an example.
> > > > > > test.c
> > > > > > void main(void) {
> > > > > >     return;
> > > > > > }
> > > > > > Then, we compile it by "gcc test.c -o a.out" and execute it. In the
> > > > > > execution, the memcpy() has been called unexpectedly. It is because
> > > > > > many libc initialized functions will be executed before entering the
> > > > > > user-defined main function. One of the example is __libc_setup_tls(),
> > > > > > which is called by __libc_start_main(). The __libc_setup_tls() will
> > > > > > use memcpy() during the process of creating the Dynamic Thread Vector
> > > > > > (DTV).
> > > > > >
> > > > > > Therefore, I think the memcpy() is widely used in most programs.
> > > > > >
> > > > >
> > > > > You're missing my point.  Not every (any?) program spends a majority of the
> > > > > time doing memcpy(), and even if a program did, all of my points are still
> > > > > valid.
> > > > >
> > > > > Please read the discussion in the thread I referenced and the questions in
> > > > > my prior message.
> > > > >
> > > >
> > > > Hi Darius,
> > > >
> > > > As I mentioned before, we want to treat vector ISA like a general ISA
> > > > instead of a specific IP. User program should be able to use it
> > > > transparently just like FPU.
> > > > It seems that the use case you want is asking user to use vector like
> > > > a specific IP, user program should ask kernel before they use it and
> > > > that is not what we want to do in this patchset.
> > > >
> > >
> > > Hi Greentime,
> > >
> > > Right.
> > >
> > > But beyond what I want to do or what you want to do, is what *should* Linux
> > > do?  I have attempted to provide evidence to support my position.  You have
> > > not responded to or addressed the majority of my questions, which is
> > > concerning to me.
> >
> > Hi Darius,
> >
> > What is your majority questions?
> >
>
> 1. How will memory allocation failures for context state memory be reported
> to user space?

it will return -ENOMEM for some cases or show the warning messages for
some cases.
We know it's not perfect, we should enhance that in the future, but
let's take an example: 256 bits vector length system. 256 bits * 32
registers /8 = 1KB.

> 2. How will a system administrator (i.e., the user) be able to effectively
> manage a system where the vector unit, which could have a considerable area
> and/or power impact to the system, has one or more of the following
> properties:

As I mentioned before,
We would like to let user use vector transparently just like FPU or
other extensions.
If user knows that this system supports vector and user uses intrinsic
functions or assembly codes or compiler generating vector codes, user
can just use it just like FPU.
If user doesn't know that whether this system support vector or not,
user can just use the glibc or ifunc in his own libraries to detect
vector support dynamically.

>   a. A single vector unit shared among two or more harts,
>
>   b. Additional power consumption when the vector unit is enabled and idle
> versus not being enabled at all,
>
>   c. For a system which supports variable operating frequency, a reduction
> in the maximum frequency when the vector unit is enabled, and/or
>
>   d. The inability to enter low power states and/or delays to low power
> states transitions when the vector unit is enabled.

We also don't support this kind of system(a single vector unit shared
among 2 or more harts) in this implementation. I'll add more
assumptions in the next version patches.
For the frequency or power issues, I'll also not treat them as a
special case since we want to treat vector like an normal extension of
ISA instead of a specific IP.

> 3. You contend that the RISC-V V-extension resembles ARM SVE/SVE2, at least
> more than Intel AMX.  I do not agree, but nevertheless, why then does this
> patchset not behave similar to SVE?  On arm64, SVE is only enabled and
> memory is only allocated on first use, *not* unconditionally for all tasks.

As we mentioned before, almost every user space program will use glibc
ld.so/libc.so and these libraries will use the vector version
optimization porting in a system with vector support.
That's why we don't let it trigger the first illegal instruction
exception because vector registers will be used at very beginning of
the program and the illegal instruction exception handling is also
heavier because we need to go to M-mode first than S-mode which means
we will need to save context and restore context twice in both M-mode
and S-mode. Since the vstate will be allocated eventually, why not
save these 2 context save/restore overhead in both M-mode and S-mode
for handling this illegal instruction exception. And if the system
doesn't support vector, glibc won't use the vector porting and the
Linux kernel won't allocate vstate for every process too.

Darius Rad Nov. 9, 2021, 7:21 p.m. UTC | #21

On Tue, Nov 09, 2021 at 05:49:03PM +0800, Greentime Hu wrote:
> Darius Rad <darius@bluespec.com> 於 2021年10月27日 週三 下午8:58寫道：
> >
> > On Tue, Oct 26, 2021 at 12:44:31PM +0800, Greentime Hu wrote:
> > > Darius Rad <darius@bluespec.com> 於 2021年10月26日 週二 上午12:22寫道：
> > > >
> > > > On Mon, Oct 25, 2021 at 12:47:49PM +0800, Greentime Hu wrote:
> > > > > Darius Rad <darius@bluespec.com> 於 2021年10月22日 週五 下午6:40寫道：
> > > > > >
> > > > > > On Fri, Oct 22, 2021 at 11:52:01AM +0800, Vincent Chen wrote:
> > > > > > > On Thu, Oct 21, 2021 at 6:50 PM Darius Rad <darius@bluespec.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Oct 20, 2021 at 06:01:31PM -0700, Paul Walmsley wrote:
> > > > > > > > > Hello Darius,
> > > > > > > > >
> > > > > > > > > On Tue, 5 Oct 2021, Darius Rad wrote:
> > > > > > > > >
> > > > > > > > > > On Mon, Oct 04, 2021 at 08:36:30PM +0800, Greentime Hu wrote:
> > > > > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月29日 週三 下午9:28寫道：
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Sep 28, 2021 at 10:56:52PM +0800, Greentime Hu wrote:
> > > > > > > > > > > > > Darius Rad <darius@bluespec.com> 於 2021年9月13日 週一 下午8:21寫道：
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 9/8/21 1:45 PM, Greentime Hu wrote:
> > > > > > > > > > > > > > > This patch adds task switch support for vector. It supports partial lazy
> > > > > > > > > > > > > > > save and restore mechanism. It also supports all lengths of vlen.
> > > > > > > > >
> > > > > > > > > [ ... ]
> > > > > > > > >
> > > > > > > > > > > > > > So this will unconditionally enable vector instructions, and allocate
> > > > > > > > > > > > > > memory for vector state, for all processes, regardless of whether vector
> > > > > > > > > > > > > > instructions are used?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, it will enable vector if has_vector() is true. The reason that we
> > > > > > > > > > > > > choose to enable and allocate memory for user space program is because
> > > > > > > > > > > > > we also implement some common functions in the glibc such as memcpy
> > > > > > > > > > > > > vector version and it is called very often by every process. So that
> > > > > > > > > > > > > we assume if the user program is running in a CPU with vector ISA
> > > > > > > > > > > > > would like to use vector by default. If we disable it by default and
> > > > > > > > > > > > > make it trigger the illegal instruction, that might be a burden since
> > > > > > > > > > > > > almost every process will use vector glibc memcpy or something like
> > > > > > > > > > > > > that.
> > > > > > > > > > > >
> > > > > > > > > > > > Do you have any evidence to support the assertion that almost every process
> > > > > > > > > > > > would use vector operations?  One could easily argue that the converse is
> > > > > > > > > > > > true: no existing software uses the vector extension now, so most likely a
> > > > > > > > > > > > process will not be using it.
> > > > > > > > > > >
> > > > > > > > > > > Glibc ustreaming is just starting so you didn't see software using the
> > > > > > > > > > > vector extension now and this patchset is testing based on those
> > > > > > > > > > > optimized glibc too. Vincent Chen is working on the glibc vector
> > > > > > > > > > > support upstreaming and we will also upstream the vector version glibc
> > > > > > > > > > > memcpy, memcmp, memchr, memmove, memset, strcmp, strlen. Then we will
> > > > > > > > > > > see platform with vector support can use vector version mem* and str*
> > > > > > > > > > > functions automatically based on ifunc and platform without vector
> > > > > > > > > > > will use the original one automatically. These could be done to select
> > > > > > > > > > > the correct optimized glibc functions by ifunc mechanism.
> > > > > > > > >
> > > > > > > > > In your reply, I noticed that you didn't address Greentime's response
> > > > > > > > > here.  But this looks like the key issue.  If common library functions are
> > > > > > > > > vector-accelerated, wouldn't it make sense that almost every process would
> > > > > > > > > wind up using vector instructions?  And thus there wouldn't be much point
> > > > > > > > > to skipping the vector context memory allocation?
> > > > > > > > >
> > > > > > > >
> > > > > > > > This issue was addressed in the thread regarding Intel AMX I linked to in a
> > > > > > > > previous message.  I don't agree that this is the key issue; it is one of a
> > > > > > > > number of issues.  What if I don't want to take the potential
> > > > > > > > power/frequency hit for the vector unit for a workload that, at best, uses
> > > > > > > > it for the occasional memcpy?  What if the allocation fails, how will that
> > > > > > >
> > > > > > > Hi Darius,
> > > > > > > The memcpy function seems not to be occasionally used in the programs
> > > > > > > because many functions in Glibc use memcpy() to complete the memory
> > > > > > > copy. I use the following simple case as an example.
> > > > > > > test.c
> > > > > > > void main(void) {
> > > > > > >     return;
> > > > > > > }
> > > > > > > Then, we compile it by "gcc test.c -o a.out" and execute it. In the
> > > > > > > execution, the memcpy() has been called unexpectedly. It is because
> > > > > > > many libc initialized functions will be executed before entering the
> > > > > > > user-defined main function. One of the example is __libc_setup_tls(),
> > > > > > > which is called by __libc_start_main(). The __libc_setup_tls() will
> > > > > > > use memcpy() during the process of creating the Dynamic Thread Vector
> > > > > > > (DTV).
> > > > > > >
> > > > > > > Therefore, I think the memcpy() is widely used in most programs.
> > > > > > >
> > > > > >
> > > > > > You're missing my point.  Not every (any?) program spends a majority of the
> > > > > > time doing memcpy(), and even if a program did, all of my points are still
> > > > > > valid.
> > > > > >
> > > > > > Please read the discussion in the thread I referenced and the questions in
> > > > > > my prior message.
> > > > > >
> > > > >
> > > > > Hi Darius,
> > > > >
> > > > > As I mentioned before, we want to treat vector ISA like a general ISA
> > > > > instead of a specific IP. User program should be able to use it
> > > > > transparently just like FPU.
> > > > > It seems that the use case you want is asking user to use vector like
> > > > > a specific IP, user program should ask kernel before they use it and
> > > > > that is not what we want to do in this patchset.
> > > > >
> > > >
> > > > Hi Greentime,
> > > >
> > > > Right.
> > > >
> > > > But beyond what I want to do or what you want to do, is what *should* Linux
> > > > do?  I have attempted to provide evidence to support my position.  You have
> > > > not responded to or addressed the majority of my questions, which is
> > > > concerning to me.
> > >
> > > Hi Darius,
> > >
> > > What is your majority questions?
> > >
> >
> > 1. How will memory allocation failures for context state memory be reported
> > to user space?
> 
> it will return -ENOMEM for some cases or show the warning messages for
> some cases.
> We know it's not perfect, we should enhance that in the future, but
> let's take an example: 256 bits vector length system. 256 bits * 32
> registers /8 = 1KB.

When you say "show the warning message", I assume you mean the kernel will
log a message, which is not reported to user space.  User space will only
see a process unexpectedly die.

As you say, that is not great, and could be done better.  I would be
interested in knowing how you think that could be improved without needing
a user space API, or why it will be acceptable to break the user space API
later.

> 
> > 2. How will a system administrator (i.e., the user) be able to effectively
> > manage a system where the vector unit, which could have a considerable area
> > and/or power impact to the system, has one or more of the following
> > properties:
> 
> As I mentioned before,
> We would like to let user use vector transparently just like FPU or
> other extensions.
> If user knows that this system supports vector and user uses intrinsic
> functions or assembly codes or compiler generating vector codes, user
> can just use it just like FPU.
> If user doesn't know that whether this system support vector or not,
> user can just use the glibc or ifunc in his own libraries to detect
> vector support dynamically.
> 
> >   a. A single vector unit shared among two or more harts,
> >
> >   b. Additional power consumption when the vector unit is enabled and idle
> > versus not being enabled at all,
> >
> >   c. For a system which supports variable operating frequency, a reduction
> > in the maximum frequency when the vector unit is enabled, and/or
> >
> >   d. The inability to enter low power states and/or delays to low power
> > states transitions when the vector unit is enabled.
> 
> We also don't support this kind of system(a single vector unit shared
> among 2 or more harts) in this implementation. I'll add more
> assumptions in the next version patches.
> For the frequency or power issues, I'll also not treat them as a
> special case since we want to treat vector like an normal extension of
> ISA instead of a specific IP.

The problem is that it will likely be impossible to support such systems
without changing the user space API.

If you add an API along the lines of what I suggested, even if there is not
initially support to completely handle such systems, that support could be
added in the future without change to user space.

If such an API is not in place now, that support cannot be added without
breaking user space.

> 
> > 3. You contend that the RISC-V V-extension resembles ARM SVE/SVE2, at least
> > more than Intel AMX.  I do not agree, but nevertheless, why then does this
> > patchset not behave similar to SVE?  On arm64, SVE is only enabled and
> > memory is only allocated on first use, *not* unconditionally for all tasks.
> 
> As we mentioned before, almost every user space program will use glibc

As I mentioned before, I do not agree that every user space program *will*
use vector.

glibc is not the only C library used in Linux.

Whether such support as you are alluding to gets accepted to glibc also
remains to be seen.

> ld.so/libc.so and these libraries will use the vector version
> optimization porting in a system with vector support.
> That's why we don't let it trigger the first illegal instruction
> exception because vector registers will be used at very beginning of
> the program and the illegal instruction exception handling is also
> heavier because we need to go to M-mode first than S-mode which means
> we will need to save context and restore context twice in both M-mode
> and S-mode. Since the vstate will be allocated eventually, why not
> save these 2 context save/restore overhead in both M-mode and S-mode
> for handling this illegal instruction exception. And if the system
> doesn't support vector, glibc won't use the vector porting and the
> Linux kernel won't allocate vstate for every process too.

This is not convincing.  You are saying that a single illegal instruction
exception, for a process that actively desires to use vector instructions,
is more heavy weight than an unconditional allocation of up to 256 kiB per
process, even for those processes that do not use vector.

[RFC,v8,09/21] riscv: Add task switch support for vector

Commit Message

Comments

Patch