diff mbox series

[v2,2/6] RISC-V: Add a syscall for HW probing

Message ID 20230206201455.1790329-3-evan@rivosinc.com (mailing list archive)
State Superseded
Delegated to: Palmer Dabbelt
Headers show
Series RISC-V Hardware Probing User Interface | expand

Checks

Context Check Description
conchuod/tree_selection fail Failed to apply to next/pending-fixes or riscv/for-next

Commit Message

Evan Green Feb. 6, 2023, 8:14 p.m. UTC
We don't have enough space for these all in ELF_HWCAP{,2} and there's no
system call that quite does this, so let's just provide an arch-specific
one to probe for hardware capabilities.  This currently just provides
m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
the future.

Co-developed-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Evan Green <evan@rivosinc.com>

---

Changes in v2:
 - Changed the interface to look more like poll(). Rather than supplying
   key_offset and getting back an array of values with numerically
   contiguous keys, have the user pre-fill the key members of the array,
   and the kernel will fill in the corresponding values. For any key it
   doesn't recognize, it will set the key of that element to -1. This
   allows usermode to quickly ask for exactly the elements it cares
   about, and not get bogged down in a back and forth about newer keys
   that older kernels might not recognize. In other words, the kernel
   can communicate that it doesn't recognize some of the keys while
   still providing the data for the keys it does know.
 - Added a shortcut to the cpuset parameters that if a size of 0 and
   NULL is provided for the CPU set, the kernel will use a cpu mask of
   all online CPUs. This is convenient because I suspect most callers
   will only want to act on a feature if it's supported on all CPUs, and
   it's a headache to dynamically allocate an array of all 1s, not to
   mention a waste to have the kernel loop over all of the offline bits.


---
 Documentation/riscv/hwprobe.rst       |  37 +++++++
 Documentation/riscv/index.rst         |   1 +
 arch/riscv/include/asm/hwprobe.h      |  13 +++
 arch/riscv/include/asm/syscall.h      |   3 +
 arch/riscv/include/uapi/asm/hwprobe.h |  25 +++++
 arch/riscv/include/uapi/asm/unistd.h  |   8 ++
 arch/riscv/kernel/cpu.c               |   3 +-
 arch/riscv/kernel/sys_riscv.c         | 146 +++++++++++++++++++++++++-
 8 files changed, 234 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/riscv/hwprobe.rst
 create mode 100644 arch/riscv/include/asm/hwprobe.h
 create mode 100644 arch/riscv/include/uapi/asm/hwprobe.h

Comments

Greg Kroah-Hartman Feb. 7, 2023, 6:13 a.m. UTC | #1
On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> system call that quite does this, so let's just provide an arch-specific
> one to probe for hardware capabilities.  This currently just provides
> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> the future.

Ick, this is exactly what sysfs is designed to export in a sane way.
Why not just use that instead?  The "key" would be the filename, and the
value the value read from the filename.  If the key is not present, the
file is not present and it's obvious what is happening, no fancy parsing
and ABI issues at all.

Bonus is that you will also properly document all valid key/value pairs
in Documentation/ABI/ when you do this, so it reinforces what the code
should be doing correctly.

thanks,

greg k-h
Conor Dooley Feb. 7, 2023, 6:32 a.m. UTC | #2
Hey Evan, Greg,


On 7 February 2023 06:13:39 GMT, Greg KH <gregkh@linuxfoundation.org> wrote:
>On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
>> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
>> system call that quite does this, so let's just provide an arch-specific
>> one to probe for hardware capabilities.  This currently just provides
>> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
>> the future.
>
>Ick, this is exactly what sysfs is designed to export in a sane way.
>Why not just use that instead?  The "key" would be the filename, and the
>value the value read from the filename.  If the key is not present, the
>file is not present and it's obvious what is happening, no fancy parsing
>and ABI issues at all.

https://lore.kernel.org/linux-riscv/20221201160614.xpomlqq2fzpzfmcm@kamzik/

This is the sysfs interface that I mentioned drew
suggested on the v1.
I think it fits ~perfectly with what Greg is suggesting too.

>
>Bonus is that you will also properly document all valid key/value pairs
>in Documentation/ABI/ when you do this, so it reinforces what the code
>should be doing correctly.
>
>thanks,
>
>greg k-h
kernel test robot Feb. 7, 2023, 11:16 p.m. UTC | #3
Hi Evan,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on shuah-kselftest/next]
[also build test WARNING on shuah-kselftest/fixes robh/for-next soc/for-next linus/master v6.2-rc7 next-20230207]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Evan-Green/RISC-V-Move-struct-riscv_cpuinfo-to-new-header/20230207-041746
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20230206201455.1790329-3-evan%40rivosinc.com
patch subject: [PATCH v2 2/6] RISC-V: Add a syscall for HW probing
reproduce:
        # https://github.com/intel-lab-lkp/linux/commit/4a91a9ca5d81029b702e6e74c8f2015cf50af0ae
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Evan-Green/RISC-V-Move-struct-riscv_cpuinfo-to-new-header/20230207-041746
        git checkout 4a91a9ca5d81029b702e6e74c8f2015cf50af0ae
        make menuconfig
        # enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, CONFIG_WARN_ABI_ERRORS
        make htmldocs

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> Documentation/riscv/hwprobe.rst:33: WARNING: Field list ends without a blank line; unexpected unindent.

vim +33 Documentation/riscv/hwprobe.rst

    31	
    32	* :RISCV_HWPROBE_KEY_MVENDORID:: Contains the value of :mvendorid:, as per the
  > 33	  ISA specifications.
Evan Green Feb. 9, 2023, 5:09 p.m. UTC | #4
On Mon, Feb 6, 2023 at 10:32 PM Conor Dooley <conor@kernel.org> wrote:
>
> Hey Evan, Greg,
>
>
> On 7 February 2023 06:13:39 GMT, Greg KH <gregkh@linuxfoundation.org> wrote:
> >On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
> >> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> >> system call that quite does this, so let's just provide an arch-specific
> >> one to probe for hardware capabilities.  This currently just provides
> >> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> >> the future.
> >
> >Ick, this is exactly what sysfs is designed to export in a sane way.
> >Why not just use that instead?  The "key" would be the filename, and the
> >value the value read from the filename.  If the key is not present, the
> >file is not present and it's obvious what is happening, no fancy parsing
> >and ABI issues at all.
>
> https://lore.kernel.org/linux-riscv/20221201160614.xpomlqq2fzpzfmcm@kamzik/
>
> This is the sysfs interface that I mentioned drew
> suggested on the v1.
> I think it fits ~perfectly with what Greg is suggesting too.

Whoops, I'll admit I missed that comment when I reviewed the feedback
from v1. I spent some time thinking about sysfs. The problem is this
interface will be needed in places like very early program startup. If
we're trying to use this in places like the ifunc selector to decide
which memcpy to use, having to go open and read a fistful of files is
going to be complex that early, and rough on performance.

Really this is data that would go great in the aux vector, except
there's probably too much of it to justify preparing and copying into
every new process. You could point the aux vector into a vDSO data
area. This has the advantage of great performance and no syscall, but
has the disadvantages of making that data ABI, and requiring it all to
be known up front (eg the kernel can't compute any answers on the
fly).

After discussions with Palmer, my plan for the next version is to move
this into a vDSO function plus a syscall. Private vDSO data will be
prepped with common answers for the "all CPUs" case, avoiding the need
for a syscall in most cases and making this fast. Since the data is
hidden behind the vdso function, it's not ABI, which is a plus. Then
the vdso function can fall back to the syscall for cases with exotic
CPU masks or keys that are unknown/expensive to compute at runtime.

-Evan
Greg Kroah-Hartman Feb. 9, 2023, 5:13 p.m. UTC | #5
On Thu, Feb 09, 2023 at 09:09:16AM -0800, Evan Green wrote:
> On Mon, Feb 6, 2023 at 10:32 PM Conor Dooley <conor@kernel.org> wrote:
> >
> > Hey Evan, Greg,
> >
> >
> > On 7 February 2023 06:13:39 GMT, Greg KH <gregkh@linuxfoundation.org> wrote:
> > >On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
> > >> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> > >> system call that quite does this, so let's just provide an arch-specific
> > >> one to probe for hardware capabilities.  This currently just provides
> > >> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> > >> the future.
> > >
> > >Ick, this is exactly what sysfs is designed to export in a sane way.
> > >Why not just use that instead?  The "key" would be the filename, and the
> > >value the value read from the filename.  If the key is not present, the
> > >file is not present and it's obvious what is happening, no fancy parsing
> > >and ABI issues at all.
> >
> > https://lore.kernel.org/linux-riscv/20221201160614.xpomlqq2fzpzfmcm@kamzik/
> >
> > This is the sysfs interface that I mentioned drew
> > suggested on the v1.
> > I think it fits ~perfectly with what Greg is suggesting too.
> 
> Whoops, I'll admit I missed that comment when I reviewed the feedback
> from v1. I spent some time thinking about sysfs. The problem is this
> interface will be needed in places like very early program startup. If
> we're trying to use this in places like the ifunc selector to decide
> which memcpy to use, having to go open and read a fistful of files is
> going to be complex that early, and rough on performance.

How is it going to be any different on "performance" than a syscall?  Or
complex?  It should be almost identical overall as this is all in-ram
and not any real I/o is happening.  You are limited only by the speed of
your cpu.

> Really this is data that would go great in the aux vector, except
> there's probably too much of it to justify preparing and copying into
> every new process. You could point the aux vector into a vDSO data
> area. This has the advantage of great performance and no syscall, but
> has the disadvantages of making that data ABI, and requiring it all to
> be known up front (eg the kernel can't compute any answers on the
> fly).
> 
> After discussions with Palmer, my plan for the next version is to move
> this into a vDSO function plus a syscall. Private vDSO data will be
> prepped with common answers for the "all CPUs" case, avoiding the need
> for a syscall in most cases and making this fast. Since the data is
> hidden behind the vdso function, it's not ABI, which is a plus. Then
> the vdso function can fall back to the syscall for cases with exotic
> CPU masks or keys that are unknown/expensive to compute at runtime.

I still think that's wrong, as you are wanting a set of key/values here,
which is exactly what sysfs is designed for.

Please benchmark this first.  Heck, if you don't like the
open/read/close syscall overhead, use my readfile() syscall patch that I
keep proposing every 6 months or so to remove that overhead.  That would
be a good reason to get that code accepted finally :)

thanks,

greg k-h
Jessica Clarke Feb. 9, 2023, 5:22 p.m. UTC | #6
On 9 Feb 2023, at 17:13, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Thu, Feb 09, 2023 at 09:09:16AM -0800, Evan Green wrote:
>> On Mon, Feb 6, 2023 at 10:32 PM Conor Dooley <conor@kernel.org> wrote:
>>> 
>>> Hey Evan, Greg,
>>> 
>>> 
>>> On 7 February 2023 06:13:39 GMT, Greg KH <gregkh@linuxfoundation.org> wrote:
>>>> On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
>>>>> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
>>>>> system call that quite does this, so let's just provide an arch-specific
>>>>> one to probe for hardware capabilities.  This currently just provides
>>>>> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
>>>>> the future.
>>>> 
>>>> Ick, this is exactly what sysfs is designed to export in a sane way.
>>>> Why not just use that instead?  The "key" would be the filename, and the
>>>> value the value read from the filename.  If the key is not present, the
>>>> file is not present and it's obvious what is happening, no fancy parsing
>>>> and ABI issues at all.
>>> 
>>> https://lore.kernel.org/linux-riscv/20221201160614.xpomlqq2fzpzfmcm@kamzik/
>>> 
>>> This is the sysfs interface that I mentioned drew
>>> suggested on the v1.
>>> I think it fits ~perfectly with what Greg is suggesting too.
>> 
>> Whoops, I'll admit I missed that comment when I reviewed the feedback
>> from v1. I spent some time thinking about sysfs. The problem is this
>> interface will be needed in places like very early program startup. If
>> we're trying to use this in places like the ifunc selector to decide
>> which memcpy to use, having to go open and read a fistful of files is
>> going to be complex that early, and rough on performance.
> 
> How is it going to be any different on "performance" than a syscall?  Or
> complex?  It should be almost identical overall as this is all in-ram
> and not any real I/o is happening.  You are limited only by the speed of
> your cpu.
> 
>> Really this is data that would go great in the aux vector, except
>> there's probably too much of it to justify preparing and copying into
>> every new process. You could point the aux vector into a vDSO data
>> area. This has the advantage of great performance and no syscall, but
>> has the disadvantages of making that data ABI, and requiring it all to
>> be known up front (eg the kernel can't compute any answers on the
>> fly).
>> 
>> After discussions with Palmer, my plan for the next version is to move
>> this into a vDSO function plus a syscall. Private vDSO data will be
>> prepped with common answers for the "all CPUs" case, avoiding the need
>> for a syscall in most cases and making this fast. Since the data is
>> hidden behind the vdso function, it's not ABI, which is a plus. Then
>> the vdso function can fall back to the syscall for cases with exotic
>> CPU masks or keys that are unknown/expensive to compute at runtime.
> 
> I still think that's wrong, as you are wanting a set of key/values here,
> which is exactly what sysfs is designed for.

But this needs to be a RISC-V standard interface that can be programmed
against, not something tied to highly Linux-specific things like sysfs.
You’re free to implement that interface with sysfs, but exposing that
as *the* interface to use would be terrible for portability.

Jess

> Please benchmark this first.  Heck, if you don't like the
> open/read/close syscall overhead, use my readfile() syscall patch that I
> keep proposing every 6 months or so to remove that overhead.  That would
> be a good reason to get that code accepted finally :)
> 
> thanks,
> 
> greg k-h
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Evan Green Feb. 9, 2023, 6:41 p.m. UTC | #7
On Thu, Feb 9, 2023 at 9:13 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Feb 09, 2023 at 09:09:16AM -0800, Evan Green wrote:
> > On Mon, Feb 6, 2023 at 10:32 PM Conor Dooley <conor@kernel.org> wrote:
> > >
> > > Hey Evan, Greg,
> > >
> > >
> > > On 7 February 2023 06:13:39 GMT, Greg KH <gregkh@linuxfoundation.org> wrote:
> > > >On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
> > > >> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> > > >> system call that quite does this, so let's just provide an arch-specific
> > > >> one to probe for hardware capabilities.  This currently just provides
> > > >> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> > > >> the future.
> > > >
> > > >Ick, this is exactly what sysfs is designed to export in a sane way.
> > > >Why not just use that instead?  The "key" would be the filename, and the
> > > >value the value read from the filename.  If the key is not present, the
> > > >file is not present and it's obvious what is happening, no fancy parsing
> > > >and ABI issues at all.
> > >
> > > https://lore.kernel.org/linux-riscv/20221201160614.xpomlqq2fzpzfmcm@kamzik/
> > >
> > > This is the sysfs interface that I mentioned drew
> > > suggested on the v1.
> > > I think it fits ~perfectly with what Greg is suggesting too.
> >
> > Whoops, I'll admit I missed that comment when I reviewed the feedback
> > from v1. I spent some time thinking about sysfs. The problem is this
> > interface will be needed in places like very early program startup. If
> > we're trying to use this in places like the ifunc selector to decide
> > which memcpy to use, having to go open and read a fistful of files is
> > going to be complex that early, and rough on performance.
>
> How is it going to be any different on "performance" than a syscall?  Or
> complex?  It should be almost identical overall as this is all in-ram
> and not any real I/o is happening.  You are limited only by the speed of
> your cpu.

At best sysfs is 1 syscall per key, whereas this version of the
interface lets you query all the keys you're interested in with a
single syscall. With the
proposed vdso version, we'd be down to ~0 syscalls for most queries.
The complexity aspect is mostly a reference to having to do a bunch of
open/read/parse/close operations at a time when mem* operations are
still being set up. Since this is something that may get run on every
program invocation, it seems worth it to be able to get fast and
simple queries even if it's a slightly separated interface.


-Evan
Greg Kroah-Hartman Feb. 10, 2023, 6:48 a.m. UTC | #8
On Thu, Feb 09, 2023 at 05:22:09PM +0000, Jessica Clarke wrote:
> On 9 Feb 2023, at 17:13, Greg KH <gregkh@linuxfoundation.org> wrote:
> > On Thu, Feb 09, 2023 at 09:09:16AM -0800, Evan Green wrote:
> >> On Mon, Feb 6, 2023 at 10:32 PM Conor Dooley <conor@kernel.org> wrote:
> >>> 
> >>> Hey Evan, Greg,
> >>> 
> >>> 
> >>> On 7 February 2023 06:13:39 GMT, Greg KH <gregkh@linuxfoundation.org> wrote:
> >>>> On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
> >>>>> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> >>>>> system call that quite does this, so let's just provide an arch-specific
> >>>>> one to probe for hardware capabilities.  This currently just provides
> >>>>> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> >>>>> the future.
> >>>> 
> >>>> Ick, this is exactly what sysfs is designed to export in a sane way.
> >>>> Why not just use that instead?  The "key" would be the filename, and the
> >>>> value the value read from the filename.  If the key is not present, the
> >>>> file is not present and it's obvious what is happening, no fancy parsing
> >>>> and ABI issues at all.
> >>> 
> >>> https://lore.kernel.org/linux-riscv/20221201160614.xpomlqq2fzpzfmcm@kamzik/
> >>> 
> >>> This is the sysfs interface that I mentioned drew
> >>> suggested on the v1.
> >>> I think it fits ~perfectly with what Greg is suggesting too.
> >> 
> >> Whoops, I'll admit I missed that comment when I reviewed the feedback
> >> from v1. I spent some time thinking about sysfs. The problem is this
> >> interface will be needed in places like very early program startup. If
> >> we're trying to use this in places like the ifunc selector to decide
> >> which memcpy to use, having to go open and read a fistful of files is
> >> going to be complex that early, and rough on performance.
> > 
> > How is it going to be any different on "performance" than a syscall?  Or
> > complex?  It should be almost identical overall as this is all in-ram
> > and not any real I/o is happening.  You are limited only by the speed of
> > your cpu.
> > 
> >> Really this is data that would go great in the aux vector, except
> >> there's probably too much of it to justify preparing and copying into
> >> every new process. You could point the aux vector into a vDSO data
> >> area. This has the advantage of great performance and no syscall, but
> >> has the disadvantages of making that data ABI, and requiring it all to
> >> be known up front (eg the kernel can't compute any answers on the
> >> fly).
> >> 
> >> After discussions with Palmer, my plan for the next version is to move
> >> this into a vDSO function plus a syscall. Private vDSO data will be
> >> prepped with common answers for the "all CPUs" case, avoiding the need
> >> for a syscall in most cases and making this fast. Since the data is
> >> hidden behind the vdso function, it's not ABI, which is a plus. Then
> >> the vdso function can fall back to the syscall for cases with exotic
> >> CPU masks or keys that are unknown/expensive to compute at runtime.
> > 
> > I still think that's wrong, as you are wanting a set of key/values here,
> > which is exactly what sysfs is designed for.
> 
> But this needs to be a RISC-V standard interface that can be programmed
> against, not something tied to highly Linux-specific things like sysfs.
> You’re free to implement that interface with sysfs, but exposing that
> as *the* interface to use would be terrible for portability.

A vdso and a new kernel syscall is also a highly Linux-specific thing,
so I do not understand the objection here at all.  You're going to have
to wrap all of this up in some sort of common userspace library code
anyway, and that will have to handle all of the different operating
system implementations.

Also, frankly, I don't care about non-Linux implementations, so that
isn't a valid argument here :)

thanks,

greg k-h
Greg Kroah-Hartman Feb. 10, 2023, 6:50 a.m. UTC | #9
On Thu, Feb 09, 2023 at 10:41:51AM -0800, Evan Green wrote:
> On Thu, Feb 9, 2023 at 9:13 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Feb 09, 2023 at 09:09:16AM -0800, Evan Green wrote:
> > > On Mon, Feb 6, 2023 at 10:32 PM Conor Dooley <conor@kernel.org> wrote:
> > > >
> > > > Hey Evan, Greg,
> > > >
> > > >
> > > > On 7 February 2023 06:13:39 GMT, Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > >On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
> > > > >> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> > > > >> system call that quite does this, so let's just provide an arch-specific
> > > > >> one to probe for hardware capabilities.  This currently just provides
> > > > >> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> > > > >> the future.
> > > > >
> > > > >Ick, this is exactly what sysfs is designed to export in a sane way.
> > > > >Why not just use that instead?  The "key" would be the filename, and the
> > > > >value the value read from the filename.  If the key is not present, the
> > > > >file is not present and it's obvious what is happening, no fancy parsing
> > > > >and ABI issues at all.
> > > >
> > > > https://lore.kernel.org/linux-riscv/20221201160614.xpomlqq2fzpzfmcm@kamzik/
> > > >
> > > > This is the sysfs interface that I mentioned drew
> > > > suggested on the v1.
> > > > I think it fits ~perfectly with what Greg is suggesting too.
> > >
> > > Whoops, I'll admit I missed that comment when I reviewed the feedback
> > > from v1. I spent some time thinking about sysfs. The problem is this
> > > interface will be needed in places like very early program startup. If
> > > we're trying to use this in places like the ifunc selector to decide
> > > which memcpy to use, having to go open and read a fistful of files is
> > > going to be complex that early, and rough on performance.
> >
> > How is it going to be any different on "performance" than a syscall?  Or
> > complex?  It should be almost identical overall as this is all in-ram
> > and not any real I/o is happening.  You are limited only by the speed of
> > your cpu.
> 
> At best sysfs is 1 syscall per key, whereas this version of the
> interface lets you query all the keys you're interested in with a
> single syscall. With the
> proposed vdso version, we'd be down to ~0 syscalls for most queries.
> The complexity aspect is mostly a reference to having to do a bunch of
> open/read/parse/close operations at a time when mem* operations are
> still being set up. Since this is something that may get run on every
> program invocation, it seems worth it to be able to get fast and
> simple queries even if it's a slightly separated interface.

I'd be interested in the real benchmark numbers and seeing the userspace
and kernel code before arguing this too much.

Again, ignoring the lessons of the past is generally considered an
unwise decision, but hey, you do you, it's something that you are going
to have to maintain for 40+ years going forward, not me :)

good luck!

greg k-h
Conor Dooley Feb. 14, 2023, 11:51 p.m. UTC | #10
Hey Evan,

Just as a preface, I'm reviewing this lot from a position of ignorance
on what glibc wants so I'll refrain from commenting on call itself.
I figure that since the commentary has kinda died on the sysfs front &
Palmer seems to be still into the syscall stuff, that we're pushing on
with this approach...

On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> system call that quite does this, so let's just provide an arch-specific
> one to probe for hardware capabilities.  This currently just provides
> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> the future.
> 
> Co-developed-by: Palmer Dabbelt <palmer@rivosinc.com>
> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
> Signed-off-by: Evan Green <evan@rivosinc.com>
> 
> ---
> 
> Changes in v2:
>  - Changed the interface to look more like poll(). Rather than supplying
>    key_offset and getting back an array of values with numerically
>    contiguous keys, have the user pre-fill the key members of the array,
>    and the kernel will fill in the corresponding values. For any key it
>    doesn't recognize, it will set the key of that element to -1. This
>    allows usermode to quickly ask for exactly the elements it cares
>    about, and not get bogged down in a back and forth about newer keys
>    that older kernels might not recognize. In other words, the kernel
>    can communicate that it doesn't recognize some of the keys while
>    still providing the data for the keys it does know.
>  - Added a shortcut to the cpuset parameters that if a size of 0 and
>    NULL is provided for the CPU set, the kernel will use a cpu mask of
>    all online CPUs. This is convenient because I suspect most callers
>    will only want to act on a feature if it's supported on all CPUs, and
>    it's a headache to dynamically allocate an array of all 1s, not to
>    mention a waste to have the kernel loop over all of the offline bits.
> 
> 
> ---
>  Documentation/riscv/hwprobe.rst       |  37 +++++++
>  Documentation/riscv/index.rst         |   1 +
>  arch/riscv/include/asm/hwprobe.h      |  13 +++
>  arch/riscv/include/asm/syscall.h      |   3 +
>  arch/riscv/include/uapi/asm/hwprobe.h |  25 +++++
>  arch/riscv/include/uapi/asm/unistd.h  |   8 ++
>  arch/riscv/kernel/cpu.c               |   3 +-
>  arch/riscv/kernel/sys_riscv.c         | 146 +++++++++++++++++++++++++-
>  8 files changed, 234 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/riscv/hwprobe.rst
>  create mode 100644 arch/riscv/include/asm/hwprobe.h
>  create mode 100644 arch/riscv/include/uapi/asm/hwprobe.h
> 
> diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> new file mode 100644
> index 000000000000..97771090e972
> --- /dev/null
> +++ b/Documentation/riscv/hwprobe.rst
> @@ -0,0 +1,37 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +RISC-V Hardware Probing Interface
> +---------------------------------
> +
> +The RISC-V hardware probing interface is based around a single syscall, which
> +is defined in <asm/hwprobe.h>::
> +
> +    struct riscv_hwprobe {
> +        __s64 key;
> +        __u64 value;
> +    };
> +
> +    long sys_riscv_hwprobe(struct riscv_hwprobe *pairs, size_t pair_count,
> +                           size_t cpu_count, cpu_set_t *cpus,
> +                           unsigned long flags);
> +
> +The arguments are split into three groups: an array of key-value pairs, a CPU
> +set, and some flags.  The key-value pairs are supplied with a count.  Userspace
> +must prepopulate the key field for each element, and the kernel will fill in the
> +value if the key is recognized.  If a key is unknown to the kernel, its key
> +field will be cleared to -1, and its value set to 0.  The CPU set is defined by
> +CPU_SET(3), the indicated features will be supported on all CPUs in the set.
> +Usermode can supply NULL for cpus and 0 for cpu_count as a shortcut for all
> +online CPUs. There are currently no flags, this value must be zero for future
> +compatibility.
> +
> +On success 0 is returned, on failure a negative error code is returned.
> +
> +The following keys are defined:
> +
> +* :RISCV_HWPROBE_KEY_MVENDORID:: Contains the value of :mvendorid:, as per the
> +  ISA specifications.

"per the ISA specifications" sounds like dangerous wording to me! ;)

> +* :RISCV_HWPROBE_KEY_MARCHID:: Contains the value of :marchid:, as per the ISA
> +  specifications.
> +* :RISCV_HWPROBE_KEY_MIMPLID:: Contains the value of :mimplid:, as per the ISA
> +  specifications.

> diff --git a/arch/riscv/include/uapi/asm/hwprobe.h b/arch/riscv/include/uapi/asm/hwprobe.h
> new file mode 100644
> index 000000000000..591802047460
> --- /dev/null
> +++ b/arch/riscv/include/uapi/asm/hwprobe.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Copyright 2022 Rivos, Inc
> + */
> +
> +#ifndef _UAPI_ASM_HWPROBE_H
> +#define _UAPI_ASM_HWPROBE_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * Interface for probing hardware capabilities from userspace, see
> + * Documentation/riscv/hwprobe.rst for more information.
> + */
> +struct riscv_hwprobe {
> +	__s64 key;
> +	__u64 value;
> +};
> +
> +#define RISCV_HWPROBE_KEY_MVENDORID	0
> +#define RISCV_HWPROBE_KEY_MARCHID	1
> +#define RISCV_HWPROBE_KEY_MIMPID	2
> +/* Increase RISCV_HWPROBE_MAX_KEY when adding items. */

Can't wait for that to get forgotten!

> diff --git a/arch/riscv/include/uapi/asm/unistd.h b/arch/riscv/include/uapi/asm/unistd.h
> index 73d7cdd2ec49..37d47302322a 100644
> --- a/arch/riscv/include/uapi/asm/unistd.h
> +++ b/arch/riscv/include/uapi/asm/unistd.h
> @@ -43,3 +43,11 @@
>  #define __NR_riscv_flush_icache (__NR_arch_specific_syscall + 15)
>  #endif
>  __SYSCALL(__NR_riscv_flush_icache, sys_riscv_flush_icache)
> +
> +/*
> + * Allows userspace to probe

That comment looks chopped off midway through.

> diff --git a/arch/riscv/kernel/sys_riscv.c b/arch/riscv/kernel/sys_riscv.c
> index 5d3f2fbeb33c..868a12384f5a 100644
> --- a/arch/riscv/kernel/sys_riscv.c
> +++ b/arch/riscv/kernel/sys_riscv.c
> @@ -6,8 +6,11 @@
>   */
>  
>  #include <linux/syscalls.h>
> -#include <asm/unistd.h>
>  #include <asm/cacheflush.h>
> +#include <asm/cpufeature.h>
> +#include <asm/hwprobe.h>
> +#include <asm/uaccess.h>
> +#include <asm/unistd.h>
>  #include <asm-generic/mman-common.h>
>  
>  static long riscv_sys_mmap(unsigned long addr, unsigned long len,
> @@ -69,3 +72,144 @@ SYSCALL_DEFINE3(riscv_flush_icache, uintptr_t, start, uintptr_t, end,
>  
>  	return 0;
>  }
> +
> +/*
> + * The hwprobe interface, for allowing userspace to probe to see which features
> + * are supported by the hardware.  See Documentation/riscv/hwprobe.rst for more
> + * details.
> + */
> +static int set_hwprobe(struct riscv_hwprobe __user *pair, u64 val)
> +{
> +	long ret;
> +
> +	ret = put_user(val, &pair->value);
> +	if (ret < 0)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +static long hwprobe_mid(struct riscv_hwprobe __user *pair, size_t key,

ngl, it took me far too long to figure out that "mid" was not some
shortening of "middle". I don't have any suggestions though, other than
using cpu_id to match your variable and I think it becomes clearer by the
end of the series anyway when the access alignment stuff appears.

> +			cpumask_t *cpus)
> +{
> +	long cpu, id;
> +	bool first, valid;
> +
> +	first = true;
> +	valid = false;

Just make that
boot first = true, valid = false;
no?

> +	for_each_cpu(cpu, cpus) {
> +		struct riscv_cpuinfo *ci = per_cpu_ptr(&riscv_cpuinfo, cpu);
> +		long cpu_id;
> +
> +		switch (key) {
> +		case RISCV_HWPROBE_KEY_MVENDORID:
> +			cpu_id = ci->mvendorid;

Are you intentionally avoiding using riscv_cached_mfooid()?

> +			break;
> +		case RISCV_HWPROBE_KEY_MIMPID:
> +			cpu_id = ci->mimpid;
> +			break;
> +		case RISCV_HWPROBE_KEY_MARCHID:
> +			cpu_id = ci->marchid;
> +			break;
> +		}
> +
> +		if (first) {
> +			id = cpu_id;
> +			valid = true;

So, we only end up in this function if there are CPUs in the set.
Does that mean we can assume validity by default and just define valid =
true from the get go?

> +		}
> +
> +		if (id != cpu_id)
> +			valid = false;
> +	}
> +
> +	/*
> +	 * put_user() returns 0 on success, so use 1 to indicate it wasn't
> +	 * called and we should skip having incremented the output.
> +	 */
> +	if (!valid)
> +		return 1;

I'm not sure why you're returning 1 here. If the id is deemed to be
invalid, why aren't we treating it as a "real" error.
TL;DR I don't understand the comment explaining this.

> +
> +	return set_hwprobe(pair, id);
> +}
> +
> +static
> +long do_riscv_hwprobe(struct riscv_hwprobe __user *pairs, long pair_count,

The files not wrapped at 80 anyway, so why not put this on the same line
as `static`?

> +		      long cpu_count, unsigned long __user *cpus_user,
> +		      unsigned long flags)
> +{
> +	size_t out;
> +	s64 key;
> +	long ret;
> +	struct cpumask cpus;
> +
> +	/* Check the reserved flags. */
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	/*
> +	 * The only supported values must be the same on all CPUs. Allow

"The only supported values" doesn't really make much sense. Is that
intended to be read as meaning that we only report the features
supported by all CPUs in the set?

> +	 * userspace to specify NULL and 0 as a shortcut to all online CPUs.
> +	 */
> +	cpumask_clear(&cpus);
> +	if ((cpu_count == 0) && (cpus_user == NULL)) {

Is this not just `if (!cpu_count && !cpus_user)`?

> +		cpumask_copy(&cpus, cpu_online_mask);
> +	} else {
> +		if (cpu_count > cpumask_size())
> +			cpu_count = cpumask_size();

nit: newline here please, helps my poor auld eyes out on the days
they're not doing too well!

Cheers,
Conor.

> +		ret = copy_from_user(&cpus, cpus_user, cpu_count);
> +		if (!ret)
> +			return -EFAULT;
> +
> +		/*
> +		 * Userspace must provide at least one online CPU, without that there's
> +		 * no way to define what is supported.
> +		 */
> +		cpumask_and(&cpus, &cpus, cpu_online_mask);
> +		if (cpumask_empty(&cpus))
> +			return -EINVAL;
> +	}
> +
> +	for (out = 0; out < pair_count; out++, pairs++) {
> +		long ret;
> +
> +		if (get_user(key, &pairs->key))
> +			return -EFAULT;
> +
> +		switch (key) {
> +		case RISCV_HWPROBE_KEY_MVENDORID:
> +		case RISCV_HWPROBE_KEY_MARCHID:
> +		case RISCV_HWPROBE_KEY_MIMPID:
> +			ret = hwprobe_mid(pairs, key, &cpus);
> +			break;
> +
> +		/*
> +		 * For forward compatibility, unknown keys don't fail the whole
> +		 * call, but get their element key set to -1 and value set to 0
> +		 * indicating they're unrecognized.
> +		 */
> +		default:
> +			ret = put_user(-1, &pairs->key);
> +			if (ret < 0)
> +				return ret;
> +
> +			ret = set_hwprobe(pairs, 0);
> +			if (ret)
> +				return ret;
> +
> +			break;
> +		}
> +
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	return 0;
> +
> +}
> +
> +SYSCALL_DEFINE5(riscv_hwprobe, uintptr_t, pairs, uintptr_t, pair_count,
> +		uintptr_t, cpu_count, uintptr_t, cpus, uintptr_t, flags)
> +{
> +	return do_riscv_hwprobe((void __user *)pairs, pair_count, cpu_count,
> +				(void __user *)cpus, flags);
> +}
> -- 
> 2.25.1
>
Andrew Jones Feb. 15, 2023, 8:04 a.m. UTC | #11
On Tue, Feb 14, 2023 at 11:51:17PM +0000, Conor Dooley wrote:
> Hey Evan,
> 
> Just as a preface, I'm reviewing this lot from a position of ignorance
> on what glibc wants so I'll refrain from commenting on call itself.
> I figure that since the commentary has kinda died on the sysfs front &
> Palmer seems to be still into the syscall stuff, that we're pushing on
> with this approach...

If I can find time (that's a big if, atm) I'd be willing to do a sysfs
prototype just for comparison/benchmarking purposes, even if the chances
it gets merged are low. But, time...

Thanks,
drew
Arnd Bergmann Feb. 15, 2023, 9:56 a.m. UTC | #12
On Mon, Feb 6, 2023, at 21:14, Evan Green wrote:
> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> system call that quite does this, so let's just provide an arch-specific
> one to probe for hardware capabilities.  This currently just provides
> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> the future.
>
> Co-developed-by: Palmer Dabbelt <palmer@rivosinc.com>
> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
> Signed-off-by: Evan Green <evan@rivosinc.com>

I'm not sure I understand the problem with
AT_HWCAP. While the bits in AT_HWCAP and AT_HWCAP2
are limited, I don't see us running out of new
AT_* words to use for additional bits. Presumably
the kernel would already have to know about the
name of each supported HW feature and could assign
a unique bit number to them.

      Arnd
Evan Green Feb. 15, 2023, 8:49 p.m. UTC | #13
On Tue, Feb 14, 2023 at 3:51 PM Conor Dooley <conor@kernel.org> wrote:
>
> Hey Evan,
>
> Just as a preface, I'm reviewing this lot from a position of ignorance
> on what glibc wants so I'll refrain from commenting on call itself.
> I figure that since the commentary has kinda died on the sysfs front &
> Palmer seems to be still into the syscall stuff, that we're pushing on
> with this approach...

Yep, sounds good.

>
> On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:
> > We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> > system call that quite does this, so let's just provide an arch-specific
> > one to probe for hardware capabilities.  This currently just provides
> > m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> > the future.
> >
> > Co-developed-by: Palmer Dabbelt <palmer@rivosinc.com>
> > Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
> > Signed-off-by: Evan Green <evan@rivosinc.com>
> >
> > ---
> >
> > Changes in v2:
> >  - Changed the interface to look more like poll(). Rather than supplying
> >    key_offset and getting back an array of values with numerically
> >    contiguous keys, have the user pre-fill the key members of the array,
> >    and the kernel will fill in the corresponding values. For any key it
> >    doesn't recognize, it will set the key of that element to -1. This
> >    allows usermode to quickly ask for exactly the elements it cares
> >    about, and not get bogged down in a back and forth about newer keys
> >    that older kernels might not recognize. In other words, the kernel
> >    can communicate that it doesn't recognize some of the keys while
> >    still providing the data for the keys it does know.
> >  - Added a shortcut to the cpuset parameters that if a size of 0 and
> >    NULL is provided for the CPU set, the kernel will use a cpu mask of
> >    all online CPUs. This is convenient because I suspect most callers
> >    will only want to act on a feature if it's supported on all CPUs, and
> >    it's a headache to dynamically allocate an array of all 1s, not to
> >    mention a waste to have the kernel loop over all of the offline bits.
> >
> >
> > ---
> >  Documentation/riscv/hwprobe.rst       |  37 +++++++
> >  Documentation/riscv/index.rst         |   1 +
> >  arch/riscv/include/asm/hwprobe.h      |  13 +++
> >  arch/riscv/include/asm/syscall.h      |   3 +
> >  arch/riscv/include/uapi/asm/hwprobe.h |  25 +++++
> >  arch/riscv/include/uapi/asm/unistd.h  |   8 ++
> >  arch/riscv/kernel/cpu.c               |   3 +-
> >  arch/riscv/kernel/sys_riscv.c         | 146 +++++++++++++++++++++++++-
> >  8 files changed, 234 insertions(+), 2 deletions(-)
> >  create mode 100644 Documentation/riscv/hwprobe.rst
> >  create mode 100644 arch/riscv/include/asm/hwprobe.h
> >  create mode 100644 arch/riscv/include/uapi/asm/hwprobe.h
> >
> > diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> > new file mode 100644
> > index 000000000000..97771090e972
> > --- /dev/null
> > +++ b/Documentation/riscv/hwprobe.rst
> > @@ -0,0 +1,37 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +RISC-V Hardware Probing Interface
> > +---------------------------------
> > +
> > +The RISC-V hardware probing interface is based around a single syscall, which
> > +is defined in <asm/hwprobe.h>::
> > +
> > +    struct riscv_hwprobe {
> > +        __s64 key;
> > +        __u64 value;
> > +    };
> > +
> > +    long sys_riscv_hwprobe(struct riscv_hwprobe *pairs, size_t pair_count,
> > +                           size_t cpu_count, cpu_set_t *cpus,
> > +                           unsigned long flags);
> > +
> > +The arguments are split into three groups: an array of key-value pairs, a CPU
> > +set, and some flags.  The key-value pairs are supplied with a count.  Userspace
> > +must prepopulate the key field for each element, and the kernel will fill in the
> > +value if the key is recognized.  If a key is unknown to the kernel, its key
> > +field will be cleared to -1, and its value set to 0.  The CPU set is defined by
> > +CPU_SET(3), the indicated features will be supported on all CPUs in the set.
> > +Usermode can supply NULL for cpus and 0 for cpu_count as a shortcut for all
> > +online CPUs. There are currently no flags, this value must be zero for future
> > +compatibility.
> > +
> > +On success 0 is returned, on failure a negative error code is returned.
> > +
> > +The following keys are defined:
> > +
> > +* :RISCV_HWPROBE_KEY_MVENDORID:: Contains the value of :mvendorid:, as per the
> > +  ISA specifications.
>
> "per the ISA specifications" sounds like dangerous wording to me! ;)

I can replace "per the ISA specifications" with "as defined by the
RISC-V privileged architecture specification" to try and make that
more crisp.

>
> > +* :RISCV_HWPROBE_KEY_MARCHID:: Contains the value of :marchid:, as per the ISA
> > +  specifications.
> > +* :RISCV_HWPROBE_KEY_MIMPLID:: Contains the value of :mimplid:, as per the ISA
> > +  specifications.
>
> > diff --git a/arch/riscv/include/uapi/asm/hwprobe.h b/arch/riscv/include/uapi/asm/hwprobe.h
> > new file mode 100644
> > index 000000000000..591802047460
> > --- /dev/null
> > +++ b/arch/riscv/include/uapi/asm/hwprobe.h
> > @@ -0,0 +1,25 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + * Copyright 2022 Rivos, Inc
> > + */
> > +
> > +#ifndef _UAPI_ASM_HWPROBE_H
> > +#define _UAPI_ASM_HWPROBE_H
> > +
> > +#include <linux/types.h>
> > +
> > +/*
> > + * Interface for probing hardware capabilities from userspace, see
> > + * Documentation/riscv/hwprobe.rst for more information.
> > + */
> > +struct riscv_hwprobe {
> > +     __s64 key;
> > +     __u64 value;
> > +};
> > +
> > +#define RISCV_HWPROBE_KEY_MVENDORID  0
> > +#define RISCV_HWPROBE_KEY_MARCHID    1
> > +#define RISCV_HWPROBE_KEY_MIMPID     2
> > +/* Increase RISCV_HWPROBE_MAX_KEY when adding items. */
>
> Can't wait for that to get forgotten!

I know. I could add an if (pair->key > RISCV_HWPROBE_MAX_KEY) goto
unrecognized_key, with a label at the default switch case, which would
effectively be a runtime guard against it. I opted not to as it's
aesthetically harsh, but anyone can holler if they want it.

>
> > diff --git a/arch/riscv/include/uapi/asm/unistd.h b/arch/riscv/include/uapi/asm/unistd.h
> > index 73d7cdd2ec49..37d47302322a 100644
> > --- a/arch/riscv/include/uapi/asm/unistd.h
> > +++ b/arch/riscv/include/uapi/asm/unistd.h
> > @@ -43,3 +43,11 @@
> >  #define __NR_riscv_flush_icache (__NR_arch_specific_syscall + 15)
> >  #endif
> >  __SYSCALL(__NR_riscv_flush_icache, sys_riscv_flush_icache)
> > +
> > +/*
> > + * Allows userspace to probe
>
> That comment looks chopped off midway through.

Whoops yes I

>
> > diff --git a/arch/riscv/kernel/sys_riscv.c b/arch/riscv/kernel/sys_riscv.c
> > index 5d3f2fbeb33c..868a12384f5a 100644
> > --- a/arch/riscv/kernel/sys_riscv.c
> > +++ b/arch/riscv/kernel/sys_riscv.c
> > @@ -6,8 +6,11 @@
> >   */
> >
> >  #include <linux/syscalls.h>
> > -#include <asm/unistd.h>
> >  #include <asm/cacheflush.h>
> > +#include <asm/cpufeature.h>
> > +#include <asm/hwprobe.h>
> > +#include <asm/uaccess.h>
> > +#include <asm/unistd.h>
> >  #include <asm-generic/mman-common.h>
> >
> >  static long riscv_sys_mmap(unsigned long addr, unsigned long len,
> > @@ -69,3 +72,144 @@ SYSCALL_DEFINE3(riscv_flush_icache, uintptr_t, start, uintptr_t, end,
> >
> >       return 0;
> >  }
> > +
> > +/*
> > + * The hwprobe interface, for allowing userspace to probe to see which features
> > + * are supported by the hardware.  See Documentation/riscv/hwprobe.rst for more
> > + * details.
> > + */
> > +static int set_hwprobe(struct riscv_hwprobe __user *pair, u64 val)
> > +{
> > +     long ret;
> > +
> > +     ret = put_user(val, &pair->value);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     return 0;
> > +}
> > +
> > +static long hwprobe_mid(struct riscv_hwprobe __user *pair, size_t key,
>
> ngl, it took me far too long to figure out that "mid" was not some
> shortening of "middle". I don't have any suggestions though, other than
> using cpu_id to match your variable and I think it becomes clearer by the
> end of the series anyway when the access alignment stuff appears.

Agreed. I'll rename this to hwprobe_arch_id().

>
> > +                     cpumask_t *cpus)
> > +{
> > +     long cpu, id;
> > +     bool first, valid;
> > +
> > +     first = true;
> > +     valid = false;
>
> Just make that
> boot first = true, valid = false;
> no?
>
> > +     for_each_cpu(cpu, cpus) {
> > +             struct riscv_cpuinfo *ci = per_cpu_ptr(&riscv_cpuinfo, cpu);
> > +             long cpu_id;
> > +
> > +             switch (key) {
> > +             case RISCV_HWPROBE_KEY_MVENDORID:
> > +                     cpu_id = ci->mvendorid;
>
> Are you intentionally avoiding using riscv_cached_mfooid()?

This might have been done with the expectation that there would be
more members in that struct to query, so grabbing the pointer in a
higher scope made sense. It didn't evolve that way, so you're right I
should just use those functions.

>
> > +                     break;
> > +             case RISCV_HWPROBE_KEY_MIMPID:
> > +                     cpu_id = ci->mimpid;
> > +                     break;
> > +             case RISCV_HWPROBE_KEY_MARCHID:
> > +                     cpu_id = ci->marchid;
> > +                     break;
> > +             }
> > +
> > +             if (first) {
> > +                     id = cpu_id;
> > +                     valid = true;
>
> So, we only end up in this function if there are CPUs in the set.
> Does that mean we can assume validity by default and just define valid =
> true from the get go?

The valid bool was meant to remember if we had come across a value
that was different from the others. But it can be eliminated with a
different code flow. For the next spin I removed valid altogether and
just break out of the loop if we find a value that's different, since
we already know what the return value will be and don't need to keep
iterating.

>
> > +             }
> > +
> > +             if (id != cpu_id)
> > +                     valid = false;
> > +     }
> > +
> > +     /*
> > +      * put_user() returns 0 on success, so use 1 to indicate it wasn't
> > +      * called and we should skip having incremented the output.
> > +      */
> > +     if (!valid)
> > +             return 1;
>
> I'm not sure why you're returning 1 here. If the id is deemed to be
> invalid, why aren't we treating it as a "real" error.
> TL;DR I don't understand the comment explaining this.

You're right, this is a mistake, leftover from v1 when the return
value signified the number of entries populated. For v3 I changed the
return type of this function to void, and just stick -1 in value if
the requested key isn't consistent across the given cpu set. That way
the syscall can still populate and return all the other values you
requested while simultaneously indicating "I can't give you a single
answer for that cpuset" for these vendor/arch/impl id elements.

>
> > +
> > +     return set_hwprobe(pair, id);
> > +}
> > +
> > +static
> > +long do_riscv_hwprobe(struct riscv_hwprobe __user *pairs, long pair_count,
>
> The files not wrapped at 80 anyway, so why not put this on the same line
> as `static`?

It actually mostly is wrapped at 80, except for another mistake I
made. But I can break up the line between the parameters which I think
is closer to what people are used to seeing.

>
> > +                   long cpu_count, unsigned long __user *cpus_user,
> > +                   unsigned long flags)
> > +{
> > +     size_t out;
> > +     s64 key;
> > +     long ret;
> > +     struct cpumask cpus;
> > +
> > +     /* Check the reserved flags. */
> > +     if (flags != 0)
> > +             return -EINVAL;
> > +
> > +     /*
> > +      * The only supported values must be the same on all CPUs. Allow
>
> "The only supported values" doesn't really make much sense. Is that
> intended to be read as meaning that we only report the features
> supported by all CPUs in the set?

Yeah let me change that to:
/*
* The interface supports taking in a CPU mask, and returns values that
* are consistent across that mask. Allow userspace to specify NULL and
* 0 as a shortcut to all online CPUs.
*/


>
> > +      * userspace to specify NULL and 0 as a shortcut to all online CPUs.
> > +      */
> > +     cpumask_clear(&cpus);
> > +     if ((cpu_count == 0) && (cpus_user == NULL)) {
>
> Is this not just `if (!cpu_count && !cpus_user)`?
>
> > +             cpumask_copy(&cpus, cpu_online_mask);
> > +     } else {
> > +             if (cpu_count > cpumask_size())
> > +                     cpu_count = cpumask_size();
>
> nit: newline here please, helps my poor auld eyes out on the days
> they're not doing too well!
>
> Cheers,
> Conor.
>
> > +             ret = copy_from_user(&cpus, cpus_user, cpu_count);
> > +             if (!ret)
> > +                     return -EFAULT;
> > +
> > +             /*
> > +              * Userspace must provide at least one online CPU, without that there's
> > +              * no way to define what is supported.
> > +              */
> > +             cpumask_and(&cpus, &cpus, cpu_online_mask);
> > +             if (cpumask_empty(&cpus))
> > +                     return -EINVAL;
> > +     }
> > +
> > +     for (out = 0; out < pair_count; out++, pairs++) {
> > +             long ret;
> > +
> > +             if (get_user(key, &pairs->key))
> > +                     return -EFAULT;
> > +
> > +             switch (key) {
> > +             case RISCV_HWPROBE_KEY_MVENDORID:
> > +             case RISCV_HWPROBE_KEY_MARCHID:
> > +             case RISCV_HWPROBE_KEY_MIMPID:
> > +                     ret = hwprobe_mid(pairs, key, &cpus);
> > +                     break;
> > +
> > +             /*
> > +              * For forward compatibility, unknown keys don't fail the whole
> > +              * call, but get their element key set to -1 and value set to 0
> > +              * indicating they're unrecognized.
> > +              */
> > +             default:
> > +                     ret = put_user(-1, &pairs->key);
> > +                     if (ret < 0)
> > +                             return ret;
> > +
> > +                     ret = set_hwprobe(pairs, 0);
> > +                     if (ret)
> > +                             return ret;
> > +
> > +                     break;
> > +             }
> > +
> > +             if (ret < 0)
> > +                     return ret;
> > +     }
> > +
> > +     return 0;
> > +
> > +}
> > +
> > +SYSCALL_DEFINE5(riscv_hwprobe, uintptr_t, pairs, uintptr_t, pair_count,
> > +             uintptr_t, cpu_count, uintptr_t, cpus, uintptr_t, flags)
> > +{
> > +     return do_riscv_hwprobe((void __user *)pairs, pair_count, cpu_count,
> > +                             (void __user *)cpus, flags);
> > +}
> > --
> > 2.25.1
> >
Conor Dooley Feb. 15, 2023, 9:10 p.m. UTC | #14
heh, this came in right as I went to check out by branch with this on it
and look at the rest of the series.

On Wed, Feb 15, 2023 at 12:49:35PM -0800, Evan Green wrote:
> On Tue, Feb 14, 2023 at 3:51 PM Conor Dooley <conor@kernel.org> wrote:
> > On Mon, Feb 06, 2023 at 12:14:51PM -0800, Evan Green wrote:

> > > +On success 0 is returned, on failure a negative error code is returned.
> > > +
> > > +The following keys are defined:
> > > +
> > > +* :RISCV_HWPROBE_KEY_MVENDORID:: Contains the value of :mvendorid:, as per the
> > > +  ISA specifications.
> >
> > "per the ISA specifications" sounds like dangerous wording to me! ;)
> 
> I can replace "per the ISA specifications" with "as defined by the
> RISC-V privileged architecture specification" to try and make that
> more crisp.

Meh was a comment about not trusting the ISA specs, not an attempt to
be a pedant!

> > > +#define RISCV_HWPROBE_KEY_MVENDORID  0
> > > +#define RISCV_HWPROBE_KEY_MARCHID    1
> > > +#define RISCV_HWPROBE_KEY_MIMPID     2
> > > +/* Increase RISCV_HWPROBE_MAX_KEY when adding items. */
> >
> > Can't wait for that to get forgotten!
> 
> I know. I could add an if (pair->key > RISCV_HWPROBE_MAX_KEY) goto
> unrecognized_key, with a label at the default switch case, which would
> effectively be a runtime guard against it. I opted not to as it's
> aesthetically harsh, but anyone can holler if they want it.

The other question to ask is, do we need RISCV_HWPROBE_MAX_KEY?
What's it for?

> > > diff --git a/arch/riscv/include/uapi/asm/unistd.h b/arch/riscv/include/uapi/asm/unistd.h
> > > index 73d7cdd2ec49..37d47302322a 100644
> > > --- a/arch/riscv/include/uapi/asm/unistd.h
> > > +++ b/arch/riscv/include/uapi/asm/unistd.h
> > > @@ -43,3 +43,11 @@
> > >  #define __NR_riscv_flush_icache (__NR_arch_specific_syscall + 15)
> > >  #endif
> > >  __SYSCALL(__NR_riscv_flush_icache, sys_riscv_flush_icache)
> > > +
> > > +/*
> > > + * Allows userspace to probe
> >
> > That comment looks chopped off midway through.
> 
> Whoops yes I

If you could flesh it
Evan Green Feb. 15, 2023, 9:14 p.m. UTC | #15
On Wed, Feb 15, 2023 at 1:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Mon, Feb 6, 2023, at 21:14, Evan Green wrote:
> > We don't have enough space for these all in ELF_HWCAP{,2} and there's no
> > system call that quite does this, so let's just provide an arch-specific
> > one to probe for hardware capabilities.  This currently just provides
> > m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
> > the future.
> >
> > Co-developed-by: Palmer Dabbelt <palmer@rivosinc.com>
> > Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
> > Signed-off-by: Evan Green <evan@rivosinc.com>
>
> I'm not sure I understand the problem with
> AT_HWCAP. While the bits in AT_HWCAP and AT_HWCAP2
> are limited, I don't see us running out of new
> AT_* words to use for additional bits. Presumably
> the kernel would already have to know about the
> name of each supported HW feature and could assign
> a unique bit number to them.

Palmer can probably speak to this with more authority, but my
understanding about the motivation for an approach like this goes
something like:
 * With the nature of RISC-V, we expect a lot of these types of bits
and bobs, many more than we've seen with the likes of x86 and ARM.
 * We also expect in some cases these values to be inconsistent across CPUs.
 * While we could copy all that data into the aux vector every time,
it starts to look like a lot of data, not all programs care about all
of it, and a lot of it is static, making all the copying wasteful.
 * Another option that would solve most of this would be to point to a
vDSO data area from the aux vector. This solves the copy complaints,
but makes that vDSO data ABI, and requires it all to be known up
front.
 * So, a syscall with a vDSO function in front of it seemed like a
good combination of speed and flexibility.

You're certainly right that HWCAPn would work for what we're exposing
today, so the question probably comes down to our relative predictions
of how this data will grow.

-Evan
Jessica Clarke Feb. 15, 2023, 10:43 p.m. UTC | #16
On 15 Feb 2023, at 21:14, Evan Green <evan@rivosinc.com> wrote:
> 
> On Wed, Feb 15, 2023 at 1:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
>> 
>> On Mon, Feb 6, 2023, at 21:14, Evan Green wrote:
>>> We don't have enough space for these all in ELF_HWCAP{,2} and there's no
>>> system call that quite does this, so let's just provide an arch-specific
>>> one to probe for hardware capabilities.  This currently just provides
>>> m{arch,imp,vendor}id, but with the key-value pairs we can pass more in
>>> the future.
>>> 
>>> Co-developed-by: Palmer Dabbelt <palmer@rivosinc.com>
>>> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
>>> Signed-off-by: Evan Green <evan@rivosinc.com>
>> 
>> I'm not sure I understand the problem with
>> AT_HWCAP. While the bits in AT_HWCAP and AT_HWCAP2
>> are limited, I don't see us running out of new
>> AT_* words to use for additional bits. Presumably
>> the kernel would already have to know about the
>> name of each supported HW feature and could assign
>> a unique bit number to them.
> 
> Palmer can probably speak to this with more authority, but my
> understanding about the motivation for an approach like this goes
> something like:
> * With the nature of RISC-V, we expect a lot of these types of bits
> and bobs, many more than we've seen with the likes of x86 and ARM.

We’re already at (I think) 51 standard user-level extensions that LLVM
knows about.

> * We also expect in some cases these values to be inconsistent across CPUs.

That’s also true of some Arm SoCs.

> * While we could copy all that data into the aux vector every time,
> it starts to look like a lot of data, not all programs care about all
> of it, and a lot of it is static, making all the copying wasteful.

Bitvectors are pretty cheap, this is negligible.

> * Another option that would solve most of this would be to point to a
> vDSO data area from the aux vector. This solves the copy complaints,
> but makes that vDSO data ABI, and requires it all to be known up
> front.

That doesn't seem like a huge deal, other than my usual point of
needing a standardised portable cross-platform API for this, so that
shouldn’t be “the” generic interface programmed against by applications.

> * So, a syscall with a vDSO function in front of it seemed like a
> good combination of speed and flexibility.
> 
> You're certainly right that HWCAPn would work for what we're exposing
> today, so the question probably comes down to our relative predictions
> of how this data will grow.

The other big problem is vendor extensions.

Jess

> -Evan
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Arnd Bergmann Feb. 16, 2023, 1:28 p.m. UTC | #17
On Wed, Feb 15, 2023, at 23:43, Jessica Clarke wrote:
> On 15 Feb 2023, at 21:14, Evan Green <evan@rivosinc.com> wrote:
>> On Wed, Feb 15, 2023 at 1:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
>> Palmer can probably speak to this with more authority, but my
>> understanding about the motivation for an approach like this goes
>> something like:
>> * With the nature of RISC-V, we expect a lot of these types of bits
>> and bobs, many more than we've seen with the likes of x86 and ARM.
>
> We’re already at (I think) 51 standard user-level extensions that LLVM
> knows about.

Do you have an estimate of how many of these require kernel support
beyond identifying the extensions?

>> * We also expect in some cases these values to be inconsistent across CPUs.
>
> That’s also true of some Arm SoCs.

Right, but it's also something that we should not encourage, or
need to make easy to use. On arm64, the kernel support for having
asymmetric aarch32 mode was kept to an absolute minimum, and an
application is expected to get the information from /proc/cpuinfo
before pinning down a task to the correct subset of all CPUs.

>> * So, a syscall with a vDSO function in front of it seemed like a
>> good combination of speed and flexibility.
>> 
>> You're certainly right that HWCAPn would work for what we're exposing
>> today, so the question probably comes down to our relative predictions
>> of how this data will grow.
>
> The other big problem is vendor extensions.

My biggest concern is how this would be synchronized between the
interfaces that are available to users. What we have on other architectures
is a set of string identifiers in /proc/cpuinfo and a bitmask in HWCAP.
Ideally these are added in pairs so the information available to shell
scripts in human readers is the same that is available in the auxvec
data.

Adding a third interface with the same information or a superset
requires more work in ensuring that each extension is available
in exactly the right places. Ideally I think there should be only
one table of possible CPU features so nobody has to make the
decision about which ones are important enough to add to one
interface or another.

     Arnd
Evan Green Feb. 16, 2023, 11:18 p.m. UTC | #18
On Thu, Feb 16, 2023 at 5:30 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Wed, Feb 15, 2023, at 23:43, Jessica Clarke wrote:
> > On 15 Feb 2023, at 21:14, Evan Green <evan@rivosinc.com> wrote:
> >> On Wed, Feb 15, 2023 at 1:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >> Palmer can probably speak to this with more authority, but my
> >> understanding about the motivation for an approach like this goes
> >> something like:
> >> * With the nature of RISC-V, we expect a lot of these types of bits
> >> and bobs, many more than we've seen with the likes of x86 and ARM.
> >
> > We’re already at (I think) 51 standard user-level extensions that LLVM
> > knows about.
>
> Do you have an estimate of how many of these require kernel support
> beyond identifying the extensions?
>
> >> * We also expect in some cases these values to be inconsistent across CPUs.
> >
> > That’s also true of some Arm SoCs.
>
> Right, but it's also something that we should not encourage, or
> need to make easy to use. On arm64, the kernel support for having
> asymmetric aarch32 mode was kept to an absolute minimum, and an
> application is expected to get the information from /proc/cpuinfo
> before pinning down a task to the correct subset of all CPUs.
>
> >> * So, a syscall with a vDSO function in front of it seemed like a
> >> good combination of speed and flexibility.
> >>
> >> You're certainly right that HWCAPn would work for what we're exposing
> >> today, so the question probably comes down to our relative predictions
> >> of how this data will grow.
> >
> > The other big problem is vendor extensions.

Since the key range can grow without accruing a process startup time
penalty, this proposal handles vendor extensions fairly well, no?
(Contrasting at least against hwcap bits, which once allocated have to
be copied into every new process forever).

>
> My biggest concern is how this would be synchronized between the
> interfaces that are available to users. What we have on other architectures
> is a set of string identifiers in /proc/cpuinfo and a bitmask in HWCAP.
> Ideally these are added in pairs so the information available to shell
> scripts in human readers is the same that is available in the auxvec
> data.
>
> Adding a third interface with the same information or a superset
> requires more work in ensuring that each extension is available
> in exactly the right places. Ideally I think there should be only
> one table of possible CPU features so nobody has to make the
> decision about which ones are important enough to add to one
> interface or another.

That makes sense. In this case, the proposal is that RISC-V would use
this mechanism and generally abandon hwcap. So my hope is there should
still only be about 2 spots to maintain.
-Evan
diff mbox series

Patch

diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
new file mode 100644
index 000000000000..97771090e972
--- /dev/null
+++ b/Documentation/riscv/hwprobe.rst
@@ -0,0 +1,37 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+RISC-V Hardware Probing Interface
+---------------------------------
+
+The RISC-V hardware probing interface is based around a single syscall, which
+is defined in <asm/hwprobe.h>::
+
+    struct riscv_hwprobe {
+        __s64 key;
+        __u64 value;
+    };
+
+    long sys_riscv_hwprobe(struct riscv_hwprobe *pairs, size_t pair_count,
+                           size_t cpu_count, cpu_set_t *cpus,
+                           unsigned long flags);
+
+The arguments are split into three groups: an array of key-value pairs, a CPU
+set, and some flags.  The key-value pairs are supplied with a count.  Userspace
+must prepopulate the key field for each element, and the kernel will fill in the
+value if the key is recognized.  If a key is unknown to the kernel, its key
+field will be cleared to -1, and its value set to 0.  The CPU set is defined by
+CPU_SET(3), the indicated features will be supported on all CPUs in the set.
+Usermode can supply NULL for cpus and 0 for cpu_count as a shortcut for all
+online CPUs. There are currently no flags, this value must be zero for future
+compatibility.
+
+On success 0 is returned, on failure a negative error code is returned.
+
+The following keys are defined:
+
+* :RISCV_HWPROBE_KEY_MVENDORID:: Contains the value of :mvendorid:, as per the
+  ISA specifications.
+* :RISCV_HWPROBE_KEY_MARCHID:: Contains the value of :marchid:, as per the ISA
+  specifications.
+* :RISCV_HWPROBE_KEY_MIMPLID:: Contains the value of :mimplid:, as per the ISA
+  specifications.
diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
index 2e5b18fbb145..175a91db0200 100644
--- a/Documentation/riscv/index.rst
+++ b/Documentation/riscv/index.rst
@@ -7,6 +7,7 @@  RISC-V architecture
 
     boot-image-header
     vm-layout
+    hwprobe
     patch-acceptance
     uabi
 
diff --git a/arch/riscv/include/asm/hwprobe.h b/arch/riscv/include/asm/hwprobe.h
new file mode 100644
index 000000000000..08d1c3bdd78a
--- /dev/null
+++ b/arch/riscv/include/asm/hwprobe.h
@@ -0,0 +1,13 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright 2022 Rivos, Inc
+ */
+
+#ifndef _ASM_HWPROBE_H
+#define _ASM_HWPROBE_H
+
+#include <uapi/asm/hwprobe.h>
+
+#define RISCV_HWPROBE_MAX_KEY 2
+
+#endif
diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h
index 384a63b86420..78a6302ef711 100644
--- a/arch/riscv/include/asm/syscall.h
+++ b/arch/riscv/include/asm/syscall.h
@@ -75,4 +75,7 @@  static inline int syscall_get_arch(struct task_struct *task)
 }
 
 asmlinkage long sys_riscv_flush_icache(uintptr_t, uintptr_t, uintptr_t);
+
+asmlinkage long sys_riscv_hwprobe(uintptr_t, uintptr_t, uintptr_t, uintptr_t,
+				  uintptr_t, uintptr_t);
 #endif	/* _ASM_RISCV_SYSCALL_H */
diff --git a/arch/riscv/include/uapi/asm/hwprobe.h b/arch/riscv/include/uapi/asm/hwprobe.h
new file mode 100644
index 000000000000..591802047460
--- /dev/null
+++ b/arch/riscv/include/uapi/asm/hwprobe.h
@@ -0,0 +1,25 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright 2022 Rivos, Inc
+ */
+
+#ifndef _UAPI_ASM_HWPROBE_H
+#define _UAPI_ASM_HWPROBE_H
+
+#include <linux/types.h>
+
+/*
+ * Interface for probing hardware capabilities from userspace, see
+ * Documentation/riscv/hwprobe.rst for more information.
+ */
+struct riscv_hwprobe {
+	__s64 key;
+	__u64 value;
+};
+
+#define RISCV_HWPROBE_KEY_MVENDORID	0
+#define RISCV_HWPROBE_KEY_MARCHID	1
+#define RISCV_HWPROBE_KEY_MIMPID	2
+/* Increase RISCV_HWPROBE_MAX_KEY when adding items. */
+
+#endif
diff --git a/arch/riscv/include/uapi/asm/unistd.h b/arch/riscv/include/uapi/asm/unistd.h
index 73d7cdd2ec49..37d47302322a 100644
--- a/arch/riscv/include/uapi/asm/unistd.h
+++ b/arch/riscv/include/uapi/asm/unistd.h
@@ -43,3 +43,11 @@ 
 #define __NR_riscv_flush_icache (__NR_arch_specific_syscall + 15)
 #endif
 __SYSCALL(__NR_riscv_flush_icache, sys_riscv_flush_icache)
+
+/*
+ * Allows userspace to probe
+ */
+#ifndef __NR_riscv_hwprobe
+#define __NR_riscv_hwprobe (__NR_arch_specific_syscall + 14)
+#endif
+__SYSCALL(__NR_riscv_hwprobe, sys_riscv_hwprobe)
diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
index 684e5419d37d..d0fb3567cc3d 100644
--- a/arch/riscv/kernel/cpu.c
+++ b/arch/riscv/kernel/cpu.c
@@ -4,15 +4,16 @@ 
  */
 
 #include <linux/cpu.h>
+#include <linux/cpuhotplug.h>
 #include <linux/init.h>
 #include <linux/seq_file.h>
 #include <linux/of.h>
 #include <asm/cpufeature.h>
 #include <asm/csr.h>
 #include <asm/hwcap.h>
+#include <asm/pgtable.h>
 #include <asm/sbi.h>
 #include <asm/smp.h>
-#include <asm/pgtable.h>
 
 /*
  * Returns the hart ID of the given device tree node, or -ENODEV if the node
diff --git a/arch/riscv/kernel/sys_riscv.c b/arch/riscv/kernel/sys_riscv.c
index 5d3f2fbeb33c..868a12384f5a 100644
--- a/arch/riscv/kernel/sys_riscv.c
+++ b/arch/riscv/kernel/sys_riscv.c
@@ -6,8 +6,11 @@ 
  */
 
 #include <linux/syscalls.h>
-#include <asm/unistd.h>
 #include <asm/cacheflush.h>
+#include <asm/cpufeature.h>
+#include <asm/hwprobe.h>
+#include <asm/uaccess.h>
+#include <asm/unistd.h>
 #include <asm-generic/mman-common.h>
 
 static long riscv_sys_mmap(unsigned long addr, unsigned long len,
@@ -69,3 +72,144 @@  SYSCALL_DEFINE3(riscv_flush_icache, uintptr_t, start, uintptr_t, end,
 
 	return 0;
 }
+
+/*
+ * The hwprobe interface, for allowing userspace to probe to see which features
+ * are supported by the hardware.  See Documentation/riscv/hwprobe.rst for more
+ * details.
+ */
+static int set_hwprobe(struct riscv_hwprobe __user *pair, u64 val)
+{
+	long ret;
+
+	ret = put_user(val, &pair->value);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
+static long hwprobe_mid(struct riscv_hwprobe __user *pair, size_t key,
+			cpumask_t *cpus)
+{
+	long cpu, id;
+	bool first, valid;
+
+	first = true;
+	valid = false;
+	for_each_cpu(cpu, cpus) {
+		struct riscv_cpuinfo *ci = per_cpu_ptr(&riscv_cpuinfo, cpu);
+		long cpu_id;
+
+		switch (key) {
+		case RISCV_HWPROBE_KEY_MVENDORID:
+			cpu_id = ci->mvendorid;
+			break;
+		case RISCV_HWPROBE_KEY_MIMPID:
+			cpu_id = ci->mimpid;
+			break;
+		case RISCV_HWPROBE_KEY_MARCHID:
+			cpu_id = ci->marchid;
+			break;
+		}
+
+		if (first) {
+			id = cpu_id;
+			valid = true;
+		}
+
+		if (id != cpu_id)
+			valid = false;
+	}
+
+	/*
+	 * put_user() returns 0 on success, so use 1 to indicate it wasn't
+	 * called and we should skip having incremented the output.
+	 */
+	if (!valid)
+		return 1;
+
+	return set_hwprobe(pair, id);
+}
+
+static
+long do_riscv_hwprobe(struct riscv_hwprobe __user *pairs, long pair_count,
+		      long cpu_count, unsigned long __user *cpus_user,
+		      unsigned long flags)
+{
+	size_t out;
+	s64 key;
+	long ret;
+	struct cpumask cpus;
+
+	/* Check the reserved flags. */
+	if (flags != 0)
+		return -EINVAL;
+
+	/*
+	 * The only supported values must be the same on all CPUs. Allow
+	 * userspace to specify NULL and 0 as a shortcut to all online CPUs.
+	 */
+	cpumask_clear(&cpus);
+	if ((cpu_count == 0) && (cpus_user == NULL)) {
+		cpumask_copy(&cpus, cpu_online_mask);
+	} else {
+		if (cpu_count > cpumask_size())
+			cpu_count = cpumask_size();
+		ret = copy_from_user(&cpus, cpus_user, cpu_count);
+		if (!ret)
+			return -EFAULT;
+
+		/*
+		 * Userspace must provide at least one online CPU, without that there's
+		 * no way to define what is supported.
+		 */
+		cpumask_and(&cpus, &cpus, cpu_online_mask);
+		if (cpumask_empty(&cpus))
+			return -EINVAL;
+	}
+
+	for (out = 0; out < pair_count; out++, pairs++) {
+		long ret;
+
+		if (get_user(key, &pairs->key))
+			return -EFAULT;
+
+		switch (key) {
+		case RISCV_HWPROBE_KEY_MVENDORID:
+		case RISCV_HWPROBE_KEY_MARCHID:
+		case RISCV_HWPROBE_KEY_MIMPID:
+			ret = hwprobe_mid(pairs, key, &cpus);
+			break;
+
+		/*
+		 * For forward compatibility, unknown keys don't fail the whole
+		 * call, but get their element key set to -1 and value set to 0
+		 * indicating they're unrecognized.
+		 */
+		default:
+			ret = put_user(-1, &pairs->key);
+			if (ret < 0)
+				return ret;
+
+			ret = set_hwprobe(pairs, 0);
+			if (ret)
+				return ret;
+
+			break;
+		}
+
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+
+}
+
+SYSCALL_DEFINE5(riscv_hwprobe, uintptr_t, pairs, uintptr_t, pair_count,
+		uintptr_t, cpu_count, uintptr_t, cpus, uintptr_t, flags)
+{
+	return do_riscv_hwprobe((void __user *)pairs, pair_count, cpu_count,
+				(void __user *)cpus, flags);
+}