mbox series

[v3,00/11] Introduce In Field Scan driver

Message ID 20220419163859.2228874-1-tony.luck@intel.com (mailing list archive)
Headers show
Series Introduce In Field Scan driver | expand

Message

Luck, Tony April 19, 2022, 4:38 p.m. UTC
Longer description of what this does, and why it is useful in the v2 cover
letter here:
  https://lore.kernel.org/all/20220407191347.9681-1-jithu.joseph@intel.com/

But the TL;DR version is this driver loads scan test files that can
check whether silicon in a CPU core is still running correctly. It
is expected that these tests would be run several times per day to
catch problems as silicon ages.

I'm posting this update because I missed many major issues when I added
my review tag. So I have a moral obligation to fix up the things that
I missed.

Changes since V2:

Dan Williams (offline):
----------------------
1) Provided the clue to split into a tiny driver that enumerates and
   registers the device. Then the IFS driver can attach to that an behave
   much more like a normal driver (original idea from Andy Lutomirski,
   used for pmem/nvdimms)

2 .. many) Lots more pointers, tips, and general good guidance to make both
   the code and commit comments better and easier to understand.

Boris:
-----
1) Add "intel_" prefixes to the two functions moving to wider scope.

Done.

2) Move the declarations from <asm/microcode_intel.h> to <asm/cpu.h>

Done.

3) intel_cpu_signatures_match() is small enough to be "inline".

Done.

Greg:
----
1) Is the firmware already submitted to the linux-firmware project for
   inclusion there?
   If not, where should a user get it from?

The scan files will be distributed by Intel on Github in much the
same way that microcode is distributed today.

2) > +struct ifs_binary ifs_binary;

   Please no static memory.  Use the driver model properly which does not
   want you to do this at all.

   You should not need this at all.  If you do, something is wrong as you
   are tying the lifecycle of the memory to the code, not to the device.

Moved this (and ifs_test) to dynamic allocation using devm_kzalloc()
and attaching the resulting pointer to the device with dev_set_drvdata().

3) > +static ssize_t details_show(struct device *dev,
   > +			    struct device_attribute *attr,
   > +			    char *buf)
   > +{
   > +	int ret;
   > +
   > +	if (down_trylock(&ifs_sem))
   > +		return -EBUSY;

   Why do you care about locking here at all?

   > +
   > +	ret = sysfs_emit(buf, "%#llx\n", ifs_test.scan_details);
   > +	up(&ifs_sem);

   What are you protecting?  The value can change right after the lock is
   released, so who cares?

Removed locking from status and details show() functions. Running a test
is synchronous. So:
  # echo 3 > run_test
  # cat status
  # cat details
will give the results of the core 3 test as expected. It is up to the user
to not do dumb things like reading status/details from another process in
parallel with running tests.

4) > +	if (!ifs_binary.loaded) {
   > +		dev_info(&ifs_pdev->dev, "Load scan binary using driver bind interface\n");

   Do not allow userspace to spam kernel logs for no reason :(

   sysfs files are not "help files" in the kernel.

Spam removed.

5) > +void ifs_sysfs_add(void)
   > +{
   > +	ifs_pdev->dev.groups = plat_ifs_groups;

   Why do you have a single global structure?

All instances of the driver for different tests can use the same files
and functions. They use "struct ifs_data *ifsd = dev_get_drvdata(dev);"
to operate on the correct driver instance.

6) > +KernelVersion:	5.19.0

   No need for ".0"

Removed.

7) > +		For e.g to test cpu5 do echo 5 > /sys/devices/platform/intel_ifs/run_test

   So core numbers are different than cpu numbers here?  How are users
   going to map them?

Added some extra text here to say that tests are per core, but any thread
on the core can be used to run the test. Should I also point people at
/sys/devices/system/cpu/cpu#/topology/thread_siblings_list? It seems
easy for users to get a list of cores with a script like:
$ cores=$(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | sed -e 's/,.*//' | sort -n | uniq)

8) > +Description:	Version of loaded IFS binary image.

   In what format?

Added "(hexadecimal)". Also added code (and Docs) to print "none" if the load
of the scan file failed.

9) > +Description:	echo "intel_ifs" to reload IFS image.

   Huh?  Why are you using a common sysfs file for this type of attribute?
   Please do not do so, make it "reload" or something like that.

Ok. Added a "reload" file like microcode. (Though using driver bind/unbind
also works).

10) > +Description:	IFS tunable parameter that user can modify before
   > +		the scan run if they wish to override default value.

   And where are those parameters documented?  What are valid values here?

Dropped both the "noirq" and "retry" parameters. I think they now have sane
defaults. If Jithu/Ashok have a good use case, they can send a patch to add
them back.

-Tony


Jithu Joseph (7):
  x86/microcode/intel: Expose collect_cpu_info_early() for IFS
  platform/x86/intel/ifs: Read IFS firmware image
  platform/x86/intel/ifs: Check IFS Image sanity
  platform/x86/intel/ifs: Authenticate and copy to secured memory
  platform/x86/intel/ifs: Add scan test support
  platform/x86/intel/ifs: Add IFS sysfs interface
  platform/x86/intel/ifs: add ABI documentation for IFS

Tony Luck (4):
  Documentation: In-Field Scan
  platform/x86/intel/ifs: Create device for Intel IFS (In Field Scan)
  platform/x86/intel/ifs: Add stub driver for In-Field Scan
  trace: platform/x86/intel/ifs: Add trace point to track Intel IFS
    operations

 .../ABI/testing/sysfs-platform-intel-ifs      |  39 ++
 Documentation/x86/ifs.rst                     | 101 ++++++
 Documentation/x86/index.rst                   |   1 +
 MAINTAINERS                                   |   8 +
 arch/x86/include/asm/cpu.h                    |  18 +
 arch/x86/kernel/cpu/intel.c                   |  32 ++
 arch/x86/kernel/cpu/microcode/intel.c         |  59 +---
 drivers/platform/x86/intel/Kconfig            |   1 +
 drivers/platform/x86/intel/Makefile           |   1 +
 drivers/platform/x86/intel/ifs/Kconfig        |  16 +
 drivers/platform/x86/intel/ifs/Makefile       |   5 +
 drivers/platform/x86/intel/ifs/core.c         |  74 ++++
 drivers/platform/x86/intel/ifs/ifs.h          | 103 ++++++
 .../platform/x86/intel/ifs/intel_ifs_device.c |  50 +++
 drivers/platform/x86/intel/ifs/load.c         | 265 ++++++++++++++
 drivers/platform/x86/intel/ifs/runtest.c      | 333 ++++++++++++++++++
 drivers/platform/x86/intel/ifs/sysfs.c        | 151 ++++++++
 include/trace/events/intel_ifs.h              |  38 ++
 18 files changed, 1243 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-platform-intel-ifs
 create mode 100644 Documentation/x86/ifs.rst
 create mode 100644 drivers/platform/x86/intel/ifs/Kconfig
 create mode 100644 drivers/platform/x86/intel/ifs/Makefile
 create mode 100644 drivers/platform/x86/intel/ifs/core.c
 create mode 100644 drivers/platform/x86/intel/ifs/ifs.h
 create mode 100644 drivers/platform/x86/intel/ifs/intel_ifs_device.c
 create mode 100644 drivers/platform/x86/intel/ifs/load.c
 create mode 100644 drivers/platform/x86/intel/ifs/runtest.c
 create mode 100644 drivers/platform/x86/intel/ifs/sysfs.c
 create mode 100644 include/trace/events/intel_ifs.h


base-commit: b2d229d4ddb17db541098b83524d901257e93845