diff mbox series

[2/7] node: Add heterogenous memory performance

Message ID 20181114224921.12123-3-keith.busch@intel.com (mailing list archive)
State Changes Requested, archived
Headers show
Series ACPI HMAT memory sysfs representation | expand

Commit Message

Keith Busch Nov. 14, 2018, 10:49 p.m. UTC
Heterogeneous memory systems provide memory nodes with latency
and bandwidth performance attributes that are different from other
nodes. Create an interface for the kernel to register these attributes
under the node that provides the memory. If the system provides this
information, applications can query the node attributes when deciding
which node to request memory.

When multiple memory initiators exist, accessing the same memory target
from each may not perform the same as the other. The highest performing
initiator to a given target is considered to be a local initiator for
that target. The kernel provides performance attributes only for the
local initiators.

The memory's compute node should be symlinked in sysfs as one of the
node's initiators.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

  # tree /sys/devices/system/node/nodeY/initiator_access
  /sys/devices/system/node/nodeY/initiator_access
  |-- read_bandwidth
  |-- read_latency
  |-- write_bandwidth
  `-- write_latency

The bandwidth is exported as MB/s and latency is reported in nanoseconds.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/Kconfig |  8 ++++++++
 drivers/base/node.c  | 44 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/node.h | 22 ++++++++++++++++++++++
 3 files changed, 74 insertions(+)

Comments

Anshuman Khandual Nov. 19, 2018, 3:35 a.m. UTC | #1
On 11/15/2018 04:19 AM, Keith Busch wrote:
> Heterogeneous memory systems provide memory nodes with latency
> and bandwidth performance attributes that are different from other
> nodes. Create an interface for the kernel to register these attributes

There are other properties like power consumption, reliability which can
be associated with a particular PA range. Also the set of properties has
to be extensible for the future.

> under the node that provides the memory. If the system provides this
> information, applications can query the node attributes when deciding
> which node to request memory.

Right but each (memory initiator, memory target) should have these above
mentioned properties enumerated to have an 'property as seen' from kind
of semantics.

> 
> When multiple memory initiators exist, accessing the same memory target
> from each may not perform the same as the other. The highest performing
> initiator to a given target is considered to be a local initiator for
> that target. The kernel provides performance attributes only for the
> local initiators.

As mentioned above the interface must enumerate a future extensible set
of properties for each (memory initiator, memory target) pair available
on the system.

> 
> The memory's compute node should be symlinked in sysfs as one of the
> node's initiators.

Right. IIUC the first patch skips the linking process of for two nodes A
and B if (A == B) preventing association to local memory initiator.
Keith Busch Nov. 19, 2018, 3:46 p.m. UTC | #2
On Mon, Nov 19, 2018 at 09:05:07AM +0530, Anshuman Khandual wrote:
> On 11/15/2018 04:19 AM, Keith Busch wrote:
> > Heterogeneous memory systems provide memory nodes with latency
> > and bandwidth performance attributes that are different from other
> > nodes. Create an interface for the kernel to register these attributes
> 
> There are other properties like power consumption, reliability which can
> be associated with a particular PA range. Also the set of properties has
> to be extensible for the future.

Sure, I'm just starting with the attributes available from HMAT, 
If there are additional possible attributes that make sense to add, I
don't see why we can't continue appending them if this patch is okay.
 
> > under the node that provides the memory. If the system provides this
> > information, applications can query the node attributes when deciding
> > which node to request memory.
> 
> Right but each (memory initiator, memory target) should have these above
> mentioned properties enumerated to have an 'property as seen' from kind
> of semantics.
> 
> > 
> > When multiple memory initiators exist, accessing the same memory target
> > from each may not perform the same as the other. The highest performing
> > initiator to a given target is considered to be a local initiator for
> > that target. The kernel provides performance attributes only for the
> > local initiators.
> 
> As mentioned above the interface must enumerate a future extensible set
> of properties for each (memory initiator, memory target) pair available
> on the system.

That seems less friendly to use if forces the application to figure out
which CPU is the best for a given memory node rather than just provide
that answer directly.

> > The memory's compute node should be symlinked in sysfs as one of the
> > node's initiators.
> 
> Right. IIUC the first patch skips the linking process of for two nodes A
> and B if (A == B) preventing association to local memory initiator.

Right, CPUs and memory sharing a proximity domain are assumed to be
local to each other, so not going to set up those links to itself.
Anshuman Khandual Nov. 22, 2018, 1:22 p.m. UTC | #3
On 11/19/2018 09:16 PM, Keith Busch wrote:
> On Mon, Nov 19, 2018 at 09:05:07AM +0530, Anshuman Khandual wrote:
>> On 11/15/2018 04:19 AM, Keith Busch wrote:
>>> Heterogeneous memory systems provide memory nodes with latency
>>> and bandwidth performance attributes that are different from other
>>> nodes. Create an interface for the kernel to register these attributes
>>
>> There are other properties like power consumption, reliability which can
>> be associated with a particular PA range. Also the set of properties has
>> to be extensible for the future.
> 
> Sure, I'm just starting with the attributes available from HMAT, 
> If there are additional possible attributes that make sense to add, I
> don't see why we can't continue appending them if this patch is okay.

As I mentioned on the other thread

1) The interface needs to be compact to avoid large number of files
2) Single U64 will be able to handle 8 attributes with 8 bit values
3) 8 bit value needs needs to be arch independent and abstracted out

I guess 8 attributes should be good enough for all type of memory in
foreseeable future.

>  
>>> under the node that provides the memory. If the system provides this
>>> information, applications can query the node attributes when deciding
>>> which node to request memory.
>>
>> Right but each (memory initiator, memory target) should have these above
>> mentioned properties enumerated to have an 'property as seen' from kind
>> of semantics.
>>
>>>
>>> When multiple memory initiators exist, accessing the same memory target
>>> from each may not perform the same as the other. The highest performing
>>> initiator to a given target is considered to be a local initiator for
>>> that target. The kernel provides performance attributes only for the
>>> local initiators.
>>
>> As mentioned above the interface must enumerate a future extensible set
>> of properties for each (memory initiator, memory target) pair available
>> on the system.
> 
> That seems less friendly to use if forces the application to figure out
> which CPU is the best for a given memory node rather than just provide
> that answer directly.

Why ? The application would just have to scan all possible values out
there and decide for itself. A complete set of attribute values for
each pair makes the sysfs more comprehensive and gives the application
more control over it's choices.

> 
>>> The memory's compute node should be symlinked in sysfs as one of the
>>> node's initiators.
>>
>> Right. IIUC the first patch skips the linking process of for two nodes A
>> and B if (A == B) preventing association to local memory initiator.
> 
> Right, CPUs and memory sharing a proximity domain are assumed to be
> local to each other, so not going to set up those links to itself.

But this will be required for applications to evaluate correctly between
possible values from all node pairs.
Dan Williams Nov. 27, 2018, 7 a.m. UTC | #4
On Wed, Nov 14, 2018 at 2:53 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Heterogeneous memory systems provide memory nodes with latency
> and bandwidth performance attributes that are different from other
> nodes. Create an interface for the kernel to register these attributes
> under the node that provides the memory. If the system provides this
> information, applications can query the node attributes when deciding
> which node to request memory.
>
> When multiple memory initiators exist, accessing the same memory target
> from each may not perform the same as the other. The highest performing
> initiator to a given target is considered to be a local initiator for
> that target. The kernel provides performance attributes only for the
> local initiators.
>
> The memory's compute node should be symlinked in sysfs as one of the
> node's initiators.
>
> The following example shows the new sysfs hierarchy for a node exporting
> performance attributes:
>
>   # tree /sys/devices/system/node/nodeY/initiator_access
>   /sys/devices/system/node/nodeY/initiator_access
>   |-- read_bandwidth
>   |-- read_latency
>   |-- write_bandwidth
>   `-- write_latency

With the expectation that there will be nodes that are initiator-only,
target-only, or both I think this interface should indicate that. The
1:1 "local" designation of HMAT should not be directly encoded in the
interface, it's just a shortcut for finding at least one initiator in
the set that can realize the advertised performance. At least if the
interface can enumerate the set of initiators then it becomes clear
whether sysfs can answer a performance enumeration question or if the
application needs to consult an interface with specific knowledge of a
given initiator-target pairing.

It seems a precursor to these patches is arranges for offline node
devices to be created for the ACPI proximity domains that are
offline-by default for reserved memory ranges.
Dan Williams Nov. 27, 2018, 5:42 p.m. UTC | #5
On Mon, Nov 26, 2018 at 11:00 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Nov 14, 2018 at 2:53 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > Heterogeneous memory systems provide memory nodes with latency
> > and bandwidth performance attributes that are different from other
> > nodes. Create an interface for the kernel to register these attributes
> > under the node that provides the memory. If the system provides this
> > information, applications can query the node attributes when deciding
> > which node to request memory.
> >
> > When multiple memory initiators exist, accessing the same memory target
> > from each may not perform the same as the other. The highest performing
> > initiator to a given target is considered to be a local initiator for
> > that target. The kernel provides performance attributes only for the
> > local initiators.
> >
> > The memory's compute node should be symlinked in sysfs as one of the
> > node's initiators.
> >
> > The following example shows the new sysfs hierarchy for a node exporting
> > performance attributes:
> >
> >   # tree /sys/devices/system/node/nodeY/initiator_access
> >   /sys/devices/system/node/nodeY/initiator_access
> >   |-- read_bandwidth
> >   |-- read_latency
> >   |-- write_bandwidth
> >   `-- write_latency
>
> With the expectation that there will be nodes that are initiator-only,
> target-only, or both I think this interface should indicate that. The
> 1:1 "local" designation of HMAT should not be directly encoded in the
> interface, it's just a shortcut for finding at least one initiator in
> the set that can realize the advertised performance. At least if the
> interface can enumerate the set of initiators then it becomes clear
> whether sysfs can answer a performance enumeration question or if the
> application needs to consult an interface with specific knowledge of a
> given initiator-target pairing.

Sorry, I misread patch1, this series does allow publishing the
multi-initiator case that shares the same performance profile to a
given target.

> It seems a precursor to these patches is arranges for offline node
> devices to be created for the ACPI proximity domains that are
> offline-by default for reserved memory ranges.

Likely still need this though because node devices don't tend to show
up until they have a cpu or online memory.
Keith Busch Nov. 27, 2018, 5:44 p.m. UTC | #6
On Mon, Nov 26, 2018 at 11:00:09PM -0800, Dan Williams wrote:
> On Wed, Nov 14, 2018 at 2:53 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > Heterogeneous memory systems provide memory nodes with latency
> > and bandwidth performance attributes that are different from other
> > nodes. Create an interface for the kernel to register these attributes
> > under the node that provides the memory. If the system provides this
> > information, applications can query the node attributes when deciding
> > which node to request memory.
> >
> > When multiple memory initiators exist, accessing the same memory target
> > from each may not perform the same as the other. The highest performing
> > initiator to a given target is considered to be a local initiator for
> > that target. The kernel provides performance attributes only for the
> > local initiators.
> >
> > The memory's compute node should be symlinked in sysfs as one of the
> > node's initiators.
> >
> > The following example shows the new sysfs hierarchy for a node exporting
> > performance attributes:
> >
> >   # tree /sys/devices/system/node/nodeY/initiator_access
> >   /sys/devices/system/node/nodeY/initiator_access
> >   |-- read_bandwidth
> >   |-- read_latency
> >   |-- write_bandwidth
> >   `-- write_latency
> 
> With the expectation that there will be nodes that are initiator-only,
> target-only, or both I think this interface should indicate that. The
> 1:1 "local" designation of HMAT should not be directly encoded in the
> interface, it's just a shortcut for finding at least one initiator in
> the set that can realize the advertised performance. At least if the
> interface can enumerate the set of initiators then it becomes clear
> whether sysfs can answer a performance enumeration question or if the
> application needs to consult an interface with specific knowledge of a
> given initiator-target pairing.
> 
> It seems a precursor to these patches is arranges for offline node
> devices to be created for the ACPI proximity domains that are
> offline-by default for reserved memory ranges.

The intention is that all initiators symlinked to the memory node share
the initiator_access attributes, as well as itself the node is its own
initiator. There's no limit to how many the new kernel interface in
patch 1/7 allows you to register, so it's not really a 1:1 relationship.

Either instead or in addition to the symlinks, we can export a node_mask
in the initiator_access directory for which these access attributes
apply if that makes the intention more clear.
diff mbox series

Patch

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index ae213ed2a7c8..2cf67c80046d 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -149,6 +149,14 @@  config DEBUG_TEST_DRIVER_REMOVE
 	  unusable. You should say N here unless you are explicitly looking to
 	  test this functionality.
 
+config HMEM
+	bool
+	default y
+	depends on NUMA
+	help
+	  Enable reporting for heterogenous memory access attributes under
+	  their non-uniform memory nodes.
+
 source "drivers/base/test/Kconfig"
 
 config SYS_HYPERVISOR
diff --git a/drivers/base/node.c b/drivers/base/node.c
index a9b7512a9502..232535761998 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -59,6 +59,50 @@  static inline ssize_t node_read_cpulist(struct device *dev,
 static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
+#ifdef CONFIG_HMEM
+const struct attribute_group node_access_attrs_group;
+
+#define ACCESS_ATTR(name) 						\
+static ssize_t name##_show(struct device *dev,				\
+			   struct device_attribute *attr,		\
+			   char *buf)					\
+{									\
+	return sprintf(buf, "%d\n", to_node(dev)->hmem_attrs.name);	\
+}									\
+static DEVICE_ATTR_RO(name);
+
+ACCESS_ATTR(read_bandwidth)
+ACCESS_ATTR(read_latency)
+ACCESS_ATTR(write_bandwidth)
+ACCESS_ATTR(write_latency)
+
+static struct attribute *access_attrs[] = {
+	&dev_attr_read_bandwidth.attr,
+	&dev_attr_read_latency.attr,
+	&dev_attr_write_bandwidth.attr,
+	&dev_attr_write_latency.attr,
+	NULL,
+};
+
+const struct attribute_group node_access_attrs_group = {
+	.name		= "initiator_access",
+	.attrs		= access_attrs,
+};
+
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs)
+{
+	struct node *node;
+
+	if (WARN_ON_ONCE(!node_online(nid)))
+		return;
+	node = node_devices[nid];
+	node->hmem_attrs = *hmem_attrs;
+	if (sysfs_create_group(&node->dev.kobj, &node_access_attrs_group))
+		pr_info("failed to add performance attribute group to node %d\n",
+			nid);
+}
+#endif
+
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct device *dev,
 			struct device_attribute *attr, char *buf)
diff --git a/include/linux/node.h b/include/linux/node.h
index 1fd734a3fb3f..6a1aa6a153f8 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -17,14 +17,36 @@ 
 
 #include <linux/device.h>
 #include <linux/cpumask.h>
+#include <linux/list.h>
 #include <linux/workqueue.h>
 
+#ifdef CONFIG_HMEM
+/**
+ * struct node_hmem_attrs - heterogeneous memory performance attributes
+ *
+ * read_bandwidth:	Read bandwidth in MB/s
+ * write_bandwidth:	Write bandwidth in MB/s
+ * read_latency:	Read latency in nanoseconds
+ * write_latency:	Write latency in nanoseconds
+ */
+struct node_hmem_attrs {
+	unsigned int read_bandwidth;
+	unsigned int write_bandwidth;
+	unsigned int read_latency;
+	unsigned int write_latency;
+};
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs);
+#endif
+
 struct node {
 	struct device	dev;
 
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
 	struct work_struct	node_work;
 #endif
+#ifdef CONFIG_HMEM
+	struct node_hmem_attrs hmem_attrs;
+#endif
 };
 
 struct memory_block;