mbox series

[0/7] ACPI HMAT memory sysfs representation

Message ID 20181114224902.12082-1-keith.busch@intel.com (mailing list archive)
Headers show
Series ACPI HMAT memory sysfs representation | expand

Message

Keith Busch Nov. 14, 2018, 10:49 p.m. UTC
This series provides a new sysfs representation for heterogeneous
system memory.

The previous series that was specific to HMAT that this series was based
on was last posted here: https://lkml.org/lkml/2017/12/13/968

Platforms may provide multiple types of cpu attached system memory. The
memory ranges for each type may have different characteristics that
applications may wish to know about when considering what node they want
their memory allocated from. 

It had previously been difficult to describe these setups as memory
rangers were generally lumped into the NUMA node of the CPUs. New
platform attributes have been created and in use today that describe
the more complex memory hierarchies that can be created.

This series first creates new generic APIs under the kernel's node
representation. These new APIs can be used to create links among local
memory and compute nodes and export characteristics about the memory
nodes. Documentation desribing the new representation are provided.

Finally the series adds a kernel user for these new APIs from parsing
the ACPI HMAT.

Keith Busch (7):
  node: Link memory nodes to their compute nodes
  node: Add heterogenous memory performance
  doc/vm: New documentation for memory performance
  node: Add memory caching attributes
  doc/vm: New documentation for memory cache
  acpi: Create subtable parsing infrastructure
  acpi/hmat: Parse and report heterogeneous memory

 Documentation/vm/numacache.rst |  76 ++++++++
 Documentation/vm/numaperf.rst  |  71 ++++++++
 drivers/acpi/Kconfig           |   9 +
 drivers/acpi/Makefile          |   1 +
 drivers/acpi/hmat.c            | 384 +++++++++++++++++++++++++++++++++++++++++
 drivers/acpi/tables.c          |  85 +++++++--
 drivers/base/Kconfig           |   8 +
 drivers/base/node.c            | 193 +++++++++++++++++++++
 include/linux/node.h           |  47 +++++
 9 files changed, 864 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/vm/numacache.rst
 create mode 100644 Documentation/vm/numaperf.rst
 create mode 100644 drivers/acpi/hmat.c

Comments

Anshuman Khandual Nov. 16, 2018, 6:27 a.m. UTC | #1
On 11/15/2018 04:19 AM, Keith Busch wrote:
> This series provides a new sysfs representation for heterogeneous
> system memory.
> 
> The previous series that was specific to HMAT that this series was based
> on was last posted here: https://lkml.org/lkml/2017/12/13/968
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series first creates new generic APIs under the kernel's node
> representation. These new APIs can be used to create links among local
> memory and compute nodes and export characteristics about the memory
> nodes. Documentation desribing the new representation are provided.
> 
> Finally the series adds a kernel user for these new APIs from parsing
> the ACPI HMAT.

Not able to see the patches from this series either on the list or on the
archive (https://lkml.org/lkml/2018/11/15/331). IIRC last time we discussed
about this and the concern which I raised was in absence of a broader NUMA
rework for multi attribute memory it might not a good idea to settle down
and freeze sysfs interface for the user space.
Keith Busch Nov. 16, 2018, 3:51 p.m. UTC | #2
On Fri, Nov 16, 2018 at 11:57:58AM +0530, Anshuman Khandual wrote:
> On 11/15/2018 04:19 AM, Keith Busch wrote:
> > This series provides a new sysfs representation for heterogeneous
> > system memory.
> > 
> > The previous series that was specific to HMAT that this series was based
> > on was last posted here: https://lkml.org/lkml/2017/12/13/968
> > 
> > Platforms may provide multiple types of cpu attached system memory. The
> > memory ranges for each type may have different characteristics that
> > applications may wish to know about when considering what node they want
> > their memory allocated from. 
> > 
> > It had previously been difficult to describe these setups as memory
> > rangers were generally lumped into the NUMA node of the CPUs. New
> > platform attributes have been created and in use today that describe
> > the more complex memory hierarchies that can be created.
> > 
> > This series first creates new generic APIs under the kernel's node
> > representation. These new APIs can be used to create links among local
> > memory and compute nodes and export characteristics about the memory
> > nodes. Documentation desribing the new representation are provided.
> > 
> > Finally the series adds a kernel user for these new APIs from parsing
> > the ACPI HMAT.
> 
> Not able to see the patches from this series either on the list or on the
> archive (https://lkml.org/lkml/2018/11/15/331). 

The send-email split the cover-letter from the series, probably
something I did. Series followed immediately after:

  https://lkml.org/lkml/2018/11/15/332

> IIRC last time we discussed
> about this and the concern which I raised was in absence of a broader NUMA
> rework for multi attribute memory it might not a good idea to settle down
> and freeze sysfs interface for the user space.
Dave Hansen Nov. 16, 2018, 4:55 p.m. UTC | #3
On 11/15/18 10:27 PM, Anshuman Khandual wrote:
> Not able to see the patches from this series either on the list or on the
> archive (https://lkml.org/lkml/2018/11/15/331). IIRC last time we discussed
> about this and the concern which I raised was in absence of a broader NUMA
> rework for multi attribute memory it might not a good idea to settle down
> and freeze sysfs interface for the user space. 

This *is* the broader NUMA rework.  I think it's just a bit more
incremental that what you originally had in mind.

Did you have an alternative for how you wanted this to look?
Anshuman Khandual Nov. 19, 2018, 1:52 a.m. UTC | #4
On 11/16/2018 09:21 PM, Keith Busch wrote:
> On Fri, Nov 16, 2018 at 11:57:58AM +0530, Anshuman Khandual wrote:
>> On 11/15/2018 04:19 AM, Keith Busch wrote:
>>> This series provides a new sysfs representation for heterogeneous
>>> system memory.
>>>
>>> The previous series that was specific to HMAT that this series was based
>>> on was last posted here: https://lkml.org/lkml/2017/12/13/968
>>>
>>> Platforms may provide multiple types of cpu attached system memory. The
>>> memory ranges for each type may have different characteristics that
>>> applications may wish to know about when considering what node they want
>>> their memory allocated from. 
>>>
>>> It had previously been difficult to describe these setups as memory
>>> rangers were generally lumped into the NUMA node of the CPUs. New
>>> platform attributes have been created and in use today that describe
>>> the more complex memory hierarchies that can be created.
>>>
>>> This series first creates new generic APIs under the kernel's node
>>> representation. These new APIs can be used to create links among local
>>> memory and compute nodes and export characteristics about the memory
>>> nodes. Documentation desribing the new representation are provided.
>>>
>>> Finally the series adds a kernel user for these new APIs from parsing
>>> the ACPI HMAT.
>>
>> Not able to see the patches from this series either on the list or on the
>> archive (https://lkml.org/lkml/2018/11/15/331). 
> 
> The send-email split the cover-letter from the series, probably
> something I did. Series followed immediately after:
> 
>   https://lkml.org/lkml/2018/11/15/332

Yeah got it. I can see the series on the list. Thanks for pointing out.

> 
>> IIRC last time we discussed
>> about this and the concern which I raised was in absence of a broader NUMA
>> rework for multi attribute memory it might not a good idea to settle down
>> and freeze sysfs interface for the user space. 
>
Anshuman Khandual Nov. 19, 2018, 5:44 a.m. UTC | #5
On 11/16/2018 10:25 PM, Dave Hansen wrote:
> On 11/15/18 10:27 PM, Anshuman Khandual wrote:
>> Not able to see the patches from this series either on the list or on the
>> archive (https://lkml.org/lkml/2018/11/15/331). IIRC last time we discussed
>> about this and the concern which I raised was in absence of a broader NUMA
>> rework for multi attribute memory it might not a good idea to settle down
>> and freeze sysfs interface for the user space. 
> 

Hello Dave,

> This *is* the broader NUMA rework.  I think it's just a bit more
> incremental that what you originally had in mind.

IIUC NUMA re-work in principle involves these functional changes

1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term)
2. Enumerating memory node attributes as seen from the compute nodes (short/medium term)
3. Changing core MM to accommodate multi attribute memory (long term)

The first two set of changes can get the user space applications
moving by identifying the right nodes and their attributes through
sysfs interface.

> 
> Did you have an alternative for how you wanted this to look?
> 

No. I did not get enough time this year to rework on the original
proposal I had. But will be able to help here to make this interface
more generic, abstract out these properties which is extensible in
the future.

- Anshuman
Dave Hansen Nov. 19, 2018, 5:37 p.m. UTC | #6
On 11/18/18 9:44 PM, Anshuman Khandual wrote:
> IIUC NUMA re-work in principle involves these functional changes
> 
> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term)

This patch set _does_ that, though.

> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term)

It does that as well (a subset at least).

It sounds like the subset that's being exposed is insufficient for yo
We did that because we think doing anything but a subset in sysfs will
just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
attributes, that's at _least_ 1024*1024*4 files if we expose *all*
combinations.

Do we agree that sysfs is unsuitable for exposing attributes in this manner?
Anshuman Khandual Nov. 22, 2018, 11:52 a.m. UTC | #7
On 11/19/2018 11:07 PM, Dave Hansen wrote:
> On 11/18/18 9:44 PM, Anshuman Khandual wrote:
>> IIUC NUMA re-work in principle involves these functional changes
>>
>> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term)
> 
> This patch set _does_ that, though.
> 
>> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term)
> 
> It does that as well (a subset at least).
> 
> It sounds like the subset that's being exposed is insufficient for yo
> We did that because we think doing anything but a subset in sysfs will
> just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
> attributes, that's at _least_ 1024*1024*4 files if we expose *all*
> combinations.
Each permutation need not be a separate file inside all possible NODE X
(/sys/devices/system/node/nodeX) directories. It can be a top level file
enumerating various attribute values for a given (X, Y) node pair based
on an offset something like /proc/pid/pagemap.

> 
> Do we agree that sysfs is unsuitable for exposing attributes in this manner?
> 

Yes, for individual files. But this can be worked around with an offset
based access from a top level global attributes file as mentioned above.
Is there any particular advantage of using individual files for each
given attribute ? I was wondering that a single unsigned long (u64) will
be able to pack 8 different attributes where each individual attribute
values can be abstracted out in 8 bits.
Dave Hansen Nov. 22, 2018, 6:01 p.m. UTC | #8
On 11/22/18 3:52 AM, Anshuman Khandual wrote:
>>
>> It sounds like the subset that's being exposed is insufficient for yo
>> We did that because we think doing anything but a subset in sysfs will
>> just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
>> attributes, that's at _least_ 1024*1024*4 files if we expose *all*
>> combinations.
> Each permutation need not be a separate file inside all possible NODE X
> (/sys/devices/system/node/nodeX) directories. It can be a top level file
> enumerating various attribute values for a given (X, Y) node pair based
> on an offset something like /proc/pid/pagemap.

My assumption has been that this kind of thing is too fancy for sysfs:

Documentation/filesystems/sysfs.txt:
> Attributes should be ASCII text files, preferably with only one value
> per file. It is noted that it may not be efficient to contain only one
> value per file, so it is socially acceptable to express an array of
> values of the same type. 
> 
> Mixing types, expressing multiple lines of data, and doing fancy
> formatting of data is heavily frowned upon. Doing these things may get
> you publicly humiliated and your code rewritten without notice. 

/proc/pid/pagemap is binary, not one-value-per-file and relatively
complicated to parse.

Do you really think following something like pagemap is the right model
for sysfs?

BTW, I'm not saying we don't need *some* interface like you propose.  We
almost certainly will at some point.  I just don't think it will be in
sysfs.
Dan Williams Nov. 22, 2018, 6:08 p.m. UTC | #9
On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 11/19/2018 11:07 PM, Dave Hansen wrote:
> > On 11/18/18 9:44 PM, Anshuman Khandual wrote:
> >> IIUC NUMA re-work in principle involves these functional changes
> >>
> >> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term)
> >
> > This patch set _does_ that, though.
> >
> >> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term)
> >
> > It does that as well (a subset at least).
> >
> > It sounds like the subset that's being exposed is insufficient for yo
> > We did that because we think doing anything but a subset in sysfs will
> > just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
> > attributes, that's at _least_ 1024*1024*4 files if we expose *all*
> > combinations.
> Each permutation need not be a separate file inside all possible NODE X
> (/sys/devices/system/node/nodeX) directories. It can be a top level file
> enumerating various attribute values for a given (X, Y) node pair based
> on an offset something like /proc/pid/pagemap.
>
> >
> > Do we agree that sysfs is unsuitable for exposing attributes in this manner?
> >
>
> Yes, for individual files. But this can be worked around with an offset
> based access from a top level global attributes file as mentioned above.
> Is there any particular advantage of using individual files for each
> given attribute ? I was wondering that a single unsigned long (u64) will
> be able to pack 8 different attributes where each individual attribute
> values can be abstracted out in 8 bits.

sysfs has a 4K limit, and in general I don't think there is much
incremental value to go describe the entirety of the system from sysfs
or anywhere else in the kernel for that matter. It's simply too much
information to reasonably consume. Instead the kernel can describe the
coarse boundaries and some semblance of "best" access initiator for a
given target. That should cover the "80%" case of what applications
want to discover, for the other "20%" we likely need some userspace
library that can go parse these platform specific information sources
and supplement the kernel view. I also think a simpler kernel starting
point gives us room to go pull in more commonly used attributes if it
turns out they are useful, and avoid going down the path of exporting
attributes that have questionable value in practice.
Anshuman Khandual Nov. 23, 2018, 6:42 a.m. UTC | #10
On 11/22/2018 11:31 PM, Dave Hansen wrote:
> On 11/22/18 3:52 AM, Anshuman Khandual wrote:
>>>
>>> It sounds like the subset that's being exposed is insufficient for yo
>>> We did that because we think doing anything but a subset in sysfs will
>>> just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
>>> attributes, that's at _least_ 1024*1024*4 files if we expose *all*
>>> combinations.
>> Each permutation need not be a separate file inside all possible NODE X
>> (/sys/devices/system/node/nodeX) directories. It can be a top level file
>> enumerating various attribute values for a given (X, Y) node pair based
>> on an offset something like /proc/pid/pagemap.
> 
> My assumption has been that this kind of thing is too fancy for sysfs:

Applications need to know the matrix of multi attribute properties as
seen from various memory accessors/initiators to be able to bind them
to desired CPUs and memory. That gives applications true view of an
heterogeneous system. While I understand your concern here about the
sysfs (which can be worked around with probably multiple global files
may be if the size is a problem etc) but an insufficient interface is
definitely problematic in longer term. This is going to be an ABI which
is locked in for good. Hence even it might appear over engineering at
the moment but IMHO is the right thing to do.

> 
> Documentation/filesystems/sysfs.txt:
>> Attributes should be ASCII text files, preferably with only one value
>> per file. It is noted that it may not be efficient to contain only one
>> value per file, so it is socially acceptable to express an array of
>> values of the same type. 
>>
>> Mixing types, expressing multiple lines of data, and doing fancy
>> formatting of data is heavily frowned upon. Doing these things may get
>> you publicly humiliated and your code rewritten without notice. 
> 
> /proc/pid/pagemap is binary, not one-value-per-file and relatively
> complicated to parse.

I agree but it does provide user space really valuable information about
the faulted pages for it's VA space. Was there any better way of getting
it ? May be but at this point in time it is essential.

> 
> Do you really think following something like pagemap is the right model
> for sysfs.> 
> BTW, I'm not saying we don't need *some* interface like you propose.  We
> almost certainly will at some point.  I just don't think it will be in
> sysfs.

I am not saying doing this in sysfs is very elegant. I would rather have
a syscall read back (MAX_NODES * MAX_NODES * u64) attribute matrix from
the kernel. Probably a subset of that information can appear on sysfs to
speed of queries for various optimizations as Keith mentioned before. But
we will have to first evaluate and come to an agreement what constitutes
a comprehensive set for multi attribute properties. Are we willing to go
in the direction for inclusion of a new system call, subset of it appears
on sysfs etc ? My primary concern is not how the attribute information
appears on the sysfs but lack of it's completeness.
Anshuman Khandual Nov. 23, 2018, 7:10 a.m. UTC | #11
On 11/22/2018 11:38 PM, Dan Williams wrote:
> On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>>
>>
>> On 11/19/2018 11:07 PM, Dave Hansen wrote:
>>> On 11/18/18 9:44 PM, Anshuman Khandual wrote:
>>>> IIUC NUMA re-work in principle involves these functional changes
>>>>
>>>> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term)
>>>
>>> This patch set _does_ that, though.
>>>
>>>> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term)
>>>
>>> It does that as well (a subset at least).
>>>
>>> It sounds like the subset that's being exposed is insufficient for yo
>>> We did that because we think doing anything but a subset in sysfs will
>>> just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
>>> attributes, that's at _least_ 1024*1024*4 files if we expose *all*
>>> combinations.
>> Each permutation need not be a separate file inside all possible NODE X
>> (/sys/devices/system/node/nodeX) directories. It can be a top level file
>> enumerating various attribute values for a given (X, Y) node pair based
>> on an offset something like /proc/pid/pagemap.
>>
>>>
>>> Do we agree that sysfs is unsuitable for exposing attributes in this manner?
>>>
>>
>> Yes, for individual files. But this can be worked around with an offset
>> based access from a top level global attributes file as mentioned above.
>> Is there any particular advantage of using individual files for each
>> given attribute ? I was wondering that a single unsigned long (u64) will
>> be able to pack 8 different attributes where each individual attribute
>> values can be abstracted out in 8 bits.
> 
> sysfs has a 4K limit, and in general I don't think there is much
> incremental value to go describe the entirety of the system from sysfs
> or anywhere else in the kernel for that matter. It's simply too much> information to reasonably consume. Instead the kernel can describe the

I agree that it may be some amount of information to parse but is crucial
for any task on a heterogeneous system to evaluate (probably re-evaluate
if the task moves around) its memory and CPU binding at runtime to make
sure it has got the right one.

> coarse boundaries and some semblance of "best" access initiator for a
> given target. That should cover the "80%" case of what applications

The current proposal just assumes that the best one is the nearest one.
This may be true for bandwidth and latency but may not be true for some
other properties. This assumptions should not be there while defining
new ABI.

> want to discover, for the other "20%" we likely need some userspace
> library that can go parse these platform specific information sources
> and supplement the kernel view. I also think a simpler kernel starting
> point gives us room to go pull in more commonly used attributes if it
> turns out they are useful, and avoid going down the path of exporting
> attributes that have questionable value in practice.
> 

Applications can just query platform information right now and just use
them for mbind() without requiring this new interface. We are not even
changing any core MM yet. So if it's just about identifying the node's
memory properties it can be scanned from platform itself. But I agree
we would like the kernel to start adding interfaces for multi attribute
memory but all I am saying is that it has to be comprehensive. Some of
the attributes have more usefulness now and some have less but the new
ABI interface has to accommodate exporting all of these.
Dan Williams Nov. 23, 2018, 5:15 p.m. UTC | #12
On Thu, Nov 22, 2018 at 11:11 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 11/22/2018 11:38 PM, Dan Williams wrote:
> > On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >>
> >>
> >> On 11/19/2018 11:07 PM, Dave Hansen wrote:
> >>> On 11/18/18 9:44 PM, Anshuman Khandual wrote:
> >>>> IIUC NUMA re-work in principle involves these functional changes
> >>>>
> >>>> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term)
> >>>
> >>> This patch set _does_ that, though.
> >>>
> >>>> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term)
> >>>
> >>> It does that as well (a subset at least).
> >>>
> >>> It sounds like the subset that's being exposed is insufficient for yo
> >>> We did that because we think doing anything but a subset in sysfs will
> >>> just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
> >>> attributes, that's at _least_ 1024*1024*4 files if we expose *all*
> >>> combinations.
> >> Each permutation need not be a separate file inside all possible NODE X
> >> (/sys/devices/system/node/nodeX) directories. It can be a top level file
> >> enumerating various attribute values for a given (X, Y) node pair based
> >> on an offset something like /proc/pid/pagemap.
> >>
> >>>
> >>> Do we agree that sysfs is unsuitable for exposing attributes in this manner?
> >>>
> >>
> >> Yes, for individual files. But this can be worked around with an offset
> >> based access from a top level global attributes file as mentioned above.
> >> Is there any particular advantage of using individual files for each
> >> given attribute ? I was wondering that a single unsigned long (u64) will
> >> be able to pack 8 different attributes where each individual attribute
> >> values can be abstracted out in 8 bits.
> >
> > sysfs has a 4K limit, and in general I don't think there is much
> > incremental value to go describe the entirety of the system from sysfs
> > or anywhere else in the kernel for that matter. It's simply too much> information to reasonably consume. Instead the kernel can describe the
>
> I agree that it may be some amount of information to parse but is crucial
> for any task on a heterogeneous system to evaluate (probably re-evaluate
> if the task moves around) its memory and CPU binding at runtime to make
> sure it has got the right one.

Can you provide some more evidence for this statement? It seems that
not many applications even care about basic numa let alone specific
memory targeting, at least according to libnumactl users.

     dnf repoquery --whatrequires numactl-libs

The kernel is the arbiter of memory, something is broken if
applications *need* to take on this responsibility. Yes, there will be
applications that want to tune and override the default kernel
behavior, but this is the exception, not the rule. The applications
that tend to care about specific memories also tend to be purpose
built for a given platform, and that lessens their reliance on the
kernel to enumerate all properties.

> > coarse boundaries and some semblance of "best" access initiator for a
> > given target. That should cover the "80%" case of what applications
>
> The current proposal just assumes that the best one is the nearest one.
> This may be true for bandwidth and latency but may not be true for some
> other properties. This assumptions should not be there while defining
> new ABI.

In fact, I tend to agree with you, but in my opinion that's an
argument to expose even less, not more. If we start with something
minimal that can be extended over time we lessen the risk of over
exposing details that don't matter in practice.

We're in the middle of a bit of a disaster with the VmFlags export in
/proc/$pid/smaps precisely because the implementation was too
comprehensive and applications started depending on details that the
kernel does not want to guarantee going forward. So there is a real
risk of being too descriptive in an interface design.

> > want to discover, for the other "20%" we likely need some userspace
> > library that can go parse these platform specific information sources
> > and supplement the kernel view. I also think a simpler kernel starting
> > point gives us room to go pull in more commonly used attributes if it
> > turns out they are useful, and avoid going down the path of exporting
> > attributes that have questionable value in practice.
> >
>
> Applications can just query platform information right now and just use
> them for mbind() without requiring this new interface.

No, they can't today, at least not for the topology details that HMAT
is describing. The platform-firmware to numa-node translation is
currently not complete. At a minimum we need a listing of initiator
ids and target ids. For an ACPI platform that is the proximity-domain
to numa-node-id translation information. Once that translation is in
place then a userspace library can consult the platform-specific
information sources to translate the platform-firmware view to the
Linux handles for those memories. Am I missing the library that does
this today?

> We are not even
> changing any core MM yet. So if it's just about identifying the node's
> memory properties it can be scanned from platform itself. But I agree
> we would like the kernel to start adding interfaces for multi attribute
> memory but all I am saying is that it has to be comprehensive. Some of
> the attributes have more usefulness now and some have less but the new
> ABI interface has to accommodate exporting all of these.

I get the sense we are talking past each other, can you give the next
level of detail on that "has to be comprehensive" statement?
Dave Hansen Nov. 23, 2018, 7:21 p.m. UTC | #13
On 11/22/18 10:42 PM, Anshuman Khandual wrote:
> Are we willing to go in the direction for inclusion of a new system
> call, subset of it appears on sysfs etc ? My primary concern is not
> how the attribute information appears on the sysfs but lack of it's
> completeness.

A new system call makes total sense to me.  I have the same concern
about the completeness of what's exposed in sysfs, I just don't see a
_route_ to completeness with sysfs itself.  Thus, the minimalist
approach as a first step.
Dan Williams Nov. 23, 2018, 9:13 p.m. UTC | #14
On Fri, Nov 23, 2018 at 11:21 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/22/18 10:42 PM, Anshuman Khandual wrote:
> > Are we willing to go in the direction for inclusion of a new system
> > call, subset of it appears on sysfs etc ? My primary concern is not
> > how the attribute information appears on the sysfs but lack of it's
> > completeness.
>
> A new system call makes total sense to me.  I have the same concern
> about the completeness of what's exposed in sysfs, I just don't see a
> _route_ to completeness with sysfs itself.  Thus, the minimalist
> approach as a first step.

Outside of platform-firmware-id to Linux-numa-node-id what other
userspace API infrastructure does the kernel need to provide? It seems
userspace enumeration of memory attributes is fully enabled once the
firmware-to-Linux identification is established.
Anshuman Khandual Nov. 26, 2018, 3:38 p.m. UTC | #15
On 11/24/2018 12:51 AM, Dave Hansen wrote:
> On 11/22/18 10:42 PM, Anshuman Khandual wrote:
>> Are we willing to go in the direction for inclusion of a new system
>> call, subset of it appears on sysfs etc ? My primary concern is not
>> how the attribute information appears on the sysfs but lack of it's
>> completeness.
> 
> A new system call makes total sense to me.  I have the same concern
> about the completeness of what's exposed in sysfs, I just don't see a
> _route_ to completeness with sysfs itself.  Thus, the minimalist
> approach as a first step.

Okay if we agree on the need for a new specific system call extracting
the superset attribute information MAX_NUMNODES * MAX_NUMNODES * U64
(u64 packs 8 bit values for 8 attributes or something like that) as we
had discussed before, it makes sense to export a subset of it which can
be faster but useful for the user space without going through a system
call. Do you agree on a (system call + sysfs) approach in principle ?
Also sysfs exported information has to be derived from whats available
through the system call not the other way round. Hence the starting
point has to be the system call definition.
Anshuman Khandual Nov. 26, 2018, 3:52 p.m. UTC | #16
On 11/24/2018 02:43 AM, Dan Williams wrote:
> On Fri, Nov 23, 2018 at 11:21 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 11/22/18 10:42 PM, Anshuman Khandual wrote:
>>> Are we willing to go in the direction for inclusion of a new system
>>> call, subset of it appears on sysfs etc ? My primary concern is not
>>> how the attribute information appears on the sysfs but lack of it's
>>> completeness.
>>
>> A new system call makes total sense to me.  I have the same concern
>> about the completeness of what's exposed in sysfs, I just don't see a
>> _route_ to completeness with sysfs itself.  Thus, the minimalist
>> approach as a first step.
> 
> Outside of platform-firmware-id to Linux-numa-node-id what other
> userspace API infrastructure does the kernel need to provide? It seems
> userspace enumeration of memory attributes is fully enabled once the
> firmware-to-Linux identification is established.

Which is true if the user space is required to probe the memory attribute
values for the platform-firmware-id from the platform and then request
required memory from corresponding Linux-numa-node-id via standard mm
interfaces like mbind(). But in this patch series we are not mapping
platform-firmware-id to Linux-numa-node-id. We are exporting properties
applicable to Linux nodes (Linux-numa-node-id).

Even if platform-firmware-id to Linux-numa-node-id is required it can
be done through a new file like the following. Applications can just
take the platform_id node and query platform about it's properties.

/sys/devices/system/node/nodeX/platform_id

This above interface would have been okay as its just an extension of
the existing node information on sysfs. But thats not the case with
this proposal.
Dave Hansen Nov. 26, 2018, 4:42 p.m. UTC | #17
On 11/23/18 1:13 PM, Dan Williams wrote:
>> A new system call makes total sense to me.  I have the same concern
>> about the completeness of what's exposed in sysfs, I just don't see a
>> _route_ to completeness with sysfs itself.  Thus, the minimalist
>> approach as a first step.
> Outside of platform-firmware-id to Linux-numa-node-id what other
> userspace API infrastructure does the kernel need to provide? It seems
> userspace enumeration of memory attributes is fully enabled once the
> firmware-to-Linux identification is established.

It would be nice not to have each app need to know about each specific
platform's firmware.
Dave Hansen Nov. 26, 2018, 5:20 p.m. UTC | #18
On 11/26/18 7:38 AM, Anshuman Khandual wrote:
> On 11/24/2018 12:51 AM, Dave Hansen wrote:
>> On 11/22/18 10:42 PM, Anshuman Khandual wrote:
>>> Are we willing to go in the direction for inclusion of a new system
>>> call, subset of it appears on sysfs etc ? My primary concern is not
>>> how the attribute information appears on the sysfs but lack of it's
>>> completeness.
>>
>> A new system call makes total sense to me.  I have the same concern
>> about the completeness of what's exposed in sysfs, I just don't see a
>> _route_ to completeness with sysfs itself.  Thus, the minimalist
>> approach as a first step.
> 
> Okay if we agree on the need for a new specific system call extracting
> the superset attribute information MAX_NUMNODES * MAX_NUMNODES * U64
> (u64 packs 8 bit values for 8 attributes or something like that) as we
> had discussed before, it makes sense to export a subset of it which can
> be faster but useful for the user space without going through a system
> call. 

The information that needs to be exported is a bit more than that.  It's
not just a binary attribute.

The information we have from the new ACPI table, for instance, is the
read and write bandwidth and latency between two nodes.  They are, IIRC,
two-byte values in the ACPI table[1], each.  That's 8 bytes worth of
data right there, which wouldn't fit *anything* else.

The list of things we want to export will certainly grow.  That means we
need a syscall something like this:

int get_mem_attribute(unsigned long attribute_nr,
		      unsigned long __user * initiator_nmask,
		      unsigned long __user * target_nmask,
		      unsigned long maxnode,
		      unsigned long *attributes_out);

#define MEM_ATTR_READ_BANDWIDTH		1
#define MEM_ATTR_WRITE_BANDWIDTH	2
#define MEM_ATTR_READ_LATENCY		3
#define MEM_ATTR_WRITE_LATENCTY		4
#define MEM_ATTR_ENCRYPTION		5

If you want to know the read latency between nodes 4 and 8, you do:

	ret = get_mem_attr(MEM_ATTR_READ_LATENCY,
			   (1<<4), (1<<8), max, &array);

And the answer shows up at array[0] in this example.  If you had more
than one bit set in the two nmasks, you would have a longer array.

The length of the array is the number of bits set in initiator_nmask *
the number of bits set in target_nmask * sizeof(ulong).

This has the advantage of supporting ULONG_MAX attributes, and scales
from asking for one attribute at a time all the way up to dumping the
entire system worth of data for a single attribute.  The only downside
is that it's one syscall per attribute instead of packing them all
together.  But, if we have a small enough number to pack them in one
ulong, then I think we can make 64 syscalls without too much trouble.

> Do you agree on a (system call + sysfs) approach in principle ?
> Also sysfs exported information has to be derived from whats available
> through the system call not the other way round. Hence the starting
> point has to be the system call definition.

Both the sysfs information *and* what will be exported in any future
interfaces are derived from platform-specific information.  They are not
derived from one _interface_ or the other.

They obviously need to be consistent, though.

1. See "Table 5-142 System Locality Latency and Bandwidth Information
Structure" here:
http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
Dan Williams Nov. 26, 2018, 6:08 p.m. UTC | #19
On Mon, Nov 26, 2018 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/23/18 1:13 PM, Dan Williams wrote:
> >> A new system call makes total sense to me.  I have the same concern
> >> about the completeness of what's exposed in sysfs, I just don't see a
> >> _route_ to completeness with sysfs itself.  Thus, the minimalist
> >> approach as a first step.
> > Outside of platform-firmware-id to Linux-numa-node-id what other
> > userspace API infrastructure does the kernel need to provide? It seems
> > userspace enumeration of memory attributes is fully enabled once the
> > firmware-to-Linux identification is established.
>
> It would be nice not to have each app need to know about each specific
> platform's firmware.

The app wouldn't need to know if it uses a common library. Whether the
library calls into the kernel or not is an implementation detail. If
it is information that only the app cares about and the kernel does
not consume, why have a syscall?
Anshuman Khandual Nov. 27, 2018, 9:32 a.m. UTC | #20
On 11/26/2018 10:50 PM, Dave Hansen wrote:
> On 11/26/18 7:38 AM, Anshuman Khandual wrote:
>> On 11/24/2018 12:51 AM, Dave Hansen wrote:
>>> On 11/22/18 10:42 PM, Anshuman Khandual wrote:
>>>> Are we willing to go in the direction for inclusion of a new system
>>>> call, subset of it appears on sysfs etc ? My primary concern is not
>>>> how the attribute information appears on the sysfs but lack of it's
>>>> completeness.
>>>
>>> A new system call makes total sense to me.  I have the same concern
>>> about the completeness of what's exposed in sysfs, I just don't see a
>>> _route_ to completeness with sysfs itself.  Thus, the minimalist
>>> approach as a first step.
>>
>> Okay if we agree on the need for a new specific system call extracting
>> the superset attribute information MAX_NUMNODES * MAX_NUMNODES * U64
>> (u64 packs 8 bit values for 8 attributes or something like that) as we
>> had discussed before, it makes sense to export a subset of it which can
>> be faster but useful for the user space without going through a system
>> call. 
> 
> The information that needs to be exported is a bit more than that.  It's
> not just a binary attribute.

Right wont be binary because it would contain a value for an attribute.

> 
> The information we have from the new ACPI table, for instance, is the
> read and write bandwidth and latency between two nodes.  They are, IIRC,
> two-byte values in the ACPI table[1], each.  That's 8 bytes worth of
> data right there, which wouldn't fit *anything* else.

Hmm I get your point. We would need to have interfaces both system call
and sysfs where number of attributes and bit field to contain value for
any attribute can grow in the future with backward compatibility. 

> 
> The list of things we want to export will certainly grow.  That means we
> need a syscall something like this:
> 
> int get_mem_attribute(unsigned long attribute_nr,
> 		      unsigned long __user * initiator_nmask,
> 		      unsigned long __user * target_nmask,
> 		      unsigned long maxnode,
> 		      unsigned long *attributes_out);

Agreed. I was also thinking something like above syscall interface works
where attribute_nr can grow as an enum with MAX_MEM_ATTRIBUTES increasing
but still keeping previous order intact for backward compatibility. But I
guess we would need to pass a size of an attribute structure (UAPI like
perf_event_attr) so that it can grow further but then structure packing
order is maintained for backward compatibility.

int get_mem_attribute(unsigned long attribute_nr,
		      unsigned long __user * initiator_nmask,
		      unsigned long __user * target_nmask,
		      unsigned long maxnode,
 		      unsigned long *attributes_out,
		      size_t attribute_size);

 
> 
> #define MEM_ATTR_READ_BANDWIDTH		1
> #define MEM_ATTR_WRITE_BANDWIDTH	2
> #define MEM_ATTR_READ_LATENCY		3
> #define MEM_ATTR_WRITE_LATENCTY		4
> #define MEM_ATTR_ENCRYPTION		5
> 
> If you want to know the read latency between nodes 4 and 8, you do:
> 
> 	ret = get_mem_attr(MEM_ATTR_READ_LATENCY,
> 			   (1<<4), (1<<8), max, &array);
> 
> And the answer shows up at array[0] in this example.  If you had more
> than one bit set in the two nmasks, you would have a longer array.
> 
> The length of the array is the number of bits set in initiator_nmask *
> the number of bits set in target_nmask * sizeof(ulong).

Right. Hmm, I guess now that the interface is requesting for a single
attribute it does not have to worry about structure for the attribute
field. A single ULONG_MAX should be enough to hold value for any given
attribute and also it does not have to worry much about compatibility.
This is better.

> 
> This has the advantage of supporting ULONG_MAX attributes, and scales

Right.

> from asking for one attribute at a time all the way up to dumping the
> entire system worth of data for a single attribute.  The only downside

Right.

> is that it's one syscall per attribute instead of packing them all
> together.  But, if we have a small enough number to pack them in one
> ulong, then I think we can make 64 syscalls without too much trouble.

I agree. It also enables single attribute to have ULONG_MAX length value
and avoid compatibility issues because of packing order due to multiple
attributes requested together. This is definitely a cleaner interface.

> 
>> Do you agree on a (system call + sysfs) approach in principle ?
>> Also sysfs exported information has to be derived from whats available
>> through the system call not the other way round. Hence the starting
>> point has to be the system call definition.
> 
> Both the sysfs information *and* what will be exported in any future
> interfaces are derived from platform-specific information.  They are not
> derived from one _interface_ or the other.
> 
> They obviously need to be consistent, though.

What I meant was the most comprehensive set of information should be
available to be fetched from the system call. Any other interface like
sysfs (or some other) will have to be a subset of whats available through
the system call. It should never be the case where there are information
available via sysfs but not through system call route. What is exported
through either syscall or sysfs will always be derived from platform
specific information.

> 
> 1. See "Table 5-142 System Locality Latency and Bandwidth Information
> Structure" here:
> http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> 

In conclusion something like this sort of a system call interface really
makes sense and can represent superset of memory attribute information.
Anshuman Khandual Nov. 27, 2018, 10:15 a.m. UTC | #21
On 11/26/2018 11:38 PM, Dan Williams wrote:
> On Mon, Nov 26, 2018 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 11/23/18 1:13 PM, Dan Williams wrote:
>>>> A new system call makes total sense to me.  I have the same concern
>>>> about the completeness of what's exposed in sysfs, I just don't see a
>>>> _route_ to completeness with sysfs itself.  Thus, the minimalist
>>>> approach as a first step.
>>> Outside of platform-firmware-id to Linux-numa-node-id what other
>>> userspace API infrastructure does the kernel need to provide? It seems
>>> userspace enumeration of memory attributes is fully enabled once the
>>> firmware-to-Linux identification is established.
>>
>> It would be nice not to have each app need to know about each specific
>> platform's firmware.
> 
> The app wouldn't need to know if it uses a common library. Whether the
> library calls into the kernel or not is an implementation detail. If
> it is information that only the app cares about and the kernel does
> not consume, why have a syscall?

If we just care about platform-firmware-id <--> Linux-numa-node-id mapping
and fetching memory attribute from the platform (and hiding implementation
details in a library) then the following interface should be sufficient.

/sys/devices/system/node/nodeX/platform_id

But as the series proposes (and rightly so) kernel needs to start providing
ABI interfaces for memory attributes instead of hiding them in libraries.
Anshuman Khandual Nov. 27, 2018, 2:05 p.m. UTC | #22
On 11/23/2018 10:45 PM, Dan Williams wrote:
> On Thu, Nov 22, 2018 at 11:11 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>>
>>
>> On 11/22/2018 11:38 PM, Dan Williams wrote:
>>> On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual
>>> <anshuman.khandual@arm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 11/19/2018 11:07 PM, Dave Hansen wrote:
>>>>> On 11/18/18 9:44 PM, Anshuman Khandual wrote:
>>>>>> IIUC NUMA re-work in principle involves these functional changes
>>>>>>
>>>>>> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term)
>>>>>
>>>>> This patch set _does_ that, though.
>>>>>
>>>>>> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term)
>>>>>
>>>>> It does that as well (a subset at least).
>>>>>
>>>>> It sounds like the subset that's being exposed is insufficient for yo
>>>>> We did that because we think doing anything but a subset in sysfs will
>>>>> just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
>>>>> attributes, that's at _least_ 1024*1024*4 files if we expose *all*
>>>>> combinations.
>>>> Each permutation need not be a separate file inside all possible NODE X
>>>> (/sys/devices/system/node/nodeX) directories. It can be a top level file
>>>> enumerating various attribute values for a given (X, Y) node pair based
>>>> on an offset something like /proc/pid/pagemap.
>>>>
>>>>>
>>>>> Do we agree that sysfs is unsuitable for exposing attributes in this manner?
>>>>>
>>>>
>>>> Yes, for individual files. But this can be worked around with an offset
>>>> based access from a top level global attributes file as mentioned above.
>>>> Is there any particular advantage of using individual files for each
>>>> given attribute ? I was wondering that a single unsigned long (u64) will
>>>> be able to pack 8 different attributes where each individual attribute
>>>> values can be abstracted out in 8 bits.
>>>
>>> sysfs has a 4K limit, and in general I don't think there is much
>>> incremental value to go describe the entirety of the system from sysfs
>>> or anywhere else in the kernel for that matter. It's simply too much> information to reasonably consume. Instead the kernel can describe the
>>
>> I agree that it may be some amount of information to parse but is crucial
>> for any task on a heterogeneous system to evaluate (probably re-evaluate
>> if the task moves around) its memory and CPU binding at runtime to make
>> sure it has got the right one.
> 
> Can you provide some more evidence for this statement? It seems that

That a moving application can really benefit from a complete set of effective
N1 ----> N2 memory attributes ? Now mbind() binds the memory locally and
sched_setaffinity() binds the task locally. Even if the task gets to required
memory through mbind() a complete set of memory attributes as seen from
various nodes will help it create appropriate cpumask_t for sched_setaffinity().
Only local node effective attributes (as proposed by this series) wont be able
to give enough information to create required cpu mask.

> not many applications even care about basic numa let alone specific
> memory targeting, at least according to libnumactl users.

But we hope that will change going forward as more and more heterogeneous
systems comes in. We should not be locking down an ABI without considering
for future system possibilities and task's requirements.

> 
>      dnf repoquery --whatrequires numactl-libs
> 
> The kernel is the arbiter of memory, something is broken if
> applications *need* to take on this responsibility. Yes, there will be
> applications that want to tune and override the default kernel
> behavior, but this is the exception, not the rule. The applications
> that tend to care about specific memories also tend to be purpose
> built for a given platform, and that lessens their reliance on the
> kernel to enumerate all properties.

I would agree with you completely if we just want to move ahead in the
direction of preserving existing status quo. But this series does propose
an application interface (rightly so IMHO) for memory attributes and hence
we are just trying to figure out if that is the correct direction or not.

> 
>>> coarse boundaries and some semblance of "best" access initiator for a
>>> given target. That should cover the "80%" case of what applications
>>
>> The current proposal just assumes that the best one is the nearest one.
>> This may be true for bandwidth and latency but may not be true for some
>> other properties. This assumptions should not be there while defining
>> new ABI.
> 
> In fact, I tend to agree with you, but in my opinion that's an
> argument to expose even less, not more. If we start with something
> minimal that can be extended over time we lessen the risk of over
> exposing details that don't matter in practice.

As we have been discussing on the other thread what has to be exported
should be derived from what all is needed. Cherry picking minimal set
of attributes and exposing them through sysfs in the short term does
not seem to be the right way either.

> 
> We're in the middle of a bit of a disaster with the VmFlags export in
> /proc/$pid/smaps precisely because the implementation was too
> comprehensive and applications started depending on details that the
> kernel does not want to guarantee going forward. So there is a real
> risk of being too descriptive in an interface design.

There is also a risk with a insufficient and in-extensible sysfs interface
which is going to be there for ever.
 
> 
>>> want to discover, for the other "20%" we likely need some userspace
>>> library that can go parse these platform specific information sources
>>> and supplement the kernel view. I also think a simpler kernel starting
>>> point gives us room to go pull in more commonly used attributes if it
>>> turns out they are useful, and avoid going down the path of exporting
>>> attributes that have questionable value in practice.
>>>
>>
>> Applications can just query platform information right now and just use
>> them for mbind() without requiring this new interface.
> 
> No, they can't today, at least not for the topology details that HMAT
> is describing. The platform-firmware to numa-node translation is
> currently not complete. At a minimum we need a listing of initiator
> ids and target ids. For an ACPI platform that is the proximity-domain

All of the memory visible in kernel is accessible from each individual CPU
on the system irrespective which node they belong. Unless there is another
type of memory accessors like GPU cores or something each node with CPU
should be an initiator for each node with memory.

> to numa-node-id translation information. Once that translation is in
> place then a userspace library can consult the platform-specific
> information sources to translate the platform-firmware view to the
> Linux handles for those memories. Am I missing the library that does
> this today?

Once coherency enumeration (through initiator <---> memory target) is out
of the way would not the following simple new interface be sufficient to
query platform about it's properties ?

/sys/devices/system/node/nodeX/platform_id

> 
>> We are not even
>> changing any core MM yet. So if it's just about identifying the node's
>> memory properties it can be scanned from platform itself. But I agree
>> we would like the kernel to start adding interfaces for multi attribute
>> memory but all I am saying is that it has to be comprehensive. Some of
>> the attributes have more usefulness now and some have less but the new
>> ABI interface has to accommodate exporting all of these.
> 
> I get the sense we are talking past each other, can you give the next
> level of detail on that "has to be comprehensive" statement?

The discussion on the other thread has some new details from last week.
Dan Williams Nov. 27, 2018, 4:56 p.m. UTC | #23
On Tue, Nov 27, 2018 at 2:15 AM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 11/26/2018 11:38 PM, Dan Williams wrote:
> > On Mon, Nov 26, 2018 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >>
> >> On 11/23/18 1:13 PM, Dan Williams wrote:
> >>>> A new system call makes total sense to me.  I have the same concern
> >>>> about the completeness of what's exposed in sysfs, I just don't see a
> >>>> _route_ to completeness with sysfs itself.  Thus, the minimalist
> >>>> approach as a first step.
> >>> Outside of platform-firmware-id to Linux-numa-node-id what other
> >>> userspace API infrastructure does the kernel need to provide? It seems
> >>> userspace enumeration of memory attributes is fully enabled once the
> >>> firmware-to-Linux identification is established.
> >>
> >> It would be nice not to have each app need to know about each specific
> >> platform's firmware.
> >
> > The app wouldn't need to know if it uses a common library. Whether the
> > library calls into the kernel or not is an implementation detail. If
> > it is information that only the app cares about and the kernel does
> > not consume, why have a syscall?
>
> If we just care about platform-firmware-id <--> Linux-numa-node-id mapping
> and fetching memory attribute from the platform (and hiding implementation
> details in a library) then the following interface should be sufficient.
>
> /sys/devices/system/node/nodeX/platform_id
>
> But as the series proposes (and rightly so) kernel needs to start providing
> ABI interfaces for memory attributes instead of hiding them in libraries.

Yes, I can get on board with sysfs providing a subset of the
performance description for administrators to discover the common case
via scripting and leave the exhaustive attribute description to a
separate interface. I was pushing back on the notion that sysfs must
be that exhaustive interface... we're making progress.

I still think we need /sys/devices/system/node/nodeX/platform_id to
enable higher order platform enumeration tooling, but that need not be
the end of the kernel interface description.