Message ID | 20181114224902.12082-1-keith.busch@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | ACPI HMAT memory sysfs representation | expand |
On 11/15/2018 04:19 AM, Keith Busch wrote: > This series provides a new sysfs representation for heterogeneous > system memory. > > The previous series that was specific to HMAT that this series was based > on was last posted here: https://lkml.org/lkml/2017/12/13/968 > > Platforms may provide multiple types of cpu attached system memory. The > memory ranges for each type may have different characteristics that > applications may wish to know about when considering what node they want > their memory allocated from. > > It had previously been difficult to describe these setups as memory > rangers were generally lumped into the NUMA node of the CPUs. New > platform attributes have been created and in use today that describe > the more complex memory hierarchies that can be created. > > This series first creates new generic APIs under the kernel's node > representation. These new APIs can be used to create links among local > memory and compute nodes and export characteristics about the memory > nodes. Documentation desribing the new representation are provided. > > Finally the series adds a kernel user for these new APIs from parsing > the ACPI HMAT. Not able to see the patches from this series either on the list or on the archive (https://lkml.org/lkml/2018/11/15/331). IIRC last time we discussed about this and the concern which I raised was in absence of a broader NUMA rework for multi attribute memory it might not a good idea to settle down and freeze sysfs interface for the user space.
On Fri, Nov 16, 2018 at 11:57:58AM +0530, Anshuman Khandual wrote: > On 11/15/2018 04:19 AM, Keith Busch wrote: > > This series provides a new sysfs representation for heterogeneous > > system memory. > > > > The previous series that was specific to HMAT that this series was based > > on was last posted here: https://lkml.org/lkml/2017/12/13/968 > > > > Platforms may provide multiple types of cpu attached system memory. The > > memory ranges for each type may have different characteristics that > > applications may wish to know about when considering what node they want > > their memory allocated from. > > > > It had previously been difficult to describe these setups as memory > > rangers were generally lumped into the NUMA node of the CPUs. New > > platform attributes have been created and in use today that describe > > the more complex memory hierarchies that can be created. > > > > This series first creates new generic APIs under the kernel's node > > representation. These new APIs can be used to create links among local > > memory and compute nodes and export characteristics about the memory > > nodes. Documentation desribing the new representation are provided. > > > > Finally the series adds a kernel user for these new APIs from parsing > > the ACPI HMAT. > > Not able to see the patches from this series either on the list or on the > archive (https://lkml.org/lkml/2018/11/15/331). The send-email split the cover-letter from the series, probably something I did. Series followed immediately after: https://lkml.org/lkml/2018/11/15/332 > IIRC last time we discussed > about this and the concern which I raised was in absence of a broader NUMA > rework for multi attribute memory it might not a good idea to settle down > and freeze sysfs interface for the user space.
On 11/15/18 10:27 PM, Anshuman Khandual wrote: > Not able to see the patches from this series either on the list or on the > archive (https://lkml.org/lkml/2018/11/15/331). IIRC last time we discussed > about this and the concern which I raised was in absence of a broader NUMA > rework for multi attribute memory it might not a good idea to settle down > and freeze sysfs interface for the user space. This *is* the broader NUMA rework. I think it's just a bit more incremental that what you originally had in mind. Did you have an alternative for how you wanted this to look?
On 11/16/2018 09:21 PM, Keith Busch wrote: > On Fri, Nov 16, 2018 at 11:57:58AM +0530, Anshuman Khandual wrote: >> On 11/15/2018 04:19 AM, Keith Busch wrote: >>> This series provides a new sysfs representation for heterogeneous >>> system memory. >>> >>> The previous series that was specific to HMAT that this series was based >>> on was last posted here: https://lkml.org/lkml/2017/12/13/968 >>> >>> Platforms may provide multiple types of cpu attached system memory. The >>> memory ranges for each type may have different characteristics that >>> applications may wish to know about when considering what node they want >>> their memory allocated from. >>> >>> It had previously been difficult to describe these setups as memory >>> rangers were generally lumped into the NUMA node of the CPUs. New >>> platform attributes have been created and in use today that describe >>> the more complex memory hierarchies that can be created. >>> >>> This series first creates new generic APIs under the kernel's node >>> representation. These new APIs can be used to create links among local >>> memory and compute nodes and export characteristics about the memory >>> nodes. Documentation desribing the new representation are provided. >>> >>> Finally the series adds a kernel user for these new APIs from parsing >>> the ACPI HMAT. >> >> Not able to see the patches from this series either on the list or on the >> archive (https://lkml.org/lkml/2018/11/15/331). > > The send-email split the cover-letter from the series, probably > something I did. Series followed immediately after: > > https://lkml.org/lkml/2018/11/15/332 Yeah got it. I can see the series on the list. Thanks for pointing out. > >> IIRC last time we discussed >> about this and the concern which I raised was in absence of a broader NUMA >> rework for multi attribute memory it might not a good idea to settle down >> and freeze sysfs interface for the user space. >
On 11/16/2018 10:25 PM, Dave Hansen wrote: > On 11/15/18 10:27 PM, Anshuman Khandual wrote: >> Not able to see the patches from this series either on the list or on the >> archive (https://lkml.org/lkml/2018/11/15/331). IIRC last time we discussed >> about this and the concern which I raised was in absence of a broader NUMA >> rework for multi attribute memory it might not a good idea to settle down >> and freeze sysfs interface for the user space. > Hello Dave, > This *is* the broader NUMA rework. I think it's just a bit more > incremental that what you originally had in mind. IIUC NUMA re-work in principle involves these functional changes 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term) 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term) 3. Changing core MM to accommodate multi attribute memory (long term) The first two set of changes can get the user space applications moving by identifying the right nodes and their attributes through sysfs interface. > > Did you have an alternative for how you wanted this to look? > No. I did not get enough time this year to rework on the original proposal I had. But will be able to help here to make this interface more generic, abstract out these properties which is extensible in the future. - Anshuman
On 11/18/18 9:44 PM, Anshuman Khandual wrote: > IIUC NUMA re-work in principle involves these functional changes > > 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term) This patch set _does_ that, though. > 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term) It does that as well (a subset at least). It sounds like the subset that's being exposed is insufficient for yo We did that because we think doing anything but a subset in sysfs will just blow up sysfs: MAX_NUMNODES is as high as 1024, so if we have 4 attributes, that's at _least_ 1024*1024*4 files if we expose *all* combinations. Do we agree that sysfs is unsuitable for exposing attributes in this manner?
On 11/19/2018 11:07 PM, Dave Hansen wrote: > On 11/18/18 9:44 PM, Anshuman Khandual wrote: >> IIUC NUMA re-work in principle involves these functional changes >> >> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term) > > This patch set _does_ that, though. > >> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term) > > It does that as well (a subset at least). > > It sounds like the subset that's being exposed is insufficient for yo > We did that because we think doing anything but a subset in sysfs will > just blow up sysfs: MAX_NUMNODES is as high as 1024, so if we have 4 > attributes, that's at _least_ 1024*1024*4 files if we expose *all* > combinations. Each permutation need not be a separate file inside all possible NODE X (/sys/devices/system/node/nodeX) directories. It can be a top level file enumerating various attribute values for a given (X, Y) node pair based on an offset something like /proc/pid/pagemap. > > Do we agree that sysfs is unsuitable for exposing attributes in this manner? > Yes, for individual files. But this can be worked around with an offset based access from a top level global attributes file as mentioned above. Is there any particular advantage of using individual files for each given attribute ? I was wondering that a single unsigned long (u64) will be able to pack 8 different attributes where each individual attribute values can be abstracted out in 8 bits.
On 11/22/18 3:52 AM, Anshuman Khandual wrote: >> >> It sounds like the subset that's being exposed is insufficient for yo >> We did that because we think doing anything but a subset in sysfs will >> just blow up sysfs: MAX_NUMNODES is as high as 1024, so if we have 4 >> attributes, that's at _least_ 1024*1024*4 files if we expose *all* >> combinations. > Each permutation need not be a separate file inside all possible NODE X > (/sys/devices/system/node/nodeX) directories. It can be a top level file > enumerating various attribute values for a given (X, Y) node pair based > on an offset something like /proc/pid/pagemap. My assumption has been that this kind of thing is too fancy for sysfs: Documentation/filesystems/sysfs.txt: > Attributes should be ASCII text files, preferably with only one value > per file. It is noted that it may not be efficient to contain only one > value per file, so it is socially acceptable to express an array of > values of the same type. > > Mixing types, expressing multiple lines of data, and doing fancy > formatting of data is heavily frowned upon. Doing these things may get > you publicly humiliated and your code rewritten without notice. /proc/pid/pagemap is binary, not one-value-per-file and relatively complicated to parse. Do you really think following something like pagemap is the right model for sysfs? BTW, I'm not saying we don't need *some* interface like you propose. We almost certainly will at some point. I just don't think it will be in sysfs.
On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 11/19/2018 11:07 PM, Dave Hansen wrote: > > On 11/18/18 9:44 PM, Anshuman Khandual wrote: > >> IIUC NUMA re-work in principle involves these functional changes > >> > >> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term) > > > > This patch set _does_ that, though. > > > >> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term) > > > > It does that as well (a subset at least). > > > > It sounds like the subset that's being exposed is insufficient for yo > > We did that because we think doing anything but a subset in sysfs will > > just blow up sysfs: MAX_NUMNODES is as high as 1024, so if we have 4 > > attributes, that's at _least_ 1024*1024*4 files if we expose *all* > > combinations. > Each permutation need not be a separate file inside all possible NODE X > (/sys/devices/system/node/nodeX) directories. It can be a top level file > enumerating various attribute values for a given (X, Y) node pair based > on an offset something like /proc/pid/pagemap. > > > > > Do we agree that sysfs is unsuitable for exposing attributes in this manner? > > > > Yes, for individual files. But this can be worked around with an offset > based access from a top level global attributes file as mentioned above. > Is there any particular advantage of using individual files for each > given attribute ? I was wondering that a single unsigned long (u64) will > be able to pack 8 different attributes where each individual attribute > values can be abstracted out in 8 bits. sysfs has a 4K limit, and in general I don't think there is much incremental value to go describe the entirety of the system from sysfs or anywhere else in the kernel for that matter. It's simply too much information to reasonably consume. Instead the kernel can describe the coarse boundaries and some semblance of "best" access initiator for a given target. That should cover the "80%" case of what applications want to discover, for the other "20%" we likely need some userspace library that can go parse these platform specific information sources and supplement the kernel view. I also think a simpler kernel starting point gives us room to go pull in more commonly used attributes if it turns out they are useful, and avoid going down the path of exporting attributes that have questionable value in practice.
On 11/22/2018 11:31 PM, Dave Hansen wrote: > On 11/22/18 3:52 AM, Anshuman Khandual wrote: >>> >>> It sounds like the subset that's being exposed is insufficient for yo >>> We did that because we think doing anything but a subset in sysfs will >>> just blow up sysfs: MAX_NUMNODES is as high as 1024, so if we have 4 >>> attributes, that's at _least_ 1024*1024*4 files if we expose *all* >>> combinations. >> Each permutation need not be a separate file inside all possible NODE X >> (/sys/devices/system/node/nodeX) directories. It can be a top level file >> enumerating various attribute values for a given (X, Y) node pair based >> on an offset something like /proc/pid/pagemap. > > My assumption has been that this kind of thing is too fancy for sysfs: Applications need to know the matrix of multi attribute properties as seen from various memory accessors/initiators to be able to bind them to desired CPUs and memory. That gives applications true view of an heterogeneous system. While I understand your concern here about the sysfs (which can be worked around with probably multiple global files may be if the size is a problem etc) but an insufficient interface is definitely problematic in longer term. This is going to be an ABI which is locked in for good. Hence even it might appear over engineering at the moment but IMHO is the right thing to do. > > Documentation/filesystems/sysfs.txt: >> Attributes should be ASCII text files, preferably with only one value >> per file. It is noted that it may not be efficient to contain only one >> value per file, so it is socially acceptable to express an array of >> values of the same type. >> >> Mixing types, expressing multiple lines of data, and doing fancy >> formatting of data is heavily frowned upon. Doing these things may get >> you publicly humiliated and your code rewritten without notice. > > /proc/pid/pagemap is binary, not one-value-per-file and relatively > complicated to parse. I agree but it does provide user space really valuable information about the faulted pages for it's VA space. Was there any better way of getting it ? May be but at this point in time it is essential. > > Do you really think following something like pagemap is the right model > for sysfs.> > BTW, I'm not saying we don't need *some* interface like you propose. We > almost certainly will at some point. I just don't think it will be in > sysfs. I am not saying doing this in sysfs is very elegant. I would rather have a syscall read back (MAX_NODES * MAX_NODES * u64) attribute matrix from the kernel. Probably a subset of that information can appear on sysfs to speed of queries for various optimizations as Keith mentioned before. But we will have to first evaluate and come to an agreement what constitutes a comprehensive set for multi attribute properties. Are we willing to go in the direction for inclusion of a new system call, subset of it appears on sysfs etc ? My primary concern is not how the attribute information appears on the sysfs but lack of it's completeness.
On 11/22/2018 11:38 PM, Dan Williams wrote: > On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: >> >> >> >> On 11/19/2018 11:07 PM, Dave Hansen wrote: >>> On 11/18/18 9:44 PM, Anshuman Khandual wrote: >>>> IIUC NUMA re-work in principle involves these functional changes >>>> >>>> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term) >>> >>> This patch set _does_ that, though. >>> >>>> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term) >>> >>> It does that as well (a subset at least). >>> >>> It sounds like the subset that's being exposed is insufficient for yo >>> We did that because we think doing anything but a subset in sysfs will >>> just blow up sysfs: MAX_NUMNODES is as high as 1024, so if we have 4 >>> attributes, that's at _least_ 1024*1024*4 files if we expose *all* >>> combinations. >> Each permutation need not be a separate file inside all possible NODE X >> (/sys/devices/system/node/nodeX) directories. It can be a top level file >> enumerating various attribute values for a given (X, Y) node pair based >> on an offset something like /proc/pid/pagemap. >> >>> >>> Do we agree that sysfs is unsuitable for exposing attributes in this manner? >>> >> >> Yes, for individual files. But this can be worked around with an offset >> based access from a top level global attributes file as mentioned above. >> Is there any particular advantage of using individual files for each >> given attribute ? I was wondering that a single unsigned long (u64) will >> be able to pack 8 different attributes where each individual attribute >> values can be abstracted out in 8 bits. > > sysfs has a 4K limit, and in general I don't think there is much > incremental value to go describe the entirety of the system from sysfs > or anywhere else in the kernel for that matter. It's simply too much> information to reasonably consume. Instead the kernel can describe the I agree that it may be some amount of information to parse but is crucial for any task on a heterogeneous system to evaluate (probably re-evaluate if the task moves around) its memory and CPU binding at runtime to make sure it has got the right one. > coarse boundaries and some semblance of "best" access initiator for a > given target. That should cover the "80%" case of what applications The current proposal just assumes that the best one is the nearest one. This may be true for bandwidth and latency but may not be true for some other properties. This assumptions should not be there while defining new ABI. > want to discover, for the other "20%" we likely need some userspace > library that can go parse these platform specific information sources > and supplement the kernel view. I also think a simpler kernel starting > point gives us room to go pull in more commonly used attributes if it > turns out they are useful, and avoid going down the path of exporting > attributes that have questionable value in practice. > Applications can just query platform information right now and just use them for mbind() without requiring this new interface. We are not even changing any core MM yet. So if it's just about identifying the node's memory properties it can be scanned from platform itself. But I agree we would like the kernel to start adding interfaces for multi attribute memory but all I am saying is that it has to be comprehensive. Some of the attributes have more usefulness now and some have less but the new ABI interface has to accommodate exporting all of these.
On Thu, Nov 22, 2018 at 11:11 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 11/22/2018 11:38 PM, Dan Williams wrote: > > On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual > > <anshuman.khandual@arm.com> wrote: > >> > >> > >> > >> On 11/19/2018 11:07 PM, Dave Hansen wrote: > >>> On 11/18/18 9:44 PM, Anshuman Khandual wrote: > >>>> IIUC NUMA re-work in principle involves these functional changes > >>>> > >>>> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term) > >>> > >>> This patch set _does_ that, though. > >>> > >>>> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term) > >>> > >>> It does that as well (a subset at least). > >>> > >>> It sounds like the subset that's being exposed is insufficient for yo > >>> We did that because we think doing anything but a subset in sysfs will > >>> just blow up sysfs: MAX_NUMNODES is as high as 1024, so if we have 4 > >>> attributes, that's at _least_ 1024*1024*4 files if we expose *all* > >>> combinations. > >> Each permutation need not be a separate file inside all possible NODE X > >> (/sys/devices/system/node/nodeX) directories. It can be a top level file > >> enumerating various attribute values for a given (X, Y) node pair based > >> on an offset something like /proc/pid/pagemap. > >> > >>> > >>> Do we agree that sysfs is unsuitable for exposing attributes in this manner? > >>> > >> > >> Yes, for individual files. But this can be worked around with an offset > >> based access from a top level global attributes file as mentioned above. > >> Is there any particular advantage of using individual files for each > >> given attribute ? I was wondering that a single unsigned long (u64) will > >> be able to pack 8 different attributes where each individual attribute > >> values can be abstracted out in 8 bits. > > > > sysfs has a 4K limit, and in general I don't think there is much > > incremental value to go describe the entirety of the system from sysfs > > or anywhere else in the kernel for that matter. It's simply too much> information to reasonably consume. Instead the kernel can describe the > > I agree that it may be some amount of information to parse but is crucial > for any task on a heterogeneous system to evaluate (probably re-evaluate > if the task moves around) its memory and CPU binding at runtime to make > sure it has got the right one. Can you provide some more evidence for this statement? It seems that not many applications even care about basic numa let alone specific memory targeting, at least according to libnumactl users. dnf repoquery --whatrequires numactl-libs The kernel is the arbiter of memory, something is broken if applications *need* to take on this responsibility. Yes, there will be applications that want to tune and override the default kernel behavior, but this is the exception, not the rule. The applications that tend to care about specific memories also tend to be purpose built for a given platform, and that lessens their reliance on the kernel to enumerate all properties. > > coarse boundaries and some semblance of "best" access initiator for a > > given target. That should cover the "80%" case of what applications > > The current proposal just assumes that the best one is the nearest one. > This may be true for bandwidth and latency but may not be true for some > other properties. This assumptions should not be there while defining > new ABI. In fact, I tend to agree with you, but in my opinion that's an argument to expose even less, not more. If we start with something minimal that can be extended over time we lessen the risk of over exposing details that don't matter in practice. We're in the middle of a bit of a disaster with the VmFlags export in /proc/$pid/smaps precisely because the implementation was too comprehensive and applications started depending on details that the kernel does not want to guarantee going forward. So there is a real risk of being too descriptive in an interface design. > > want to discover, for the other "20%" we likely need some userspace > > library that can go parse these platform specific information sources > > and supplement the kernel view. I also think a simpler kernel starting > > point gives us room to go pull in more commonly used attributes if it > > turns out they are useful, and avoid going down the path of exporting > > attributes that have questionable value in practice. > > > > Applications can just query platform information right now and just use > them for mbind() without requiring this new interface. No, they can't today, at least not for the topology details that HMAT is describing. The platform-firmware to numa-node translation is currently not complete. At a minimum we need a listing of initiator ids and target ids. For an ACPI platform that is the proximity-domain to numa-node-id translation information. Once that translation is in place then a userspace library can consult the platform-specific information sources to translate the platform-firmware view to the Linux handles for those memories. Am I missing the library that does this today? > We are not even > changing any core MM yet. So if it's just about identifying the node's > memory properties it can be scanned from platform itself. But I agree > we would like the kernel to start adding interfaces for multi attribute > memory but all I am saying is that it has to be comprehensive. Some of > the attributes have more usefulness now and some have less but the new > ABI interface has to accommodate exporting all of these. I get the sense we are talking past each other, can you give the next level of detail on that "has to be comprehensive" statement?
On 11/22/18 10:42 PM, Anshuman Khandual wrote: > Are we willing to go in the direction for inclusion of a new system > call, subset of it appears on sysfs etc ? My primary concern is not > how the attribute information appears on the sysfs but lack of it's > completeness. A new system call makes total sense to me. I have the same concern about the completeness of what's exposed in sysfs, I just don't see a _route_ to completeness with sysfs itself. Thus, the minimalist approach as a first step.
On Fri, Nov 23, 2018 at 11:21 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 11/22/18 10:42 PM, Anshuman Khandual wrote: > > Are we willing to go in the direction for inclusion of a new system > > call, subset of it appears on sysfs etc ? My primary concern is not > > how the attribute information appears on the sysfs but lack of it's > > completeness. > > A new system call makes total sense to me. I have the same concern > about the completeness of what's exposed in sysfs, I just don't see a > _route_ to completeness with sysfs itself. Thus, the minimalist > approach as a first step. Outside of platform-firmware-id to Linux-numa-node-id what other userspace API infrastructure does the kernel need to provide? It seems userspace enumeration of memory attributes is fully enabled once the firmware-to-Linux identification is established.
On 11/24/2018 12:51 AM, Dave Hansen wrote: > On 11/22/18 10:42 PM, Anshuman Khandual wrote: >> Are we willing to go in the direction for inclusion of a new system >> call, subset of it appears on sysfs etc ? My primary concern is not >> how the attribute information appears on the sysfs but lack of it's >> completeness. > > A new system call makes total sense to me. I have the same concern > about the completeness of what's exposed in sysfs, I just don't see a > _route_ to completeness with sysfs itself. Thus, the minimalist > approach as a first step. Okay if we agree on the need for a new specific system call extracting the superset attribute information MAX_NUMNODES * MAX_NUMNODES * U64 (u64 packs 8 bit values for 8 attributes or something like that) as we had discussed before, it makes sense to export a subset of it which can be faster but useful for the user space without going through a system call. Do you agree on a (system call + sysfs) approach in principle ? Also sysfs exported information has to be derived from whats available through the system call not the other way round. Hence the starting point has to be the system call definition.
On 11/24/2018 02:43 AM, Dan Williams wrote: > On Fri, Nov 23, 2018 at 11:21 AM Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 11/22/18 10:42 PM, Anshuman Khandual wrote: >>> Are we willing to go in the direction for inclusion of a new system >>> call, subset of it appears on sysfs etc ? My primary concern is not >>> how the attribute information appears on the sysfs but lack of it's >>> completeness. >> >> A new system call makes total sense to me. I have the same concern >> about the completeness of what's exposed in sysfs, I just don't see a >> _route_ to completeness with sysfs itself. Thus, the minimalist >> approach as a first step. > > Outside of platform-firmware-id to Linux-numa-node-id what other > userspace API infrastructure does the kernel need to provide? It seems > userspace enumeration of memory attributes is fully enabled once the > firmware-to-Linux identification is established. Which is true if the user space is required to probe the memory attribute values for the platform-firmware-id from the platform and then request required memory from corresponding Linux-numa-node-id via standard mm interfaces like mbind(). But in this patch series we are not mapping platform-firmware-id to Linux-numa-node-id. We are exporting properties applicable to Linux nodes (Linux-numa-node-id). Even if platform-firmware-id to Linux-numa-node-id is required it can be done through a new file like the following. Applications can just take the platform_id node and query platform about it's properties. /sys/devices/system/node/nodeX/platform_id This above interface would have been okay as its just an extension of the existing node information on sysfs. But thats not the case with this proposal.
On 11/23/18 1:13 PM, Dan Williams wrote: >> A new system call makes total sense to me. I have the same concern >> about the completeness of what's exposed in sysfs, I just don't see a >> _route_ to completeness with sysfs itself. Thus, the minimalist >> approach as a first step. > Outside of platform-firmware-id to Linux-numa-node-id what other > userspace API infrastructure does the kernel need to provide? It seems > userspace enumeration of memory attributes is fully enabled once the > firmware-to-Linux identification is established. It would be nice not to have each app need to know about each specific platform's firmware.
On 11/26/18 7:38 AM, Anshuman Khandual wrote: > On 11/24/2018 12:51 AM, Dave Hansen wrote: >> On 11/22/18 10:42 PM, Anshuman Khandual wrote: >>> Are we willing to go in the direction for inclusion of a new system >>> call, subset of it appears on sysfs etc ? My primary concern is not >>> how the attribute information appears on the sysfs but lack of it's >>> completeness. >> >> A new system call makes total sense to me. I have the same concern >> about the completeness of what's exposed in sysfs, I just don't see a >> _route_ to completeness with sysfs itself. Thus, the minimalist >> approach as a first step. > > Okay if we agree on the need for a new specific system call extracting > the superset attribute information MAX_NUMNODES * MAX_NUMNODES * U64 > (u64 packs 8 bit values for 8 attributes or something like that) as we > had discussed before, it makes sense to export a subset of it which can > be faster but useful for the user space without going through a system > call. The information that needs to be exported is a bit more than that. It's not just a binary attribute. The information we have from the new ACPI table, for instance, is the read and write bandwidth and latency between two nodes. They are, IIRC, two-byte values in the ACPI table[1], each. That's 8 bytes worth of data right there, which wouldn't fit *anything* else. The list of things we want to export will certainly grow. That means we need a syscall something like this: int get_mem_attribute(unsigned long attribute_nr, unsigned long __user * initiator_nmask, unsigned long __user * target_nmask, unsigned long maxnode, unsigned long *attributes_out); #define MEM_ATTR_READ_BANDWIDTH 1 #define MEM_ATTR_WRITE_BANDWIDTH 2 #define MEM_ATTR_READ_LATENCY 3 #define MEM_ATTR_WRITE_LATENCTY 4 #define MEM_ATTR_ENCRYPTION 5 If you want to know the read latency between nodes 4 and 8, you do: ret = get_mem_attr(MEM_ATTR_READ_LATENCY, (1<<4), (1<<8), max, &array); And the answer shows up at array[0] in this example. If you had more than one bit set in the two nmasks, you would have a longer array. The length of the array is the number of bits set in initiator_nmask * the number of bits set in target_nmask * sizeof(ulong). This has the advantage of supporting ULONG_MAX attributes, and scales from asking for one attribute at a time all the way up to dumping the entire system worth of data for a single attribute. The only downside is that it's one syscall per attribute instead of packing them all together. But, if we have a small enough number to pack them in one ulong, then I think we can make 64 syscalls without too much trouble. > Do you agree on a (system call + sysfs) approach in principle ? > Also sysfs exported information has to be derived from whats available > through the system call not the other way round. Hence the starting > point has to be the system call definition. Both the sysfs information *and* what will be exported in any future interfaces are derived from platform-specific information. They are not derived from one _interface_ or the other. They obviously need to be consistent, though. 1. See "Table 5-142 System Locality Latency and Bandwidth Information Structure" here: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
On Mon, Nov 26, 2018 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 11/23/18 1:13 PM, Dan Williams wrote: > >> A new system call makes total sense to me. I have the same concern > >> about the completeness of what's exposed in sysfs, I just don't see a > >> _route_ to completeness with sysfs itself. Thus, the minimalist > >> approach as a first step. > > Outside of platform-firmware-id to Linux-numa-node-id what other > > userspace API infrastructure does the kernel need to provide? It seems > > userspace enumeration of memory attributes is fully enabled once the > > firmware-to-Linux identification is established. > > It would be nice not to have each app need to know about each specific > platform's firmware. The app wouldn't need to know if it uses a common library. Whether the library calls into the kernel or not is an implementation detail. If it is information that only the app cares about and the kernel does not consume, why have a syscall?
On 11/26/2018 10:50 PM, Dave Hansen wrote: > On 11/26/18 7:38 AM, Anshuman Khandual wrote: >> On 11/24/2018 12:51 AM, Dave Hansen wrote: >>> On 11/22/18 10:42 PM, Anshuman Khandual wrote: >>>> Are we willing to go in the direction for inclusion of a new system >>>> call, subset of it appears on sysfs etc ? My primary concern is not >>>> how the attribute information appears on the sysfs but lack of it's >>>> completeness. >>> >>> A new system call makes total sense to me. I have the same concern >>> about the completeness of what's exposed in sysfs, I just don't see a >>> _route_ to completeness with sysfs itself. Thus, the minimalist >>> approach as a first step. >> >> Okay if we agree on the need for a new specific system call extracting >> the superset attribute information MAX_NUMNODES * MAX_NUMNODES * U64 >> (u64 packs 8 bit values for 8 attributes or something like that) as we >> had discussed before, it makes sense to export a subset of it which can >> be faster but useful for the user space without going through a system >> call. > > The information that needs to be exported is a bit more than that. It's > not just a binary attribute. Right wont be binary because it would contain a value for an attribute. > > The information we have from the new ACPI table, for instance, is the > read and write bandwidth and latency between two nodes. They are, IIRC, > two-byte values in the ACPI table[1], each. That's 8 bytes worth of > data right there, which wouldn't fit *anything* else. Hmm I get your point. We would need to have interfaces both system call and sysfs where number of attributes and bit field to contain value for any attribute can grow in the future with backward compatibility. > > The list of things we want to export will certainly grow. That means we > need a syscall something like this: > > int get_mem_attribute(unsigned long attribute_nr, > unsigned long __user * initiator_nmask, > unsigned long __user * target_nmask, > unsigned long maxnode, > unsigned long *attributes_out); Agreed. I was also thinking something like above syscall interface works where attribute_nr can grow as an enum with MAX_MEM_ATTRIBUTES increasing but still keeping previous order intact for backward compatibility. But I guess we would need to pass a size of an attribute structure (UAPI like perf_event_attr) so that it can grow further but then structure packing order is maintained for backward compatibility. int get_mem_attribute(unsigned long attribute_nr, unsigned long __user * initiator_nmask, unsigned long __user * target_nmask, unsigned long maxnode, unsigned long *attributes_out, size_t attribute_size); > > #define MEM_ATTR_READ_BANDWIDTH 1 > #define MEM_ATTR_WRITE_BANDWIDTH 2 > #define MEM_ATTR_READ_LATENCY 3 > #define MEM_ATTR_WRITE_LATENCTY 4 > #define MEM_ATTR_ENCRYPTION 5 > > If you want to know the read latency between nodes 4 and 8, you do: > > ret = get_mem_attr(MEM_ATTR_READ_LATENCY, > (1<<4), (1<<8), max, &array); > > And the answer shows up at array[0] in this example. If you had more > than one bit set in the two nmasks, you would have a longer array. > > The length of the array is the number of bits set in initiator_nmask * > the number of bits set in target_nmask * sizeof(ulong). Right. Hmm, I guess now that the interface is requesting for a single attribute it does not have to worry about structure for the attribute field. A single ULONG_MAX should be enough to hold value for any given attribute and also it does not have to worry much about compatibility. This is better. > > This has the advantage of supporting ULONG_MAX attributes, and scales Right. > from asking for one attribute at a time all the way up to dumping the > entire system worth of data for a single attribute. The only downside Right. > is that it's one syscall per attribute instead of packing them all > together. But, if we have a small enough number to pack them in one > ulong, then I think we can make 64 syscalls without too much trouble. I agree. It also enables single attribute to have ULONG_MAX length value and avoid compatibility issues because of packing order due to multiple attributes requested together. This is definitely a cleaner interface. > >> Do you agree on a (system call + sysfs) approach in principle ? >> Also sysfs exported information has to be derived from whats available >> through the system call not the other way round. Hence the starting >> point has to be the system call definition. > > Both the sysfs information *and* what will be exported in any future > interfaces are derived from platform-specific information. They are not > derived from one _interface_ or the other. > > They obviously need to be consistent, though. What I meant was the most comprehensive set of information should be available to be fetched from the system call. Any other interface like sysfs (or some other) will have to be a subset of whats available through the system call. It should never be the case where there are information available via sysfs but not through system call route. What is exported through either syscall or sysfs will always be derived from platform specific information. > > 1. See "Table 5-142 System Locality Latency and Bandwidth Information > Structure" here: > http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf > In conclusion something like this sort of a system call interface really makes sense and can represent superset of memory attribute information.
On 11/26/2018 11:38 PM, Dan Williams wrote: > On Mon, Nov 26, 2018 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 11/23/18 1:13 PM, Dan Williams wrote: >>>> A new system call makes total sense to me. I have the same concern >>>> about the completeness of what's exposed in sysfs, I just don't see a >>>> _route_ to completeness with sysfs itself. Thus, the minimalist >>>> approach as a first step. >>> Outside of platform-firmware-id to Linux-numa-node-id what other >>> userspace API infrastructure does the kernel need to provide? It seems >>> userspace enumeration of memory attributes is fully enabled once the >>> firmware-to-Linux identification is established. >> >> It would be nice not to have each app need to know about each specific >> platform's firmware. > > The app wouldn't need to know if it uses a common library. Whether the > library calls into the kernel or not is an implementation detail. If > it is information that only the app cares about and the kernel does > not consume, why have a syscall? If we just care about platform-firmware-id <--> Linux-numa-node-id mapping and fetching memory attribute from the platform (and hiding implementation details in a library) then the following interface should be sufficient. /sys/devices/system/node/nodeX/platform_id But as the series proposes (and rightly so) kernel needs to start providing ABI interfaces for memory attributes instead of hiding them in libraries.
On 11/23/2018 10:45 PM, Dan Williams wrote: > On Thu, Nov 22, 2018 at 11:11 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: >> >> >> >> On 11/22/2018 11:38 PM, Dan Williams wrote: >>> On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual >>> <anshuman.khandual@arm.com> wrote: >>>> >>>> >>>> >>>> On 11/19/2018 11:07 PM, Dave Hansen wrote: >>>>> On 11/18/18 9:44 PM, Anshuman Khandual wrote: >>>>>> IIUC NUMA re-work in principle involves these functional changes >>>>>> >>>>>> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term) >>>>> >>>>> This patch set _does_ that, though. >>>>> >>>>>> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term) >>>>> >>>>> It does that as well (a subset at least). >>>>> >>>>> It sounds like the subset that's being exposed is insufficient for yo >>>>> We did that because we think doing anything but a subset in sysfs will >>>>> just blow up sysfs: MAX_NUMNODES is as high as 1024, so if we have 4 >>>>> attributes, that's at _least_ 1024*1024*4 files if we expose *all* >>>>> combinations. >>>> Each permutation need not be a separate file inside all possible NODE X >>>> (/sys/devices/system/node/nodeX) directories. It can be a top level file >>>> enumerating various attribute values for a given (X, Y) node pair based >>>> on an offset something like /proc/pid/pagemap. >>>> >>>>> >>>>> Do we agree that sysfs is unsuitable for exposing attributes in this manner? >>>>> >>>> >>>> Yes, for individual files. But this can be worked around with an offset >>>> based access from a top level global attributes file as mentioned above. >>>> Is there any particular advantage of using individual files for each >>>> given attribute ? I was wondering that a single unsigned long (u64) will >>>> be able to pack 8 different attributes where each individual attribute >>>> values can be abstracted out in 8 bits. >>> >>> sysfs has a 4K limit, and in general I don't think there is much >>> incremental value to go describe the entirety of the system from sysfs >>> or anywhere else in the kernel for that matter. It's simply too much> information to reasonably consume. Instead the kernel can describe the >> >> I agree that it may be some amount of information to parse but is crucial >> for any task on a heterogeneous system to evaluate (probably re-evaluate >> if the task moves around) its memory and CPU binding at runtime to make >> sure it has got the right one. > > Can you provide some more evidence for this statement? It seems that That a moving application can really benefit from a complete set of effective N1 ----> N2 memory attributes ? Now mbind() binds the memory locally and sched_setaffinity() binds the task locally. Even if the task gets to required memory through mbind() a complete set of memory attributes as seen from various nodes will help it create appropriate cpumask_t for sched_setaffinity(). Only local node effective attributes (as proposed by this series) wont be able to give enough information to create required cpu mask. > not many applications even care about basic numa let alone specific > memory targeting, at least according to libnumactl users. But we hope that will change going forward as more and more heterogeneous systems comes in. We should not be locking down an ABI without considering for future system possibilities and task's requirements. > > dnf repoquery --whatrequires numactl-libs > > The kernel is the arbiter of memory, something is broken if > applications *need* to take on this responsibility. Yes, there will be > applications that want to tune and override the default kernel > behavior, but this is the exception, not the rule. The applications > that tend to care about specific memories also tend to be purpose > built for a given platform, and that lessens their reliance on the > kernel to enumerate all properties. I would agree with you completely if we just want to move ahead in the direction of preserving existing status quo. But this series does propose an application interface (rightly so IMHO) for memory attributes and hence we are just trying to figure out if that is the correct direction or not. > >>> coarse boundaries and some semblance of "best" access initiator for a >>> given target. That should cover the "80%" case of what applications >> >> The current proposal just assumes that the best one is the nearest one. >> This may be true for bandwidth and latency but may not be true for some >> other properties. This assumptions should not be there while defining >> new ABI. > > In fact, I tend to agree with you, but in my opinion that's an > argument to expose even less, not more. If we start with something > minimal that can be extended over time we lessen the risk of over > exposing details that don't matter in practice. As we have been discussing on the other thread what has to be exported should be derived from what all is needed. Cherry picking minimal set of attributes and exposing them through sysfs in the short term does not seem to be the right way either. > > We're in the middle of a bit of a disaster with the VmFlags export in > /proc/$pid/smaps precisely because the implementation was too > comprehensive and applications started depending on details that the > kernel does not want to guarantee going forward. So there is a real > risk of being too descriptive in an interface design. There is also a risk with a insufficient and in-extensible sysfs interface which is going to be there for ever. > >>> want to discover, for the other "20%" we likely need some userspace >>> library that can go parse these platform specific information sources >>> and supplement the kernel view. I also think a simpler kernel starting >>> point gives us room to go pull in more commonly used attributes if it >>> turns out they are useful, and avoid going down the path of exporting >>> attributes that have questionable value in practice. >>> >> >> Applications can just query platform information right now and just use >> them for mbind() without requiring this new interface. > > No, they can't today, at least not for the topology details that HMAT > is describing. The platform-firmware to numa-node translation is > currently not complete. At a minimum we need a listing of initiator > ids and target ids. For an ACPI platform that is the proximity-domain All of the memory visible in kernel is accessible from each individual CPU on the system irrespective which node they belong. Unless there is another type of memory accessors like GPU cores or something each node with CPU should be an initiator for each node with memory. > to numa-node-id translation information. Once that translation is in > place then a userspace library can consult the platform-specific > information sources to translate the platform-firmware view to the > Linux handles for those memories. Am I missing the library that does > this today? Once coherency enumeration (through initiator <---> memory target) is out of the way would not the following simple new interface be sufficient to query platform about it's properties ? /sys/devices/system/node/nodeX/platform_id > >> We are not even >> changing any core MM yet. So if it's just about identifying the node's >> memory properties it can be scanned from platform itself. But I agree >> we would like the kernel to start adding interfaces for multi attribute >> memory but all I am saying is that it has to be comprehensive. Some of >> the attributes have more usefulness now and some have less but the new >> ABI interface has to accommodate exporting all of these. > > I get the sense we are talking past each other, can you give the next > level of detail on that "has to be comprehensive" statement? The discussion on the other thread has some new details from last week.
On Tue, Nov 27, 2018 at 2:15 AM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 11/26/2018 11:38 PM, Dan Williams wrote: > > On Mon, Nov 26, 2018 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote: > >> > >> On 11/23/18 1:13 PM, Dan Williams wrote: > >>>> A new system call makes total sense to me. I have the same concern > >>>> about the completeness of what's exposed in sysfs, I just don't see a > >>>> _route_ to completeness with sysfs itself. Thus, the minimalist > >>>> approach as a first step. > >>> Outside of platform-firmware-id to Linux-numa-node-id what other > >>> userspace API infrastructure does the kernel need to provide? It seems > >>> userspace enumeration of memory attributes is fully enabled once the > >>> firmware-to-Linux identification is established. > >> > >> It would be nice not to have each app need to know about each specific > >> platform's firmware. > > > > The app wouldn't need to know if it uses a common library. Whether the > > library calls into the kernel or not is an implementation detail. If > > it is information that only the app cares about and the kernel does > > not consume, why have a syscall? > > If we just care about platform-firmware-id <--> Linux-numa-node-id mapping > and fetching memory attribute from the platform (and hiding implementation > details in a library) then the following interface should be sufficient. > > /sys/devices/system/node/nodeX/platform_id > > But as the series proposes (and rightly so) kernel needs to start providing > ABI interfaces for memory attributes instead of hiding them in libraries. Yes, I can get on board with sysfs providing a subset of the performance description for administrators to discover the common case via scripting and leave the exhaustive attribute description to a separate interface. I was pushing back on the notion that sysfs must be that exhaustive interface... we're making progress. I still think we need /sys/devices/system/node/nodeX/platform_id to enable higher order platform enumeration tooling, but that need not be the end of the kernel interface description.