Message ID | 20250307063534.540-1-rakie.kim@sk.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/mempolicy: Add memory hotplug support in weighted interleave | expand |
On Fri, Mar 07, 2025 at 03:35:29PM +0900, Rakie Kim wrote: > Unnecessary sysfs entries: Nodes without memory were included in sysfs > at boot. > Missing hotplug support: Nodes that became online after initialization > were not recognized, causing incomplete interleave configurations. This comment is misleading. Nodes can "come online" but they are absolutely detected during init - as nodes cannot be "hotplugged" themselves. Resources can be added *to* nodes, but nodes themselves cannot be added or removed. I think what you're trying to say here is: 1) The current system creates 1 entry per possible node (explicitly) 2) Not all nodes may have memory at all times (memory can be hotplugged) 3) It would be nice to let sysfs and weighted interleave omit memoryless nodes until those nodes had memory hotplugged into them. > Dynamic sysfs updates for hotplugged nodes New memory nodes are > recognized and integrated via the memory hotplug mechanism. > Subsequent patches refine this functionality: > Just going to reiterate that that there's no such this as a hotplug node or "new nodes" - only nodes that have their attributes changed (i.e. !N_MEMORY -> N_MEMORY). The node exists, it may just not have anything associated with it. Maybe semantic nits, but it matters. The nodes are present and can be operated on before memory comes online, and that has implications for users. Depending on how that hardware comes online, it may or may not report its performance data prior to memory hotplug. If it doesn't report its performance data, then hiding the node before it hotplugs memory means a user can't pre-configure the system for when the memory is added (which could be used immediately). Hiding the node until hotplug also means we have hidden state. We need to capture pre-hotplug reported performance data so that if it comes online the auto-calculation of weights is correct. But if the user has already switched from auto to manual mode, then a node suddenly appearing will have an unknown state. This is why I initially chose to just expose N_POSSIBLE entries in sysfs, because the transition state causes hidden information - and that felt worse than extra entries. I suppose I should add some documentation somewhere that discusses this issue. I think the underlying issue you're dealing with is that the system is creating more nodes for you than it should. ~Gregory
On Fri, Mar 07, 2025 at 10:56:04AM -0500, Gregory Price wrote: > > I think the underlying issue you're dealing with is that the system is > creating more nodes for you than it should. > Looking into this for other reasons, I think you are right that multiple numa nodes can exist that cover the same memory - just different regions. I can see why you would want to hide the nodes that don't actively have memory online, but i still have concerns for nodes that may come and go and hiding this configuration from the user until memory arrives. An example would be a DCD device where a node could add or remove memory at any time. If you removed the last block of memory, the node would disappear - but the block could come back at any time. That seems problematic, as you might want to manage that node while no memory is present. ~Gregory
On Fri, 7 Mar 2025 10:56:04 -0500 Gregory Price <gourry@gourry.net> wrote: Hi Gregory Thank you for your response regarding this patch. > On Fri, Mar 07, 2025 at 03:35:29PM +0900, Rakie Kim wrote: > > Unnecessary sysfs entries: Nodes without memory were included in sysfs > > at boot. > > Missing hotplug support: Nodes that became online after initialization > > were not recognized, causing incomplete interleave configurations. > > This comment is misleading. Nodes can "come online" but they are > absolutely detected during init - as nodes cannot be "hotplugged" > themselves. Resources can be added *to* nodes, but nodes themselves > cannot be added or removed. > > I think what you're trying to say here is: > > 1) The current system creates 1 entry per possible node (explicitly) > 2) Not all nodes may have memory at all times (memory can be hotplugged) > 3) It would be nice to let sysfs and weighted interleave omit memoryless > nodes until those nodes had memory hotplugged into them. > > > Dynamic sysfs updates for hotplugged nodes New memory nodes are > > recognized and integrated via the memory hotplug mechanism. > > Subsequent patches refine this functionality: > > > > Just going to reiterate that that there's no such this as a hotplug node > or "new nodes" - only nodes that have their attributes changed (i.e. > !N_MEMORY -> N_MEMORY). The node exists, it may just not have anything > associated with it. > > Maybe semantic nits, but it matters. The nodes are present and can be > operated on before memory comes online, and that has implications for > users. Depending on how that hardware comes online, it may or may not > report its performance data prior to memory hotplug. I agree with your assessment. The existing comments, as you pointed out, might indeed be confusing or misleading. I'll make sure this issue is addressed in version 2. > > If it doesn't report its performance data, then hiding the node before > it hotplugs memory means a user can't pre-configure the system for when > the memory is added (which could be used immediately). > > Hiding the node until hotplug also means we have hidden state. We need > to capture pre-hotplug reported performance data so that if it comes > online the auto-calculation of weights is correct. But if the user has > already switched from auto to manual mode, then a node suddenly > appearing will have an unknown state. > > This is why I initially chose to just expose N_POSSIBLE entries in > sysfs, because the transition state causes hidden information - and that > felt worse than extra entries. I suppose I should add some > documentation somewhere that discusses this issue. > > I think the underlying issue you're dealing with is that the system is > creating more nodes for you than it should. I will reply to your next comment on this issue soon. > > ~Gregory Rakie
On Fri, 7 Mar 2025 16:55:40 -0500 Gregory Price <gourry@gourry.net> wrote: > On Fri, Mar 07, 2025 at 10:56:04AM -0500, Gregory Price wrote: > > > > I think the underlying issue you're dealing with is that the system is > > creating more nodes for you than it should. > > > > Looking into this for other reasons, I think you are right that multiple > numa nodes can exist that cover the same memory - just different > regions. > I understand your concerns, and I agree that the most critical issue at the moment is that the system is generating more nodes than necessary. We need to conduct a more thorough analysis of this problem, but a detailed investigation will require a significant amount of time. In this context, these patches might offer a quick solution to address the issue. Additionally, it's important to note that not many CXL devices have been developed yet, and their operations are not entirely optimized. Therefore, we might encounter behaviors from CXL devices and servers that differ from our expectations. I hope these patches can serve as a solution for unforeseen issues. > I can see why you would want to hide the nodes that don't actively have > memory online, but i still have concerns for nodes that may come and > go and hiding this configuration from the user until memory arrives. > > An example would be a DCD device where a node could add or remove memory > at any time. If you removed the last block of memory, the node would > disappear - but the block could come back at any time. That seems > problematic, as you might want to manage that node while no memory is > present. > > ~Gregory Of course, the patches may need further refinements. Therefore, I plan to simplify the patches and remove any unnecessary modifications in the upcoming version 2 update. Once it's ready, I would be very grateful if you could take the time to review version 2 and share any further feedback you might have. Rakie
On Mon, Mar 10, 2025 at 06:03:59PM +0900, Rakie Kim wrote: > On Fri, 7 Mar 2025 16:55:40 -0500 Gregory Price <gourry@gourry.net> wrote: > > On Fri, Mar 07, 2025 at 10:56:04AM -0500, Gregory Price wrote: > > > > > > I think the underlying issue you're dealing with is that the system is > > > creating more nodes for you than it should. > > > > > > > Looking into this for other reasons, I think you are right that multiple > > numa nodes can exist that cover the same memory - just different > > regions. > > > > I understand your concerns, and I agree that the most critical issue at the > moment is that the system is generating more nodes than necessary. > We need to conduct a more thorough analysis of this problem, but a detailed > investigation will require a significant amount of time. In this context, > these patches might offer a quick solution to address the issue. > I dug into the expected CEDT / CFMWS behaviors and had some discussions with Dan and Jonathan - assuming your CEDT has multiple CFMWS to cover the same set of devices, this is the expected behavior. https://lore.kernel.org/linux-mm/Z226PG9t-Ih7fJDL@gourry-fedora-PF4VCD3F/T/#m2780e47df7f0962a79182502afc99843bb046205 Basically your BIOS is likely creating one per device and likely one per host bridge (to allow intra-host-bridge interleave). This puts us in an awkward state, and I need some time to consider whether we should expose N_POSSIBLE nodes or N_MEMORY nodes. Probably it makes sense to expose N_MEMORY nodes and allow for hidden state, as the annoying corner condition of a DCD coming and going most likely means a user wouldn't be using weighted interleave anyway. So if you can confirm what you CEDT says compared to the notes above, I think we can move forward with this. ~Gregory
On Mon, 10 Mar 2025 10:13:58 -0400 Gregory Price <gourry@gourry.net> wrote: Hi Gregory, I have updated version 2 of the patch series, incorporating the feedback from you and Joshua. However, this version does not yet include updates to the commit messages regarding the points you previously mentioned. Your detailed explanations have been incredibly valuable in helping us analyze the system, and I sincerely appreciate your insights. > 2) We need to clearly define what the weight of a node will be when > in manual mode and a node goes (memory -> no memory -> memory) Additionally, I will soon provide an updated document addressing this and other points you raised in your emails. Thank you again for your guidance and support. Rakie > On Mon, Mar 10, 2025 at 06:03:59PM +0900, Rakie Kim wrote: > > On Fri, 7 Mar 2025 16:55:40 -0500 Gregory Price <gourry@gourry.net> wrote: > > > On Fri, Mar 07, 2025 at 10:56:04AM -0500, Gregory Price wrote: > > > > > > > > I think the underlying issue you're dealing with is that the system is > > > > creating more nodes for you than it should. > > > > > > > > > > Looking into this for other reasons, I think you are right that multiple > > > numa nodes can exist that cover the same memory - just different > > > regions. > > > > > > > I understand your concerns, and I agree that the most critical issue at the > > moment is that the system is generating more nodes than necessary. > > We need to conduct a more thorough analysis of this problem, but a detailed > > investigation will require a significant amount of time. In this context, > > these patches might offer a quick solution to address the issue. > > > > I dug into the expected CEDT / CFMWS behaviors and had some discussions > with Dan and Jonathan - assuming your CEDT has multiple CFMWS to cover > the same set of devices, this is the expected behavior. > > https://lore.kernel.org/linux-mm/Z226PG9t-Ih7fJDL@gourry-fedora-PF4VCD3F/T/#m2780e47df7f0962a79182502afc99843bb046205 > > Basically your BIOS is likely creating one per device and likely one > per host bridge (to allow intra-host-bridge interleave). > > This puts us in an awkward state, and I need some time to consider > whether we should expose N_POSSIBLE nodes or N_MEMORY nodes. > > Probably it makes sense to expose N_MEMORY nodes and allow for hidden > state, as the annoying corner condition of a DCD coming and going > most likely means a user wouldn't be using weighted interleave anyway. > > So if you can confirm what you CEDT says compared to the notes above, I > think we can move forward with this. > > ~Gregory