diff mbox series

[2/2,v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

Message ID 20250226213518.767670-2-joshua.hahnjy@gmail.com (mailing list archive)
State New
Headers show
Series [1/2,v6] mm/mempolicy: Weighted Interleave Auto-tuning | expand

Commit Message

Joshua Hahn Feb. 26, 2025, 9:35 p.m. UTC
We should never try to allocate memory from a memoryless node. Creating a
sysfs knob to control its weighted interleave weight does not make sense,
and can be unsafe.

Only create weighted interleave weight knobs for nodes with memory.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/mempolicy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Honggyu Kim Feb. 27, 2025, 2:32 a.m. UTC | #1
Hi Joshua,

On 2/27/2025 6:35 AM, Joshua Hahn wrote:
> We should never try to allocate memory from a memoryless node. Creating a
> sysfs knob to control its weighted interleave weight does not make sense,
> and can be unsafe.
> 
> Only create weighted interleave weight knobs for nodes with memory.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>   mm/mempolicy.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4cc04ff8f12c..50cbb7c047fa 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
>   		return err;
>   	}
>   
> -	for_each_node_state(nid, N_POSSIBLE) {

Actually, we're aware of this issue and currently trying to fix this.
In our system, we've attached 4ch of CXL memory for each socket as
follows.

         node0             node1
       +-------+   UPI   +-------+
       | CPU 0 |-+-----+-| CPU 1 |
       +-------+         +-------+
       | DRAM0 |         | DRAM1 |
       +---+---+         +---+---+
           |                 |
       +---+---+         +---+---+
       | CXL 0 |         | CXL 4 |
       +---+---+         +---+---+
       | CXL 1 |         | CXL 5 |
       +---+---+         +---+---+
       | CXL 2 |         | CXL 6 |
       +---+---+         +---+---+
       | CXL 3 |         | CXL 7 |
       +---+---+         +---+---+
         node2             node3

The 4ch of CXL memory are detected as a single NUMA node in each socket,
but it shows as follows with the current N_POSSIBLE loop.

$ ls /sys/kernel/mm/mempolicy/weighted_interleave/
node0 node1 node2 node3 node4 node5
node6 node7 node8 node9 node10 node11

> +	for_each_node_state(nid, N_MEMORY) {

But using N_MEMORY doesn't fix this problem and it hides the entire CXL
memory nodes in our system because the CXL memory isn't detected at this
point of creating node*.  Maybe there is some difference when multiple
CXL memory is detected as a single node.

We have to create more nodes when CXL memory is detected later.  In 
addition, this part can be changed to "for_each_online_node(nid)"
although N_MEMORY is also fine here.

We've internally fixed it using a memory hotpluging callback so we can
upload another working version later.

Do you mind if we continue fixing this work?

Thanks,
Honggyu

>   		err = add_weight_node(nid, wi_kobj);
>   		if (err) {
>   			pr_err("failed to add sysfs [node%d]\n", nid);
Honggyu Kim Feb. 27, 2025, 3:20 a.m. UTC | #2
On 2/27/2025 11:32 AM, Honggyu Kim wrote:
> Hi Joshua,
> 
> On 2/27/2025 6:35 AM, Joshua Hahn wrote:
>> We should never try to allocate memory from a memoryless node. Creating a
>> sysfs knob to control its weighted interleave weight does not make sense,
>> and can be unsafe.
>>
>> Only create weighted interleave weight knobs for nodes with memory.
>>
>> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>> ---
>>   mm/mempolicy.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 4cc04ff8f12c..50cbb7c047fa 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct 
>> kobject *root_kobj)
>>           return err;
>>       }
>> -    for_each_node_state(nid, N_POSSIBLE) {
> 
> Actually, we're aware of this issue and currently trying to fix this.
> In our system, we've attached 4ch of CXL memory for each socket as
> follows.
> 
>          node0             node1
>        +-------+   UPI   +-------+
>        | CPU 0 |-+-----+-| CPU 1 |
>        +-------+         +-------+
>        | DRAM0 |         | DRAM1 |
>        +---+---+         +---+---+
>            |                 |
>        +---+---+         +---+---+
>        | CXL 0 |         | CXL 4 |
>        +---+---+         +---+---+
>        | CXL 1 |         | CXL 5 |
>        +---+---+         +---+---+
>        | CXL 2 |         | CXL 6 |
>        +---+---+         +---+---+
>        | CXL 3 |         | CXL 7 |
>        +---+---+         +---+---+
>          node2             node3
> 
> The 4ch of CXL memory are detected as a single NUMA node in each socket,
> but it shows as follows with the current N_POSSIBLE loop.
> 
> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
> node0 node1 node2 node3 node4 node5
> node6 node7 node8 node9 node10 node11
> 
>> +    for_each_node_state(nid, N_MEMORY) {

Thinking it again, we can leave it as a separate patch but add our patch 
on top of it.

The only concern I have is having only N_MEMORY patch hides weight
setting knobs for CXL memory and it makes there is no way to set weight 
values to CXL memory in my system.

IMHO, this and our patch is better to be submitted together.

Thanks,
Honggyu

> 
> But using N_MEMORY doesn't fix this problem and it hides the entire CXL
> memory nodes in our system because the CXL memory isn't detected at this
> point of creating node*.  Maybe there is some difference when multiple
> CXL memory is detected as a single node.
> 
> We have to create more nodes when CXL memory is detected later.  In 
> addition, this part can be changed to "for_each_online_node(nid)"
> although N_MEMORY is also fine here.
> 
> We've internally fixed it using a memory hotpluging callback so we can
> upload another working version later.
> 
> Do you mind if we continue fixing this work?
> 
> Thanks,
> Honggyu
> 
>>           err = add_weight_node(nid, wi_kobj);
>>           if (err) {
>>               pr_err("failed to add sysfs [node%d]\n", nid);
>
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4cc04ff8f12c..50cbb7c047fa 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3721,7 +3721,7 @@  static int add_weighted_interleave_group(struct kobject *root_kobj)
 		return err;
 	}
 
-	for_each_node_state(nid, N_POSSIBLE) {
+	for_each_node_state(nid, N_MEMORY) {
 		err = add_weight_node(nid, wi_kobj);
 		if (err) {
 			pr_err("failed to add sysfs [node%d]\n", nid);