Message ID | 1585420282-25630-1-git-send-email-Hoan@os.amperecomputing.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA | expand |
On 03/28/20 at 11:31am, Hoan Tran wrote: > In NUMA layout which nodes have memory ranges that span across other nodes, > the mm driver can detect the memory node id incorrectly. > > For example, with layout below > Node 0 address: 0000 xxxx 0000 xxxx > Node 1 address: xxxx 1111 xxxx 1111 Sorry, I read this example several times, but still don't get what it means. Can it be given with real hex number address as an exmaple? I mean just using the memory layout you have seen from some systems. The change looks interesting though. > > Note: > - Memory from low to high > - 0/1: Node id > - x: Invalid memory of a node > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > config, mm only checks the memory validity but not the node id. > Because of that, Node 1 also detects the memory from node 0 as below > when it scans from the start address to the end address of node 1. > > Node 0 address: 0000 xxxx xxxx xxxx > Node 1 address: xxxx 1111 1111 1111 > > This layout could occur on any architecture. Most of them enables > this config by default with CONFIG_NUMA. This patch, by default, enables > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > v3: > * Revise the patch description > > V2: > * Revise the patch description > > Hoan Tran (5): > mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA > powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > arch/powerpc/Kconfig | 9 --------- > arch/s390/Kconfig | 8 -------- > arch/sparc/Kconfig | 9 --------- > arch/x86/Kconfig | 9 --------- > mm/page_alloc.c | 2 +- > 5 files changed, 1 insertion(+), 36 deletions(-) > > -- > 1.8.3.1 > >
On Sat 28-03-20 11:31:17, Hoan Tran wrote: > In NUMA layout which nodes have memory ranges that span across other nodes, > the mm driver can detect the memory node id incorrectly. > > For example, with layout below > Node 0 address: 0000 xxxx 0000 xxxx > Node 1 address: xxxx 1111 xxxx 1111 > > Note: > - Memory from low to high > - 0/1: Node id > - x: Invalid memory of a node > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > config, mm only checks the memory validity but not the node id. > Because of that, Node 1 also detects the memory from node 0 as below > when it scans from the start address to the end address of node 1. > > Node 0 address: 0000 xxxx xxxx xxxx > Node 1 address: xxxx 1111 1111 1111 > > This layout could occur on any architecture. Most of them enables > this config by default with CONFIG_NUMA. This patch, by default, enables > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. I am not opposed to this at all. It reduces the config space and that is a good thing on its own. The history has shown that meory layout might be really wild wrt NUMA. The config is only used for early_pfn_in_nid which is clearly an overkill. Your description doesn't really explain why this is safe though. The history of this config is somehow messy, though. Mike has tried to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any reasoning what so ever. This doesn't make it really easy see whether reasons for reintroduction are still there. Maybe there are some subtle dependencies. I do not see any TBH but that might be burried deep in an arch specific code. > v3: > * Revise the patch description > > V2: > * Revise the patch description > > Hoan Tran (5): > mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA > powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > arch/powerpc/Kconfig | 9 --------- > arch/s390/Kconfig | 8 -------- > arch/sparc/Kconfig | 9 --------- > arch/x86/Kconfig | 9 --------- > mm/page_alloc.c | 2 +- > 5 files changed, 1 insertion(+), 36 deletions(-) > > -- > 1.8.3.1 >
On Sun 29-03-20 08:19:24, Baoquan He wrote: > On 03/28/20 at 11:31am, Hoan Tran wrote: > > In NUMA layout which nodes have memory ranges that span across other nodes, > > the mm driver can detect the memory node id incorrectly. > > > > For example, with layout below > > Node 0 address: 0000 xxxx 0000 xxxx > > Node 1 address: xxxx 1111 xxxx 1111 > > Sorry, I read this example several times, but still don't get what it > means. Can it be given with real hex number address as an exmaple? I > mean just using the memory layout you have seen from some systems. The > change looks interesting though. Does this make it more clear? physical address range and its node associaion [0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1]
On 03/30/20 at 09:44am, Michal Hocko wrote: > On Sun 29-03-20 08:19:24, Baoquan He wrote: > > On 03/28/20 at 11:31am, Hoan Tran wrote: > > > In NUMA layout which nodes have memory ranges that span across other nodes, > > > the mm driver can detect the memory node id incorrectly. > > > > > > For example, with layout below > > > Node 0 address: 0000 xxxx 0000 xxxx > > > Node 1 address: xxxx 1111 xxxx 1111 > > > > Sorry, I read this example several times, but still don't get what it > > means. Can it be given with real hex number address as an exmaple? I > > mean just using the memory layout you have seen from some systems. The > > change looks interesting though. > > Does this make it more clear? > physical address range and its node associaion > [0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1] I later read it again, have got what Hoan is trying to say, thanks. I think the change in this patchset makes sense, still have some concern though, let me add comment in other thread.
On 03/30/20 at 09:42am, Michal Hocko wrote: > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > In NUMA layout which nodes have memory ranges that span across other nodes, > > the mm driver can detect the memory node id incorrectly. > > > > For example, with layout below > > Node 0 address: 0000 xxxx 0000 xxxx > > Node 1 address: xxxx 1111 xxxx 1111 > > > > Note: > > - Memory from low to high > > - 0/1: Node id > > - x: Invalid memory of a node > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > config, mm only checks the memory validity but not the node id. > > Because of that, Node 1 also detects the memory from node 0 as below > > when it scans from the start address to the end address of node 1. > > > > Node 0 address: 0000 xxxx xxxx xxxx > > Node 1 address: xxxx 1111 1111 1111 > > > > This layout could occur on any architecture. Most of them enables > > this config by default with CONFIG_NUMA. This patch, by default, enables > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > I am not opposed to this at all. It reduces the config space and that is > a good thing on its own. The history has shown that meory layout might > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > which is clearly an overkill. > > Your description doesn't really explain why this is safe though. The > history of this config is somehow messy, though. Mike has tried > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > reasoning what so ever. This doesn't make it really easy see whether > reasons for reintroduction are still there. Maybe there are some subtle > dependencies. I do not see any TBH but that might be burried deep in an > arch specific code. Yeah, since early_pfnnid_cache was added, we do not need worry about the performance. But when I read the mem init code on x86 again, I do see there are codes to handle the node overlapping, e.g in numa_cleanup_meminfo(), when store node id into memblock. But the thing is if we have encountered the node overlapping, we just return ahead of time, leave something uninitialized. I am wondering if the system with node overlapping can still run heathily. > > > v3: > > * Revise the patch description > > > > V2: > > * Revise the patch description > > > > Hoan Tran (5): > > mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA > > powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > > > arch/powerpc/Kconfig | 9 --------- > > arch/s390/Kconfig | 8 -------- > > arch/sparc/Kconfig | 9 --------- > > arch/x86/Kconfig | 9 --------- > > mm/page_alloc.c | 2 +- > > 5 files changed, 1 insertion(+), 36 deletions(-) > > > > -- > > 1.8.3.1 > > > > -- > Michal Hocko > SUSE Labs >
On 03/30/20 at 04:16pm, Baoquan He wrote: > On 03/30/20 at 09:42am, Michal Hocko wrote: > > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > > In NUMA layout which nodes have memory ranges that span across other nodes, > > > the mm driver can detect the memory node id incorrectly. > > > > > > For example, with layout below > > > Node 0 address: 0000 xxxx 0000 xxxx > > > Node 1 address: xxxx 1111 xxxx 1111 > > > > > > Note: > > > - Memory from low to high > > > - 0/1: Node id > > > - x: Invalid memory of a node > > > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > > config, mm only checks the memory validity but not the node id. > > > Because of that, Node 1 also detects the memory from node 0 as below > > > when it scans from the start address to the end address of node 1. > > > > > > Node 0 address: 0000 xxxx xxxx xxxx > > > Node 1 address: xxxx 1111 1111 1111 > > > > > > This layout could occur on any architecture. Most of them enables > > > this config by default with CONFIG_NUMA. This patch, by default, enables > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > > > I am not opposed to this at all. It reduces the config space and that is > > a good thing on its own. The history has shown that meory layout might > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > > which is clearly an overkill. > > > > Your description doesn't really explain why this is safe though. The > > history of this config is somehow messy, though. Mike has tried > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > > reasoning what so ever. This doesn't make it really easy see whether > > reasons for reintroduction are still there. Maybe there are some subtle > > dependencies. I do not see any TBH but that might be burried deep in an > > arch specific code. > > Yeah, since early_pfnnid_cache was added, we do not need worry about the > performance. But when I read the mem init code on x86 again, I do see there > are codes to handle the node overlapping, e.g in numa_cleanup_meminfo(), > when store node id into memblock. But the thing is if we have > encountered the node overlapping, we just return ahead of time, leave > something uninitialized. I am wondering if the system with node > overlapping can still run heathily. Ok, I didn't read code carefully. That is handling case where memblock with different node id overlap, it needs return. In the example Hoan gave, it has no problem, system can run well. Please ignore above comment.
On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > In NUMA layout which nodes have memory ranges that span across other nodes, > > the mm driver can detect the memory node id incorrectly. > > > > For example, with layout below > > Node 0 address: 0000 xxxx 0000 xxxx > > Node 1 address: xxxx 1111 xxxx 1111 > > > > Note: > > - Memory from low to high > > - 0/1: Node id > > - x: Invalid memory of a node > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > config, mm only checks the memory validity but not the node id. > > Because of that, Node 1 also detects the memory from node 0 as below > > when it scans from the start address to the end address of node 1. > > > > Node 0 address: 0000 xxxx xxxx xxxx > > Node 1 address: xxxx 1111 1111 1111 > > > > This layout could occur on any architecture. Most of them enables > > this config by default with CONFIG_NUMA. This patch, by default, enables > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > I am not opposed to this at all. It reduces the config space and that is > a good thing on its own. The history has shown that meory layout might > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > which is clearly an overkill. > > Your description doesn't really explain why this is safe though. The > history of this config is somehow messy, though. Mike has tried > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > reasoning what so ever. This doesn't make it really easy see whether > reasons for reintroduction are still there. Maybe there are some subtle > dependencies. I do not see any TBH but that might be burried deep in an > arch specific code. Well, back then early_pfn_in_nid() was arch-dependant, today everyone except ia64 rely on HAVE_MEMBLOCK_NODE_MAP. So, if the memblock node map is correct, that using CONFIG_NUMA instead of CONFIG_NODES_SPAN_OTHER_NODES would only mean that early_pfn_in_nid() will cost several cycles more on architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64 and sh). Agian, ia64 is an exception here. > > v3: > > * Revise the patch description > > > > V2: > > * Revise the patch description > > > > Hoan Tran (5): > > mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA > > powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > > > arch/powerpc/Kconfig | 9 --------- > > arch/s390/Kconfig | 8 -------- > > arch/sparc/Kconfig | 9 --------- > > arch/x86/Kconfig | 9 --------- > > mm/page_alloc.c | 2 +- > > 5 files changed, 1 insertion(+), 36 deletions(-) > > > > -- > > 1.8.3.1 > > > > -- > Michal Hocko > SUSE Labs
On 03/30/20 at 09:42am, Michal Hocko wrote: > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > In NUMA layout which nodes have memory ranges that span across other nodes, > > the mm driver can detect the memory node id incorrectly. > > > > For example, with layout below > > Node 0 address: 0000 xxxx 0000 xxxx > > Node 1 address: xxxx 1111 xxxx 1111 > > > > Note: > > - Memory from low to high > > - 0/1: Node id > > - x: Invalid memory of a node > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > config, mm only checks the memory validity but not the node id. > > Because of that, Node 1 also detects the memory from node 0 as below > > when it scans from the start address to the end address of node 1. > > > > Node 0 address: 0000 xxxx xxxx xxxx > > Node 1 address: xxxx 1111 1111 1111 > > > > This layout could occur on any architecture. Most of them enables > > this config by default with CONFIG_NUMA. This patch, by default, enables > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > I am not opposed to this at all. It reduces the config space and that is > a good thing on its own. The history has shown that meory layout might > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > which is clearly an overkill. > > Your description doesn't really explain why this is safe though. The > history of this config is somehow messy, though. Mike has tried > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > reasoning what so ever. This doesn't make it really easy see whether > reasons for reintroduction are still there. Maybe there are some subtle > dependencies. I do not see any TBH but that might be burried deep in an > arch specific code. Since on all ARCHes NODES_SPAN_OTHER_NODES has dependency on NUMA, replacing it with CONFIG_NUMA seems no risk. Just for those ARCHes which don't have CONFIG_NODES_SPAN_OTHER_NODES before, it involves a tiny performance degradation. Besides, s390 has removed support of NODES_SPAN_OTHER_NODES already. commit 701dc81e7412daaf3c5bf4bc55d35c8b1525112a Author: Heiko Carstens <heiko.carstens@de.ibm.com> Date: Wed Feb 19 13:29:15 2020 +0100 s390/mm: remove fake numa support > > > v3: > > * Revise the patch description > > > > V2: > > * Revise the patch description > > > > Hoan Tran (5): > > mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA > > powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > > > arch/powerpc/Kconfig | 9 --------- > > arch/s390/Kconfig | 8 -------- > > arch/sparc/Kconfig | 9 --------- > > arch/x86/Kconfig | 9 --------- > > mm/page_alloc.c | 2 +- > > 5 files changed, 1 insertion(+), 36 deletions(-) > > > > -- > > 1.8.3.1 > > > > -- > Michal Hocko > SUSE Labs >
On Mon 30-03-20 12:21:27, Mike Rapoport wrote: > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > > In NUMA layout which nodes have memory ranges that span across other nodes, > > > the mm driver can detect the memory node id incorrectly. > > > > > > For example, with layout below > > > Node 0 address: 0000 xxxx 0000 xxxx > > > Node 1 address: xxxx 1111 xxxx 1111 > > > > > > Note: > > > - Memory from low to high > > > - 0/1: Node id > > > - x: Invalid memory of a node > > > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > > config, mm only checks the memory validity but not the node id. > > > Because of that, Node 1 also detects the memory from node 0 as below > > > when it scans from the start address to the end address of node 1. > > > > > > Node 0 address: 0000 xxxx xxxx xxxx > > > Node 1 address: xxxx 1111 1111 1111 > > > > > > This layout could occur on any architecture. Most of them enables > > > this config by default with CONFIG_NUMA. This patch, by default, enables > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > > > I am not opposed to this at all. It reduces the config space and that is > > a good thing on its own. The history has shown that meory layout might > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > > which is clearly an overkill. > > > > Your description doesn't really explain why this is safe though. The > > history of this config is somehow messy, though. Mike has tried > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > > reasoning what so ever. This doesn't make it really easy see whether > > reasons for reintroduction are still there. Maybe there are some subtle > > dependencies. I do not see any TBH but that might be burried deep in an > > arch specific code. > > Well, back then early_pfn_in_nid() was arch-dependant, today everyone > except ia64 rely on HAVE_MEMBLOCK_NODE_MAP. What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would really love to see that thing go away. It is causing problems when people try to use memblock api. > So, if the memblock node map > is correct, that using CONFIG_NUMA instead of CONFIG_NODES_SPAN_OTHER_NODES > would only mean that early_pfn_in_nid() will cost several cycles more on > architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64 > and sh). Do we have any idea on how much of an overhead that is? Because this is per each pfn so it can accumulate a lot! > Agian, ia64 is an exception here. Thanks for the clarification!
On Mon, Mar 30, 2020 at 11:58:43AM +0200, Michal Hocko wrote: > On Mon 30-03-20 12:21:27, Mike Rapoport wrote: > > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > > > In NUMA layout which nodes have memory ranges that span across other nodes, > > > > the mm driver can detect the memory node id incorrectly. > > > > > > > > For example, with layout below > > > > Node 0 address: 0000 xxxx 0000 xxxx > > > > Node 1 address: xxxx 1111 xxxx 1111 > > > > > > > > Note: > > > > - Memory from low to high > > > > - 0/1: Node id > > > > - x: Invalid memory of a node > > > > > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > > > config, mm only checks the memory validity but not the node id. > > > > Because of that, Node 1 also detects the memory from node 0 as below > > > > when it scans from the start address to the end address of node 1. > > > > > > > > Node 0 address: 0000 xxxx xxxx xxxx > > > > Node 1 address: xxxx 1111 1111 1111 > > > > > > > > This layout could occur on any architecture. Most of them enables > > > > this config by default with CONFIG_NUMA. This patch, by default, enables > > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > > > > > I am not opposed to this at all. It reduces the config space and that is > > > a good thing on its own. The history has shown that meory layout might > > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > > > which is clearly an overkill. > > > > > > Your description doesn't really explain why this is safe though. The > > > history of this config is somehow messy, though. Mike has tried > > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > > > reasoning what so ever. This doesn't make it really easy see whether > > > reasons for reintroduction are still there. Maybe there are some subtle > > > dependencies. I do not see any TBH but that might be burried deep in an > > > arch specific code. > > > > Well, back then early_pfn_in_nid() was arch-dependant, today everyone > > except ia64 rely on HAVE_MEMBLOCK_NODE_MAP. > > What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would > really love to see that thing go away. It is causing problems when > people try to use memblock api. Sorry, my bad, ia64 does not have NODES_SPAN_OTHER_NODES, but it does have HAVE_MEMBLOCK_NODE_MAP. I remember I've tried killing HAVE_MEMBLOCK_NODE_MAP, but I've run into some problems and then I've got distracted. I too would like to have HAVE_MEMBLOCK_NODE_MAP go away, maybe I'll take another look at it. > > So, if the memblock node map > > is correct, that using CONFIG_NUMA instead of CONFIG_NODES_SPAN_OTHER_NODES > > would only mean that early_pfn_in_nid() will cost several cycles more on > > architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64 > > and sh). > > Do we have any idea on how much of an overhead that is? Because this is > per each pfn so it can accumulate a lot! It's O(log(N)) where N is the amount of the memory banks (ie. memblock.memory.cnt) > > Agian, ia64 is an exception here. > > Thanks for the clarification! > -- > Michal Hocko > SUSE Labs
On 03/30/20 at 01:26pm, Mike Rapoport wrote: > On Mon, Mar 30, 2020 at 11:58:43AM +0200, Michal Hocko wrote: > > On Mon 30-03-20 12:21:27, Mike Rapoport wrote: > > > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > > > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > > > > In NUMA layout which nodes have memory ranges that span across other nodes, > > > > > the mm driver can detect the memory node id incorrectly. > > > > > > > > > > For example, with layout below > > > > > Node 0 address: 0000 xxxx 0000 xxxx > > > > > Node 1 address: xxxx 1111 xxxx 1111 > > > > > > > > > > Note: > > > > > - Memory from low to high > > > > > - 0/1: Node id > > > > > - x: Invalid memory of a node > > > > > > > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > > > > config, mm only checks the memory validity but not the node id. > > > > > Because of that, Node 1 also detects the memory from node 0 as below > > > > > when it scans from the start address to the end address of node 1. > > > > > > > > > > Node 0 address: 0000 xxxx xxxx xxxx > > > > > Node 1 address: xxxx 1111 1111 1111 > > > > > > > > > > This layout could occur on any architecture. Most of them enables > > > > > this config by default with CONFIG_NUMA. This patch, by default, enables > > > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > > > > > > > I am not opposed to this at all. It reduces the config space and that is > > > > a good thing on its own. The history has shown that meory layout might > > > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > > > > which is clearly an overkill. > > > > > > > > Your description doesn't really explain why this is safe though. The > > > > history of this config is somehow messy, though. Mike has tried > > > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > > > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > > > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > > > > reasoning what so ever. This doesn't make it really easy see whether > > > > reasons for reintroduction are still there. Maybe there are some subtle > > > > dependencies. I do not see any TBH but that might be burried deep in an > > > > arch specific code. > > > > > > Well, back then early_pfn_in_nid() was arch-dependant, today everyone > > > except ia64 rely on HAVE_MEMBLOCK_NODE_MAP. > > > > What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would > > really love to see that thing go away. It is causing problems when > > people try to use memblock api. > > Sorry, my bad, ia64 does not have NODES_SPAN_OTHER_NODES, but it does have > HAVE_MEMBLOCK_NODE_MAP. > > I remember I've tried killing HAVE_MEMBLOCK_NODE_MAP, but I've run into > some problems and then I've got distracted. I too would like to have > HAVE_MEMBLOCK_NODE_MAP go away, maybe I'll take another look at it. > > > > So, if the memblock node map > > > is correct, that using CONFIG_NUMA instead of CONFIG_NODES_SPAN_OTHER_NODES > > > would only mean that early_pfn_in_nid() will cost several cycles more on > > > architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64 > > > and sh). > > > > Do we have any idea on how much of an overhead that is? Because this is > > per each pfn so it can accumulate a lot! > > It's O(log(N)) where N is the amount of the memory banks (ie. memblock.memory.cnt) This is for the Node id searching. But early_pfn_in_nid() is calling for each pfn, this is the big one, I think. Otherwise, it may be optimized as no-op. > > > > Agian, ia64 is an exception here. > > > > Thanks for the clarification! > > -- > > Michal Hocko > > SUSE Labs > > -- > Sincerely yours, > Mike. > >
On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > In NUMA layout which nodes have memory ranges that span across other nodes, > > the mm driver can detect the memory node id incorrectly. > > > > For example, with layout below > > Node 0 address: 0000 xxxx 0000 xxxx > > Node 1 address: xxxx 1111 xxxx 1111 > > > > Note: > > - Memory from low to high > > - 0/1: Node id > > - x: Invalid memory of a node > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > config, mm only checks the memory validity but not the node id. > > Because of that, Node 1 also detects the memory from node 0 as below > > when it scans from the start address to the end address of node 1. > > > > Node 0 address: 0000 xxxx xxxx xxxx > > Node 1 address: xxxx 1111 1111 1111 > > > > This layout could occur on any architecture. Most of them enables > > this config by default with CONFIG_NUMA. This patch, by default, enables > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > I am not opposed to this at all. It reduces the config space and that is > a good thing on its own. The history has shown that meory layout might > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > which is clearly an overkill. > > Your description doesn't really explain why this is safe though. The > history of this config is somehow messy, though. Mike has tried > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > reasoning what so ever. This doesn't make it really easy see whether > reasons for reintroduction are still there. Maybe there are some subtle > dependencies. I do not see any TBH but that might be burried deep in an > arch specific code. I've looked at this a bit more and it seems that the check for early_pfn_in_nid() in memmap_init_zone() can be simply removed. The commits you've mentioned were way before the addition of HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone sizes and boundaries based on the memblock node map. So, the memmap_init_zone() is called when zone boundaries are already within a node. I don't have access to machines with memory layout that required this check at the first place, so if anybody who does could test the change below on such machine it would be great. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3c4eb750a199..6d3eb0901864 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5908,10 +5908,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, pfn = next_pfn(pfn); continue; } - if (!early_pfn_in_nid(pfn, nid)) { - pfn++; - continue; - } if (overlap_memmap_init(zone, &pfn)) continue; if (defer_init(nid, pfn, end_pfn)) > > v3: > > * Revise the patch description > > > > V2: > > * Revise the patch description > > > > Hoan Tran (5): > > mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA > > powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES > > > > arch/powerpc/Kconfig | 9 --------- > > arch/s390/Kconfig | 8 -------- > > arch/sparc/Kconfig | 9 --------- > > arch/x86/Kconfig | 9 --------- > > mm/page_alloc.c | 2 +- > > 5 files changed, 1 insertion(+), 36 deletions(-) > > > > -- > > 1.8.3.1 > > > > -- > Michal Hocko > SUSE Labs
On Mon 30-03-20 20:51:00, Mike Rapoport wrote: > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > > In NUMA layout which nodes have memory ranges that span across other nodes, > > > the mm driver can detect the memory node id incorrectly. > > > > > > For example, with layout below > > > Node 0 address: 0000 xxxx 0000 xxxx > > > Node 1 address: xxxx 1111 xxxx 1111 > > > > > > Note: > > > - Memory from low to high > > > - 0/1: Node id > > > - x: Invalid memory of a node > > > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > > config, mm only checks the memory validity but not the node id. > > > Because of that, Node 1 also detects the memory from node 0 as below > > > when it scans from the start address to the end address of node 1. > > > > > > Node 0 address: 0000 xxxx xxxx xxxx > > > Node 1 address: xxxx 1111 1111 1111 > > > > > > This layout could occur on any architecture. Most of them enables > > > this config by default with CONFIG_NUMA. This patch, by default, enables > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > > > I am not opposed to this at all. It reduces the config space and that is > > a good thing on its own. The history has shown that meory layout might > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > > which is clearly an overkill. > > > > Your description doesn't really explain why this is safe though. The > > history of this config is somehow messy, though. Mike has tried > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > > reasoning what so ever. This doesn't make it really easy see whether > > reasons for reintroduction are still there. Maybe there are some subtle > > dependencies. I do not see any TBH but that might be burried deep in an > > arch specific code. > > I've looked at this a bit more and it seems that the check for > early_pfn_in_nid() in memmap_init_zone() can be simply removed. > > The commits you've mentioned were way before the addition of > HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone > sizes and boundaries based on the memblock node map. > So, the memmap_init_zone() is called when zone boundaries are already > within a node. But zones from different nodes might overlap in the pfn range. And this check is there to skip over those overlapping areas. The only way to skip over this check I can see is to do a different pfn walk and go through memblock ranges which are guaranteed to belong to a single node.
On Mon, Mar 30, 2020 at 08:23:01PM +0200, Michal Hocko wrote: > On Mon 30-03-20 20:51:00, Mike Rapoport wrote: > > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > > > In NUMA layout which nodes have memory ranges that span across other nodes, > > > > the mm driver can detect the memory node id incorrectly. > > > > > > > > For example, with layout below > > > > Node 0 address: 0000 xxxx 0000 xxxx > > > > Node 1 address: xxxx 1111 xxxx 1111 > > > > > > > > Note: > > > > - Memory from low to high > > > > - 0/1: Node id > > > > - x: Invalid memory of a node > > > > > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > > > config, mm only checks the memory validity but not the node id. > > > > Because of that, Node 1 also detects the memory from node 0 as below > > > > when it scans from the start address to the end address of node 1. > > > > > > > > Node 0 address: 0000 xxxx xxxx xxxx > > > > Node 1 address: xxxx 1111 1111 1111 > > > > > > > > This layout could occur on any architecture. Most of them enables > > > > this config by default with CONFIG_NUMA. This patch, by default, enables > > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > > > > > I am not opposed to this at all. It reduces the config space and that is > > > a good thing on its own. The history has shown that meory layout might > > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > > > which is clearly an overkill. > > > > > > Your description doesn't really explain why this is safe though. The > > > history of this config is somehow messy, though. Mike has tried > > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > > > reasoning what so ever. This doesn't make it really easy see whether > > > reasons for reintroduction are still there. Maybe there are some subtle > > > dependencies. I do not see any TBH but that might be burried deep in an > > > arch specific code. > > > > I've looked at this a bit more and it seems that the check for > > early_pfn_in_nid() in memmap_init_zone() can be simply removed. > > > > The commits you've mentioned were way before the addition of > > HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone > > sizes and boundaries based on the memblock node map. > > So, the memmap_init_zone() is called when zone boundaries are already > > within a node. > > But zones from different nodes might overlap in the pfn range. And this > check is there to skip over those overlapping areas. Maybe I mis-read the code, but I don't see how this could happen. In the HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls calculate_node_totalpages() that ensures that node->node_zones are entirely within the node because this is checked in zone_spanned_pages_in_node(). So, for zones from different nodes to overlap in the pfn range the nodes themself should overlap. Is this even possible? > The only way to skip over this check I can see is to do a different pfn > walk and go through memblock ranges which are guaranteed to belong to a > single node. > -- > Michal Hocko > SUSE Labs
On Tue 31-03-20 11:14:23, Mike Rapoport wrote: > On Mon, Mar 30, 2020 at 08:23:01PM +0200, Michal Hocko wrote: > > On Mon 30-03-20 20:51:00, Mike Rapoport wrote: > > > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > > > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > > > > In NUMA layout which nodes have memory ranges that span across other nodes, > > > > > the mm driver can detect the memory node id incorrectly. > > > > > > > > > > For example, with layout below > > > > > Node 0 address: 0000 xxxx 0000 xxxx > > > > > Node 1 address: xxxx 1111 xxxx 1111 > > > > > > > > > > Note: > > > > > - Memory from low to high > > > > > - 0/1: Node id > > > > > - x: Invalid memory of a node > > > > > > > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > > > > config, mm only checks the memory validity but not the node id. > > > > > Because of that, Node 1 also detects the memory from node 0 as below > > > > > when it scans from the start address to the end address of node 1. > > > > > > > > > > Node 0 address: 0000 xxxx xxxx xxxx > > > > > Node 1 address: xxxx 1111 1111 1111 > > > > > > > > > > This layout could occur on any architecture. Most of them enables > > > > > this config by default with CONFIG_NUMA. This patch, by default, enables > > > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > > > > > > > I am not opposed to this at all. It reduces the config space and that is > > > > a good thing on its own. The history has shown that meory layout might > > > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > > > > which is clearly an overkill. > > > > > > > > Your description doesn't really explain why this is safe though. The > > > > history of this config is somehow messy, though. Mike has tried > > > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > > > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > > > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > > > > reasoning what so ever. This doesn't make it really easy see whether > > > > reasons for reintroduction are still there. Maybe there are some subtle > > > > dependencies. I do not see any TBH but that might be burried deep in an > > > > arch specific code. > > > > > > I've looked at this a bit more and it seems that the check for > > > early_pfn_in_nid() in memmap_init_zone() can be simply removed. > > > > > > The commits you've mentioned were way before the addition of > > > HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone > > > sizes and boundaries based on the memblock node map. > > > So, the memmap_init_zone() is called when zone boundaries are already > > > within a node. > > > > But zones from different nodes might overlap in the pfn range. And this > > check is there to skip over those overlapping areas. > > Maybe I mis-read the code, but I don't see how this could happen. In the > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls > calculate_node_totalpages() that ensures that node->node_zones are entirely > within the node because this is checked in zone_spanned_pages_in_node(). zone_spanned_pages_in_node does chech the zone boundaries are within the node boundaries. But that doesn't really tell anything about other potential zones interleaving with the physical memory range. zone->spanned_pages simply gives the physical range for the zone including holes. Interleaving nodes are essentially a hole (__absent_pages_in_range is going to skip those). That means that when free_area_init_core simply goes over the whole physical zone range including holes and that is why we need to check both for physical and logical holes (aka other nodes). The life would be so much easier if the whole thing would simply iterate over memblocks...
Hi Michal, On 03/31/20 at 10:55am, Michal Hocko wrote: > On Tue 31-03-20 11:14:23, Mike Rapoport wrote: > > Maybe I mis-read the code, but I don't see how this could happen. In the > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls > > calculate_node_totalpages() that ensures that node->node_zones are entirely > > within the node because this is checked in zone_spanned_pages_in_node(). > > zone_spanned_pages_in_node does chech the zone boundaries are within the > node boundaries. But that doesn't really tell anything about other > potential zones interleaving with the physical memory range. > zone->spanned_pages simply gives the physical range for the zone > including holes. Interleaving nodes are essentially a hole > (__absent_pages_in_range is going to skip those). > > That means that when free_area_init_core simply goes over the whole > physical zone range including holes and that is why we need to check > both for physical and logical holes (aka other nodes). > > The life would be so much easier if the whole thing would simply iterate > over memblocks... The memblock iterating sounds a great idea. I tried with putting the memblock iterating in the upper layer, memmap_init(), which is used for boot mem only anyway. Do you think it's doable and OK? It yes, I can work out a formal patch to make this simpler as you said. The draft code is as below. Like this it uses the existing code and involves little change. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 138a56c0f48f..558d421f294b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, * function. They do not exist on hotplugged memory. */ if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) { - pfn = next_pfn(pfn); - continue; - } - if (!early_pfn_in_nid(pfn, nid)) { - pfn++; - continue; - } if (overlap_memmap_init(zone, &pfn)) continue; if (defer_init(nid, pfn, end_pfn)) @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone) } void __meminit __weak memmap_init(unsigned long size, int nid, - unsigned long zone, unsigned long start_pfn) + unsigned long zone, unsigned long range_start_pfn) { - memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); + unsigned long start_pfn, end_pfn; + unsigned long range_end_pfn = range_start_pfn + size; + int i; + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); + if (end_pfn > start_pfn) + memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); + } } static int zone_batchsize(struct zone *zone)
On Tue 31-03-20 22:03:32, Baoquan He wrote: > Hi Michal, > > On 03/31/20 at 10:55am, Michal Hocko wrote: > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote: > > > Maybe I mis-read the code, but I don't see how this could happen. In the > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls > > > calculate_node_totalpages() that ensures that node->node_zones are entirely > > > within the node because this is checked in zone_spanned_pages_in_node(). > > > > zone_spanned_pages_in_node does chech the zone boundaries are within the > > node boundaries. But that doesn't really tell anything about other > > potential zones interleaving with the physical memory range. > > zone->spanned_pages simply gives the physical range for the zone > > including holes. Interleaving nodes are essentially a hole > > (__absent_pages_in_range is going to skip those). > > > > That means that when free_area_init_core simply goes over the whole > > physical zone range including holes and that is why we need to check > > both for physical and logical holes (aka other nodes). > > > > The life would be so much easier if the whole thing would simply iterate > > over memblocks... > > The memblock iterating sounds a great idea. I tried with putting the > memblock iterating in the upper layer, memmap_init(), which is used for > boot mem only anyway. Do you think it's doable and OK? It yes, I can > work out a formal patch to make this simpler as you said. The draft code > is as below. Like this it uses the existing code and involves little change. Doing this would be a step in the right direction! I haven't checked the code very closely though. The below sounds way too simple to be truth I am afraid. First for_each_mem_pfn_range is available only for CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep saying that I really hate that being conditional). Also I haven't really checked the deferred initialization path - I have a very vague recollection that it has been converted to the memblock api but I have happilly dropped all that memory. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 138a56c0f48f..558d421f294b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > * function. They do not exist on hotplugged memory. > */ > if (context == MEMMAP_EARLY) { > - if (!early_pfn_valid(pfn)) { > - pfn = next_pfn(pfn); > - continue; > - } > - if (!early_pfn_in_nid(pfn, nid)) { > - pfn++; > - continue; > - } > if (overlap_memmap_init(zone, &pfn)) > continue; > if (defer_init(nid, pfn, end_pfn)) > @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone) > } > > void __meminit __weak memmap_init(unsigned long size, int nid, > - unsigned long zone, unsigned long start_pfn) > + unsigned long zone, unsigned long range_start_pfn) > { > - memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); > + unsigned long start_pfn, end_pfn; > + unsigned long range_end_pfn = range_start_pfn + size; > + int i; > + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); > + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); > + if (end_pfn > start_pfn) > + memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); > + } > } > > static int zone_batchsize(struct zone *zone)
On 03/31/20 at 04:21pm, Michal Hocko wrote: > On Tue 31-03-20 22:03:32, Baoquan He wrote: > > Hi Michal, > > > > On 03/31/20 at 10:55am, Michal Hocko wrote: > > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote: > > > > Maybe I mis-read the code, but I don't see how this could happen. In the > > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls > > > > calculate_node_totalpages() that ensures that node->node_zones are entirely > > > > within the node because this is checked in zone_spanned_pages_in_node(). > > > > > > zone_spanned_pages_in_node does chech the zone boundaries are within the > > > node boundaries. But that doesn't really tell anything about other > > > potential zones interleaving with the physical memory range. > > > zone->spanned_pages simply gives the physical range for the zone > > > including holes. Interleaving nodes are essentially a hole > > > (__absent_pages_in_range is going to skip those). > > > > > > That means that when free_area_init_core simply goes over the whole > > > physical zone range including holes and that is why we need to check > > > both for physical and logical holes (aka other nodes). > > > > > > The life would be so much easier if the whole thing would simply iterate > > > over memblocks... > > > > The memblock iterating sounds a great idea. I tried with putting the > > memblock iterating in the upper layer, memmap_init(), which is used for > > boot mem only anyway. Do you think it's doable and OK? It yes, I can > > work out a formal patch to make this simpler as you said. The draft code > > is as below. Like this it uses the existing code and involves little change. > > Doing this would be a step in the right direction! I haven't checked the > code very closely though. The below sounds way too simple to be truth I > am afraid. First for_each_mem_pfn_range is available only for > CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep > saying that I really hate that being conditional). Also I haven't really > checked the deferred initialization path - I have a very vague > recollection that it has been converted to the memblock api but I have > happilly dropped all that memory. Thanks for your quick response and pointing out the rest suspect aspects, I will investigate what you mentioned, see if they impact. > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 138a56c0f48f..558d421f294b 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > * function. They do not exist on hotplugged memory. > > */ > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) { > > - pfn = next_pfn(pfn); > > - continue; > > - } > > - if (!early_pfn_in_nid(pfn, nid)) { > > - pfn++; > > - continue; > > - } > > if (overlap_memmap_init(zone, &pfn)) > > continue; > > if (defer_init(nid, pfn, end_pfn)) > > @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone) > > } > > > > void __meminit __weak memmap_init(unsigned long size, int nid, > > - unsigned long zone, unsigned long start_pfn) > > + unsigned long zone, unsigned long range_start_pfn) > > { > > - memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); > > + unsigned long start_pfn, end_pfn; > > + unsigned long range_end_pfn = range_start_pfn + size; > > + int i; > > + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > > + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); > > + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); > > + if (end_pfn > start_pfn) > > + memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); > > + } > > } > > > > static int zone_batchsize(struct zone *zone) > > -- > Michal Hocko > SUSE Labs >
Hi All, On 3/31/20 7:31 AM, Baoquan He wrote: > On 03/31/20 at 04:21pm, Michal Hocko wrote: >> On Tue 31-03-20 22:03:32, Baoquan He wrote: >>> Hi Michal, >>> >>> On 03/31/20 at 10:55am, Michal Hocko wrote: >>>> On Tue 31-03-20 11:14:23, Mike Rapoport wrote: >>>>> Maybe I mis-read the code, but I don't see how this could happen. In the >>>>> HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls >>>>> calculate_node_totalpages() that ensures that node->node_zones are entirely >>>>> within the node because this is checked in zone_spanned_pages_in_node(). >>>> >>>> zone_spanned_pages_in_node does chech the zone boundaries are within the >>>> node boundaries. But that doesn't really tell anything about other >>>> potential zones interleaving with the physical memory range. >>>> zone->spanned_pages simply gives the physical range for the zone >>>> including holes. Interleaving nodes are essentially a hole >>>> (__absent_pages_in_range is going to skip those). >>>> >>>> That means that when free_area_init_core simply goes over the whole >>>> physical zone range including holes and that is why we need to check >>>> both for physical and logical holes (aka other nodes). >>>> >>>> The life would be so much easier if the whole thing would simply iterate >>>> over memblocks... >>> >>> The memblock iterating sounds a great idea. I tried with putting the >>> memblock iterating in the upper layer, memmap_init(), which is used for >>> boot mem only anyway. Do you think it's doable and OK? It yes, I can >>> work out a formal patch to make this simpler as you said. The draft code >>> is as below. Like this it uses the existing code and involves little change. >> >> Doing this would be a step in the right direction! I haven't checked the >> code very closely though. The below sounds way too simple to be truth I >> am afraid. First for_each_mem_pfn_range is available only for >> CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep >> saying that I really hate that being conditional). Also I haven't really >> checked the deferred initialization path - I have a very vague >> recollection that it has been converted to the memblock api but I have >> happilly dropped all that memory. > > Thanks for your quick response and pointing out the rest suspect aspects, > I will investigate what you mentioned, see if they impact. I would like to check if we still move on with my patch to remove CONFIG_NODES_SPAN_OTHER_NODES and have another patch on top it? Thanks Hoan > >> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 138a56c0f48f..558d421f294b 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, >>> * function. They do not exist on hotplugged memory. >>> */ >>> if (context == MEMMAP_EARLY) { >>> - if (!early_pfn_valid(pfn)) { >>> - pfn = next_pfn(pfn); >>> - continue; >>> - } >>> - if (!early_pfn_in_nid(pfn, nid)) { >>> - pfn++; >>> - continue; >>> - } >>> if (overlap_memmap_init(zone, &pfn)) >>> continue; >>> if (defer_init(nid, pfn, end_pfn)) >>> @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone) >>> } >>> >>> void __meminit __weak memmap_init(unsigned long size, int nid, >>> - unsigned long zone, unsigned long start_pfn) >>> + unsigned long zone, unsigned long range_start_pfn) >>> { >>> - memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); >>> + unsigned long start_pfn, end_pfn; >>> + unsigned long range_end_pfn = range_start_pfn + size; >>> + int i; >>> + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { >>> + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); >>> + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); >>> + if (end_pfn > start_pfn) >>> + memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); >>> + } >>> } >>> >>> static int zone_batchsize(struct zone *zone) >> >> -- >> Michal Hocko >> SUSE Labs >> >
On 04/02/20 at 09:46pm, Hoan Tran wrote: > Hi All, > > On 3/31/20 7:31 AM, Baoquan He wrote: > > On 03/31/20 at 04:21pm, Michal Hocko wrote: > > > On Tue 31-03-20 22:03:32, Baoquan He wrote: > > > > Hi Michal, > > > > > > > > On 03/31/20 at 10:55am, Michal Hocko wrote: > > > > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote: > > > > > > Maybe I mis-read the code, but I don't see how this could happen. In the > > > > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls > > > > > > calculate_node_totalpages() that ensures that node->node_zones are entirely > > > > > > within the node because this is checked in zone_spanned_pages_in_node(). > > > > > > > > > > zone_spanned_pages_in_node does chech the zone boundaries are within the > > > > > node boundaries. But that doesn't really tell anything about other > > > > > potential zones interleaving with the physical memory range. > > > > > zone->spanned_pages simply gives the physical range for the zone > > > > > including holes. Interleaving nodes are essentially a hole > > > > > (__absent_pages_in_range is going to skip those). > > > > > > > > > > That means that when free_area_init_core simply goes over the whole > > > > > physical zone range including holes and that is why we need to check > > > > > both for physical and logical holes (aka other nodes). > > > > > > > > > > The life would be so much easier if the whole thing would simply iterate > > > > > over memblocks... > > > > > > > > The memblock iterating sounds a great idea. I tried with putting the > > > > memblock iterating in the upper layer, memmap_init(), which is used for > > > > boot mem only anyway. Do you think it's doable and OK? It yes, I can > > > > work out a formal patch to make this simpler as you said. The draft code > > > > is as below. Like this it uses the existing code and involves little change. > > > > > > Doing this would be a step in the right direction! I haven't checked the > > > code very closely though. The below sounds way too simple to be truth I > > > am afraid. First for_each_mem_pfn_range is available only for > > > CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep > > > saying that I really hate that being conditional). Also I haven't really > > > checked the deferred initialization path - I have a very vague > > > recollection that it has been converted to the memblock api but I have > > > happilly dropped all that memory. > > > > Thanks for your quick response and pointing out the rest suspect aspects, > > I will investigate what you mentioned, see if they impact. > > I would like to check if we still move on with my patch to remove > CONFIG_NODES_SPAN_OTHER_NODES and have another patch on top it? I think we would like to replace CONFIG_NODES_SPAN_OTHER_NODES with CONFIG_NUMA, and just let UMA return 0 as node id, as Michal replied in another mail. Anyway, your patch 2~5 are still needed to sit on top of the change of this new plan.
Hi, On 4/3/20 12:09 AM, Baoquan He wrote: > On 04/02/20 at 09:46pm, Hoan Tran wrote: >> Hi All, >> >> On 3/31/20 7:31 AM, Baoquan He wrote: >>> On 03/31/20 at 04:21pm, Michal Hocko wrote: >>>> On Tue 31-03-20 22:03:32, Baoquan He wrote: >>>>> Hi Michal, >>>>> >>>>> On 03/31/20 at 10:55am, Michal Hocko wrote: >>>>>> On Tue 31-03-20 11:14:23, Mike Rapoport wrote: >>>>>>> Maybe I mis-read the code, but I don't see how this could happen. In the >>>>>>> HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls >>>>>>> calculate_node_totalpages() that ensures that node->node_zones are entirely >>>>>>> within the node because this is checked in zone_spanned_pages_in_node(). >>>>>> >>>>>> zone_spanned_pages_in_node does chech the zone boundaries are within the >>>>>> node boundaries. But that doesn't really tell anything about other >>>>>> potential zones interleaving with the physical memory range. >>>>>> zone->spanned_pages simply gives the physical range for the zone >>>>>> including holes. Interleaving nodes are essentially a hole >>>>>> (__absent_pages_in_range is going to skip those). >>>>>> >>>>>> That means that when free_area_init_core simply goes over the whole >>>>>> physical zone range including holes and that is why we need to check >>>>>> both for physical and logical holes (aka other nodes). >>>>>> >>>>>> The life would be so much easier if the whole thing would simply iterate >>>>>> over memblocks... >>>>> >>>>> The memblock iterating sounds a great idea. I tried with putting the >>>>> memblock iterating in the upper layer, memmap_init(), which is used for >>>>> boot mem only anyway. Do you think it's doable and OK? It yes, I can >>>>> work out a formal patch to make this simpler as you said. The draft code >>>>> is as below. Like this it uses the existing code and involves little change. >>>> >>>> Doing this would be a step in the right direction! I haven't checked the >>>> code very closely though. The below sounds way too simple to be truth I >>>> am afraid. First for_each_mem_pfn_range is available only for >>>> CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep >>>> saying that I really hate that being conditional). Also I haven't really >>>> checked the deferred initialization path - I have a very vague >>>> recollection that it has been converted to the memblock api but I have >>>> happilly dropped all that memory. >>> >>> Thanks for your quick response and pointing out the rest suspect aspects, >>> I will investigate what you mentioned, see if they impact. >> >> I would like to check if we still move on with my patch to remove >> CONFIG_NODES_SPAN_OTHER_NODES and have another patch on top it? > > I think we would like to replace CONFIG_NODES_SPAN_OTHER_NODES with > CONFIG_NUMA, and just let UMA return 0 as node id, as Michal replied in > another mail. Anyway, your patch 2~5 are still needed to sit on top of > the change of this new plan. Got it. Thanks for quick response. Regards Hoan >
On Tue, Mar 31, 2020 at 04:21:38PM +0200, Michal Hocko wrote: > On Tue 31-03-20 22:03:32, Baoquan He wrote: > > Hi Michal, > > > > On 03/31/20 at 10:55am, Michal Hocko wrote: > > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote: > > > > Maybe I mis-read the code, but I don't see how this could happen. In the > > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls > > > > calculate_node_totalpages() that ensures that node->node_zones are entirely > > > > within the node because this is checked in zone_spanned_pages_in_node(). > > > > > > zone_spanned_pages_in_node does chech the zone boundaries are within the > > > node boundaries. But that doesn't really tell anything about other > > > potential zones interleaving with the physical memory range. > > > zone->spanned_pages simply gives the physical range for the zone > > > including holes. Interleaving nodes are essentially a hole > > > (__absent_pages_in_range is going to skip those). > > > > > > That means that when free_area_init_core simply goes over the whole > > > physical zone range including holes and that is why we need to check > > > both for physical and logical holes (aka other nodes). > > > > > > The life would be so much easier if the whole thing would simply iterate > > > over memblocks... > > > > The memblock iterating sounds a great idea. I tried with putting the > > memblock iterating in the upper layer, memmap_init(), which is used for > > boot mem only anyway. Do you think it's doable and OK? It yes, I can > > work out a formal patch to make this simpler as you said. The draft code > > is as below. Like this it uses the existing code and involves little change. > > Doing this would be a step in the right direction! I haven't checked the > code very closely though. The below sounds way too simple to be truth I > am afraid. First for_each_mem_pfn_range is available only for > CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep > saying that I really hate that being conditional). Also I haven't really > checked the deferred initialization path - I have a very vague > recollection that it has been converted to the memblock api but I have > happilly dropped all that memory. The Baoquan's patch almost did it, at least for simple case of qemu with 2 nodes. It's only missing the adjustment to the size passed to memmap_init_zone() as it may change because of clamping. I've drafted something that removes HAVE_MEMBLOCK_NODE_MAP and added this patch there [1]. For several memory configurations I could emulate with qemu it worked. I'm going to wait a bit to see of kbuild is happy and then I'll send the patches. Baoquan, I took liberty to add your SoB, hope you don't mind. [1] https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=memblock/all-have-node-map > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 138a56c0f48f..558d421f294b 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > * function. They do not exist on hotplugged memory. > > */ > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) { > > - pfn = next_pfn(pfn); > > - continue; > > - } > > - if (!early_pfn_in_nid(pfn, nid)) { > > - pfn++; > > - continue; > > - } > > if (overlap_memmap_init(zone, &pfn)) > > continue; > > if (defer_init(nid, pfn, end_pfn)) > > @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone) > > } > > > > void __meminit __weak memmap_init(unsigned long size, int nid, > > - unsigned long zone, unsigned long start_pfn) > > + unsigned long zone, unsigned long range_start_pfn) > > { > > - memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); > > + unsigned long start_pfn, end_pfn; > > + unsigned long range_end_pfn = range_start_pfn + size; > > + int i; > > + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > > + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); > > + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); > > + if (end_pfn > start_pfn) > > + memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); > > + } > > } > > > > static int zone_batchsize(struct zone *zone) > > -- > Michal Hocko > SUSE Labs
On 04/09/20 at 07:27pm, Mike Rapoport wrote: > On Tue, Mar 31, 2020 at 04:21:38PM +0200, Michal Hocko wrote: > > On Tue 31-03-20 22:03:32, Baoquan He wrote: > > > Hi Michal, > > > > > > On 03/31/20 at 10:55am, Michal Hocko wrote: > > > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote: > > > > > Maybe I mis-read the code, but I don't see how this could happen. In the > > > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls > > > > > calculate_node_totalpages() that ensures that node->node_zones are entirely > > > > > within the node because this is checked in zone_spanned_pages_in_node(). > > > > > > > > zone_spanned_pages_in_node does chech the zone boundaries are within the > > > > node boundaries. But that doesn't really tell anything about other > > > > potential zones interleaving with the physical memory range. > > > > zone->spanned_pages simply gives the physical range for the zone > > > > including holes. Interleaving nodes are essentially a hole > > > > (__absent_pages_in_range is going to skip those). > > > > > > > > That means that when free_area_init_core simply goes over the whole > > > > physical zone range including holes and that is why we need to check > > > > both for physical and logical holes (aka other nodes). > > > > > > > > The life would be so much easier if the whole thing would simply iterate > > > > over memblocks... > > > > > > The memblock iterating sounds a great idea. I tried with putting the > > > memblock iterating in the upper layer, memmap_init(), which is used for > > > boot mem only anyway. Do you think it's doable and OK? It yes, I can > > > work out a formal patch to make this simpler as you said. The draft code > > > is as below. Like this it uses the existing code and involves little change. > > > > Doing this would be a step in the right direction! I haven't checked the > > code very closely though. The below sounds way too simple to be truth I > > am afraid. First for_each_mem_pfn_range is available only for > > CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep > > saying that I really hate that being conditional). Also I haven't really > > checked the deferred initialization path - I have a very vague > > recollection that it has been converted to the memblock api but I have > > happilly dropped all that memory. > > The Baoquan's patch almost did it, at least for simple case of qemu with 2 > nodes. It's only missing the adjustment to the size passed to > memmap_init_zone() as it may change because of clamping. Right, the size need be adjusted after start and end clamping. > > I've drafted something that removes HAVE_MEMBLOCK_NODE_MAP and added this > patch there [1]. For several memory configurations I could emulate with > qemu it worked. > I'm going to wait a bit to see of kbuild is happy and then I'll send the > patches. > > Baoquan, I took liberty to add your SoB, hope you don't mind. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=memblock/all-have-node-map Of course not. Thanks for doing this, and look forward to seeing your formal patchset posting when it's ready. > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index 138a56c0f48f..558d421f294b 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > > * function. They do not exist on hotplugged memory. > > > */ > > > if (context == MEMMAP_EARLY) { > > > - if (!early_pfn_valid(pfn)) { > > > - pfn = next_pfn(pfn); > > > - continue; > > > - } > > > - if (!early_pfn_in_nid(pfn, nid)) { > > > - pfn++; > > > - continue; > > > - } > > > if (overlap_memmap_init(zone, &pfn)) > > > continue; > > > if (defer_init(nid, pfn, end_pfn)) > > > @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone) > > > } > > > > > > void __meminit __weak memmap_init(unsigned long size, int nid, > > > - unsigned long zone, unsigned long start_pfn) > > > + unsigned long zone, unsigned long range_start_pfn) > > > { > > > - memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); > > > + unsigned long start_pfn, end_pfn; > > > + unsigned long range_end_pfn = range_start_pfn + size; > > > + int i; > > > + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > > > + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); > > > + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); > > > + if (end_pfn > start_pfn) > > > + memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL); > > > + } > > > } > > > > > > static int zone_batchsize(struct zone *zone) > > > > -- > > Michal Hocko > > SUSE Labs > > -- > Sincerely yours, > Mike. > >