Message ID | 20150123141817.GA22926@phnom.home.cmpxchg.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 23 Jan 2015, Johannes Weiner wrote: > Is the assumption of this patch wrong? Does the specified node have > to be online for the fallback to work? Nodes that are offline have no control structures allocated and thus allocations will likely segfault when the address of the controls structure for the node is accessed. If we wanted to prevent that then every allocation would have to add a check to see if the nodes are online which would impact performance. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/23/2015 06:18 AM, Johannes Weiner wrote: > Hi Guenter, > > CC'ing Christoph for slub-stuff: > > On Thu, Jan 22, 2015 at 09:08:02PM -0800, Guenter Roeck wrote: >> On Thu, Jan 22, 2015 at 03:05:17PM -0800, akpm@linux-foundation.org wrote: >>> The mm-of-the-moment snapshot 2015-01-22-15-04 has been uploaded to >>> >>> http://www.ozlabs.org/~akpm/mmotm/ >>> >> qemu test for ppc64 fails with >> >> Unable to handle kernel paging request for data at address 0x0000af50 >> Faulting instruction address: 0xc00000000089d5d4 >> Oops: Kernel access of bad area, sig: 11 [#1] >> >> with the following call stack: >> >> Call Trace: >> [c00000003d32f920] [c00000000089d588] .__slab_alloc.isra.44+0x7c/0x6f4 >> (unreliable) >> [c00000003d32fa90] [c00000000020cf8c] .kmem_cache_alloc_node_trace+0x12c/0x3b0 >> [c00000003d32fb60] [c000000000bceeb4] .mem_cgroup_init+0x128/0x1b0 >> [c00000003d32fbf0] [c00000000000a2b4] .do_one_initcall+0xd4/0x260 >> [c00000003d32fce0] [c000000000ba26a8] .kernel_init_freeable+0x244/0x32c >> [c00000003d32fdb0] [c00000000000ac24] .kernel_init+0x24/0x140 >> [c00000003d32fe30] [c000000000009564] .ret_from_kernel_thread+0x58/0x74 >> >> bisect log: > > [...] > >> # first bad commit: [a40d0d2cf21e2714e9a6c842085148c938bf36ab] mm: memcontrol: remove unnecessary soft limit tree node test > > The change in question is this: > > mm: memcontrol: remove unnecessary soft limit tree node test > > kzalloc_node() automatically falls back to nodes with suitable memory. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > Acked-by: Michal Hocko <mhocko@suse.cz> > Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index fb9788af4a3e..10db4a654d68 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4539,13 +4539,10 @@ static void __init mem_cgroup_soft_limit_tree_init(void) > { > struct mem_cgroup_tree_per_node *rtpn; > struct mem_cgroup_tree_per_zone *rtpz; > - int tmp, node, zone; > + int node, zone; > > for_each_node(node) { > - tmp = node; > - if (!node_state(node, N_NORMAL_MEMORY)) > - tmp = -1; > - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); > + rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node); > BUG_ON(!rtpn); > > soft_limit_tree.rb_tree_per_node[node] = rtpn; > > -- > > Is the assumption of this patch wrong? Does the specified node have > to be online for the fallback to work? > I added some debugging. First, the problem is only seen with SMP disabled. Second, there is only one online node. Without your patch: Node 0 online 1 high 1 memory 1 cpu 0 normal 1 tmp 0 rtpn c00000003d240600 Node 1 online 0 high 0 memory 0 cpu 0 normal 0 tmp -1 rtpn c00000003d240640 Node 2 online 0 high 0 memory 0 cpu 0 normal 0 tmp -1 rtpn c00000003d240680 [ and so on up to node 255 ] With your patch: Node 0 online 1 high 1 memory 1 cpu 0 normal 1 rtpn c00000003d240600 Unable to handle kernel paging request for data at address 0x0000af50 Faulting instruction address: 0xc000000000895a3c Oops: Kernel access of bad area, sig: 11 [#1] The log message is after the call to kzalloc_node. So it doesn't look like the fallback is working, at least not with ppc64 in non-SMP mode. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 23, 2015 at 09:17:44AM -0600, Christoph Lameter wrote: > On Fri, 23 Jan 2015, Johannes Weiner wrote: > > > Is the assumption of this patch wrong? Does the specified node have > > to be online for the fallback to work? > > Nodes that are offline have no control structures allocated and thus > allocations will likely segfault when the address of the controls > structure for the node is accessed. > > If we wanted to prevent that then every allocation would have to add a > check to see if the nodes are online which would impact performance. Okay, that makes sense, thank you. Andrew, can you please drop this patch? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 23, 2015 at 07:46:43AM -0800, Guenter Roeck wrote: > I added some debugging. First, the problem is only seen with SMP disabled. > Second, there is only one online node. > > Without your patch: > > Node 0 online 1 high 1 memory 1 cpu 0 normal 1 tmp 0 rtpn c00000003d240600 > Node 1 online 0 high 0 memory 0 cpu 0 normal 0 tmp -1 rtpn c00000003d240640 > Node 2 online 0 high 0 memory 0 cpu 0 normal 0 tmp -1 rtpn c00000003d240680 > > [ and so on up to node 255 ] > > With your patch: > > Node 0 online 1 high 1 memory 1 cpu 0 normal 1 rtpn c00000003d240600 > Unable to handle kernel paging request for data at address 0x0000af50 > Faulting instruction address: 0xc000000000895a3c > Oops: Kernel access of bad area, sig: 11 [#1] > > The log message is after the call to kzalloc_node. > > So it doesn't look like the fallback is working, at least not with ppc64 > in non-SMP mode. Yep, and Christoph confirmed that it's not meant to work like that. The patch is flawed. Thanks for testing and sorry for breaking your setup. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/23/2015 08:02 AM, Johannes Weiner wrote: > On Fri, Jan 23, 2015 at 09:17:44AM -0600, Christoph Lameter wrote: >> On Fri, 23 Jan 2015, Johannes Weiner wrote: >> >>> Is the assumption of this patch wrong? Does the specified node have >>> to be online for the fallback to work? >> >> Nodes that are offline have no control structures allocated and thus >> allocations will likely segfault when the address of the controls >> structure for the node is accessed. >> >> If we wanted to prevent that then every allocation would have to add a >> check to see if the nodes are online which would impact performance. > > Okay, that makes sense, thank you. > > Andrew, can you please drop this patch? > Problem is that there are three patches. 2537ffb mm: memcontrol: consolidate swap controller code 2f9b346 mm: memcontrol: consolidate memory controller initialization a40d0d2 mm: memcontrol: remove unnecessary soft limit tree node test Reverting (or dropping) a40d0d2 alone is not possible since it modifies mem_cgroup_soft_limit_tree_init which is removed by 2f9b346. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 23 Jan 2015, Johannes Weiner wrote: > struct mem_cgroup_tree_per_node *rtpn; > struct mem_cgroup_tree_per_zone *rtpz; > - int tmp, node, zone; > + int node, zone; > > for_each_node(node) { Do for_each_online_node(node) { instead? > - tmp = node; > - if (!node_state(node, N_NORMAL_MEMORY)) > - tmp = -1; > - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); > + rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node); > BUG_ON(!rtpn); > > soft_limit_tree.rb_tree_per_node[node] = rtpn; > > -- > > Is the assumption of this patch wrong? Does the specified node have > to be online for the fallback to work? > > Thanks > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/23/2015 12:20 PM, Christoph Lameter wrote: > On Fri, 23 Jan 2015, Johannes Weiner wrote: > >> struct mem_cgroup_tree_per_node *rtpn; >> struct mem_cgroup_tree_per_zone *rtpz; >> - int tmp, node, zone; >> + int node, zone; >> >> for_each_node(node) { > > Do for_each_online_node(node) { > > instead? > Wouldn't that have unintended consequences ? So far rb tree nodes are allocated even if a node not online; the above would change that. Are you saying it is unnecessary to initialize rb tree nodes if the node is not online ? Not that I have any idea what is correct, it just seems odd that the existing code would do all this allocation if it is not necessary. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 23 Jan 2015, Guenter Roeck wrote: > Wouldn't that have unintended consequences ? So far > rb tree nodes are allocated even if a node not online; > the above would change that. Are you saying it is > unnecessary to initialize rb tree nodes if the node > is not online ? It is not advisable to allocate since an offline node means that the structure cannot be allocated on the node where it would be most beneficial. Typically subsystems allocate the per node data structures when the node is brought online. > Not that I have any idea what is correct, it just seems odd > that the existing code would do all this allocation if it is not > necessary. Not sure how the code there works just guessing from other subsystems. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 23, 2015 at 03:09:20PM -0600, Christoph Lameter wrote: > On Fri, 23 Jan 2015, Guenter Roeck wrote: > > > Wouldn't that have unintended consequences ? So far > > rb tree nodes are allocated even if a node not online; > > the above would change that. Are you saying it is > > unnecessary to initialize rb tree nodes if the node > > is not online ? > > It is not advisable to allocate since an offline node means that the > structure cannot be allocated on the node where it would be most > beneficial. Typically subsystems allocate the per node data structures > when the node is brought online. I would generally agree, but this code, which implements a userspace interface, is already grotesquely inefficient and heavyhanded. It's also superseded in the next release, so we can just keep this simple at this point. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 24 Jan 2015 02:16:23 -0500, Johannes Weiner said: > I would generally agree, but this code, which implements a userspace > interface, is already grotesquely inefficient and heavyhanded. It's > also superseded in the next release, so we can just keep this simple > at this point. Wait, what? Userspace interface that's superceded in the next release? I *hope* this was intended as "Yes,it's ugly in v4 of the patch, v5 is a lot cleaner..."
On Sun, Jan 25, 2015 at 04:36:28PM -0500, Valdis.Kletnieks@vt.edu wrote: > On Sat, 24 Jan 2015 02:16:23 -0500, Johannes Weiner said: > > > I would generally agree, but this code, which implements a userspace > > interface, is already grotesquely inefficient and heavyhanded. It's > > also superseded in the next release, so we can just keep this simple > > at this point. > > Wait, what? Userspace interface that's superceded in the next release? The existing interface and its implementation are going to remain in place, obviously, we can't break userspace. But the semantics are ill-defined and the implementation bad to a point where we decided to fix both by adding a second interface and encouraging users to switch. Now if a user were to report that these off-node allocations are actually creating problems in real life I would fix it. But I'm fairly certain that remote access costs are overshadowed by the reclaim stalls this mechanism creates. So what I was trying to say above is that I don't see a point in complicating the v1 implementation for a questionable minor optimization when v2 is already being added to address much more severe shortcomings in v1. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 23-01-15 09:17:44, Christoph Lameter wrote: > On Fri, 23 Jan 2015, Johannes Weiner wrote: > > > Is the assumption of this patch wrong? Does the specified node have > > to be online for the fallback to work? Admittedly, I was checking only SLAB allocator when reviewing and assuming SLUB would behave in the same way :/ But maybe I have misinterpreted the slab code as well and get_node(struct kmem_cache *, int node) returns non-NULL for !online nodes. > Nodes that are offline have no control structures allocated and thus > allocations will likely segfault when the address of the controls > structure for the node is accessed. > > If we wanted to prevent that then every allocation would have to add a > check to see if the nodes are online which would impact performance. I have briefly checked the code and it seems that many users are aware of this and use the same construct Johannes used in the end or they use cpu_to_node. But then there are other users doing: net/openvswitch/flow_table.c: /* Initialize the default stat node. */ stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, 0); and this can blow up if Node0 is not online. I haven't checked other callers but are we sure they all are aware of !online nodes? E.g. dev_to_node() will return a node which is assigned to a device. I do not see where exactly this is set to anything else than -1 (I got quickly lost in set_dev_node callers). E.g. PCI bus sets its affinity from bus->sysdata which seems to be initialized in pci_acpi_scan_root and that is checking for an online node. Is it possible that some devices will get the node from BIOS or by other means? That being said I have no problem with checking node_online in the memcg code which was reported to blow up here. I am just thinking whether it is safe to simply blow up like that.
On Tue, 27 Jan 2015, Michal Hocko wrote: > Admittedly, I was checking only SLAB allocator when reviewing and > assuming SLUB would behave in the same way :/ > But maybe I have misinterpreted the slab code as well and > get_node(struct kmem_cache *, int node) returns non-NULL for !online > nodes. Oh. Just allocate from node 12345 in SLAB and you will also have a strange failure. > I have briefly checked the code and it seems that many users are aware > of this and use the same construct Johannes used in the end or they use > cpu_to_node. But then there are other users doing: > net/openvswitch/flow_table.c: > /* Initialize the default stat node. */ > stats = kmem_cache_alloc_node(flow_stats_cache, > GFP_KERNEL | __GFP_ZERO, 0); > > and this can blow up if Node0 is not online. I haven't checked other Node 0 is special in many architectures and is guaranteed to exist. PowerPC is a notable exception which causes frequent issues with NUMA changes. > That being said I have no problem with checking node_online in the memcg > code which was reported to blow up here. I am just thinking whether it > is safe to simply blow up like that. Node numbers must be legitimate in order to be used. Same thing with processor numbers. We cannot always check if they are online. The numbers in use must be sane. We have notifier subsystems that do callbacks to allow subsystems to add and remove new nodes and processors. Those should be used to ensure that only legitimate node and processor numbers are used. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fb9788af4a3e..10db4a654d68 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4539,13 +4539,10 @@ static void __init mem_cgroup_soft_limit_tree_init(void) { struct mem_cgroup_tree_per_node *rtpn; struct mem_cgroup_tree_per_zone *rtpz; - int tmp, node, zone; + int node, zone; for_each_node(node) { - tmp = node; - if (!node_state(node, N_NORMAL_MEMORY)) - tmp = -1; - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); + rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node); BUG_ON(!rtpn); soft_limit_tree.rb_tree_per_node[node] = rtpn;