[v2,2/9] x86: numa: check the node id consistently for x86

Message ID	1567231103-13237-3-git-send-email-linyunsheng@huawei.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <SRS0=Nkoz=W3=vger.kernel.org=linux-mips-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2DF211398 for <patchwork-linux-mips@patchwork.kernel.org>; Sat, 31 Aug 2019 06:00:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 09DB723774 for <patchwork-linux-mips@patchwork.kernel.org>; Sat, 31 Aug 2019 06:00:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725981AbfHaGAq (ORCPT <rfc822;patchwork-linux-mips@patchwork.kernel.org>); Sat, 31 Aug 2019 02:00:46 -0400 Received: from szxga07-in.huawei.com ([45.249.212.35]:49298 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725899AbfHaGAq (ORCPT <rfc822;linux-mips@vger.kernel.org>); Sat, 31 Aug 2019 02:00:46 -0400 Received: from DGGEMS404-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id CCA8FFE050212F500B61; Sat, 31 Aug 2019 14:00:41 +0800 (CST) Received: from localhost.localdomain (10.67.212.75) by DGGEMS404-HUB.china.huawei.com (10.3.19.204) with Microsoft SMTP Server id 14.3.439.0; Sat, 31 Aug 2019 14:00:34 +0800 From: Yunsheng Lin <linyunsheng@huawei.com> To: <catalin.marinas@arm.com>, <will@kernel.org>, <mingo@redhat.com>, <bp@alien8.de>, <rth@twiddle.net>, <ink@jurassic.park.msu.ru>, <mattst88@gmail.com>, <benh@kernel.crashing.org>, <paulus@samba.org>, <mpe@ellerman.id.au>, <heiko.carstens@de.ibm.com>, <gor@linux.ibm.com>, <borntraeger@de.ibm.com>, <ysato@users.sourceforge.jp>, <dalias@libc.org>, <davem@davemloft.net>, <ralf@linux-mips.org>, <paul.burton@mips.com>, <jhogan@kernel.org>, <jiaxun.yang@flygoat.com>, <chenhc@lemote.com> CC: <akpm@linux-foundation.org>, <rppt@linux.ibm.com>, <anshuman.khandual@arm.com>, <tglx@linutronix.de>, <cai@lca.pw>, <robin.murphy@arm.com>, <linux-arm-kernel@lists.infradead.org>, <linux-kernel@vger.kernel.org>, <hpa@zytor.com>, <x86@kernel.org>, <dave.hansen@linux.intel.com>, <luto@kernel.org>, <peterz@infradead.org>, <len.brown@intel.com>, <axboe@kernel.dk>, <dledford@redhat.com>, <jeffrey.t.kirsher@intel.com>, <linux-alpha@vger.kernel.org>, <nfont@linux.vnet.ibm.com>, <naveen.n.rao@linux.vnet.ibm.com>, <mwb@linux.vnet.ibm.com>, <linuxppc-dev@lists.ozlabs.org>, <linux-s390@vger.kernel.org>, <linux-sh@vger.kernel.org>, <sparclinux@vger.kernel.org>, <tbogendoerfer@suse.de>, <linux-mips@vger.kernel.org>, <linuxarm@huawei.com> Subject: [PATCH v2 2/9] x86: numa: check the node id consistently for x86 Date: Sat, 31 Aug 2019 13:58:16 +0800 Message-ID: <1567231103-13237-3-git-send-email-linyunsheng@huawei.com> X-Mailer: git-send-email 2.8.1 In-Reply-To: <1567231103-13237-1-git-send-email-linyunsheng@huawei.com> References: <1567231103-13237-1-git-send-email-linyunsheng@huawei.com> MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.67.212.75] X-CFilter-Loop: Reflected Sender: linux-mips-owner@vger.kernel.org Precedence: bulk List-ID: <linux-mips.vger.kernel.org> X-Mailing-List: linux-mips@vger.kernel.org
Series	check the node id consistently across different arches \| expand [v2,0/9] check the node id consistently across different arches [v2,1/9] arm64: numa: check the node id consistently for arm64 [v2,2/9] x86: numa: check the node id consistently for x86 [v2,3/9] alpha: numa: check the node id consistently for alpha [v2,4/9] powerpc: numa: check the node id consistently for powerpc [v2,5/9] s390: numa: check the node id consistently for s390 [v2,6/9] sh: numa: check the node id consistently for sh [v2,7/9] sparc64: numa: check the node id consistently for sparc64 [v2,8/9] mips: numa: check the node id consistently for mips ip27 [v2,9/9] mips: numa: check the node id consistently for mips loongson64

Yunsheng Lin Aug. 31, 2019, 5:58 a.m. UTC

According to Section 6.2.14 from ACPI spec 6.3 [1], the setting
of proximity domain is optional, as below:

This optional object is used to describe proximity domain
associations within a machine. _PXM evaluates to an integer
that identifies a device as belonging to a Proximity Domain
defined in the System Resource Affinity Table (SRAT).

This patch checks node id with the below case before returning
node_to_cpumask_map[node]:
1. if node_id >= nr_node_ids, return cpu_none_mask
2. if node_id < 0, return cpu_online_mask
3. if node_to_cpumask_map[node_id] is NULL, return cpu_online_mask

[1] https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 arch/x86/include/asm/topology.h | 6 ++++++
 arch/x86/mm/numa.c              | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

Peter Zijlstra Aug. 31, 2019, 8:55 a.m. UTC | #1

On Sat, Aug 31, 2019 at 01:58:16PM +0800, Yunsheng Lin wrote:
> According to Section 6.2.14 from ACPI spec 6.3 [1], the setting
> of proximity domain is optional, as below:
> 
> This optional object is used to describe proximity domain
> associations within a machine. _PXM evaluates to an integer
> that identifies a device as belonging to a Proximity Domain
> defined in the System Resource Affinity Table (SRAT).

That's just words.. what does it actually mean?

> This patch checks node id with the below case before returning
> node_to_cpumask_map[node]:
> 1. if node_id >= nr_node_ids, return cpu_none_mask
> 2. if node_id < 0, return cpu_online_mask
> 3. if node_to_cpumask_map[node_id] is NULL, return cpu_online_mask
> 
> [1] https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  arch/x86/include/asm/topology.h | 6 ++++++
>  arch/x86/mm/numa.c              | 2 +-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 4b14d23..f36e9c8 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -69,6 +69,12 @@ extern const struct cpumask *cpumask_of_node(int node);
>  /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
>  static inline const struct cpumask *cpumask_of_node(int node)
>  {
> +	if (node >= nr_node_ids)
> +		return cpu_none_mask;
> +
> +	if (node < 0 || !node_to_cpumask_map[node])
> +		return cpu_online_mask;
> +
>  	return node_to_cpumask_map[node];
>  }
>  #endif

I _reallly_ hate this. Users are expected to use valid numa ids. Now
we're adding all this checking to all users. Why do we want to do that?

Using '(unsigned)node >= nr_nods_ids' is an error.

> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index e6dad60..5e393d2 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -868,7 +868,7 @@ const struct cpumask *cpumask_of_node(int node)
>  		dump_stack();
>  		return cpu_none_mask;
>  	}
> -	if (node_to_cpumask_map[node] == NULL) {
> +	if (node < 0 || !node_to_cpumask_map[node]) {
>  		printk(KERN_WARNING
>  			"cpumask_of_node(%d): no node_to_cpumask_map!\n",
>  			node);
> -- 
> 2.8.1
>

Yunsheng Lin Aug. 31, 2019, 10:09 a.m. UTC | #2

On 2019/8/31 16:55, Peter Zijlstra wrote:
> On Sat, Aug 31, 2019 at 01:58:16PM +0800, Yunsheng Lin wrote:
>> According to Section 6.2.14 from ACPI spec 6.3 [1], the setting
>> of proximity domain is optional, as below:
>>
>> This optional object is used to describe proximity domain
>> associations within a machine. _PXM evaluates to an integer
>> that identifies a device as belonging to a Proximity Domain
>> defined in the System Resource Affinity Table (SRAT).
> 
> That's just words.. what does it actually mean?

It means the dev_to_node(dev) may return -1 if the bios does not
implement the proximity domain feature, user may use that value
to call cpumask_of_node and cpumask_of_node does not protect itself
from node id being -1, which causes out of bound access.

> 
>> This patch checks node id with the below case before returning
>> node_to_cpumask_map[node]:
>> 1. if node_id >= nr_node_ids, return cpu_none_mask
>> 2. if node_id < 0, return cpu_online_mask
>> 3. if node_to_cpumask_map[node_id] is NULL, return cpu_online_mask
>>
>> [1] https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>>  arch/x86/include/asm/topology.h | 6 ++++++
>>  arch/x86/mm/numa.c              | 2 +-
>>  2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
>> index 4b14d23..f36e9c8 100644
>> --- a/arch/x86/include/asm/topology.h
>> +++ b/arch/x86/include/asm/topology.h
>> @@ -69,6 +69,12 @@ extern const struct cpumask *cpumask_of_node(int node);
>>  /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
>>  static inline const struct cpumask *cpumask_of_node(int node)
>>  {
>> +	if (node >= nr_node_ids)
>> +		return cpu_none_mask;
>> +
>> +	if (node < 0 || !node_to_cpumask_map[node])
>> +		return cpu_online_mask;
>> +
>>  	return node_to_cpumask_map[node];
>>  }
>>  #endif
> 
> I _reallly_ hate this. Users are expected to use valid numa ids. Now
> we're adding all this checking to all users. Why do we want to do that?

As above, the dev_to_node(dev) may return -1.

> 
> Using '(unsigned)node >= nr_nods_ids' is an error.

'node >= nr_node_ids' can be dropped if all user is expected to not call
cpumask_of_node with node id greater or equal to nr_nods_ids.

From what I can see, the problem can be fixed in three place:
1. Make user dev_to_node return a valid node id even when proximity
   domain is not set by bios(or node id set by buggy bios is not valid),
   which may need info from the numa system to make sure it will return
   a valid node.

2. User that call cpumask_of_node should ensure the node id is valid
   before calling cpumask_of_node, and user also need some info to
   make ensure node id is valid.

3. Make sure cpumask_of_node deal with invalid node id as this patchset.

Which one do you prefer to make sure node id is valid, or do you
have any better idea?

Any detail advice and suggestion will be very helpful, thanks.

> 
>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>> index e6dad60..5e393d2 100644
>> --- a/arch/x86/mm/numa.c
>> +++ b/arch/x86/mm/numa.c
>> @@ -868,7 +868,7 @@ const struct cpumask *cpumask_of_node(int node)
>>  		dump_stack();
>>  		return cpu_none_mask;
>>  	}
>> -	if (node_to_cpumask_map[node] == NULL) {
>> +	if (node < 0 || !node_to_cpumask_map[node]) {
>>  		printk(KERN_WARNING
>>  			"cpumask_of_node(%d): no node_to_cpumask_map!\n",
>>  			node);
>> -- 
>> 2.8.1
>>
> 
> .
>

Peter Zijlstra Aug. 31, 2019, 4:12 p.m. UTC | #3

On Sat, Aug 31, 2019 at 06:09:39PM +0800, Yunsheng Lin wrote:
> 
> 
> On 2019/8/31 16:55, Peter Zijlstra wrote:
> > On Sat, Aug 31, 2019 at 01:58:16PM +0800, Yunsheng Lin wrote:
> >> According to Section 6.2.14 from ACPI spec 6.3 [1], the setting
> >> of proximity domain is optional, as below:
> >>
> >> This optional object is used to describe proximity domain
> >> associations within a machine. _PXM evaluates to an integer
> >> that identifies a device as belonging to a Proximity Domain
> >> defined in the System Resource Affinity Table (SRAT).
> > 
> > That's just words.. what does it actually mean?
> 
> It means the dev_to_node(dev) may return -1 if the bios does not
> implement the proximity domain feature, user may use that value
> to call cpumask_of_node and cpumask_of_node does not protect itself
> from node id being -1, which causes out of bound access.

> >> @@ -69,6 +69,12 @@ extern const struct cpumask *cpumask_of_node(int node);
> >>  /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
> >>  static inline const struct cpumask *cpumask_of_node(int node)
> >>  {
> >> +	if (node >= nr_node_ids)
> >> +		return cpu_none_mask;
> >> +
> >> +	if (node < 0 || !node_to_cpumask_map[node])
> >> +		return cpu_online_mask;
> >> +
> >>  	return node_to_cpumask_map[node];
> >>  }
> >>  #endif
> > 
> > I _reallly_ hate this. Users are expected to use valid numa ids. Now
> > we're adding all this checking to all users. Why do we want to do that?
> 
> As above, the dev_to_node(dev) may return -1.
> 
> > 
> > Using '(unsigned)node >= nr_nods_ids' is an error.
> 
> 'node >= nr_node_ids' can be dropped if all user is expected to not call
> cpumask_of_node with node id greater or equal to nr_nods_ids.

you copied my typo :-)

> From what I can see, the problem can be fixed in three place:
> 1. Make user dev_to_node return a valid node id even when proximity
>    domain is not set by bios(or node id set by buggy bios is not valid),
>    which may need info from the numa system to make sure it will return
>    a valid node.
> 
> 2. User that call cpumask_of_node should ensure the node id is valid
>    before calling cpumask_of_node, and user also need some info to
>    make ensure node id is valid.
> 
> 3. Make sure cpumask_of_node deal with invalid node id as this patchset.
> 
> Which one do you prefer to make sure node id is valid, or do you
> have any better idea?
> 
> Any detail advice and suggestion will be very helpful, thanks.

1) because even it is not set, the device really does belong to a node.
It is impossible a device will have magic uniform access to memory when
CPUs cannot.

2) is already true today, cpumask_of_node() requires a valid node_id.

3) is just wrong and increases overhead for everyone.

Huacai Chen Sept. 1, 2019, 4:45 a.m. UTC | #4

Hi, Peter,

I found that this patch has been merged but I haven't received the e-mail for some unknown reasons.
https://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git/commit/?id=1c6c1ca318585f1096d4d04bc722297c85e9fb8a

Firstly, your comments are correct, so the barrier.h part is perfect.

Secondly, most of the rest is useless, because smp_mb__before_llsc, loongson_llsc_mb and other memory barriers are the same thing on Loongson-3. We don't need to add loongson_llsc_mb if there is already a smp_mb__before_llsc.

Thirdly, maybe the only exception is syscall.c, but mips_atomic_set is not used on Loongson-3. And if in some cases we use it, I think the user-to-kernel context switch has the same effect of a memory barrier.

Lastly, I think the real exception is test_and_set_bit_lock, it need a loongson_llsc_mb.

Huacai

Yunsheng Lin Sept. 2, 2019, 5:46 a.m. UTC | #5

On 2019/9/1 0:12, Peter Zijlstra wrote:
> On Sat, Aug 31, 2019 at 06:09:39PM +0800, Yunsheng Lin wrote:
>>
>>
>> On 2019/8/31 16:55, Peter Zijlstra wrote:
>>> On Sat, Aug 31, 2019 at 01:58:16PM +0800, Yunsheng Lin wrote:
>>>> According to Section 6.2.14 from ACPI spec 6.3 [1], the setting
>>>> of proximity domain is optional, as below:
>>>>
>>>> This optional object is used to describe proximity domain
>>>> associations within a machine. _PXM evaluates to an integer
>>>> that identifies a device as belonging to a Proximity Domain
>>>> defined in the System Resource Affinity Table (SRAT).
>>>
>>> That's just words.. what does it actually mean?
>>
>> It means the dev_to_node(dev) may return -1 if the bios does not
>> implement the proximity domain feature, user may use that value
>> to call cpumask_of_node and cpumask_of_node does not protect itself
>> from node id being -1, which causes out of bound access.
> 
>>>> @@ -69,6 +69,12 @@ extern const struct cpumask *cpumask_of_node(int node);
>>>>  /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
>>>>  static inline const struct cpumask *cpumask_of_node(int node)
>>>>  {
>>>> +	if (node >= nr_node_ids)
>>>> +		return cpu_none_mask;
>>>> +
>>>> +	if (node < 0 || !node_to_cpumask_map[node])
>>>> +		return cpu_online_mask;
>>>> +
>>>>  	return node_to_cpumask_map[node];
>>>>  }
>>>>  #endif
>>>
>>> I _reallly_ hate this. Users are expected to use valid numa ids. Now
>>> we're adding all this checking to all users. Why do we want to do that?
>>
>> As above, the dev_to_node(dev) may return -1.
>>
>>>
>>> Using '(unsigned)node >= nr_nods_ids' is an error.
>>
>> 'node >= nr_node_ids' can be dropped if all user is expected to not call
>> cpumask_of_node with node id greater or equal to nr_nods_ids.
> 
> you copied my typo :-)

I did note the typo, corrected the first one, but missed the second one :)

> 
>> From what I can see, the problem can be fixed in three place:
>> 1. Make user dev_to_node return a valid node id even when proximity
>>    domain is not set by bios(or node id set by buggy bios is not valid),
>>    which may need info from the numa system to make sure it will return
>>    a valid node.
>>
>> 2. User that call cpumask_of_node should ensure the node id is valid
>>    before calling cpumask_of_node, and user also need some info to
>>    make ensure node id is valid.
>>
>> 3. Make sure cpumask_of_node deal with invalid node id as this patchset.
>>
>> Which one do you prefer to make sure node id is valid, or do you
>> have any better idea?
>>
>> Any detail advice and suggestion will be very helpful, thanks.
> 
> 1) because even it is not set, the device really does belong to a node.
> It is impossible a device will have magic uniform access to memory when
> CPUs cannot.

So it means dev_to_node() will return either NUMA_NO_NODE or a
valid node id?

> 
> 2) is already true today, cpumask_of_node() requires a valid node_id.

Ok, most of the user does check node_id before calling
cpumask_of_node(), but does a little different type of checking:

1) some does " < 0" check;
2) some does "== NUMA_NO_NODE" check;
3) some does ">= MAX_NUMNODES" check;
4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.


> 
> 3) is just wrong and increases overhead for everyone.

Ok, cpumask_of_node() is also used in some critical path such
as scheduling, which may not need those checking, the overhead
is unnecessary.

But for non-critical path such as setup or configuration path,
it better to have consistent checking, and also simplify the
user code that calls cpumask_of_node().

Do you think it is worth the trouble to add a new function
such as cpumask_of_node_check(maybe some other name) to do
consistent checking?

Or caller just simply check if dev_to_node()'s return value is
NUMA_NO_NODE before calling cpumask_of_node()?


> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> .
>

Peter Zijlstra Sept. 2, 2019, 7:25 a.m. UTC | #6

On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
> On 2019/9/1 0:12, Peter Zijlstra wrote:

> > 1) because even it is not set, the device really does belong to a node.
> > It is impossible a device will have magic uniform access to memory when
> > CPUs cannot.
> 
> So it means dev_to_node() will return either NUMA_NO_NODE or a
> valid node id?

NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
said, not a valid device location on a NUMA system.

Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
node association. It just means we don't know and might have to guess.

> > 2) is already true today, cpumask_of_node() requires a valid node_id.
> 
> Ok, most of the user does check node_id before calling
> cpumask_of_node(), but does a little different type of checking:
> 
> 1) some does " < 0" check;
> 2) some does "== NUMA_NO_NODE" check;
> 3) some does ">= MAX_NUMNODES" check;
> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.

The one true way is:

	'(unsigned)node_id >= nr_node_ids'

> > 3) is just wrong and increases overhead for everyone.
> 
> Ok, cpumask_of_node() is also used in some critical path such
> as scheduling, which may not need those checking, the overhead
> is unnecessary.
> 
> But for non-critical path such as setup or configuration path,
> it better to have consistent checking, and also simplify the
> user code that calls cpumask_of_node().
> 
> Do you think it is worth the trouble to add a new function
> such as cpumask_of_node_check(maybe some other name) to do
> consistent checking?
> 
> Or caller just simply check if dev_to_node()'s return value is
> NUMA_NO_NODE before calling cpumask_of_node()?

It is not a matter of convenience. The function is called
cpumask_of_node(), when node < 0 || node >= nr_node_ids, it is not a
valid node, therefore the function shouldn't return anything except an
error.

Also note that the CONFIG_DEBUG_PER_CPU_MAPS version of
cpumask_of_node() already does this (although it wants the below fix).

---
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e6dad600614c..5f49c10201c7 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -861,7 +861,7 @@ void numa_remove_cpu(int cpu)
  */
 const struct cpumask *cpumask_of_node(int node)
 {
-	if (node >= nr_node_ids) {
+	if ((unsigned)node >= nr_node_ids) {
 		printk(KERN_WARNING
 			"cpumask_of_node(%d): node > nr_node_ids(%u)\n",
 			node, nr_node_ids);

Yunsheng Lin Sept. 2, 2019, 12:25 p.m. UTC | #7

On 2019/9/2 15:25, Peter Zijlstra wrote:
> On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
>> On 2019/9/1 0:12, Peter Zijlstra wrote:
> 
>>> 1) because even it is not set, the device really does belong to a node.
>>> It is impossible a device will have magic uniform access to memory when
>>> CPUs cannot.
>>
>> So it means dev_to_node() will return either NUMA_NO_NODE or a
>> valid node id?
> 
> NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
> said, not a valid device location on a NUMA system.
> 
> Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
> node association. It just means we don't know and might have to guess.

How do we guess the device's location when ACPI/BIOS does not set it?

It seems dev_to_node() does not do anything about that and leave the
job to the caller or whatever function that get called with its return
value, such as cpumask_of_node().

> 
>>> 2) is already true today, cpumask_of_node() requires a valid node_id.
>>
>> Ok, most of the user does check node_id before calling
>> cpumask_of_node(), but does a little different type of checking:
>>
>> 1) some does " < 0" check;
>> 2) some does "== NUMA_NO_NODE" check;
>> 3) some does ">= MAX_NUMNODES" check;
>> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.
> 
> The one true way is:
> 
> 	'(unsigned)node_id >= nr_node_ids'

I missed the magic of the "unsigned" in your previous reply.

> 
>>> 3) is just wrong and increases overhead for everyone.
>>
>> Ok, cpumask_of_node() is also used in some critical path such
>> as scheduling, which may not need those checking, the overhead
>> is unnecessary.
>>
>> But for non-critical path such as setup or configuration path,
>> it better to have consistent checking, and also simplify the
>> user code that calls cpumask_of_node().
>>
>> Do you think it is worth the trouble to add a new function
>> such as cpumask_of_node_check(maybe some other name) to do
>> consistent checking?
>>
>> Or caller just simply check if dev_to_node()'s return value is
>> NUMA_NO_NODE before calling cpumask_of_node()?
> 
> It is not a matter of convenience. The function is called
> cpumask_of_node(), when node < 0 || node >= nr_node_ids, it is not a
> valid node, therefore the function shouldn't return anything except an
> error.
what do you mean by error? What I can think is three type of errors:
1) return NULL, this way it seems cpumask_of_node() also leave the
   job to the function that calls it.
2) cpu_none_mask, I am not sure what this means, maybe it means there
   is no cpu on the same node with the device?
3) give a warning, stack dump, or even a BUG_ON?

I would prefer the second one, and implement the third one when the
CONFIG_DEBUG_PER_CPU_MAPS is selected.

Any suggestion?

> 
> Also note that the CONFIG_DEBUG_PER_CPU_MAPS version of
> cpumask_of_node() already does this (although it wants the below fix).

Thanks for the note and example.

Peter Zijlstra Sept. 2, 2019, 12:56 p.m. UTC | #8

On Mon, Sep 02, 2019 at 08:25:24PM +0800, Yunsheng Lin wrote:
> On 2019/9/2 15:25, Peter Zijlstra wrote:
> > On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
> >> On 2019/9/1 0:12, Peter Zijlstra wrote:
> > 
> >>> 1) because even it is not set, the device really does belong to a node.
> >>> It is impossible a device will have magic uniform access to memory when
> >>> CPUs cannot.
> >>
> >> So it means dev_to_node() will return either NUMA_NO_NODE or a
> >> valid node id?
> > 
> > NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
> > said, not a valid device location on a NUMA system.
> > 
> > Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
> > node association. It just means we don't know and might have to guess.
> 
> How do we guess the device's location when ACPI/BIOS does not set it?

See device_add(), it looks to the device's parent and on NO_NODE, puts
it there.

Lacking any hints, just stick it to node0 and print a FW_BUG or
something.

> It seems dev_to_node() does not do anything about that and leave the
> job to the caller or whatever function that get called with its return
> value, such as cpumask_of_node().

Well, dev_to_node() doesn't do anything; nor should it. It are the
callers of set_dev_node() that should be taking care.

Also note how device_add() sets the device node to the parent device's
node on NUMA_NO_NODE. Arguably we should change it to complain when it
finds NUMA_NO_NODE and !parent.

---
 drivers/base/core.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index f0dd8e38fee3..2caf204966a0 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2120,8 +2120,16 @@ int device_add(struct device *dev)
 		dev->kobj.parent = kobj;
 
 	/* use parent numa_node */
-	if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
-		set_dev_node(dev, dev_to_node(parent));
+	if (dev_to_node(dev) == NUMA_NO_NODE) {
+		if (parent)
+			set_dev_node(dev, dev_to_node(parent));
+#ifdef CONFIG_NUMA
+		else {
+			pr_err("device: '%s': has no assigned NUMA node\n", dev_name(dev));
+			set_dev_node(dev, 0);
+		}
+#endif
+	}
 
 	/* first, register with generic layer. */
 	/* we require the name to be set before, and pass NULL */

Ingo Molnar Sept. 2, 2019, 6:17 p.m. UTC | #9

* Peter Zijlstra <peterz@infradead.org> wrote:

> Also note that the CONFIG_DEBUG_PER_CPU_MAPS version of
> cpumask_of_node() already does this (although it wants the below fix).
> 
> ---
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index e6dad600614c..5f49c10201c7 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -861,7 +861,7 @@ void numa_remove_cpu(int cpu)
>   */
>  const struct cpumask *cpumask_of_node(int node)
>  {
> -	if (node >= nr_node_ids) {
> +	if ((unsigned)node >= nr_node_ids) {
>  		printk(KERN_WARNING
>  			"cpumask_of_node(%d): node > nr_node_ids(%u)\n",
>  			node, nr_node_ids);

Nitpicking: please also fix the kernel message to say ">=".

With that:

Acked-by: Ingo Molnar <mingo@kernel.org>

Note that:

- arch/arm64/mm/numa.c copied the same sign bug (or unrobustness if we 
  don't want to call it a bug).

- Kudos to the mm/memcontrol.c and kernel/bpf/syscall.c teams for not 
  copying that bug. Booo for none of them fixing the buggy pattern 
  elsewhere in the kernel ;-)

Thanks,

	Ingo

Ingo Molnar Sept. 2, 2019, 6:22 p.m. UTC | #10

* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 02, 2019 at 08:25:24PM +0800, Yunsheng Lin wrote:
> > On 2019/9/2 15:25, Peter Zijlstra wrote:
> > > On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
> > >> On 2019/9/1 0:12, Peter Zijlstra wrote:
> > > 
> > >>> 1) because even it is not set, the device really does belong to a node.
> > >>> It is impossible a device will have magic uniform access to memory when
> > >>> CPUs cannot.
> > >>
> > >> So it means dev_to_node() will return either NUMA_NO_NODE or a
> > >> valid node id?
> > > 
> > > NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
> > > said, not a valid device location on a NUMA system.
> > > 
> > > Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
> > > node association. It just means we don't know and might have to guess.
> > 
> > How do we guess the device's location when ACPI/BIOS does not set it?
> 
> See device_add(), it looks to the device's parent and on NO_NODE, puts
> it there.
> 
> Lacking any hints, just stick it to node0 and print a FW_BUG or
> something.
> 
> > It seems dev_to_node() does not do anything about that and leave the
> > job to the caller or whatever function that get called with its return
> > value, such as cpumask_of_node().
> 
> Well, dev_to_node() doesn't do anything; nor should it. It are the
> callers of set_dev_node() that should be taking care.
> 
> Also note how device_add() sets the device node to the parent device's
> node on NUMA_NO_NODE. Arguably we should change it to complain when it
> finds NUMA_NO_NODE and !parent.
> 
> ---
>  drivers/base/core.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index f0dd8e38fee3..2caf204966a0 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2120,8 +2120,16 @@ int device_add(struct device *dev)
>  		dev->kobj.parent = kobj;
>  
>  	/* use parent numa_node */
> -	if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
> -		set_dev_node(dev, dev_to_node(parent));
> +	if (dev_to_node(dev) == NUMA_NO_NODE) {
> +		if (parent)
> +			set_dev_node(dev, dev_to_node(parent));
> +#ifdef CONFIG_NUMA
> +		else {
> +			pr_err("device: '%s': has no assigned NUMA node\n", dev_name(dev));
> +			set_dev_node(dev, 0);
> +		}
> +#endif

BTW., is firmware required to always provide a NUMA node on NUMA systems?

I.e. do we really want this warning on non-NUMA systems that don't assign 
NUMA nodes?

Also, even on NUMA systems, is firmware required to provide a NUMA node - 
i.e. is it in principle invalid to offer no NUMA binding?

Thanks,

	Ingo

Peter Zijlstra Sept. 2, 2019, 7:14 p.m. UTC | #11

On Mon, Sep 02, 2019 at 08:22:52PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:

> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index f0dd8e38fee3..2caf204966a0 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -2120,8 +2120,16 @@ int device_add(struct device *dev)
> >  		dev->kobj.parent = kobj;
> >  
> >  	/* use parent numa_node */
> > -	if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
> > -		set_dev_node(dev, dev_to_node(parent));
> > +	if (dev_to_node(dev) == NUMA_NO_NODE) {
> > +		if (parent)
> > +			set_dev_node(dev, dev_to_node(parent));
> > +#ifdef CONFIG_NUMA
> > +		else {
> > +			pr_err("device: '%s': has no assigned NUMA node\n", dev_name(dev));
> > +			set_dev_node(dev, 0);
> > +		}
> > +#endif
> 
> BTW., is firmware required to always provide a NUMA node on NUMA systems?
> 
> I.e. do we really want this warning on non-NUMA systems that don't assign 
> NUMA nodes?

Good point; we might have to exclude nr_node_ids==1 systems from
warning.

> Also, even on NUMA systems, is firmware required to provide a NUMA node - 
> i.e. is it in principle invalid to offer no NUMA binding?

I think so; a device needs to be _somewhere_, right? Typically though;
devices are on a PCI bus, and the PCI bridge itself will have a NUMA
binding and then the above parent rule will make everything just work.

But I don't see how you can be outside of the NUMA topology.

Yunsheng Lin Sept. 3, 2019, 6:19 a.m. UTC | #12

On 2019/9/2 20:56, Peter Zijlstra wrote:
> On Mon, Sep 02, 2019 at 08:25:24PM +0800, Yunsheng Lin wrote:
>> On 2019/9/2 15:25, Peter Zijlstra wrote:
>>> On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
>>>> On 2019/9/1 0:12, Peter Zijlstra wrote:
>>>
>>>>> 1) because even it is not set, the device really does belong to a node.
>>>>> It is impossible a device will have magic uniform access to memory when
>>>>> CPUs cannot.
>>>>
>>>> So it means dev_to_node() will return either NUMA_NO_NODE or a
>>>> valid node id?
>>>
>>> NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
>>> said, not a valid device location on a NUMA system.
>>>
>>> Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
>>> node association. It just means we don't know and might have to guess.
>>
>> How do we guess the device's location when ACPI/BIOS does not set it?
> 
> See device_add(), it looks to the device's parent and on NO_NODE, puts
> it there.
> 
> Lacking any hints, just stick it to node0 and print a FW_BUG or
> something.
> 
>> It seems dev_to_node() does not do anything about that and leave the
>> job to the caller or whatever function that get called with its return
>> value, such as cpumask_of_node().
> 
> Well, dev_to_node() doesn't do anything; nor should it. It are the
> callers of set_dev_node() that should be taking care.
> 
> Also note how device_add() sets the device node to the parent device's
> node on NUMA_NO_NODE. Arguably we should change it to complain when it
> finds NUMA_NO_NODE and !parent.

Is it possible that the node id set by device_add() become invalid
if the node is offlined, then dev_to_node() may return a invalid
node id.

From the comment in select_fallback_rq(), it seems that a node can
be offlined, not sure if node offline process has taken cared of that?

	/*
         * If the node that the CPU is on has been offlined, cpu_to_node()
         * will return -1. There is no CPU on the node, and we should
         * select the CPU on the other node.
         */


With the above assumption that a device is always on a valid node,
the node id returned from dev_to_node() can be safely passed to
cpumask_of_node() without any checking?

> 
> ---
>  drivers/base/core.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index f0dd8e38fee3..2caf204966a0 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2120,8 +2120,16 @@ int device_add(struct device *dev)
>  		dev->kobj.parent = kobj;
>  
>  	/* use parent numa_node */
> -	if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
> -		set_dev_node(dev, dev_to_node(parent));
> +	if (dev_to_node(dev) == NUMA_NO_NODE) {
> +		if (parent)
> +			set_dev_node(dev, dev_to_node(parent));
> +#ifdef CONFIG_NUMA
> +		else {
> +			pr_err("device: '%s': has no assigned NUMA node\n", dev_name(dev));
> +			set_dev_node(dev, 0);
> +		}
> +#endif
> +	}
>  
>  	/* first, register with generic layer. */
>  	/* we require the name to be set before, and pass NULL */
> 
> .
>

Peter Zijlstra Sept. 3, 2019, 7:11 a.m. UTC | #13

On Tue, Sep 03, 2019 at 02:19:04PM +0800, Yunsheng Lin wrote:
> On 2019/9/2 20:56, Peter Zijlstra wrote:
> > On Mon, Sep 02, 2019 at 08:25:24PM +0800, Yunsheng Lin wrote:
> >> On 2019/9/2 15:25, Peter Zijlstra wrote:
> >>> On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
> >>>> On 2019/9/1 0:12, Peter Zijlstra wrote:
> >>>
> >>>>> 1) because even it is not set, the device really does belong to a node.
> >>>>> It is impossible a device will have magic uniform access to memory when
> >>>>> CPUs cannot.
> >>>>
> >>>> So it means dev_to_node() will return either NUMA_NO_NODE or a
> >>>> valid node id?
> >>>
> >>> NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
> >>> said, not a valid device location on a NUMA system.
> >>>
> >>> Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
> >>> node association. It just means we don't know and might have to guess.
> >>
> >> How do we guess the device's location when ACPI/BIOS does not set it?
> > 
> > See device_add(), it looks to the device's parent and on NO_NODE, puts
> > it there.
> > 
> > Lacking any hints, just stick it to node0 and print a FW_BUG or
> > something.
> > 
> >> It seems dev_to_node() does not do anything about that and leave the
> >> job to the caller or whatever function that get called with its return
> >> value, such as cpumask_of_node().
> > 
> > Well, dev_to_node() doesn't do anything; nor should it. It are the
> > callers of set_dev_node() that should be taking care.
> > 
> > Also note how device_add() sets the device node to the parent device's
> > node on NUMA_NO_NODE. Arguably we should change it to complain when it
> > finds NUMA_NO_NODE and !parent.
> 
> Is it possible that the node id set by device_add() become invalid
> if the node is offlined, then dev_to_node() may return a invalid
> node id.

In that case I would expect the device to go away too. Once the memory
controller goes away, the PCI bus connected to it cannot continue to
function.

> From the comment in select_fallback_rq(), it seems that a node can
> be offlined, not sure if node offline process has taken cared of that?
> 
> 	/*
>          * If the node that the CPU is on has been offlined, cpu_to_node()
>          * will return -1. There is no CPU on the node, and we should
>          * select the CPU on the other node.
>          */

Ugh, so I disagree with that notion. cpu_to_node() mapping should be
fixed, you simply cannot change it after boot, too much stuff relies on
it.

Setting cpu_to_node to -1 on node offline is just wrong. But alas, it
seems this is already so.

> With the above assumption that a device is always on a valid node,
> the node id returned from dev_to_node() can be safely passed to
> cpumask_of_node() without any checking?

Yunsheng Lin Sept. 3, 2019, 8:31 a.m. UTC | #14

On 2019/9/3 15:11, Peter Zijlstra wrote:
> On Tue, Sep 03, 2019 at 02:19:04PM +0800, Yunsheng Lin wrote:
>> On 2019/9/2 20:56, Peter Zijlstra wrote:
>>> On Mon, Sep 02, 2019 at 08:25:24PM +0800, Yunsheng Lin wrote:
>>>> On 2019/9/2 15:25, Peter Zijlstra wrote:
>>>>> On Mon, Sep 02, 2019 at 01:46:51PM +0800, Yunsheng Lin wrote:
>>>>>> On 2019/9/1 0:12, Peter Zijlstra wrote:
>>>>>
>>>>>>> 1) because even it is not set, the device really does belong to a node.
>>>>>>> It is impossible a device will have magic uniform access to memory when
>>>>>>> CPUs cannot.
>>>>>>
>>>>>> So it means dev_to_node() will return either NUMA_NO_NODE or a
>>>>>> valid node id?
>>>>>
>>>>> NUMA_NO_NODE := -1, which is not a valid node number. It is also, like I
>>>>> said, not a valid device location on a NUMA system.
>>>>>
>>>>> Just because ACPI/BIOS is shit, doesn't mean the device doesn't have a
>>>>> node association. It just means we don't know and might have to guess.
>>>>
>>>> How do we guess the device's location when ACPI/BIOS does not set it?
>>>
>>> See device_add(), it looks to the device's parent and on NO_NODE, puts
>>> it there.
>>>
>>> Lacking any hints, just stick it to node0 and print a FW_BUG or
>>> something.
>>>
>>>> It seems dev_to_node() does not do anything about that and leave the
>>>> job to the caller or whatever function that get called with its return
>>>> value, such as cpumask_of_node().
>>>
>>> Well, dev_to_node() doesn't do anything; nor should it. It are the
>>> callers of set_dev_node() that should be taking care.
>>>
>>> Also note how device_add() sets the device node to the parent device's
>>> node on NUMA_NO_NODE. Arguably we should change it to complain when it
>>> finds NUMA_NO_NODE and !parent.
>>
>> Is it possible that the node id set by device_add() become invalid
>> if the node is offlined, then dev_to_node() may return a invalid
>> node id.
> 
> In that case I would expect the device to go away too. Once the memory
> controller goes away, the PCI bus connected to it cannot continue to
> function.

Ok. To summarize the discussion in order to for me to understand it
correctly:

1) Make sure device_add() set to default node0 to a device if
   ACPI/BIOS does not set the node id and it has not no parent device.

2) Use '(unsigned)node_id >= nr_node_ids' to fix the
   CONFIG_DEBUG_PER_CPU_MAPS version of cpumask_of_node() for x86
   and arm64, x86 just has a fix from you now, a patch for arm64 is
   also needed.

3) Maybe fix some other the sign bug for node id checking through the
   kernel using the '(unsigned)node_id >= nr_node_ids'.

Please see if I understand it correctly or miss something.
Maybe I can begin by sending a patch about item one to see if everyone
is ok with the idea?


> 
>> From the comment in select_fallback_rq(), it seems that a node can
>> be offlined, not sure if node offline process has taken cared of that?
>>
>> 	/*
>>          * If the node that the CPU is on has been offlined, cpu_to_node()
>>          * will return -1. There is no CPU on the node, and we should
>>          * select the CPU on the other node.
>>          */
> 
> Ugh, so I disagree with that notion. cpu_to_node() mapping should be
> fixed, you simply cannot change it after boot, too much stuff relies on
> it.
> 
> Setting cpu_to_node to -1 on node offline is just wrong. But alas, it
> seems this is already so.

Peter Zijlstra Sept. 4, 2019, 9:21 a.m. UTC | #15

*why* are you replying to some random unrelated thread?

Also, please use a sane MUA and wrap your lines <80 chars.

On Wed, Sep 04, 2019 at 10:03:31AM +0800, huangpei@loongson.cn wrote:
> >Hi, Peter,
> >
> >I found that this patch has been merged but I haven't received the e-mail for some unknown reasons.
> >https://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git/commit/?id=1c6c1ca318585f1096d4d04bc722297c85e9fb8a
> >
> >Firstly, your comments are correct, so the barrier.h part is perfect.
> >
> >Secondly, most of the rest is useless, because smp_mb__before_llsc, loongson_llsc_mb and other memory barriers are the same thing on Loongson-3. We don't need to add loongson_llsc_mb if there is already a smp_mb__before_llsc.

There wasn't. Take for example set_bit(), that didn't have
smp_mb__before_llsc on.

Also; MIPS should probably convert to asm-generic/bitops/atomic.h.

> >Thirdly, maybe the only exception is syscall.c, but mips_atomic_set is not used on Loongson-3. And if in some cases we use it, I think the user-to-kernel context switch has the same effect of a memory barrier.

And how is some random person trying to make sense of MIPS to know that?

You all created a badly documented inconsitent trainwreck. You're
'lucky' the MIPS maintainers accepted that mess in the first place.

Anyway, yes there are too many barrers now in some cases, in a previous
version I had:

  https://lkml.kernel.org/r/20190424124421.693353463@infradead.org

But because I dropped changes to local.h that might not be true anymore;
it needs careful consideration. Please audit carefully and if you find
all smp_mb__before_llsc() usage is now superfluous for this 'funny' chip
of yours, then re-submit the above patch.

> +. per-cpu like local_t *should only* be written by local cpu, and may be read by remote cpu sometimes
> 
> +. if and only if local cpu can write per-cpu, then Loongson3's llsc bug would not be triggerd.
> 
> same as this_cpu_cmpxchg_double
> 
> If so, then no need to add sync before and after cmpxchg_local

Correct, we already dropped the change for other local.h stuff.

Peter Zijlstra Sept. 4, 2019, 10:04 a.m. UTC | #16

On Wed, Sep 04, 2019 at 11:21:54AM +0200, Peter Zijlstra wrote:
> 
> *why* are you replying to some random unrelated thread?
> 
> Also, please use a sane MUA and wrap your lines <80 chars.
> 
> On Wed, Sep 04, 2019 at 10:03:31AM +0800, huangpei@loongson.cn wrote:
> > >Hi, Peter,
> > >
> > >I found that this patch has been merged but I haven't received the e-mail for some unknown reasons.
> > >https://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git/commit/?id=1c6c1ca318585f1096d4d04bc722297c85e9fb8a
> > >
> > >Firstly, your comments are correct, so the barrier.h part is perfect.
> > >
> > >Secondly, most of the rest is useless, because smp_mb__before_llsc, loongson_llsc_mb and other memory barriers are the same thing on Loongson-3. We don't need to add loongson_llsc_mb if there is already a smp_mb__before_llsc.
> 
> There wasn't. Take for example set_bit(), that didn't have
> smp_mb__before_llsc on.
> 
> Also; MIPS should probably convert to asm-generic/bitops/atomic.h.
> 
> > >Thirdly, maybe the only exception is syscall.c, but mips_atomic_set is not used on Loongson-3. And if in some cases we use it, I think the user-to-kernel context switch has the same effect of a memory barrier.
> 
> And how is some random person trying to make sense of MIPS to know that?
> 
> You all created a badly documented inconsitent trainwreck. You're
> 'lucky' the MIPS maintainers accepted that mess in the first place.
> 
> Anyway, yes there are too many barrers now in some cases, in a previous
> version I had:
> 
>   https://lkml.kernel.org/r/20190424124421.693353463@infradead.org
> 
> But because I dropped changes to local.h that might not be true anymore;
> it needs careful consideration. Please audit carefully and if you find
> all smp_mb__before_llsc() usage is now superfluous for this 'funny' chip
> of yours, then re-submit the above patch.

I think we're good, smp_mb__{before,after}_atomic() already doesn't work
with local_t.

Huang Pei Sept. 4, 2019, 12:57 p.m. UTC | #17

On 09/04/2019 05:21 PM, Peter Zijlstra wrote:
> 
> *why* are you replying to some random unrelated thread?

Chen ask me if whether your patch has more sync than needed, but I'm not sure 
whether sync before and after *cmpxchg_local* is.
> 
> Also, please use a sane MUA and wrap your lines <80 chars.

Sorry, I finally got thunderbird in plain text mode with < 80 chars. It wont 
happen again.

> 
> On Wed, Sep 04, 2019 at 10:03:31AM +0800, huangpei@loongson.cn wrote:
>>> Hi, Peter,
>>>
>>> I found that this patch has been merged but I haven't received the e-mail for some unknown reasons.
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git/commit/?id=1c6c1ca318585f1096d4d04bc722297c85e9fb8a
>>>
>>> Firstly, your comments are correct, so the barrier.h part is perfect.
>>>
>>> Secondly, most of the rest is useless, because smp_mb__before_llsc, loongson_llsc_mb and other memory barriers are the same thing on Loongson-3. We don't need to add loongson_llsc_mb if there is already a smp_mb__before_llsc.
> 
> There wasn't. Take for example set_bit(), that didn't have
> smp_mb__before_llsc on.
> 
> Also; MIPS should probably convert to asm-generic/bitops/atomic.h.
> 
>>> Thirdly, maybe the only exception is syscall.c, but mips_atomic_set is not used on Loongson-3. And if in some cases we use it, I think the user-to-kernel context switch has the same effect of a memory barrier.
> 
> And how is some random person trying to make sense of MIPS to know that?
> 
> You all created a badly documented inconsitent trainwreck. You're
> 'lucky' the MIPS maintainers accepted that mess in the first place.
> 
> Anyway, yes there are too many barrers now in some cases, in a previous
> version I had:
> 
>    https://lkml.kernel.org/r/20190424124421.693353463@infradead.org
> 
> But because I dropped changes to local.h that might not be true anymore;
> it needs careful consideration. Please audit carefully and if you find
> all smp_mb__before_llsc() usage is now superfluous for this 'funny' chip
> of yours, then re-submit the above patch.
> 
>> +. per-cpu like local_t *should only* be written by local cpu, and may be read by remote cpu sometimes
>>
>> +. if and only if local cpu can write per-cpu, then Loongson3's llsc bug would not be triggerd.
>>
>> same as this_cpu_cmpxchg_double
>>
>> If so, then no need to add sync before and after cmpxchg_local
> 
> Correct, we already dropped the change for other local.h stuff.
> 

What about cmpxchg_local? Your patch add sync before and after ll/sc in 
__cmpxchg, so *cmpxchg_local* has sync around it. But cmpxchg_local operate on 
per-cpu, which *shall not* trigger loongson's LLSC bug, since only *this* cpu 
write, other cpus read.

[v2,2/9] x86: numa: check the node id consistently for x86

Commit Message

Comments

Patch