diff mbox

[02/26] rewrite READ_ONCE/WRITE_ONCE

Message ID 20170302163834.2273519-3-arnd@arndb.de (mailing list archive)
State Not Applicable
Delegated to: Kalle Valo
Headers show

Commit Message

Arnd Bergmann March 2, 2017, 4:38 p.m. UTC
When CONFIG_KASAN is enabled, the READ_ONCE/WRITE_ONCE macros cause
rather large kernel stacks, e.g.:

mm/vmscan.c: In function 'shrink_page_list':
mm/vmscan.c:1333:1: error: the frame size of 3456 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
block/cfq-iosched.c: In function 'cfqg_stats_add_aux':
block/cfq-iosched.c:750:1: error: the frame size of 4048 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
fs/btrfs/disk-io.c: In function 'open_ctree':
fs/btrfs/disk-io.c:3314:1: error: the frame size of 3136 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
fs/btrfs/relocation.c: In function 'build_backref_tree':
fs/btrfs/relocation.c:1193:1: error: the frame size of 4336 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
fs/fscache/stats.c: In function 'fscache_stats_show':
fs/fscache/stats.c:287:1: error: the frame size of 6512 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
fs/jbd2/commit.c: In function 'jbd2_journal_commit_transaction':
fs/jbd2/commit.c:1139:1: error: the frame size of 3760 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]

This attempts a rewrite of the two macros, using a simpler implementation
for the most common case of having a naturally aligned 1, 2, 4, or (on
64-bit architectures) 8  byte object that can be accessed with a single
instruction.  For these, we go back to a volatile pointer dereference
that we had with the ACCESS_ONCE macro.

READ_ONCE/WRITE_ONCE also try to handle unaligned objects and objects
of other sizes by forcing either a word-size access (which may trap
on some architectures) or doing a non-atomic memcpy. I could not figure
out what these are actually used for, but they appear to be done
intentionally, so I'm leaving that code untouched.

I had to fix up a couple of files that either use WRITE_ONCE() as an
implicit typecast, or ignore the result of READ_ONCE(). In all cases,
the modified code seems no worse to me than the original.

Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/x86/include/asm/switch_to.h |  2 +-
 fs/overlayfs/util.c              |  6 ++---
 include/linux/compiler.h         | 47 ++++++++++++++++++++++++++++++++--------
 3 files changed, 42 insertions(+), 13 deletions(-)

Comments

Christian Borntraeger March 2, 2017, 4:51 p.m. UTC | #1
On 03/02/2017 05:38 PM, Arnd Bergmann wrote:
> When CONFIG_KASAN is enabled, the READ_ONCE/WRITE_ONCE macros cause
> rather large kernel stacks, e.g.:
> 
> mm/vmscan.c: In function 'shrink_page_list':
> mm/vmscan.c:1333:1: error: the frame size of 3456 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
> block/cfq-iosched.c: In function 'cfqg_stats_add_aux':
> block/cfq-iosched.c:750:1: error: the frame size of 4048 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
> fs/btrfs/disk-io.c: In function 'open_ctree':
> fs/btrfs/disk-io.c:3314:1: error: the frame size of 3136 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
> fs/btrfs/relocation.c: In function 'build_backref_tree':
> fs/btrfs/relocation.c:1193:1: error: the frame size of 4336 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
> fs/fscache/stats.c: In function 'fscache_stats_show':
> fs/fscache/stats.c:287:1: error: the frame size of 6512 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
> fs/jbd2/commit.c: In function 'jbd2_journal_commit_transaction':
> fs/jbd2/commit.c:1139:1: error: the frame size of 3760 bytes is larger than 3072 bytes [-Werror=frame-larger-than=]
> 
> This attempts a rewrite of the two macros, using a simpler implementation
> for the most common case of having a naturally aligned 1, 2, 4, or (on
> 64-bit architectures) 8  byte object that can be accessed with a single
> instruction.  For these, we go back to a volatile pointer dereference
> that we had with the ACCESS_ONCE macro.

We had changed that back then because gcc 4.6 and 4.7 had a bug that could
removed the volatile statement on aggregate types like the following one

union ipte_control {
        unsigned long val;
        struct {
                unsigned long k  : 1;
                unsigned long kh : 31;
                unsigned long kg : 32;
        };
};

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145

If I see that right, your __ALIGNED_WORD(x)
macro would say that for above structure  sizeof(x) == sizeof(long)) is true,
so it would fall back to the old volatile cast and might reintroduce the 
old compiler bug?

Could you maybe you fence your simple macro for anything older than 4.9? After
all there was no kasan support anyway on these older gcc version.

Christian
Arnd Bergmann March 2, 2017, 5:55 p.m. UTC | #2
On Thu, Mar 2, 2017 at 5:51 PM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> On 03/02/2017 05:38 PM, Arnd Bergmann wrote:
>>
>> This attempts a rewrite of the two macros, using a simpler implementation
>> for the most common case of having a naturally aligned 1, 2, 4, or (on
>> 64-bit architectures) 8  byte object that can be accessed with a single
>> instruction.  For these, we go back to a volatile pointer dereference
>> that we had with the ACCESS_ONCE macro.
>
> We had changed that back then because gcc 4.6 and 4.7 had a bug that could
> removed the volatile statement on aggregate types like the following one
>
> union ipte_control {
>         unsigned long val;
>         struct {
>                 unsigned long k  : 1;
>                 unsigned long kh : 31;
>                 unsigned long kg : 32;
>         };
> };
>
> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
>
> If I see that right, your __ALIGNED_WORD(x)
> macro would say that for above structure  sizeof(x) == sizeof(long)) is true,
> so it would fall back to the old volatile cast and might reintroduce the
> old compiler bug?

Ah, right, that's the missing piece. For some reason I didn't find
the reference in the source or the git log.

> Could you maybe you fence your simple macro for anything older than 4.9? After
> all there was no kasan support anyway on these older gcc version.

Yes, that should work, thanks!

     Arnd
Christian Borntraeger March 2, 2017, 7 p.m. UTC | #3
On 03/02/2017 06:55 PM, Arnd Bergmann wrote:
> On Thu, Mar 2, 2017 at 5:51 PM, Christian Borntraeger
> <borntraeger@de.ibm.com> wrote:
>> On 03/02/2017 05:38 PM, Arnd Bergmann wrote:
>>>
>>> This attempts a rewrite of the two macros, using a simpler implementation
>>> for the most common case of having a naturally aligned 1, 2, 4, or (on
>>> 64-bit architectures) 8  byte object that can be accessed with a single
>>> instruction.  For these, we go back to a volatile pointer dereference
>>> that we had with the ACCESS_ONCE macro.
>>
>> We had changed that back then because gcc 4.6 and 4.7 had a bug that could
>> removed the volatile statement on aggregate types like the following one
>>
>> union ipte_control {
>>         unsigned long val;
>>         struct {
>>                 unsigned long k  : 1;
>>                 unsigned long kh : 31;
>>                 unsigned long kg : 32;
>>         };
>> };
>>
>> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
>>
>> If I see that right, your __ALIGNED_WORD(x)
>> macro would say that for above structure  sizeof(x) == sizeof(long)) is true,
>> so it would fall back to the old volatile cast and might reintroduce the
>> old compiler bug?

Oh dear, I should double check my sentences in emails before sending...anyway
the full story is referenced in 

commit 60815cf2e05057db5b78e398d9734c493560b11e
    Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
which has a pointer to
http://marc.info/?i=54611D86.4040306%40de.ibm.com
which contains the full story.

> 
> Ah, right, that's the missing piece. For some reason I didn't find
> the reference in the source or the git log.
> 
>> Could you maybe you fence your simple macro for anything older than 4.9? After
>> all there was no kasan support anyway on these older gcc version.
> 
> Yes, that should work, thanks!
Arnd Bergmann March 2, 2017, 9:45 p.m. UTC | #4
On Thu, Mar 2, 2017 at 8:00 PM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> On 03/02/2017 06:55 PM, Arnd Bergmann wrote:
>> On Thu, Mar 2, 2017 at 5:51 PM, Christian Borntraeger
>> <borntraeger@de.ibm.com> wrote:
>>> On 03/02/2017 05:38 PM, Arnd Bergmann wrote:
>>>>
>>>> This attempts a rewrite of the two macros, using a simpler implementation
>>>> for the most common case of having a naturally aligned 1, 2, 4, or (on
>>>> 64-bit architectures) 8  byte object that can be accessed with a single
>>>> instruction.  For these, we go back to a volatile pointer dereference
>>>> that we had with the ACCESS_ONCE macro.
>>>
>>> We had changed that back then because gcc 4.6 and 4.7 had a bug that could
>>> removed the volatile statement on aggregate types like the following one
>>>
>>> union ipte_control {
>>>         unsigned long val;
>>>         struct {
>>>                 unsigned long k  : 1;
>>>                 unsigned long kh : 31;
>>>                 unsigned long kg : 32;
>>>         };
>>> };
>>>
>>> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
>>>
>>> If I see that right, your __ALIGNED_WORD(x)
>>> macro would say that for above structure  sizeof(x) == sizeof(long)) is true,
>>> so it would fall back to the old volatile cast and might reintroduce the
>>> old compiler bug?
>
> Oh dear, I should double check my sentences in emails before sending...anyway
> the full story is referenced in
>
> commit 60815cf2e05057db5b78e398d9734c493560b11e
>     Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
> which has a pointer to
> http://marc.info/?i=54611D86.4040306%40de.ibm.com
> which contains the full story.

Ok, got it. So I guess the behavior of forcing aligned accesses on aligned
data is accidental, and allowing non-power-of-two arguments is also not
the main purpose. Maybe we could just bail out on new compilers if we get
either of those? That might catch code that accidentally does something
that is inherently non-atomic or that causes a trap when the intention was
to have a simple atomic access.

     Arnd
Christian Borntraeger March 3, 2017, 8:26 a.m. UTC | #5
On 03/02/2017 10:45 PM, Arnd Bergmann wrote:
> On Thu, Mar 2, 2017 at 8:00 PM, Christian Borntraeger
> <borntraeger@de.ibm.com> wrote:
>> On 03/02/2017 06:55 PM, Arnd Bergmann wrote:
>>> On Thu, Mar 2, 2017 at 5:51 PM, Christian Borntraeger
>>> <borntraeger@de.ibm.com> wrote:
>>>> On 03/02/2017 05:38 PM, Arnd Bergmann wrote:
>>>>>
>>>>> This attempts a rewrite of the two macros, using a simpler implementation
>>>>> for the most common case of having a naturally aligned 1, 2, 4, or (on
>>>>> 64-bit architectures) 8  byte object that can be accessed with a single
>>>>> instruction.  For these, we go back to a volatile pointer dereference
>>>>> that we had with the ACCESS_ONCE macro.
>>>>
>>>> We had changed that back then because gcc 4.6 and 4.7 had a bug that could
>>>> removed the volatile statement on aggregate types like the following one
>>>>
>>>> union ipte_control {
>>>>         unsigned long val;
>>>>         struct {
>>>>                 unsigned long k  : 1;
>>>>                 unsigned long kh : 31;
>>>>                 unsigned long kg : 32;
>>>>         };
>>>> };
>>>>
>>>> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
>>>>
>>>> If I see that right, your __ALIGNED_WORD(x)
>>>> macro would say that for above structure  sizeof(x) == sizeof(long)) is true,
>>>> so it would fall back to the old volatile cast and might reintroduce the
>>>> old compiler bug?
>>
>> Oh dear, I should double check my sentences in emails before sending...anyway
>> the full story is referenced in
>>
>> commit 60815cf2e05057db5b78e398d9734c493560b11e
>>     Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
>> which has a pointer to
>> http://marc.info/?i=54611D86.4040306%40de.ibm.com
>> which contains the full story.
> 
> Ok, got it. So I guess the behavior of forcing aligned accesses on aligned
> data is accidental, and allowing non-power-of-two arguments is also not
> the main purpose.


Right. The main purpose is to read/write _ONCE_. You can assume a somewhat
atomic access for sizes <= word size. And there are certainly places that
rely on that. But the *ONCE thing is mostly used for things where we used
barrier() 10 years ago.


 Maybe we could just bail out on new compilers if we get
> either of those? That might catch code that accidentally does something
> that is inherently non-atomic or that causes a trap when the intention was
> to have a simple atomic access.

I think Linus stated that its ok to assume that the compiler is smart enough 
to uses a single instruction to access aligned and properly sized scalar types
for *ONCE.

Back then when I changed ACCESS_ONCE there were many places that did use it
for non-atomic, > word size accesses. For example on some architectures a pmd_t
is a typedef to an array, for which there is no way to read that atomically.
So the focus must be on the "ONCE" part.

If some code uses a properly aligned, word sized object we can also assume 
atomic access. If the access is not properly sized/aligned we do not get
atomicity, but we do get the "ONCE".
But adding a check for alignment/size would break the compilation of some
code.
Arnd Bergmann March 3, 2017, 9:54 a.m. UTC | #6
On Fri, Mar 3, 2017 at 9:26 AM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> On 03/02/2017 10:45 PM, Arnd Bergmann wrote:
>> Ok, got it. So I guess the behavior of forcing aligned accesses on aligned
>> data is accidental, and allowing non-power-of-two arguments is also not
>> the main purpose.
>
>
> Right. The main purpose is to read/write _ONCE_. You can assume a somewhat
> atomic access for sizes <= word size. And there are certainly places that
> rely on that. But the *ONCE thing is mostly used for things where we used
> barrier() 10 years ago.

Ok

>
>  Maybe we could just bail out on new compilers if we get
>> either of those? That might catch code that accidentally does something
>> that is inherently non-atomic or that causes a trap when the intention was
>> to have a simple atomic access.
>
> I think Linus stated that its ok to assume that the compiler is smart enough
> to uses a single instruction to access aligned and properly sized scalar types
> for *ONCE.
>
> Back then when I changed ACCESS_ONCE there were many places that did use it
> for non-atomic, > word size accesses. For example on some architectures a pmd_t
> is a typedef to an array, for which there is no way to read that atomically.
> So the focus must be on the "ONCE" part.
>
> If some code uses a properly aligned, word sized object we can also assume
> atomic access. If the access is not properly sized/aligned we do not get
> atomicity, but we do get the "ONCE".
> But adding a check for alignment/size would break the compilation of some
> code.

So what should be the expected behavior for objects that have a smaller
alignment? E.g. this structure

struct fourbytes {
   char bytes[4];
} __packed;

when passed into the current READ_ONCE() will be accessed with
a 32-bit load, while reading it with

struct fourbytes local = *(volatile struct fourbytes *)voidpointer;

on architectures like ARMv5 or lower will turn into four single-byte
reads to avoid an alignment trap when the pointer is actually
unaligned.

I can see arguments for and against either behavior, but what should
I do when modifying it for newer compilers? The possible options
that I see are

- keep assuming that the pointer will be aligned at runtime
  and doesn't trap
- use the regular gcc behavior and do byte-accesses on those
  architectures that otherwise might trap
- add a runtime alignment check to do atomic accesses whenever
  possible, but never trap
- fail the build

     Arnd
Peter Zijlstra March 3, 2017, 2:49 p.m. UTC | #7
On Fri, Mar 03, 2017 at 09:26:50AM +0100, Christian Borntraeger wrote:
> Right. The main purpose is to read/write _ONCE_. You can assume a somewhat
> atomic access for sizes <= word size. And there are certainly places that
> rely on that. But the *ONCE thing is mostly used for things where we used
> barrier() 10 years ago.

A lot of code relies on READ/WRITE_ONCE() to generate single
instructions for naturally aligned machined word sized loads/stores
(something GCC used to guarantee, but does no longer IIRC).

So much so that I would say its a bug if READ/WRITE_ONCE() doesn't
generate a single instruction under those conditions.

However, every time I've tried to introduce stricter
semantics/primitives to verify things Linus hated it.
Peter Zijlstra March 3, 2017, 2:57 p.m. UTC | #8
On Fri, Mar 03, 2017 at 03:49:38PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 03, 2017 at 09:26:50AM +0100, Christian Borntraeger wrote:
> > Right. The main purpose is to read/write _ONCE_. You can assume a somewhat
> > atomic access for sizes <= word size. And there are certainly places that
> > rely on that. But the *ONCE thing is mostly used for things where we used
> > barrier() 10 years ago.
> 
> A lot of code relies on READ/WRITE_ONCE() to generate single
> instructions for naturally aligned machined word sized loads/stores
> (something GCC used to guarantee, but does no longer IIRC).
> 
> So much so that I would say its a bug if READ/WRITE_ONCE() doesn't
> generate a single instruction under those conditions.
> 
> However, every time I've tried to introduce stricter
> semantics/primitives to verify things Linus hated it.

See here for the last attempt:

  https://marc.info/?l=linux-virtualization&m=148007765918101&w=2
diff mbox

Patch

diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index fcc5cd387fd1..0c243dd569fe 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -30,7 +30,7 @@  static inline void prepare_switch_to(struct task_struct *prev,
 	 *
 	 * To minimize cache pollution, just follow the stack pointer.
 	 */
-	READ_ONCE(*(unsigned char *)next->thread.sp);
+	(void)READ_ONCE(*(unsigned char *)next->thread.sp);
 #endif
 }
 
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 952286f4826c..1c10632a48bb 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -222,8 +222,8 @@  void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
 
 void ovl_inode_init(struct inode *inode, struct inode *realinode, bool is_upper)
 {
-	WRITE_ONCE(inode->i_private, (unsigned long) realinode |
-		   (is_upper ? OVL_ISUPPER_MASK : 0));
+	WRITE_ONCE(inode->i_private, (void *)((unsigned long) realinode |
+		   (is_upper ? OVL_ISUPPER_MASK : 0)));
 }
 
 void ovl_inode_update(struct inode *inode, struct inode *upperinode)
@@ -231,7 +231,7 @@  void ovl_inode_update(struct inode *inode, struct inode *upperinode)
 	WARN_ON(!upperinode);
 	WARN_ON(!inode_unhashed(inode));
 	WRITE_ONCE(inode->i_private,
-		   (unsigned long) upperinode | OVL_ISUPPER_MASK);
+		   (void *)((unsigned long) upperinode | OVL_ISUPPER_MASK));
 	if (!S_ISDIR(upperinode->i_mode))
 		__insert_inode_hash(inode, (unsigned long) upperinode);
 }
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 56b90897a459..b619f5853af8 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -288,6 +288,10 @@  static __always_inline void __write_once_size(volatile void *p, void *res, int s
 	}
 }
 
+#define __ALIGNED_WORD(x)						\
+	((sizeof(x) == 1 || sizeof(x) == 2 || sizeof(x) == 4 ||		\
+	  sizeof(x) == sizeof(long)) && (sizeof(x) == __alignof__(x)))	\
+
 /*
  * Prevent the compiler from merging or refetching reads or writes. The
  * compiler is also forbidden from reordering successive instances of
@@ -309,8 +313,13 @@  static __always_inline void __write_once_size(volatile void *p, void *res, int s
  * mutilate accesses that either do not require ordering or that interact
  * with an explicit memory barrier or atomic instruction that provides the
  * required ordering.
+ *
+ * Unaligned data is particularly tricky here: if the type that gets
+ * passed in is not naturally aligned, we cast to a type of higher
+ * alignment, which is not well-defined in C. This is fine as long
+ * as the actual data is aligned, but otherwise might require a trap
+ * to satisfy the load.
  */
-
 #define __READ_ONCE(x, check)						\
 ({									\
 	union { typeof(x) __val; char __c[1]; } __u;			\
@@ -320,7 +329,32 @@  static __always_inline void __write_once_size(volatile void *p, void *res, int s
 		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
 	__u.__val;							\
 })
-#define READ_ONCE(x) __READ_ONCE(x, 1)
+
+#define __WRITE_ONCE(x, val)						\
+({									\
+	union { typeof(x) __val; char __c[1]; } __u =			\
+		{ .__val = (__force typeof(x)) (val) };			\
+	__write_once_size(&(x), __u.__c, sizeof(x));			\
+	__u.__val;							\
+})
+
+
+/*
+ * the common case is simple: x is naturally aligned, not an array,
+ * and accessible with a single load, avoiding the need for local
+ * variables. With KASAN, this is important as any call to
+ *__write_once_size(),__read_once_size_nocheck() or __read_once_size()
+ * uses significant amounts of stack space for checking that we don't
+ * overflow the union.
+ */
+#define __READ_ONCE_SIMPLE(x)						\
+	(typeof(x))(*(volatile typeof(&(x)))&(x))
+
+#define __WRITE_ONCE_SIMPLE(x, val)					\
+	({*(volatile typeof(&(x)))&(x) = (val); })
+
+#define READ_ONCE(x) __builtin_choose_expr(__ALIGNED_WORD(x), 		\
+	__READ_ONCE_SIMPLE(x), __READ_ONCE(x, 1))
 
 /*
  * Use READ_ONCE_NOCHECK() instead of READ_ONCE() if you need
@@ -328,13 +362,8 @@  static __always_inline void __write_once_size(volatile void *p, void *res, int s
  */
 #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)
 
-#define WRITE_ONCE(x, val) \
-({							\
-	union { typeof(x) __val; char __c[1]; } __u =	\
-		{ .__val = (__force typeof(x)) (val) }; \
-	__write_once_size(&(x), __u.__c, sizeof(x));	\
-	__u.__val;					\
-})
+#define WRITE_ONCE(x, val) do { __builtin_choose_expr(__ALIGNED_WORD(x), \
+	__WRITE_ONCE_SIMPLE(x, val), __WRITE_ONCE(x, val)); } while (0)
 
 #endif /* __KERNEL__ */