diff mbox series

btrfs: fix data race when accessing the last_trans field of a root

Message ID 5152cead4acef28ac0dff3db80692a6e8852ddc4.1719828039.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix data race when accessing the last_trans field of a root | expand

Commit Message

Filipe Manana July 1, 2024, 10:01 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

KCSAN complains about a data race when accessing the last_trans field of a
root:

  [  199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]

  [  199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
  [  199.555210]  btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
  [  199.555999]  start_transaction+0x154/0xcd8 [btrfs]
  [  199.556780]  btrfs_join_transaction+0x44/0x60 [btrfs]
  [  199.557559]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
  [  199.558339]  btrfs_update_time+0x8c/0xb0 [btrfs]
  [  199.559123]  touch_atime+0x16c/0x1e0
  [  199.559151]  pipe_read+0x6a8/0x7d0
  [  199.559179]  vfs_read+0x466/0x498
  [  199.559204]  ksys_read+0x108/0x150
  [  199.559230]  __s390x_sys_read+0x68/0x88
  [  199.559257]  do_syscall+0x1c6/0x210
  [  199.559286]  __do_syscall+0xc8/0xf0
  [  199.559318]  system_call+0x70/0x98

  [  199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
  [  199.559464]  record_root_in_trans+0x196/0x228 [btrfs]
  [  199.560236]  btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
  [  199.561097]  start_transaction+0x154/0xcd8 [btrfs]
  [  199.561927]  btrfs_join_transaction+0x44/0x60 [btrfs]
  [  199.562700]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
  [  199.563493]  btrfs_update_time+0x8c/0xb0 [btrfs]
  [  199.564277]  file_update_time+0xb8/0xf0
  [  199.564301]  pipe_write+0x8ac/0xab8
  [  199.564326]  vfs_write+0x33c/0x588
  [  199.564349]  ksys_write+0x108/0x150
  [  199.564372]  __s390x_sys_write+0x68/0x88
  [  199.564397]  do_syscall+0x1c6/0x210
  [  199.564424]  __do_syscall+0xc8/0xf0
  [  199.564452]  system_call+0x70/0x98

This is because we update and read last_trans concurrently without any
type of synchronization. This should be generally harmless and in the
worst case it can make us do extra locking (btrfs_record_root_in_trans())
trigger some warnings at ctree.c or do extra work during relocation - this
would probably only happen in case of load or store tearing.

So fix this by always reading and updating the field using READ_ONCE()
and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.c       |  4 ++--
 fs/btrfs/ctree.h       | 10 ++++++++++
 fs/btrfs/defrag.c      |  2 +-
 fs/btrfs/disk-io.c     |  4 ++--
 fs/btrfs/relocation.c  |  8 ++++----
 fs/btrfs/transaction.c |  8 ++++----
 6 files changed, 23 insertions(+), 13 deletions(-)

Comments

Josef Bacik July 1, 2024, 2:16 p.m. UTC | #1
On Mon, Jul 01, 2024 at 11:01:53AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> KCSAN complains about a data race when accessing the last_trans field of a
> root:
> 
>   [  199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]
> 
>   [  199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
>   [  199.555210]  btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
>   [  199.555999]  start_transaction+0x154/0xcd8 [btrfs]
>   [  199.556780]  btrfs_join_transaction+0x44/0x60 [btrfs]
>   [  199.557559]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
>   [  199.558339]  btrfs_update_time+0x8c/0xb0 [btrfs]
>   [  199.559123]  touch_atime+0x16c/0x1e0
>   [  199.559151]  pipe_read+0x6a8/0x7d0
>   [  199.559179]  vfs_read+0x466/0x498
>   [  199.559204]  ksys_read+0x108/0x150
>   [  199.559230]  __s390x_sys_read+0x68/0x88
>   [  199.559257]  do_syscall+0x1c6/0x210
>   [  199.559286]  __do_syscall+0xc8/0xf0
>   [  199.559318]  system_call+0x70/0x98
> 
>   [  199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
>   [  199.559464]  record_root_in_trans+0x196/0x228 [btrfs]
>   [  199.560236]  btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
>   [  199.561097]  start_transaction+0x154/0xcd8 [btrfs]
>   [  199.561927]  btrfs_join_transaction+0x44/0x60 [btrfs]
>   [  199.562700]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
>   [  199.563493]  btrfs_update_time+0x8c/0xb0 [btrfs]
>   [  199.564277]  file_update_time+0xb8/0xf0
>   [  199.564301]  pipe_write+0x8ac/0xab8
>   [  199.564326]  vfs_write+0x33c/0x588
>   [  199.564349]  ksys_write+0x108/0x150
>   [  199.564372]  __s390x_sys_write+0x68/0x88
>   [  199.564397]  do_syscall+0x1c6/0x210
>   [  199.564424]  __do_syscall+0xc8/0xf0
>   [  199.564452]  system_call+0x70/0x98
> 
> This is because we update and read last_trans concurrently without any
> type of synchronization. This should be generally harmless and in the
> worst case it can make us do extra locking (btrfs_record_root_in_trans())
> trigger some warnings at ctree.c or do extra work during relocation - this
> would probably only happen in case of load or store tearing.
> 
> So fix this by always reading and updating the field using READ_ONCE()
> and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
David Sterba July 2, 2024, 2:52 p.m. UTC | #2
On Mon, Jul 01, 2024 at 11:01:53AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> KCSAN complains about a data race when accessing the last_trans field of a
> root:
> 
>   [  199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]
> 
>   [  199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
>   [  199.555210]  btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
>   [  199.555999]  start_transaction+0x154/0xcd8 [btrfs]
>   [  199.556780]  btrfs_join_transaction+0x44/0x60 [btrfs]
>   [  199.557559]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
>   [  199.558339]  btrfs_update_time+0x8c/0xb0 [btrfs]
>   [  199.559123]  touch_atime+0x16c/0x1e0
>   [  199.559151]  pipe_read+0x6a8/0x7d0
>   [  199.559179]  vfs_read+0x466/0x498
>   [  199.559204]  ksys_read+0x108/0x150
>   [  199.559230]  __s390x_sys_read+0x68/0x88
>   [  199.559257]  do_syscall+0x1c6/0x210
>   [  199.559286]  __do_syscall+0xc8/0xf0
>   [  199.559318]  system_call+0x70/0x98
> 
>   [  199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
>   [  199.559464]  record_root_in_trans+0x196/0x228 [btrfs]
>   [  199.560236]  btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
>   [  199.561097]  start_transaction+0x154/0xcd8 [btrfs]
>   [  199.561927]  btrfs_join_transaction+0x44/0x60 [btrfs]
>   [  199.562700]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
>   [  199.563493]  btrfs_update_time+0x8c/0xb0 [btrfs]
>   [  199.564277]  file_update_time+0xb8/0xf0
>   [  199.564301]  pipe_write+0x8ac/0xab8
>   [  199.564326]  vfs_write+0x33c/0x588
>   [  199.564349]  ksys_write+0x108/0x150
>   [  199.564372]  __s390x_sys_write+0x68/0x88
>   [  199.564397]  do_syscall+0x1c6/0x210
>   [  199.564424]  __do_syscall+0xc8/0xf0
>   [  199.564452]  system_call+0x70/0x98
> 
> This is because we update and read last_trans concurrently without any
> type of synchronization. This should be generally harmless and in the
> worst case it can make us do extra locking (btrfs_record_root_in_trans())
> trigger some warnings at ctree.c or do extra work during relocation - this
> would probably only happen in case of load or store tearing.
> 
> So fix this by always reading and updating the field using READ_ONCE()
> and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.

I'm curious why you mention the load/store tearing, as we discussed this
last time under some READ_ONCE/WRITE_ONCE change it's not happening on
aligned addresses for any integer type, I provided links to intel manuals.

I suggest using data_race as is more suitable in this case, it's more
specific than READ_ONCE/WRITE_ONCE that is for preventing certain
compiler optimizations.
Filipe Manana July 2, 2024, 3:09 p.m. UTC | #3
On Tue, Jul 2, 2024 at 3:52 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Jul 01, 2024 at 11:01:53AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > KCSAN complains about a data race when accessing the last_trans field of a
> > root:
> >
> >   [  199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]
> >
> >   [  199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
> >   [  199.555210]  btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
> >   [  199.555999]  start_transaction+0x154/0xcd8 [btrfs]
> >   [  199.556780]  btrfs_join_transaction+0x44/0x60 [btrfs]
> >   [  199.557559]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
> >   [  199.558339]  btrfs_update_time+0x8c/0xb0 [btrfs]
> >   [  199.559123]  touch_atime+0x16c/0x1e0
> >   [  199.559151]  pipe_read+0x6a8/0x7d0
> >   [  199.559179]  vfs_read+0x466/0x498
> >   [  199.559204]  ksys_read+0x108/0x150
> >   [  199.559230]  __s390x_sys_read+0x68/0x88
> >   [  199.559257]  do_syscall+0x1c6/0x210
> >   [  199.559286]  __do_syscall+0xc8/0xf0
> >   [  199.559318]  system_call+0x70/0x98
> >
> >   [  199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
> >   [  199.559464]  record_root_in_trans+0x196/0x228 [btrfs]
> >   [  199.560236]  btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
> >   [  199.561097]  start_transaction+0x154/0xcd8 [btrfs]
> >   [  199.561927]  btrfs_join_transaction+0x44/0x60 [btrfs]
> >   [  199.562700]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
> >   [  199.563493]  btrfs_update_time+0x8c/0xb0 [btrfs]
> >   [  199.564277]  file_update_time+0xb8/0xf0
> >   [  199.564301]  pipe_write+0x8ac/0xab8
> >   [  199.564326]  vfs_write+0x33c/0x588
> >   [  199.564349]  ksys_write+0x108/0x150
> >   [  199.564372]  __s390x_sys_write+0x68/0x88
> >   [  199.564397]  do_syscall+0x1c6/0x210
> >   [  199.564424]  __do_syscall+0xc8/0xf0
> >   [  199.564452]  system_call+0x70/0x98
> >
> > This is because we update and read last_trans concurrently without any
> > type of synchronization. This should be generally harmless and in the
> > worst case it can make us do extra locking (btrfs_record_root_in_trans())
> > trigger some warnings at ctree.c or do extra work during relocation - this
> > would probably only happen in case of load or store tearing.
> >
> > So fix this by always reading and updating the field using READ_ONCE()
> > and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
>
> I'm curious why you mention the load/store tearing, as we discussed this
> last time under some READ_ONCE/WRITE_ONCE change it's not happening on
> aligned addresses for any integer type, I provided links to intel manuals.

Yes, I do remember that.
But that was a different case, it was about a pointer type.

This is a u64. Can't the load/store tearing happen at the very least
on 32 bits systems?

I believe that's the reason we use WRITE_ONCE/READ_ONCE in several
places dealing with u64s.

>
> I suggest using data_race as is more suitable in this case, it's more
> specific than READ_ONCE/WRITE_ONCE that is for preventing certain
> compiler optimizations.
David Sterba July 2, 2024, 3:46 p.m. UTC | #4
On Tue, Jul 02, 2024 at 04:09:42PM +0100, Filipe Manana wrote:
> On Tue, Jul 2, 2024 at 3:52 PM David Sterba <dsterba@suse.cz> wrote:
> > On Mon, Jul 01, 2024 at 11:01:53AM +0100, fdmanana@kernel.org wrote:
> > >   [  199.564372]  __s390x_sys_write+0x68/0x88
> > >   [  199.564397]  do_syscall+0x1c6/0x210
> > >   [  199.564424]  __do_syscall+0xc8/0xf0
> > >   [  199.564452]  system_call+0x70/0x98
> > >
> > > This is because we update and read last_trans concurrently without any
> > > type of synchronization. This should be generally harmless and in the
> > > worst case it can make us do extra locking (btrfs_record_root_in_trans())
> > > trigger some warnings at ctree.c or do extra work during relocation - this
> > > would probably only happen in case of load or store tearing.
> > >
> > > So fix this by always reading and updating the field using READ_ONCE()
> > > and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
> >
> > I'm curious why you mention the load/store tearing, as we discussed this
> > last time under some READ_ONCE/WRITE_ONCE change it's not happening on
> > aligned addresses for any integer type, I provided links to intel manuals.
> 
> Yes, I do remember that.
> But that was a different case, it was about a pointer type.
> 
> This is a u64. Can't the load/store tearing happen at the very least
> on 32 bits systems?

Right, it was for a pointer type. I'll continue searching for a
definitive answer regarding 64bit types on 32bit architectures. The
tearing could likely happen when a 64bit type is split into two
cachelines, but I'd be very curious how this could happen within one
cacheline (assuming compiler will align 64bit types to 8 bytes).

> I believe that's the reason we use WRITE_ONCE/READ_ONCE in several
> places dealing with u64s.

AFAIK we do READ_ONCE/WRITE_ONCE for unlocked access as an annotation,
e.g. for the sysfs configuration values used in code. Or when there's a
fast path that reads a value outised of a lock and then under the lock,
there it needs the fresh value that's enforced by READ_ONCE.

The KCSAN reports should be fixed by data_race() annotation so it's not
confused by the above.

I don't see how READ_ONCE protects against load tearing on 32bit because
it's doing the same thing on 64bit and that's verifying that it's basic
scalar type and hen it's an ordinary access.

https://elixir.bootlin.com/linux/latest/source/include/asm-generic/rwonce.h#L47

#ifndef __READ_ONCE
#define __READ_ONCE(x)	(*(const volatile __unqual_scalar_typeof(x) *)&(x))
#endif

#define READ_ONCE(x)							\
({									\
	compiletime_assert_rwonce_type(x);				\
	__READ_ONCE(x);							\
})

__unqual_scalar_typeof() checks types from char to long long.

An x86_64 build and also i386 build use the same file
asm-generic/rwonce.h, no code that would prevent load tearing.
The only thing that comes to mind is that it's all hidden in the address
and pointer dereference, but that still says nothing about alignment or
cacheline-straddling.

There are arch-specific implementations of that header that do
workarounds some architectural oddities, but that's on Alpha (we don't
care) and ARM64 (64bit arch with 64bit pointers assumed).

Why I'm so picky about that? For one I want to understad it completely.
This has been bothering me for a long time as the arguments were not
always solid and more like cargo culting (happend in the past) or
scattered in comments to articles or mail threads. If we're really
missing correct use of the _ONCE accessors then we have potential bugs
lurking somewhere.

I don't mind if we add data_race() annotations as we do generally update
code to be able to use internal tools like that, or when we use _ONCE
for the fast path or as an annotation.

The article https://lwn.net/Articles/793253/ title says "Who's afraid of
a big bad optimizing compiler?" and in the first load/store tearing
example argues with a sample 16 bit architecture that could do 2 byte
loads of a pointer. That's good for a demonstration, I want something
real and relevant for linux kernel.
David Sterba July 3, 2024, 11:05 p.m. UTC | #5
On Tue, Jul 02, 2024 at 05:46:06PM +0200, David Sterba wrote:
> On Tue, Jul 02, 2024 at 04:09:42PM +0100, Filipe Manana wrote:
> > On Tue, Jul 2, 2024 at 3:52 PM David Sterba <dsterba@suse.cz> wrote:
> > > On Mon, Jul 01, 2024 at 11:01:53AM +0100, fdmanana@kernel.org wrote:
> > > >   [  199.564372]  __s390x_sys_write+0x68/0x88
> > > >   [  199.564397]  do_syscall+0x1c6/0x210
> > > >   [  199.564424]  __do_syscall+0xc8/0xf0
> > > >   [  199.564452]  system_call+0x70/0x98
> > > >
> > > > This is because we update and read last_trans concurrently without any
> > > > type of synchronization. This should be generally harmless and in the
> > > > worst case it can make us do extra locking (btrfs_record_root_in_trans())
> > > > trigger some warnings at ctree.c or do extra work during relocation - this
> > > > would probably only happen in case of load or store tearing.
> > > >
> > > > So fix this by always reading and updating the field using READ_ONCE()
> > > > and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
> > >
> > > I'm curious why you mention the load/store tearing, as we discussed this
> > > last time under some READ_ONCE/WRITE_ONCE change it's not happening on
> > > aligned addresses for any integer type, I provided links to intel manuals.
> > 
> > Yes, I do remember that.
> > But that was a different case, it was about a pointer type.
> > 
> > This is a u64. Can't the load/store tearing happen at the very least
> > on 32 bits systems?
> 
> Right, it was for a pointer type. I'll continue searching for a
> definitive answer regarding 64bit types on 32bit architectures.

I have a counter example, but this is a weak "proof by compiler", yet at
least something tangible:

in space-info.c:
void btrfs_update_space_info_chunk_size(struct btrfs_space_info *space_info,
                                        u64 chunk_size)
{
        WRITE_ONCE(space_info->chunk_size, chunk_size);
}

This uses _ONCE for the sysfs use case but demonstrates that it does not
prevent load tearing on 32bit:

000012f4 <btrfs_update_space_info_chunk_size>:
    12f4:       /-- e8 fc ff ff ff                      calll  12f5 <btrfs_update_space_info_chunk_size+0x1>
                        12f5: R_386_PC32        __fentry__
    12f9:           55                                  pushl  %ebp
    12fa:           89 e5                               movl   %esp,%ebp
    12fc:           53                                  pushl  %ebx
    12fd:           5b                                  popl   %ebx

    12fe:           89 50 44                            movl   %edx,0x44(%eax)
    1301:           89 48 48                            movl   %ecx,0x48(%eax)

    1304:           5d                                  popl   %ebp
    1305:       /-- e9 fc ff ff ff                      jmpl   1306 <btrfs_update_space_info_chunk_size+0x12>
                        1306: R_386_PC32        __x86_return_thunk
    130a:           66 90                               xchgw  %ax,%ax


eax - first parameter, a pointer to struct
edx - first half of the first parameter (u64)
ecx - second half of the first parameter

I found this example as it easy to identify, many other _ONCE uses are
static inlines or mixed in other code.

I've also tried various compilers on godbolt.org, on 32bit architectures
like MIPS32, SPARC32 and maybe others, there are two (or more)
instructions needed.

The question that remains is if the 2 instruction load/store matters in
case if the value is in one cache line as this depends on the MESI cache
synchronization protocol. The only case I see where it could tear it is
when the process gets rescheduled exactly between the instructions. An
outer lock/mutex synchronization could prevent it so it depends.
Filipe Manana July 8, 2024, 4:23 p.m. UTC | #6
On Thu, Jul 4, 2024 at 12:05 AM David Sterba <dsterba@suse.cz> wrote:
>
> On Tue, Jul 02, 2024 at 05:46:06PM +0200, David Sterba wrote:
> > On Tue, Jul 02, 2024 at 04:09:42PM +0100, Filipe Manana wrote:
> > > On Tue, Jul 2, 2024 at 3:52 PM David Sterba <dsterba@suse.cz> wrote:
> > > > On Mon, Jul 01, 2024 at 11:01:53AM +0100, fdmanana@kernel.org wrote:
> > > > >   [  199.564372]  __s390x_sys_write+0x68/0x88
> > > > >   [  199.564397]  do_syscall+0x1c6/0x210
> > > > >   [  199.564424]  __do_syscall+0xc8/0xf0
> > > > >   [  199.564452]  system_call+0x70/0x98
> > > > >
> > > > > This is because we update and read last_trans concurrently without any
> > > > > type of synchronization. This should be generally harmless and in the
> > > > > worst case it can make us do extra locking (btrfs_record_root_in_trans())
> > > > > trigger some warnings at ctree.c or do extra work during relocation - this
> > > > > would probably only happen in case of load or store tearing.
> > > > >
> > > > > So fix this by always reading and updating the field using READ_ONCE()
> > > > > and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
> > > >
> > > > I'm curious why you mention the load/store tearing, as we discussed this
> > > > last time under some READ_ONCE/WRITE_ONCE change it's not happening on
> > > > aligned addresses for any integer type, I provided links to intel manuals.
> > >
> > > Yes, I do remember that.
> > > But that was a different case, it was about a pointer type.
> > >
> > > This is a u64. Can't the load/store tearing happen at the very least
> > > on 32 bits systems?
> >
> > Right, it was for a pointer type. I'll continue searching for a
> > definitive answer regarding 64bit types on 32bit architectures.
>
> I have a counter example, but this is a weak "proof by compiler", yet at
> least something tangible:
>
> in space-info.c:
> void btrfs_update_space_info_chunk_size(struct btrfs_space_info *space_info,
>                                         u64 chunk_size)
> {
>         WRITE_ONCE(space_info->chunk_size, chunk_size);
> }
>
> This uses _ONCE for the sysfs use case but demonstrates that it does not
> prevent load tearing on 32bit:
>
> 000012f4 <btrfs_update_space_info_chunk_size>:
>     12f4:       /-- e8 fc ff ff ff                      calll  12f5 <btrfs_update_space_info_chunk_size+0x1>
>                         12f5: R_386_PC32        __fentry__
>     12f9:           55                                  pushl  %ebp
>     12fa:           89 e5                               movl   %esp,%ebp
>     12fc:           53                                  pushl  %ebx
>     12fd:           5b                                  popl   %ebx
>
>     12fe:           89 50 44                            movl   %edx,0x44(%eax)
>     1301:           89 48 48                            movl   %ecx,0x48(%eax)
>
>     1304:           5d                                  popl   %ebp
>     1305:       /-- e9 fc ff ff ff                      jmpl   1306 <btrfs_update_space_info_chunk_size+0x12>
>                         1306: R_386_PC32        __x86_return_thunk
>     130a:           66 90                               xchgw  %ax,%ax
>
>
> eax - first parameter, a pointer to struct
> edx - first half of the first parameter (u64)
> ecx - second half of the first parameter
>
> I found this example as it easy to identify, many other _ONCE uses are
> static inlines or mixed in other code.
>
> I've also tried various compilers on godbolt.org, on 32bit architectures
> like MIPS32, SPARC32 and maybe others, there are two (or more)
> instructions needed.

Ok, so that's actually mentioned to be expected in one lwn article at
least, see below.

But I'm under the impression that access/update to variables without
locking or using an atomic type, should use the _ONCE macros, and
several places mention that explicitly.
For example https://lwn.net/Articles/846700/  says:

"... it is highly recommended to use READ_ONCE() and WRITE_ONCE() to
load and store shared data outside a lock."

There's similar recommendation in the Linux-Kernel Memory Model
document at https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html
which says:

"Loads from and stores to shared (but non-atomic) variables should be
protected with the READ_ONCE(), WRITE_ONCE(), and the now-obsolete
ACCESS_ONCE()..."

Going back to "Who's afraid of a big bad optimizing compiler?"
(https://lwn.net/Articles/793253/) there's a whole list of problems
when accessing shared data without a lock or an atomic type, from
load/store fusing, to invented loads/stores, etc - for all of them the
use of _ONCE() is recommended.

For the load/store tearing, forgetting the 64 bits read on a 32 bits
system case, it mentions that reading a word sized variable (a
pointer) can result in loading it byte by byte.
And recommends _ONCE()  to solve such problems as long as we have
properly aligned and machine-word-sized accesses.

For the 64 bits access in a 32 bits platform it says:

"Of course, the compiler simply has no choice but to tear some stores
in the general case, given the possibility of code using 64-bit
integers running on a 32-bit system."

Which matches what you observed by disassembling code.

For some cases going with data_race() is probably just fine because
reading a stale or bogus value would be harmless, like reading a value
from a block reserve.
For others which I'm not so sure about, the _ONCE() use seems to be the safest.

And what do you want to do with the patch, which is already in for-next?

Thanks.

>
> The question that remains is if the 2 instruction load/store matters in
> case if the value is in one cache line as this depends on the MESI cache
> synchronization protocol. The only case I see where it could tear it is
> when the process gets rescheduled exactly between the instructions. An
> outer lock/mutex synchronization could prevent it so it depends.
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index e33f9f5a228d..451203055bbf 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -321,7 +321,7 @@  int btrfs_copy_root(struct btrfs_trans_handle *trans,
 	WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
 		trans->transid != fs_info->running_transaction->transid);
 	WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
-		trans->transid != root->last_trans);
+		trans->transid != btrfs_get_root_last_trans(root));
 
 	level = btrfs_header_level(buf);
 	if (level == 0)
@@ -556,7 +556,7 @@  int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
 	WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
 		trans->transid != fs_info->running_transaction->transid);
 	WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
-		trans->transid != root->last_trans);
+		trans->transid != btrfs_get_root_last_trans(root));
 
 	level = btrfs_header_level(buf);
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1004cb934b4a..c8568b1a61c4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -356,6 +356,16 @@  static inline void btrfs_set_root_last_log_commit(struct btrfs_root *root, int c
 	WRITE_ONCE(root->last_log_commit, commit_id);
 }
 
+static inline u64 btrfs_get_root_last_trans(const struct btrfs_root *root)
+{
+	return READ_ONCE(root->last_trans);
+}
+
+static inline void btrfs_set_root_last_trans(struct btrfs_root *root, u64 transid)
+{
+	WRITE_ONCE(root->last_trans, transid);
+}
+
 /*
  * Structure that conveys information about an extent that is going to replace
  * all the extents in a file range.
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index e7a24f096cb6..f6dbda37a361 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -139,7 +139,7 @@  int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 	if (trans)
 		transid = trans->transid;
 	else
-		transid = inode->root->last_trans;
+		transid = btrfs_get_root_last_trans(root);
 
 	defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
 	if (!defrag)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5870e76d20e2..4f5c1ff72a7d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -654,7 +654,7 @@  static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	root->state = 0;
 	RB_CLEAR_NODE(&root->rb_node);
 
-	root->last_trans = 0;
+	btrfs_set_root_last_trans(root, 0);
 	root->free_objectid = 0;
 	root->nr_delalloc_inodes = 0;
 	root->nr_ordered_extents = 0;
@@ -998,7 +998,7 @@  int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
 		return ret;
 	}
 
-	log_root->last_trans = trans->transid;
+	btrfs_set_root_last_trans(log_root, trans->transid);
 	log_root->root_key.offset = btrfs_root_id(root);
 
 	inode_item = &log_root->root_item.inode;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 6ea407255a30..865e37b5354f 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -817,7 +817,7 @@  static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans,
 		goto abort;
 	}
 	set_bit(BTRFS_ROOT_SHAREABLE, &reloc_root->state);
-	reloc_root->last_trans = trans->transid;
+	btrfs_set_root_last_trans(reloc_root, trans->transid);
 	return reloc_root;
 fail:
 	kfree(root_item);
@@ -864,7 +864,7 @@  int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
 	 */
 	if (root->reloc_root) {
 		reloc_root = root->reloc_root;
-		reloc_root->last_trans = trans->transid;
+		btrfs_set_root_last_trans(reloc_root, trans->transid);
 		return 0;
 	}
 
@@ -1739,7 +1739,7 @@  static noinline_for_stack int merge_reloc_root(struct reloc_control *rc,
 		 * btrfs_update_reloc_root() and update our root item
 		 * appropriately.
 		 */
-		reloc_root->last_trans = trans->transid;
+		btrfs_set_root_last_trans(reloc_root, trans->transid);
 		trans->block_rsv = rc->block_rsv;
 
 		replaced = 0;
@@ -2082,7 +2082,7 @@  static int record_reloc_root_in_trans(struct btrfs_trans_handle *trans,
 	struct btrfs_root *root;
 	int ret;
 
-	if (reloc_root->last_trans == trans->transid)
+	if (btrfs_get_root_last_trans(reloc_root) == trans->transid)
 		return 0;
 
 	root = btrfs_get_fs_root(fs_info, reloc_root->root_key.offset, false);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 9590a1899b9d..e0490a6f068e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -405,7 +405,7 @@  static int record_root_in_trans(struct btrfs_trans_handle *trans,
 	int ret = 0;
 
 	if ((test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
-	    root->last_trans < trans->transid) || force) {
+	    btrfs_get_root_last_trans(root) < trans->transid) || force) {
 		WARN_ON(!force && root->commit_root != root->node);
 
 		/*
@@ -421,7 +421,7 @@  static int record_root_in_trans(struct btrfs_trans_handle *trans,
 		smp_wmb();
 
 		spin_lock(&fs_info->fs_roots_radix_lock);
-		if (root->last_trans == trans->transid && !force) {
+		if (btrfs_get_root_last_trans(root) == trans->transid && !force) {
 			spin_unlock(&fs_info->fs_roots_radix_lock);
 			return 0;
 		}
@@ -429,7 +429,7 @@  static int record_root_in_trans(struct btrfs_trans_handle *trans,
 				   (unsigned long)btrfs_root_id(root),
 				   BTRFS_ROOT_TRANS_TAG);
 		spin_unlock(&fs_info->fs_roots_radix_lock);
-		root->last_trans = trans->transid;
+		btrfs_set_root_last_trans(root, trans->transid);
 
 		/* this is pretty tricky.  We don't want to
 		 * take the relocation lock in btrfs_record_root_in_trans
@@ -491,7 +491,7 @@  int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 	 * and barriers
 	 */
 	smp_rmb();
-	if (root->last_trans == trans->transid &&
+	if (btrfs_get_root_last_trans(root) == trans->transid &&
 	    !test_bit(BTRFS_ROOT_IN_TRANS_SETUP, &root->state))
 		return 0;