Message ID | 20200219040426.1140-1-cai@lca.pw (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | fs: fix a data race in i_size_write/i_size_read | expand |
On Tue, Feb 18, 2020 at 11:04:26PM -0500, Qian Cai wrote: > inode::i_size could be accessed concurently as noticed by KCSAN, > > BUG: KCSAN: data-race in iomap_do_writepage / iomap_write_end > > write to 0xffff8bf68fc0cac0 of 8 bytes by task 7484 on cpu 71: > iomap_write_end+0xea/0x530 > i_size_write at include/linux/fs.h:888 > (inlined by) iomap_write_end at fs/iomap/buffered-io.c:782 > iomap_write_actor+0x132/0x200 > iomap_apply+0x245/0x8a5 > iomap_file_buffered_write+0xbd/0xf0 > xfs_file_buffered_aio_write+0x1c2/0x790 [xfs] > xfs_file_write_iter+0x232/0x2d0 [xfs] > new_sync_write+0x29c/0x3b0 > __vfs_write+0x92/0xa0 > vfs_write+0x103/0x260 > ksys_write+0x9d/0x130 > __x64_sys_write+0x4c/0x60 > do_syscall_64+0x91/0xb05 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > read to 0xffff8bf68fc0cac0 of 8 bytes by task 5901 on cpu 70: > iomap_do_writepage+0xf4/0x450 > i_size_read at include/linux/fs.h:866 > (inlined by) iomap_do_writepage at fs/iomap/buffered-io.c:1558 > write_cache_pages+0x523/0xb20 > iomap_writepages+0x47/0x80 > xfs_vm_writepages+0xc7/0x100 [xfs] > do_writepages+0x5e/0x130 > __writeback_single_inode+0xd5/0xb20 > writeback_sb_inodes+0x429/0x910 > __writeback_inodes_wb+0xc4/0x150 > wb_writeback+0x47b/0x830 > wb_workfn+0x688/0x930 > process_one_work+0x54f/0xb90 > worker_thread+0x80/0x5f0 > kthread+0x1cd/0x1f0 > ret_from_fork+0x27/0x50 > > Reported by Kernel Concurrency Sanitizer on: > CPU: 70 PID: 5901 Comm: kworker/u257:2 Tainted: G L 5.6.0-rc2-next-20200218+ #2 > Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 > Workqueue: writeback wb_workfn (flush-254:0) > > The write is protected by exclusive inode::i_rwsem (in > xfs_file_buffered_aio_write()) but the read is not. A shattered value > could introduce a logic bug. Fixed it using a pair of WRITE/READ_ONCE(). If aligned 64bit stores on 64bit host (note the BITS_PER_LONG ifdefs) end up being split, the kernel is FUBAR anyway. Details, please - how could that end up happening?
> On Feb 18, 2020, at 11:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > If aligned 64bit stores on 64bit host (note the BITS_PER_LONG ifdefs) end up > being split, the kernel is FUBAR anyway. Details, please - how could that > end up happening? My understanding is the compiler might decide to split the load into saying two 4-byte loads. Then, we might have, Load1 Store Load2 where the load value could be a garbage. Also, Marco (the KCSAN maintainer) who knew more of compiler than me mentioned that there is no guarantee that the store will not be split either. Thus, the WRITE_ONCE().
On Wed, Feb 19, 2020 at 12:08:40AM -0500, Qian Cai wrote: > > > > On Feb 18, 2020, at 11:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > If aligned 64bit stores on 64bit host (note the BITS_PER_LONG ifdefs) end up > > being split, the kernel is FUBAR anyway. Details, please - how could that > > end up happening? > > My understanding is the compiler might decide to split the load into saying two 4-byte loads. Then, we might have, > > Load1 > Store > Load2 > > where the load value could be a garbage. Also, Marco (the KCSAN maintainer) who knew more of compiler than me mentioned that there is no guarantee that the store will not be split either. Thus, the WRITE_ONCE(). > I would suggest * if some compiler does that, ask the persons responsible for that "optimization" which flags should be used to disable it. * if they fail to provide such, educate them regarding the usefulness of their idea * if that does not help, don't use the bloody piece of garbage. Again, is that pure theory (because I can't come up with any reason why splitting a 32bit load would be any less legitimate than doing the same to a 64bit one on a 64bit architecture), or is there anything that really would pull that off?
On Wed, 19 Feb 2020 at 06:23, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Feb 19, 2020 at 12:08:40AM -0500, Qian Cai wrote: > > > > > > > On Feb 18, 2020, at 11:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > If aligned 64bit stores on 64bit host (note the BITS_PER_LONG ifdefs) end up > > > being split, the kernel is FUBAR anyway. Details, please - how could that > > > end up happening? > > > > My understanding is the compiler might decide to split the load into saying two 4-byte loads. Then, we might have, > > > > Load1 > > Store > > Load2 > > > > where the load value could be a garbage. Also, Marco (the KCSAN maintainer) who knew more of compiler than me mentioned that there is no guarantee that the store will not be split either. Thus, the WRITE_ONCE(). > > In theory, yes. But by now we're aware of the current convention of assuming plain aligned writes up to word-size are atomic (which KCSAN respects with default Kconfig). > I would suggest > * if some compiler does that, ask the persons responsible for that > "optimization" which flags should be used to disable it. > * if they fail to provide such, educate them regarding the > usefulness of their idea > * if that does not help, don't use the bloody piece of garbage. > > Again, is that pure theory (because I can't come up with > any reason why splitting a 32bit load would be any less legitimate > than doing the same to a 64bit one on a 64bit architecture), > or is there anything that really would pull that off? Right. In reality, for mainstream architectures, it appears quite unlikely. There may be other valid reasons, such as documenting the fact the write can happen concurrently with loads. Let's assume the WRITE_ONCE can be dropped. The load is a different story. While load tearing may not be an issue, it's more likely that other optimizations can break the code. For example load fusing can break code that expects repeated loads in a loop. E.g. I found these uses of i_size_read in loops: git grep -E '(for|while) \(.*i_size_read' fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) { fs/ocfs2/dir.c: for (i = 0; i < i_size_read(inode) && i < offset; ) { fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) { fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode) fs/squashfs/dir.c: while (length < i_size_read(inode)) { fs/squashfs/namei.c: while (length < i_size_read(dir)) { Can i_size writes happen concurrently, and if so will these break if the compiler decides to just do i_size_read's load once, and keep the result in a register? Thanks, -- Marco
On Tue, Feb 18, 2020 at 11:04:26PM -0500, Qian Cai wrote: > inode::i_size could be accessed concurently as noticed by KCSAN, > > BUG: KCSAN: data-race in iomap_do_writepage / iomap_write_end > > The write is protected by exclusive inode::i_rwsem (in > xfs_file_buffered_aio_write()) but the read is not. A shattered value > could introduce a logic bug. Fixed it using a pair of WRITE/READ_ONCE(). We had a different problem with lack of READ_ONCE/WRITE_ONCE for i_size, the fix was the same though. There was i_size_read(inode) used in max() macro and compiled code two reads (unlocked), and this led to a race when where different value was checked and then used. The thread: https://lore.kernel.org/linux-fsdevel/20191011202050.8656-1-josef@toxicpanda.com/ We had to apply a workaround to btrfs code because the generic fix was not merged, even with several reviews and fixing a real bug. The report from KCSAN seems to require some sort of splitting the values. What we saw happened on 64bit platform without that effect so I'd call that a more likely to happen because the pattern max(i_size_read(inode), ...) is not something we'd instinctively consider unsafe.
On Wed, Feb 19, 2020 at 10:21:46AM +0100, Marco Elver wrote: > Right. In reality, for mainstream architectures, it appears quite unlikely. > > There may be other valid reasons, such as documenting the fact the > write can happen concurrently with loads. > > Let's assume the WRITE_ONCE can be dropped. > > The load is a different story. While load tearing may not be an issue, > it's more likely that other optimizations can break the code. For > example load fusing can break code that expects repeated loads in a > loop. E.g. I found these uses of i_size_read in loops: > > git grep -E '(for|while) \(.*i_size_read' > fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) { > fs/ocfs2/dir.c: for (i = 0; i < i_size_read(inode) && > i < offset; ) { > fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) { > fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode) > fs/squashfs/dir.c: while (length < i_size_read(inode)) { > fs/squashfs/namei.c: while (length < i_size_read(dir)) { > > Can i_size writes happen concurrently, and if so will these break if > the compiler decides to just do i_size_read's load once, and keep the > result in a register? It depends on the semantics and the behaviour when the value is not cached in a register might be the wrong one. A concrete example with assembly and analysis can be found in d98da49977f6 ("btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range"), which is the workardound mentioned in the my other mail. C: actual_end = min_t(u64, i_size_read(inode), end + 1); Asm: mov 0x20(%rsp),%rax cmp %rax,0x48(%r15) # read movl $0x0,0x18(%rsp) mov %rax,%r12 mov %r14,%rax cmovbe 0x48(%r15),%r12 # eval Where r15 is inode and 0x48 is offset of i_size. The original fix was to revert 62b37622718c that would do an intermediate assignment and this would also avoid the doulble evaluation but is not future-proof, should the compiler merge the stores and call i_size_read anyway. There's a patch adding READ_ONCE to i_size_read but that's not being applied at the moment and we need to fix the bug. Instead, emulate READ_ONCE by two barrier()s that's what effectively happens. The assembly confirms single evaluation: mov 0x48(%rbp),%rax # read once mov 0x20(%rsp),%rcx mov $0x20,%edx cmp %rax,%rcx cmovbe %rcx,%rax mov %rax,(%rsp) mov %rax,%rcx mov %r14,%rax Where 0x48(%rbp) is inode->i_size stored to %eax.
> On Feb 19, 2020, at 4:21 AM, Marco Elver <elver@google.com> wrote: > > Let's assume the WRITE_ONCE can be dropped. > > The load is a different story. While load tearing may not be an issue, > it's more likely that other optimizations can break the code. For > example load fusing can break code that expects repeated loads in a > loop. E.g. I found these uses of i_size_read in loops: > > git grep -E '(for|while) \(.*i_size_read' > fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) { > fs/ocfs2/dir.c: for (i = 0; i < i_size_read(inode) && > i < offset; ) { > fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) { > fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode) > fs/squashfs/dir.c: while (length < i_size_read(inode)) { > fs/squashfs/namei.c: while (length < i_size_read(dir)) { > > Can i_size writes happen concurrently, and if so will these break if > the compiler decides to just do i_size_read's load once, and keep the > result in a register? Al, is it more acceptable to add READ_ONCE() only then?
diff --git a/include/linux/fs.h b/include/linux/fs.h index 3cd4fe6b845e..25f98da90cf3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -863,7 +863,7 @@ static inline loff_t i_size_read(const struct inode *inode) preempt_enable(); return i_size; #else - return inode->i_size; + return READ_ONCE(inode->i_size); #endif } @@ -885,7 +885,7 @@ static inline void i_size_write(struct inode *inode, loff_t i_size) inode->i_size = i_size; preempt_enable(); #else - inode->i_size = i_size; + WRITE_ONCE(inode->i_size, i_size); #endif }
inode::i_size could be accessed concurently as noticed by KCSAN, BUG: KCSAN: data-race in iomap_do_writepage / iomap_write_end write to 0xffff8bf68fc0cac0 of 8 bytes by task 7484 on cpu 71: iomap_write_end+0xea/0x530 i_size_write at include/linux/fs.h:888 (inlined by) iomap_write_end at fs/iomap/buffered-io.c:782 iomap_write_actor+0x132/0x200 iomap_apply+0x245/0x8a5 iomap_file_buffered_write+0xbd/0xf0 xfs_file_buffered_aio_write+0x1c2/0x790 [xfs] xfs_file_write_iter+0x232/0x2d0 [xfs] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff8bf68fc0cac0 of 8 bytes by task 5901 on cpu 70: iomap_do_writepage+0xf4/0x450 i_size_read at include/linux/fs.h:866 (inlined by) iomap_do_writepage at fs/iomap/buffered-io.c:1558 write_cache_pages+0x523/0xb20 iomap_writepages+0x47/0x80 xfs_vm_writepages+0xc7/0x100 [xfs] do_writepages+0x5e/0x130 __writeback_single_inode+0xd5/0xb20 writeback_sb_inodes+0x429/0x910 __writeback_inodes_wb+0xc4/0x150 wb_writeback+0x47b/0x830 wb_workfn+0x688/0x930 process_one_work+0x54f/0xb90 worker_thread+0x80/0x5f0 kthread+0x1cd/0x1f0 ret_from_fork+0x27/0x50 Reported by Kernel Concurrency Sanitizer on: CPU: 70 PID: 5901 Comm: kworker/u257:2 Tainted: G L 5.6.0-rc2-next-20200218+ #2 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: writeback wb_workfn (flush-254:0) The write is protected by exclusive inode::i_rwsem (in xfs_file_buffered_aio_write()) but the read is not. A shattered value could introduce a logic bug. Fixed it using a pair of WRITE/READ_ONCE(). Signed-off-by: Qian Cai <cai@lca.pw> --- include/linux/fs.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)