Message ID | 20180405215512.513-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Apr 05, 2018 at 10:55:12PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Currently if we allocate extents beyond an inode's i_size (through the > fallocate system call) and then fsync the file, we log the extents but > after a power failure we replay them and then immediately drop them. > This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs: > Avoid orphan inodes cleanup while replaying log"), because it marks > the inode as an orphan instead of dropping any extents beyond i_size > before replaying logged extents, so after the log replay, and while > the mount operation is still ongoing, we find the inode marked as an > orphan and then perform a truncation (drop extents beyond the inode's > i_size). Because the processing of orphan inodes is still done > right after replaying the log and before the mount operation finishes, > the intention of that commit does not make any sense (at least as > of today). However reverting that behaviour is not enough, because > we can not simply discard all extents beyond i_size and then replay > logged extents, because we risk dropping extents beyond i_size created > in past transactions, for example: > > add prealloc extent beyond i_size > fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode > transaction commit > add another prealloc extent beyond i_size > fsync - triggers the fast fsync path > power failure > > In that scenario, we would drop the first extent and then replay the > second one. To fix this just make sure that all prealloc extents > beyond i_size are logged, and if we find too many (which is far from > a common case), fallback to a full transaction commit (like we do when > logging regular extents in the fast fsync path). > > Trivial reproducer: > > $ mkfs.btrfs -f /dev/sdb > $ mount /dev/sdb /mnt > $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo > $ sync > $ xfs_io -c "falloc -k 256K 1M" /mnt/foo > $ xfs_io -c "fsync" /mnt/foo > <power failure> > > # mount to replay log > $ mount /dev/sdb /mnt > # at this point the file only has one extent, at offset 0, size 256K > > A test case for fstests follows soon, covering multiple scenarios that > involve adding prealloc extents with previous shrinking truncates and > without such truncates. > > Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Added to next, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 10, 2018 at 01:50:21PM +0000, Sasha Levin wrote: > Hi, > > [This is an automated email] > > This commit has been processed because it contains a "Fixes:" tag, > fixing commit: c71bf099abdd Btrfs: Avoid orphan inodes cleanup while replaying log. > > The bot has also determined it's probably a bug fixing patch. (score: 6.2138) > > The bot has tested the following trees: v4.16.1, v4.15.16, v4.14.33, v4.9.93, v4.4.127. > > v4.16.1: Build OK! > v4.15.16: Build OK! > v4.14.33: Build OK! > v4.9.93: Build failed! Errors: > tree-log.c:2367:24: error: ‘struct btrfs_fs_info’ has no member named ‘sectorsize’ > tree-log.c:2367:24: error: ‘struct btrfs_fs_info’ has no member named ‘sectorsize’ > tree-log.c:4224:13: error: ‘struct inode’ has no member named ‘flags’; did you mean ‘i_flags’? > tree-log.c:4229:38: error: ‘struct inode’ has no member named ‘vfs_inode’ > tree-log.c:4239:4: error: implicit declaration of function ‘refcount_inc’; did you mean ‘i_readcount_inc’? [-Werror=implicit-function-declaration] 4.9 and older would need manual fixes to the patch. There are a few refactorings but that should be easy to resolve if somebody wants to backport port that. > > v4.4.127: Failed to apply! Possible dependencies: > 0132761017e0 ("btrfs: fix string and comment grammatical issues and typos") > > > -- > Thanks, > Sasha????{.n?+???????+%?????ݶ??w??{.n?+????{???k~????^n?r???z???h?????&???z??z?ޗ?+??+zf???h???~????i?????????z_???j:+v???)ߣ?m -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 70afd1085033..eb3a41269b0e 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2457,13 +2457,41 @@ static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb, if (ret) break; - /* for regular files, make sure corresponding - * orphan item exist. extents past the new EOF - * will be truncated later by orphan cleanup. + /* + * Before replaying extents, truncate the inode to its + * size. We need to do it now and not after log replay + * because before an fsync we can have prealloc extents + * added beyond the inode's i_size. If we did it after, + * through orphan cleanup for example, we would drop + * those prealloc extents just after replaying them. */ if (S_ISREG(mode)) { - ret = insert_orphan_item(wc->trans, root, - key.objectid); + struct inode *inode; + u64 from; + + inode = read_one_inode(root, key.objectid); + if (!inode) { + ret = -EIO; + break; + } + from = ALIGN(i_size_read(inode), + root->fs_info->sectorsize); + ret = btrfs_drop_extents(wc->trans, root, inode, + from, (u64)-1, 1); + /* + * If the nlink count is zero here, the iput + * will free the inode. We bump it to make + * sure it doesn't get freed until the link + * count fixup is done. + */ + if (!ret) { + if (inode->i_nlink == 0) + inc_nlink(inode); + /* Update link count and nbytes. */ + ret = btrfs_update_inode(wc->trans, + root, inode); + } + iput(inode); if (ret) break; } @@ -4348,6 +4376,31 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans, num++; } + /* + * Add all prealloc extents beyond the inode's i_size to make sure we + * don't lose them after doing a fast fsync and replaying the log. + */ + if (inode->flags & BTRFS_INODE_PREALLOC) { + struct rb_node *node; + + for (node = rb_last(&tree->map); node; node = rb_prev(node)) { + em = rb_entry(node, struct extent_map, rb_node); + if (em->start < i_size_read(&inode->vfs_inode)) + break; + if (!list_empty(&em->list)) + continue; + /* Same as above loop. */ + if (++num > 32768) { + list_del_init(&tree->modified_extents); + ret = -EFBIG; + goto process; + } + refcount_inc(&em->refs); + set_bit(EXTENT_FLAG_LOGGING, &em->flags); + list_add_tail(&em->list, &extents); + } + } + list_sort(NULL, &extents, extent_cmp); btrfs_get_logged_extents(inode, logged_list, logged_start, logged_end); /*