Message ID | 1408959780-4217-1-git-send-email-fdmanana@suse.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On Mon, Aug 25, 2014 at 10:43:00AM +0100, Filipe Manana wrote: > While writing to a file, in inode.c:cow_file_range() (and same applies to > submit_compressed_extents()), after reserving an extent for the file data, > we create a new extent map for the written range and insert it into the > extent map cache. After that, we create an ordered operation, but if it > fails (due to a transient/temporary-ENOMEM), we return without dropping > that extent map, which points to a reserved extent that is freed when we > return. A subsequent incremental fsync (when the btrfs inode doesn't have > the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and > logs a file extent item based on that extent map, which points to a disk > extent that doesn't contain valid data - it was freed by us earlier, at this > point it might contain any random/garbage data. > > Therefore, if we reach an error condition when cowing a file range after > we added the new extent map to the cache, drop it from the cache before > returning. > > Some sequence of steps that lead to this: > > $ mkfs.btrfs -f /dev/sdd > $ mount -o commit=9999 /dev/sdd /mnt > $ cd /mnt > > $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo > $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096" > $ sync > > $ od -t x1 foo > 0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 > * > 0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 > * > 0020000 > > $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo > > # Now this write + fsync fail with -ENOMEM, which was returned by > # btrfs_add_ordered_extent() in inode.c:cow_file_range(). > $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo > $ xfs_io -c "fsync" foo > fsync: Cannot allocate memory > > # Now do a new write + fsync, which will succeed. Our previous > # -ENOMEM was a transient/temporary error. > $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo > $ xfs_io -c "fsync" foo > > # Our file content (in page cache) is now: > $ od -t x1 foo > 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 > * > 0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > * > 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee > * > 0050000 > > # Now reboot the machine, and mount the fs, so that fsync log replay > # takes place. > > # The file content is now weird, in particular the first 8Kb, which > # do not match our data before nor after the sync command above. > $ od -t x1 foo > 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee > * > 0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 > * > 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee > * > 0050000 > > # In fact these first 4Kb are a duplicate of the last 4kb block. > # The last write got an extent map/file extent item that points to > # the same disk extent that we got in the write+fsync that failed > # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to > # verify that: > > $ btrfs-debug-tree /dev/sdd > (...) > item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 > extent data disk byte 12582912 nr 8192 > extent data offset 0 nr 8192 ram 8192 > item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53 > extent data disk byte 0 nr 0 > extent data offset 0 nr 8192 ram 8192 > item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53 > extent data disk byte 12582912 nr 4096 > extent data offset 0 nr 4096 ram 4096 > > $ umount /dev/sdd > $ btrfsck /dev/sdd > Checking filesystem on /dev/sdd > UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f > checking extents > extent item 12582912 has multiple extent items > ref mismatch on [12582912 4096] extent item 1, found 2 > Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192 > backpointer mismatch on [12582912 4096] > Errors found in extent allocation tree or chunk allocation > checking free space cache > checking fs roots > root 5 inode 257 errors 1000, some csum missing > found 131074 bytes used err is 1 > total csum bytes: 4 > total tree bytes: 131072 > total fs tree bytes: 32768 > total extent tree bytes: 16384 > btree space waste bytes: 123404 > file data blocks allocated: 274432 > referenced 274432 > Btrfs v3.14.1-96-gcc7fd5a-dirty > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/inode.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index c678dea..16e8146 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -792,8 +792,12 @@ retry: > ins.offset, > BTRFS_ORDERED_COMPRESSED, > async_extent->compress_type); > - if (ret) > + if (ret) { > + btrfs_drop_extent_cache(inode, async_extent->start, > + async_extent->start + > + async_extent->ram_size - 1, 0); > goto out_free_reserve; > + } > > /* > * clear dirty, set writeback and unlock the pages. What about the 'if (ret) {}' after btrfs_add_ordered_extent_compress()? It looks similar to this case. thanks, -liubo > @@ -985,14 +989,14 @@ static noinline int cow_file_range(struct inode *inode, > ret = btrfs_add_ordered_extent(inode, start, ins.objectid, > ram_size, cur_alloc_size, 0); > if (ret) > - goto out_reserve; > + goto out_drop_extent_cache; > > if (root->root_key.objectid == > BTRFS_DATA_RELOC_TREE_OBJECTID) { > ret = btrfs_reloc_clone_csums(inode, start, > cur_alloc_size); > if (ret) > - goto out_reserve; > + goto out_drop_extent_cache; > } > > if (disk_num_bytes < cur_alloc_size) > @@ -1020,6 +1024,8 @@ static noinline int cow_file_range(struct inode *inode, > out: > return ret; > > +out_drop_extent_cache: > + btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0); > out_reserve: > btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); > out_unlock: > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 26, 2014 at 4:32 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > On Mon, Aug 25, 2014 at 10:43:00AM +0100, Filipe Manana wrote: >> While writing to a file, in inode.c:cow_file_range() (and same applies to >> submit_compressed_extents()), after reserving an extent for the file data, >> we create a new extent map for the written range and insert it into the >> extent map cache. After that, we create an ordered operation, but if it >> fails (due to a transient/temporary-ENOMEM), we return without dropping >> that extent map, which points to a reserved extent that is freed when we >> return. A subsequent incremental fsync (when the btrfs inode doesn't have >> the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and >> logs a file extent item based on that extent map, which points to a disk >> extent that doesn't contain valid data - it was freed by us earlier, at this >> point it might contain any random/garbage data. >> >> Therefore, if we reach an error condition when cowing a file range after >> we added the new extent map to the cache, drop it from the cache before >> returning. >> >> Some sequence of steps that lead to this: >> >> $ mkfs.btrfs -f /dev/sdd >> $ mount -o commit=9999 /dev/sdd /mnt >> $ cd /mnt >> >> $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo >> $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096" >> $ sync >> >> $ od -t x1 foo >> 0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 >> * >> 0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 >> * >> 0020000 >> >> $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo >> >> # Now this write + fsync fail with -ENOMEM, which was returned by >> # btrfs_add_ordered_extent() in inode.c:cow_file_range(). >> $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo >> $ xfs_io -c "fsync" foo >> fsync: Cannot allocate memory >> >> # Now do a new write + fsync, which will succeed. Our previous >> # -ENOMEM was a transient/temporary error. >> $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo >> $ xfs_io -c "fsync" foo >> >> # Our file content (in page cache) is now: >> $ od -t x1 foo >> 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 >> * >> 0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> * >> 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> * >> 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee >> * >> 0050000 >> >> # Now reboot the machine, and mount the fs, so that fsync log replay >> # takes place. >> >> # The file content is now weird, in particular the first 8Kb, which >> # do not match our data before nor after the sync command above. >> $ od -t x1 foo >> 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee >> * >> 0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 >> * >> 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> * >> 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee >> * >> 0050000 >> >> # In fact these first 4Kb are a duplicate of the last 4kb block. >> # The last write got an extent map/file extent item that points to >> # the same disk extent that we got in the write+fsync that failed >> # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to >> # verify that: >> >> $ btrfs-debug-tree /dev/sdd >> (...) >> item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 >> extent data disk byte 12582912 nr 8192 >> extent data offset 0 nr 8192 ram 8192 >> item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53 >> extent data disk byte 0 nr 0 >> extent data offset 0 nr 8192 ram 8192 >> item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53 >> extent data disk byte 12582912 nr 4096 >> extent data offset 0 nr 4096 ram 4096 >> >> $ umount /dev/sdd >> $ btrfsck /dev/sdd >> Checking filesystem on /dev/sdd >> UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f >> checking extents >> extent item 12582912 has multiple extent items >> ref mismatch on [12582912 4096] extent item 1, found 2 >> Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192 >> backpointer mismatch on [12582912 4096] >> Errors found in extent allocation tree or chunk allocation >> checking free space cache >> checking fs roots >> root 5 inode 257 errors 1000, some csum missing >> found 131074 bytes used err is 1 >> total csum bytes: 4 >> total tree bytes: 131072 >> total fs tree bytes: 32768 >> total extent tree bytes: 16384 >> btree space waste bytes: 123404 >> file data blocks allocated: 274432 >> referenced 274432 >> Btrfs v3.14.1-96-gcc7fd5a-dirty >> >> Signed-off-by: Filipe Manana <fdmanana@suse.com> >> --- >> fs/btrfs/inode.c | 12 +++++++++--- >> 1 file changed, 9 insertions(+), 3 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index c678dea..16e8146 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -792,8 +792,12 @@ retry: >> ins.offset, >> BTRFS_ORDERED_COMPRESSED, >> async_extent->compress_type); >> - if (ret) >> + if (ret) { >> + btrfs_drop_extent_cache(inode, async_extent->start, >> + async_extent->start + >> + async_extent->ram_size - 1, 0); >> goto out_free_reserve; >> + } >> >> /* >> * clear dirty, set writeback and unlock the pages. > > What about the 'if (ret) {}' after btrfs_add_ordered_extent_compress()? > It looks similar to this case. Similar but different, and not a problem. The problem is returning after adding the extent map to the modified list and before creating an ordered operation. For that case you mention, since the ordered operation was created, we return without removing the extent map and without returning the reserved space too - that's because removing the em and returning the space is done by btrfs_finish_ordered_io(), for which the fsync was to wait for (again, because the ordered operation exists). thanks > > thanks, > -liubo > >> @@ -985,14 +989,14 @@ static noinline int cow_file_range(struct inode *inode, >> ret = btrfs_add_ordered_extent(inode, start, ins.objectid, >> ram_size, cur_alloc_size, 0); >> if (ret) >> - goto out_reserve; >> + goto out_drop_extent_cache; >> >> if (root->root_key.objectid == >> BTRFS_DATA_RELOC_TREE_OBJECTID) { >> ret = btrfs_reloc_clone_csums(inode, start, >> cur_alloc_size); >> if (ret) >> - goto out_reserve; >> + goto out_drop_extent_cache; >> } >> >> if (disk_num_bytes < cur_alloc_size) >> @@ -1020,6 +1024,8 @@ static noinline int cow_file_range(struct inode *inode, >> out: >> return ret; >> >> +out_drop_extent_cache: >> + btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0); >> out_reserve: >> btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); >> out_unlock: >> -- >> 1.9.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 26, 2014 at 08:56:18AM +0100, Filipe David Manana wrote: > On Tue, Aug 26, 2014 at 4:32 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > > On Mon, Aug 25, 2014 at 10:43:00AM +0100, Filipe Manana wrote: > >> While writing to a file, in inode.c:cow_file_range() (and same applies to > >> submit_compressed_extents()), after reserving an extent for the file data, > >> we create a new extent map for the written range and insert it into the > >> extent map cache. After that, we create an ordered operation, but if it > >> fails (due to a transient/temporary-ENOMEM), we return without dropping > >> that extent map, which points to a reserved extent that is freed when we > >> return. A subsequent incremental fsync (when the btrfs inode doesn't have > >> the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and > >> logs a file extent item based on that extent map, which points to a disk > >> extent that doesn't contain valid data - it was freed by us earlier, at this > >> point it might contain any random/garbage data. > >> > >> Therefore, if we reach an error condition when cowing a file range after > >> we added the new extent map to the cache, drop it from the cache before > >> returning. > >> > >> Some sequence of steps that lead to this: > >> > >> $ mkfs.btrfs -f /dev/sdd > >> $ mount -o commit=9999 /dev/sdd /mnt > >> $ cd /mnt > >> > >> $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo > >> $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096" > >> $ sync > >> > >> $ od -t x1 foo > >> 0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 > >> * > >> 0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 > >> * > >> 0020000 > >> > >> $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo > >> > >> # Now this write + fsync fail with -ENOMEM, which was returned by > >> # btrfs_add_ordered_extent() in inode.c:cow_file_range(). > >> $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo > >> $ xfs_io -c "fsync" foo > >> fsync: Cannot allocate memory > >> > >> # Now do a new write + fsync, which will succeed. Our previous > >> # -ENOMEM was a transient/temporary error. > >> $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo > >> $ xfs_io -c "fsync" foo > >> > >> # Our file content (in page cache) is now: > >> $ od -t x1 foo > >> 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 > >> * > >> 0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > >> * > >> 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > >> * > >> 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee > >> * > >> 0050000 > >> > >> # Now reboot the machine, and mount the fs, so that fsync log replay > >> # takes place. > >> > >> # The file content is now weird, in particular the first 8Kb, which > >> # do not match our data before nor after the sync command above. > >> $ od -t x1 foo > >> 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee > >> * > >> 0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 > >> * > >> 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > >> * > >> 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee > >> * > >> 0050000 > >> > >> # In fact these first 4Kb are a duplicate of the last 4kb block. > >> # The last write got an extent map/file extent item that points to > >> # the same disk extent that we got in the write+fsync that failed > >> # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to > >> # verify that: > >> > >> $ btrfs-debug-tree /dev/sdd > >> (...) > >> item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 > >> extent data disk byte 12582912 nr 8192 > >> extent data offset 0 nr 8192 ram 8192 > >> item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53 > >> extent data disk byte 0 nr 0 > >> extent data offset 0 nr 8192 ram 8192 > >> item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53 > >> extent data disk byte 12582912 nr 4096 > >> extent data offset 0 nr 4096 ram 4096 > >> > >> $ umount /dev/sdd > >> $ btrfsck /dev/sdd > >> Checking filesystem on /dev/sdd > >> UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f > >> checking extents > >> extent item 12582912 has multiple extent items > >> ref mismatch on [12582912 4096] extent item 1, found 2 > >> Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192 > >> backpointer mismatch on [12582912 4096] > >> Errors found in extent allocation tree or chunk allocation > >> checking free space cache > >> checking fs roots > >> root 5 inode 257 errors 1000, some csum missing > >> found 131074 bytes used err is 1 > >> total csum bytes: 4 > >> total tree bytes: 131072 > >> total fs tree bytes: 32768 > >> total extent tree bytes: 16384 > >> btree space waste bytes: 123404 > >> file data blocks allocated: 274432 > >> referenced 274432 > >> Btrfs v3.14.1-96-gcc7fd5a-dirty > >> > >> Signed-off-by: Filipe Manana <fdmanana@suse.com> > >> --- > >> fs/btrfs/inode.c | 12 +++++++++--- > >> 1 file changed, 9 insertions(+), 3 deletions(-) > >> > >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >> index c678dea..16e8146 100644 > >> --- a/fs/btrfs/inode.c > >> +++ b/fs/btrfs/inode.c > >> @@ -792,8 +792,12 @@ retry: > >> ins.offset, > >> BTRFS_ORDERED_COMPRESSED, > >> async_extent->compress_type); > >> - if (ret) > >> + if (ret) { > >> + btrfs_drop_extent_cache(inode, async_extent->start, > >> + async_extent->start + > >> + async_extent->ram_size - 1, 0); > >> goto out_free_reserve; > >> + } My bad, I made a mistake, what I want is just sitting here ;) > >> > >> /* > >> * clear dirty, set writeback and unlock the pages. > > > > What about the 'if (ret) {}' after btrfs_add_ordered_extent_compress()? > > It looks similar to this case. > > Similar but different, and not a problem. > > The problem is returning after adding the extent map to the modified > list and before creating an ordered operation. > For that case you mention, since the ordered operation was created, we > return without removing the extent map and without returning the > reserved space too - that's because removing the em and returning the > space is done by btrfs_finish_ordered_io(), for which the fsync was to > wait for (again, because the ordered operation exists). Thanks for the explanation. -liubo > > thanks > > > > > thanks, > > -liubo > > > >> @@ -985,14 +989,14 @@ static noinline int cow_file_range(struct inode *inode, > >> ret = btrfs_add_ordered_extent(inode, start, ins.objectid, > >> ram_size, cur_alloc_size, 0); > >> if (ret) > >> - goto out_reserve; > >> + goto out_drop_extent_cache; > >> > >> if (root->root_key.objectid == > >> BTRFS_DATA_RELOC_TREE_OBJECTID) { > >> ret = btrfs_reloc_clone_csums(inode, start, > >> cur_alloc_size); > >> if (ret) > >> - goto out_reserve; > >> + goto out_drop_extent_cache; > >> } > >> > >> if (disk_num_bytes < cur_alloc_size) > >> @@ -1020,6 +1024,8 @@ static noinline int cow_file_range(struct inode *inode, > >> out: > >> return ret; > >> > >> +out_drop_extent_cache: > >> + btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0); > >> out_reserve: > >> btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); > >> out_unlock: > >> -- > >> 1.9.1 > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c678dea..16e8146 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -792,8 +792,12 @@ retry: ins.offset, BTRFS_ORDERED_COMPRESSED, async_extent->compress_type); - if (ret) + if (ret) { + btrfs_drop_extent_cache(inode, async_extent->start, + async_extent->start + + async_extent->ram_size - 1, 0); goto out_free_reserve; + } /* * clear dirty, set writeback and unlock the pages. @@ -985,14 +989,14 @@ static noinline int cow_file_range(struct inode *inode, ret = btrfs_add_ordered_extent(inode, start, ins.objectid, ram_size, cur_alloc_size, 0); if (ret) - goto out_reserve; + goto out_drop_extent_cache; if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) { ret = btrfs_reloc_clone_csums(inode, start, cur_alloc_size); if (ret) - goto out_reserve; + goto out_drop_extent_cache; } if (disk_num_bytes < cur_alloc_size) @@ -1020,6 +1024,8 @@ static noinline int cow_file_range(struct inode *inode, out: return ret; +out_drop_extent_cache: + btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0); out_reserve: btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); out_unlock:
While writing to a file, in inode.c:cow_file_range() (and same applies to submit_compressed_extents()), after reserving an extent for the file data, we create a new extent map for the written range and insert it into the extent map cache. After that, we create an ordered operation, but if it fails (due to a transient/temporary-ENOMEM), we return without dropping that extent map, which points to a reserved extent that is freed when we return. A subsequent incremental fsync (when the btrfs inode doesn't have the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and logs a file extent item based on that extent map, which points to a disk extent that doesn't contain valid data - it was freed by us earlier, at this point it might contain any random/garbage data. Therefore, if we reach an error condition when cowing a file range after we added the new extent map to the cache, drop it from the cache before returning. Some sequence of steps that lead to this: $ mkfs.btrfs -f /dev/sdd $ mount -o commit=9999 /dev/sdd /mnt $ cd /mnt $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096" $ sync $ od -t x1 foo 0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 * 0020000 $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo # Now this write + fsync fail with -ENOMEM, which was returned by # btrfs_add_ordered_extent() in inode.c:cow_file_range(). $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo $ xfs_io -c "fsync" foo fsync: Cannot allocate memory # Now do a new write + fsync, which will succeed. Our previous # -ENOMEM was a transient/temporary error. $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo $ xfs_io -c "fsync" foo # Our file content (in page cache) is now: $ od -t x1 foo 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 * 0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0050000 # Now reboot the machine, and mount the fs, so that fsync log replay # takes place. # The file content is now weird, in particular the first 8Kb, which # do not match our data before nor after the sync command above. $ od -t x1 foo 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0050000 # In fact these first 4Kb are a duplicate of the last 4kb block. # The last write got an extent map/file extent item that points to # the same disk extent that we got in the write+fsync that failed # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to # verify that: $ btrfs-debug-tree /dev/sdd (...) item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 extent data disk byte 12582912 nr 8192 extent data offset 0 nr 8192 ram 8192 item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 8192 ram 8192 item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53 extent data disk byte 12582912 nr 4096 extent data offset 0 nr 4096 ram 4096 $ umount /dev/sdd $ btrfsck /dev/sdd Checking filesystem on /dev/sdd UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f checking extents extent item 12582912 has multiple extent items ref mismatch on [12582912 4096] extent item 1, found 2 Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192 backpointer mismatch on [12582912 4096] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots root 5 inode 257 errors 1000, some csum missing found 131074 bytes used err is 1 total csum bytes: 4 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123404 file data blocks allocated: 274432 referenced 274432 Btrfs v3.14.1-96-gcc7fd5a-dirty Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/inode.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)