Message ID | 4240e179e2439dd1698798e2de79ec59990cbaa0.1712452660.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix wrong block_start calculation for btrfs_drop_extent_map_range() | expand |
Hi, > [BUG] > During my extent_map cleanup/refactor, with more than too strict sanity > checks, extent-map-tests::test_case_7() would crash my extent_map sanity > checks. > > The problem is, after btrfs_drop_extent_map_range(), the resulted > extent_map has a @block_start way too large. > Meanwhile my btrfs_file_extent_item based members are returning a > correct @disk_bytenr along with correct @offset. > > The extent map layout looks like this: > > 0 16K 32K 48K > | PINNED | | Regular | > > The regular em at [32K, 48K) also has 32K @block_start. > > Then drop range [0, 36K), which should shrink the regular one to be > [36K, 48K). > However the @block_start is incorrect, we expect 32K + 4K, but got 52K. > > [CAUSE] > Inside btrfs_drop_extent_map_range() function, if we hit an extent_map > that covers the target range but is still beyond it, we need to split > that extent map into half: > > |<-- drop range -->| > |<----- existing extent_map --->| > > And if the extent map is not compressed, we need to forward > extent_map::block_start by the difference between the end of drop range > and the extent map start. > > However in that particular case, the difference is calculated using > (start + len - em->start). > > The problem is @start can be modified if the drop range covers any > pinned extent. > > This leads to wrong calculation, and would be caught by my later > extent_map sanity checks, which checks the em::block_start against > btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset. > > And unfortunately this is going to cause data corruption, as the > splitted em is pointing an incorrect location, can cause either > unexpected read error or wild writes. > > [FIX] > Fix it by avoiding using @start completely, and use @end - em->start > instead, which @end is exclusive bytenr number. > > And update the test case to verify the @block_start to prevent such > problem from happening. > > CC: stable@vger.kernel.org # 6.7+ > Fixes: c962098ca4af ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range") > Signed-off-by: Qu Wenruo <wqu@suse.com> $ git describe --contains c962098ca4af v6.5-rc7~4^2 so it should be CC: stable@vger.kernel.org # 6.5+ Best Regards Wang Yugui (wangyugui@e16-tech.com) 2024/04/08
On Mon, Apr 08, 2024 at 08:00:15AM +0800, Wang Yugui wrote: > Hi, > > > [BUG] > > During my extent_map cleanup/refactor, with more than too strict sanity > > checks, extent-map-tests::test_case_7() would crash my extent_map sanity > > checks. > > > > The problem is, after btrfs_drop_extent_map_range(), the resulted > > extent_map has a @block_start way too large. > > Meanwhile my btrfs_file_extent_item based members are returning a > > correct @disk_bytenr along with correct @offset. > > > > The extent map layout looks like this: > > > > 0 16K 32K 48K > > | PINNED | | Regular | > > > > The regular em at [32K, 48K) also has 32K @block_start. > > > > Then drop range [0, 36K), which should shrink the regular one to be > > [36K, 48K). > > However the @block_start is incorrect, we expect 32K + 4K, but got 52K. > > > > [CAUSE] > > Inside btrfs_drop_extent_map_range() function, if we hit an extent_map > > that covers the target range but is still beyond it, we need to split > > that extent map into half: > > > > |<-- drop range -->| > > |<----- existing extent_map --->| > > > > And if the extent map is not compressed, we need to forward > > extent_map::block_start by the difference between the end of drop range > > and the extent map start. > > > > However in that particular case, the difference is calculated using > > (start + len - em->start). > > > > The problem is @start can be modified if the drop range covers any > > pinned extent. > > > > This leads to wrong calculation, and would be caught by my later > > extent_map sanity checks, which checks the em::block_start against > > btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset. > > > > And unfortunately this is going to cause data corruption, as the > > splitted em is pointing an incorrect location, can cause either > > unexpected read error or wild writes. > > > > [FIX] > > Fix it by avoiding using @start completely, and use @end - em->start > > instead, which @end is exclusive bytenr number. > > > > And update the test case to verify the @block_start to prevent such > > problem from happening. > > > > CC: stable@vger.kernel.org # 6.7+ > > Fixes: c962098ca4af ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range") > > Signed-off-by: Qu Wenruo <wqu@suse.com> > > $ git describe --contains c962098ca4af > v6.5-rc7~4^2 > > so it should be > CC: stable@vger.kernel.org # 6.5+ As the "Fixes:" commit was backported to the following kernel releases: 6.1.47 6.4.12 it should go back to 6.1+ as well :) But we can handle that when it hits Linus's tree. thanks, greg k-h
On Sun, Apr 7, 2024 at 2:18 AM Qu Wenruo <wqu@suse.com> wrote: > > [BUG] > During my extent_map cleanup/refactor, with more than too strict sanity > checks, extent-map-tests::test_case_7() would crash my extent_map sanity > checks. > > The problem is, after btrfs_drop_extent_map_range(), the resulted > extent_map has a @block_start way too large. > Meanwhile my btrfs_file_extent_item based members are returning a > correct @disk_bytenr along with correct @offset. > > The extent map layout looks like this: > > 0 16K 32K 48K > | PINNED | | Regular | > > The regular em at [32K, 48K) also has 32K @block_start. > > Then drop range [0, 36K), which should shrink the regular one to be > [36K, 48K). > However the @block_start is incorrect, we expect 32K + 4K, but got 52K. > > [CAUSE] > Inside btrfs_drop_extent_map_range() function, if we hit an extent_map > that covers the target range but is still beyond it, we need to split > that extent map into half: > > |<-- drop range -->| > |<----- existing extent_map --->| > > And if the extent map is not compressed, we need to forward > extent_map::block_start by the difference between the end of drop range > and the extent map start. > > However in that particular case, the difference is calculated using > (start + len - em->start). > > The problem is @start can be modified if the drop range covers any > pinned extent. > > This leads to wrong calculation, and would be caught by my later > extent_map sanity checks, which checks the em::block_start against > btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset. > > And unfortunately this is going to cause data corruption, as the > splitted em is pointing an incorrect location, can cause either > unexpected read error or wild writes. It can't happen for either reads or writes actually. As for writes, it can't happen because: 1) The issue only happens when skip_pinned is true, which is the only case that adjusts the 'start' variable (parameter); 2) All IO paths pass false for the skip_pinned parameter, only relocation passes true when replacing the bytenr in file extent items, and the range it uses for btrfs_drop_extent_map_range() matches the extent item's range, so it won't cover extent maps outside the range; 3) Extent maps for writes in progress are always pinned; 4) Before doing IO on a range we lock the range and wait for any existing ordered extents in the range to complete, which results in unpinning extent maps; 5) Extent maps for writes are created when running delalloc (or during the write for direct IO), along with the ordered extent, and are created as pinned. With all these, I don't see how we can get a "wild write" or any problem in a write path. As for reads, it doesn't happen because of what's said in 2 regarding the range passed to btrfs_drop_extent_map_range(). So as far as I can see, it's currently a harmless bug, and maybe it always has been because the bad calculation has been there since 2008, see below. If it affected reads or writes, it would be easy to trigger with fstests and fsx for example (fstests). It's certainly a bug, it just doesn't have any consequences as far as I can see, so the changelog should be updated. > > [FIX] > Fix it by avoiding using @start completely, and use @end - em->start > instead, which @end is exclusive bytenr number. > > And update the test case to verify the @block_start to prevent such > problem from happening. > > CC: stable@vger.kernel.org # 6.7+ > Fixes: c962098ca4af ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range") That commit doesn't influence how split->block_start is updated, only split->start and split->len. So I can't understand why you chose to blame that commit. The bug was actually introduced in 2008 by the following commit: 3b951516ed70 ("Btrfs: Use the extent map cache to find the logical disk block during data retries") https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3b951516ed703af0f6d82053937655ad69b60864 > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/extent_map.c | 2 +- > fs/btrfs/tests/extent-map-tests.c | 6 +++++- > 2 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c > index 471654cb65b0..955ce300e5a1 100644 > --- a/fs/btrfs/extent_map.c > +++ b/fs/btrfs/extent_map.c > @@ -799,7 +799,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end, > split->block_len = em->block_len; > split->orig_start = em->orig_start; > } else { > - const u64 diff = start + len - em->start; > + const u64 diff = end - em->start; > > split->block_len = split->len; > split->block_start += diff; > diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c > index 253cce7ffecf..80e71c5cb7ab 100644 > --- a/fs/btrfs/tests/extent-map-tests.c > +++ b/fs/btrfs/tests/extent-map-tests.c > @@ -818,7 +818,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info) > test_err("em->len is %llu, expected 16K", em->len); > goto out; > } > - Please avoid such accidental changes. Thanks. > free_extent_map(em); > > read_lock(&em_tree->lock); > @@ -847,6 +846,11 @@ static int test_case_7(struct btrfs_fs_info *fs_info) > goto out; > } > > + if (em->block_start != SZ_32K + SZ_4K) { > + test_err("em->block_start is %llu, expected 36K", em->block_start); > + goto out; > + } > + > free_extent_map(em); > > read_lock(&em_tree->lock); > -- > 2.44.0 > >
在 2024/4/8 20:18, Filipe Manana 写道: > On Sun, Apr 7, 2024 at 2:18 AM Qu Wenruo <wqu@suse.com> wrote: >> >> [BUG] >> During my extent_map cleanup/refactor, with more than too strict sanity >> checks, extent-map-tests::test_case_7() would crash my extent_map sanity >> checks. >> >> The problem is, after btrfs_drop_extent_map_range(), the resulted >> extent_map has a @block_start way too large. >> Meanwhile my btrfs_file_extent_item based members are returning a >> correct @disk_bytenr along with correct @offset. >> >> The extent map layout looks like this: >> >> 0 16K 32K 48K >> | PINNED | | Regular | >> >> The regular em at [32K, 48K) also has 32K @block_start. >> >> Then drop range [0, 36K), which should shrink the regular one to be >> [36K, 48K). >> However the @block_start is incorrect, we expect 32K + 4K, but got 52K. >> >> [CAUSE] >> Inside btrfs_drop_extent_map_range() function, if we hit an extent_map >> that covers the target range but is still beyond it, we need to split >> that extent map into half: >> >> |<-- drop range -->| >> |<----- existing extent_map --->| >> >> And if the extent map is not compressed, we need to forward >> extent_map::block_start by the difference between the end of drop range >> and the extent map start. >> >> However in that particular case, the difference is calculated using >> (start + len - em->start). >> >> The problem is @start can be modified if the drop range covers any >> pinned extent. >> >> This leads to wrong calculation, and would be caught by my later >> extent_map sanity checks, which checks the em::block_start against >> btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset. >> >> And unfortunately this is going to cause data corruption, as the >> splitted em is pointing an incorrect location, can cause either >> unexpected read error or wild writes. > > It can't happen for either reads or writes actually. > > As for writes, it can't happen because: > > 1) The issue only happens when skip_pinned is true, which is the only > case that adjusts the 'start' variable (parameter); > > 2) All IO paths pass false for the skip_pinned parameter, only > relocation passes true when replacing the bytenr in file extent items, > and the range it uses for btrfs_drop_extent_map_range() matches the > extent item's range, so it won't cover extent maps outside the range; Thankfully that's what I missed. In that case we're fine. > > 3) Extent maps for writes in progress are always pinned; > > 4) Before doing IO on a range we lock the range and wait for any > existing ordered extents in the range to complete, which results in > unpinning extent maps; > > 5) Extent maps for writes are created when running delalloc (or during > the write for direct IO), along with the ordered extent, and are > created as pinned. > > With all these, I don't see how we can get a "wild write" or any > problem in a write path. > > As for reads, it doesn't happen because of what's said in 2 regarding > the range passed to btrfs_drop_extent_map_range(). > > So as far as I can see, it's currently a harmless bug, and maybe it > always has been because the bad calculation has been there since 2008, > see below. > If it affected reads or writes, it would be easy to trigger with > fstests and fsx for example (fstests). > > It's certainly a bug, it just doesn't have any consequences as far as > I can see, so the changelog should be updated. > >> >> [FIX] >> Fix it by avoiding using @start completely, and use @end - em->start >> instead, which @end is exclusive bytenr number. >> >> And update the test case to verify the @block_start to prevent such >> problem from happening. >> >> CC: stable@vger.kernel.org # 6.7+ >> Fixes: c962098ca4af ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range") > > That commit doesn't influence how split->block_start is updated, only > split->start and split->len. > So I can't understand why you chose to blame that commit. That patch removed the @len update when updating @start. Before that patch every time we update @start, @len would be changed to keep the end the same. > > The bug was actually introduced in 2008 by the following commit: > > 3b951516ed70 ("Btrfs: Use the extent map cache to find the logical > disk block during data retries") > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3b951516ed703af0f6d82053937655ad69b60864 Nope, just before the offending patch, the code looks like this for pinned extent maps: if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) { start = em_end; if (end != (u64)-1) len = start + len - em_end; goto next; } Which is correct. Thanks, Qu > >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> fs/btrfs/extent_map.c | 2 +- >> fs/btrfs/tests/extent-map-tests.c | 6 +++++- >> 2 files changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c >> index 471654cb65b0..955ce300e5a1 100644 >> --- a/fs/btrfs/extent_map.c >> +++ b/fs/btrfs/extent_map.c >> @@ -799,7 +799,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end, >> split->block_len = em->block_len; >> split->orig_start = em->orig_start; >> } else { >> - const u64 diff = start + len - em->start; >> + const u64 diff = end - em->start; >> >> split->block_len = split->len; >> split->block_start += diff; >> diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c >> index 253cce7ffecf..80e71c5cb7ab 100644 >> --- a/fs/btrfs/tests/extent-map-tests.c >> +++ b/fs/btrfs/tests/extent-map-tests.c >> @@ -818,7 +818,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info) >> test_err("em->len is %llu, expected 16K", em->len); >> goto out; >> } >> - > > Please avoid such accidental changes. > > Thanks. > >> free_extent_map(em); >> >> read_lock(&em_tree->lock); >> @@ -847,6 +846,11 @@ static int test_case_7(struct btrfs_fs_info *fs_info) >> goto out; >> } >> >> + if (em->block_start != SZ_32K + SZ_4K) { >> + test_err("em->block_start is %llu, expected 36K", em->block_start); >> + goto out; >> + } >> + >> free_extent_map(em); >> >> read_lock(&em_tree->lock); >> -- >> 2.44.0 >> >> >
On Mon, Apr 08, 2024 at 06:57:56AM +0200, Greg KH wrote: > On Mon, Apr 08, 2024 at 08:00:15AM +0800, Wang Yugui wrote: > > Hi, > > > > > [BUG] > > > During my extent_map cleanup/refactor, with more than too strict sanity > > > checks, extent-map-tests::test_case_7() would crash my extent_map sanity > > > checks. > > > > > > The problem is, after btrfs_drop_extent_map_range(), the resulted > > > extent_map has a @block_start way too large. > > > Meanwhile my btrfs_file_extent_item based members are returning a > > > correct @disk_bytenr along with correct @offset. > > > > > > The extent map layout looks like this: > > > > > > 0 16K 32K 48K > > > | PINNED | | Regular | > > > > > > The regular em at [32K, 48K) also has 32K @block_start. > > > > > > Then drop range [0, 36K), which should shrink the regular one to be > > > [36K, 48K). > > > However the @block_start is incorrect, we expect 32K + 4K, but got 52K. > > > > > > [CAUSE] > > > Inside btrfs_drop_extent_map_range() function, if we hit an extent_map > > > that covers the target range but is still beyond it, we need to split > > > that extent map into half: > > > > > > |<-- drop range -->| > > > |<----- existing extent_map --->| > > > > > > And if the extent map is not compressed, we need to forward > > > extent_map::block_start by the difference between the end of drop range > > > and the extent map start. > > > > > > However in that particular case, the difference is calculated using > > > (start + len - em->start). > > > > > > The problem is @start can be modified if the drop range covers any > > > pinned extent. > > > > > > This leads to wrong calculation, and would be caught by my later > > > extent_map sanity checks, which checks the em::block_start against > > > btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset. > > > > > > And unfortunately this is going to cause data corruption, as the > > > splitted em is pointing an incorrect location, can cause either > > > unexpected read error or wild writes. > > > > > > [FIX] > > > Fix it by avoiding using @start completely, and use @end - em->start > > > instead, which @end is exclusive bytenr number. > > > > > > And update the test case to verify the @block_start to prevent such > > > problem from happening. > > > > > > CC: stable@vger.kernel.org # 6.7+ > > > Fixes: c962098ca4af ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range") > > > Signed-off-by: Qu Wenruo <wqu@suse.com> > > > > $ git describe --contains c962098ca4af > > v6.5-rc7~4^2 > > > > so it should be > > CC: stable@vger.kernel.org # 6.5+ > > As the "Fixes:" commit was backported to the following kernel releases: > 6.1.47 6.4.12 > it should go back to 6.1+ as well :) Determining all the versions requires one extra step to scan the stable-queue.git for the commit. I think everybody does 'git describe --contains COMMITID' based on Fixes: and then pick the version for CC:. This is "best" we can promise for an average developer. > But we can handle that when it hits Linus's tree. I think it's easier for you to queue the patch to other versions at the time you pick it from mails or Linus' tree, but from what I've seen so far this is how it works.
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 471654cb65b0..955ce300e5a1 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -799,7 +799,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end, split->block_len = em->block_len; split->orig_start = em->orig_start; } else { - const u64 diff = start + len - em->start; + const u64 diff = end - em->start; split->block_len = split->len; split->block_start += diff; diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c index 253cce7ffecf..80e71c5cb7ab 100644 --- a/fs/btrfs/tests/extent-map-tests.c +++ b/fs/btrfs/tests/extent-map-tests.c @@ -818,7 +818,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info) test_err("em->len is %llu, expected 16K", em->len); goto out; } - free_extent_map(em); read_lock(&em_tree->lock); @@ -847,6 +846,11 @@ static int test_case_7(struct btrfs_fs_info *fs_info) goto out; } + if (em->block_start != SZ_32K + SZ_4K) { + test_err("em->block_start is %llu, expected 36K", em->block_start); + goto out; + } + free_extent_map(em); read_lock(&em_tree->lock);
[BUG] During my extent_map cleanup/refactor, with more than too strict sanity checks, extent-map-tests::test_case_7() would crash my extent_map sanity checks. The problem is, after btrfs_drop_extent_map_range(), the resulted extent_map has a @block_start way too large. Meanwhile my btrfs_file_extent_item based members are returning a correct @disk_bytenr along with correct @offset. The extent map layout looks like this: 0 16K 32K 48K | PINNED | | Regular | The regular em at [32K, 48K) also has 32K @block_start. Then drop range [0, 36K), which should shrink the regular one to be [36K, 48K). However the @block_start is incorrect, we expect 32K + 4K, but got 52K. [CAUSE] Inside btrfs_drop_extent_map_range() function, if we hit an extent_map that covers the target range but is still beyond it, we need to split that extent map into half: |<-- drop range -->| |<----- existing extent_map --->| And if the extent map is not compressed, we need to forward extent_map::block_start by the difference between the end of drop range and the extent map start. However in that particular case, the difference is calculated using (start + len - em->start). The problem is @start can be modified if the drop range covers any pinned extent. This leads to wrong calculation, and would be caught by my later extent_map sanity checks, which checks the em::block_start against btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset. And unfortunately this is going to cause data corruption, as the splitted em is pointing an incorrect location, can cause either unexpected read error or wild writes. [FIX] Fix it by avoiding using @start completely, and use @end - em->start instead, which @end is exclusive bytenr number. And update the test case to verify the @block_start to prevent such problem from happening. CC: stable@vger.kernel.org # 6.7+ Fixes: c962098ca4af ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range") Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent_map.c | 2 +- fs/btrfs/tests/extent-map-tests.c | 6 +++++- 2 files changed, 6 insertions(+), 2 deletions(-)