Message ID | 1410926015-26820-1-git-send-email-quwenruo@cn.fujitsu.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Wed, Sep 17, 2014 at 11:53:35AM +0800, Qu Wenruo wrote: > The following commit enhanced the merge_extent_mapping() to reduce > fragment in extent map tree, but it can't handle case which existing > lies before map_start: > 51f39 btrfs: Use right extent length when inserting overlap extent map. > > [BUG] > When existing extent map's start is before map_start, > the em->len will be minus, which will corrupt the extent map and fail to > insert the new extent map. > This will happen when someone get a large extent map, but when it is > going to insert it into extent map tree, some one has already commit > some write and split the huge extent into small parts. > > [REPRODUCER] > It is very easy to tiger using filebench with randomrw personality. > It is about 100% to reproduce when using 8G preallocated file in 60s > randonrw test. > > [FIX] > This patch can now handle any existing extent position. > Since it does not directly use existing->start, now it will find the > previous and next extent around map_start. > So the old existing->start < map_start bug will never happen again. > > [ENHANCE] > This patch will insert the best fitted extent map into extent map tree, > other than the oldest [map_start, map_start + sectorsize) or the > relatively newer but not perfect [map_start, existing->start). > > The patch will first search existing extent that does not intersects with > the desired map range [map_start, map_start + len). > The existing extent will be either before or behind map_start, and based > on the existing extent, we can find out the previous and next extent > around map_start. > > So the best fitted extent would be [prev->end, next->start). > For prev or next is not found, em->start would be prev->end and em->end > wold be next->start. > > With this patch, the fragment in extent map tree should be reduced much > more than the 51f39 commit and reduce an unneeded extent map tree search. > > Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > --- > fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++---------------- > 1 file changed, 57 insertions(+), 22 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 016c403..8039021 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -6191,21 +6191,60 @@ out_fail_inode: > goto out_fail; > } > > +/* Find next extent map of a given extent map, caller needs to ensure locks */ > +static struct extent_map *next_extent_map(struct extent_map *em) > +{ > + struct rb_node *next; > + > + next = rb_next(&em->rb_node); > + if (!next) > + return NULL; > + return container_of(next, struct extent_map, rb_node); > +} > + > +static struct extent_map *prev_extent_map(struct extent_map *em) > +{ > + struct rb_node *prev; > + > + prev = rb_prev(&em->rb_node); > + if (!prev) > + return NULL; > + return container_of(prev, struct extent_map, rb_node); > +} > + > /* helper for btfs_get_extent. Given an existing extent in the tree, > + * the existing extent is the nearest extent to map_start, > * and an extent that you want to insert, deal with overlap and insert > - * the new extent into the tree. > + * the best fitted new extent into the tree. > */ > static int merge_extent_mapping(struct extent_map_tree *em_tree, > struct extent_map *existing, > struct extent_map *em, > u64 map_start) > { > + struct extent_map *prev; > + struct extent_map *next; > + u64 start; > + u64 end; > u64 start_diff; > > BUG_ON(map_start < em->start || map_start >= extent_map_end(em)); > - start_diff = map_start - em->start; > - em->start = map_start; > - em->len = existing->start - em->start; > + > + if (existing->start > map_start) { > + next = existing; > + prev = prev_extent_map(next); > + } else { > + prev = existing; > + next = next_extent_map(prev); > + } > + > + start = prev ? extent_map_end(prev) : em->start; > + start = max_t(u64, start, em->start); > + end = next ? next->start : extent_map_end(em); > + end = min_t(u64, end, extent_map_end(em)); > + start_diff = start - em->start; > + em->start = start; > + em->len = end - start; > if (em->block_start < EXTENT_MAP_LAST_BYTE && > !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { > em->block_start += start_diff; > @@ -6482,25 +6521,21 @@ insert: > > ret = 0; > > - existing = lookup_extent_mapping(em_tree, start, len); > - if (existing && (existing->start > start || > - existing->start + existing->len <= start)) { > + existing = search_extent_mapping(em_tree, start, len); > + /* > + * existing will always be non-NULL, since there must be > + * extent causing the -EEXIST. > + */ > + if (start >= extent_map_end(existing) || > + start + len <= existing->start) { This will introduce something wrong, the 'else' part is 'em = existing;', and the condition is actually (start < extent_map_end(existing) && start + len > existing->start), which means the existing overlaps with [start, start+len). And one of overlapping cases is (existing->start > start), ie. em->start > start, this is against our rule of btrfs_get_extent, struct extent_map *btrfs_get_extent(...) { [...] insert: btrfs_release_path(path); if (em->start > start || extent_map_end(em) <= start) { btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed [%llu %llu]", em->start, em->len, start, len); err = -EIO; goto out; } [...] } thanks, -liubo > + /* > + * The existing extent map is the one nearest to > + * the [start, start + len) range which overlaps > + */ > + err = merge_extent_mapping(em_tree, existing, > + em, start); > free_extent_map(existing); > - existing = NULL; > - } > - if (!existing) { > - existing = lookup_extent_mapping(em_tree, em->start, > - em->len); > - if (existing) { > - err = merge_extent_mapping(em_tree, existing, > - em, start); > - free_extent_map(existing); > - if (err) { > - free_extent_map(em); > - em = NULL; > - } > - } else { > - err = -EIO; > + if (err) { > free_extent_map(em); > em = NULL; > } > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-------- Original Message -------- Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Liu Bo <bo.li.liu@oracle.com> To: Qu Wenruo <quwenruo@cn.fujitsu.com> Date: 2014?09?18? 12:21 > On Wed, Sep 17, 2014 at 11:53:35AM +0800, Qu Wenruo wrote: >> The following commit enhanced the merge_extent_mapping() to reduce >> fragment in extent map tree, but it can't handle case which existing >> lies before map_start: >> 51f39 btrfs: Use right extent length when inserting overlap extent map. >> >> [BUG] >> When existing extent map's start is before map_start, >> the em->len will be minus, which will corrupt the extent map and fail to >> insert the new extent map. >> This will happen when someone get a large extent map, but when it is >> going to insert it into extent map tree, some one has already commit >> some write and split the huge extent into small parts. >> >> [REPRODUCER] >> It is very easy to tiger using filebench with randomrw personality. >> It is about 100% to reproduce when using 8G preallocated file in 60s >> randonrw test. >> >> [FIX] >> This patch can now handle any existing extent position. >> Since it does not directly use existing->start, now it will find the >> previous and next extent around map_start. >> So the old existing->start < map_start bug will never happen again. >> >> [ENHANCE] >> This patch will insert the best fitted extent map into extent map tree, >> other than the oldest [map_start, map_start + sectorsize) or the >> relatively newer but not perfect [map_start, existing->start). >> >> The patch will first search existing extent that does not intersects with >> the desired map range [map_start, map_start + len). >> The existing extent will be either before or behind map_start, and based >> on the existing extent, we can find out the previous and next extent >> around map_start. >> >> So the best fitted extent would be [prev->end, next->start). >> For prev or next is not found, em->start would be prev->end and em->end >> wold be next->start. >> >> With this patch, the fragment in extent map tree should be reduced much >> more than the 51f39 commit and reduce an unneeded extent map tree search. >> >> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> >> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >> --- >> fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++---------------- >> 1 file changed, 57 insertions(+), 22 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 016c403..8039021 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -6191,21 +6191,60 @@ out_fail_inode: >> goto out_fail; >> } >> >> +/* Find next extent map of a given extent map, caller needs to ensure locks */ >> +static struct extent_map *next_extent_map(struct extent_map *em) >> +{ >> + struct rb_node *next; >> + >> + next = rb_next(&em->rb_node); >> + if (!next) >> + return NULL; >> + return container_of(next, struct extent_map, rb_node); >> +} >> + >> +static struct extent_map *prev_extent_map(struct extent_map *em) >> +{ >> + struct rb_node *prev; >> + >> + prev = rb_prev(&em->rb_node); >> + if (!prev) >> + return NULL; >> + return container_of(prev, struct extent_map, rb_node); >> +} >> + >> /* helper for btfs_get_extent. Given an existing extent in the tree, >> + * the existing extent is the nearest extent to map_start, >> * and an extent that you want to insert, deal with overlap and insert >> - * the new extent into the tree. >> + * the best fitted new extent into the tree. >> */ >> static int merge_extent_mapping(struct extent_map_tree *em_tree, >> struct extent_map *existing, >> struct extent_map *em, >> u64 map_start) >> { >> + struct extent_map *prev; >> + struct extent_map *next; >> + u64 start; >> + u64 end; >> u64 start_diff; >> >> BUG_ON(map_start < em->start || map_start >= extent_map_end(em)); >> - start_diff = map_start - em->start; >> - em->start = map_start; >> - em->len = existing->start - em->start; >> + >> + if (existing->start > map_start) { >> + next = existing; >> + prev = prev_extent_map(next); >> + } else { >> + prev = existing; >> + next = next_extent_map(prev); >> + } >> + >> + start = prev ? extent_map_end(prev) : em->start; >> + start = max_t(u64, start, em->start); >> + end = next ? next->start : extent_map_end(em); >> + end = min_t(u64, end, extent_map_end(em)); >> + start_diff = start - em->start; >> + em->start = start; >> + em->len = end - start; >> if (em->block_start < EXTENT_MAP_LAST_BYTE && >> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { >> em->block_start += start_diff; >> @@ -6482,25 +6521,21 @@ insert: >> >> ret = 0; >> >> - existing = lookup_extent_mapping(em_tree, start, len); >> - if (existing && (existing->start > start || >> - existing->start + existing->len <= start)) { >> + existing = search_extent_mapping(em_tree, start, len); >> + /* >> + * existing will always be non-NULL, since there must be >> + * extent causing the -EEXIST. >> + */ >> + if (start >= extent_map_end(existing) || >> + start + len <= existing->start) { > This will introduce something wrong, the 'else' part is 'em = existing;', > and the condition is actually > (start < extent_map_end(existing) && start + len > existing->start), > which means the existing overlaps with [start, start+len). Nope, the else part is doing the right thing. Before the patch, going to the 'em = existing;' routine's condition is like the following: 1) existing returned by lookup_extent_mapping is not NULL 2) (existing->start > start || existing->start + existing->len <=start) is not met 1) implies the following condition: (in extent_map.c, __lookup_extent_mapping()) !!(end > existing->start && start < extent_map_end(existing)), which is equal to the following: start + len > existing->start(1) && start < extent_map_end(existing) (2) 2) is actually the following start >= existing->start (3) && start < extent_map_end(existing) (4) And the hidden condition len > 0(5) combining 1) and 2), you will find the real condition to go to 'em = existing' routine is what the patch does. Due to (5), (1) and (3) is the same condition, and (2) (4) is the same too. So the patch is OK. 'em = existing' condition is not broken. > > And one of overlapping cases is (existing->start > start), ie. em->start > start, this is > against our rule of btrfs_get_extent, Nope again, this overlapping in fact is quite normal in multithread random read/write. The files's [0~16) is a preallocated one, Thread A: write [4K, 8K) into the file, but not committed yet. extent map tree contains [0,16K) only Thread B: btrfs_get_extent() the map_start is 8K, len is 4K as an example grab a large em, take [0,16K), since [4K,8K) write is not committed. comes to insert: btrfs_release_path(path); Thread A: [4K, 8K) is not committed the extent map is now [0, 4K) [4K, 8K) [8K, 16K). Thread B: goes to insert: add_extent_mapping() the [0,16K) is overlapping, and the returned existing one is [8K, 16K). which contains the [map_start, map_start + len). > struct extent_map *btrfs_get_extent(...) > { > [...] > insert: > btrfs_release_path(path); > if (em->start > start || extent_map_end(em) <= start) { > btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed > [%llu %llu]", > em->start, em->len, start, len); > err = -EIO; > goto out; > } > [...] > } > > thanks, > -liubo > >> + /* >> + * The existing extent map is the one nearest to >> + * the [start, start + len) range which overlaps >> + */ >> + err = merge_extent_mapping(em_tree, existing, >> + em, start); >> free_extent_map(existing); >> - existing = NULL; >> - } >> - if (!existing) { >> - existing = lookup_extent_mapping(em_tree, em->start, >> - em->len); >> - if (existing) { >> - err = merge_extent_mapping(em_tree, existing, >> - em, start); >> - free_extent_map(existing); >> - if (err) { >> - free_extent_map(em); >> - em = NULL; >> - } >> - } else { >> - err = -EIO; >> + if (err) { >> free_extent_map(em); >> em = NULL; >> } >> -- >> 2.1.0 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-------- Original Message -------- Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Qu Wenruo <quwenruo@cn.fujitsu.com> To: <bo.li.liu@oracle.com> Date: 2014?09?18? 13:36 > > -------- Original Message -------- > Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to > insert best fitted extent map > From: Liu Bo <bo.li.liu@oracle.com> > To: Qu Wenruo <quwenruo@cn.fujitsu.com> > Date: 2014?09?18? 12:21 >> On Wed, Sep 17, 2014 at 11:53:35AM +0800, Qu Wenruo wrote: >>> The following commit enhanced the merge_extent_mapping() to reduce >>> fragment in extent map tree, but it can't handle case which existing >>> lies before map_start: >>> 51f39 btrfs: Use right extent length when inserting overlap extent map. >>> >>> [BUG] >>> When existing extent map's start is before map_start, >>> the em->len will be minus, which will corrupt the extent map and >>> fail to >>> insert the new extent map. >>> This will happen when someone get a large extent map, but when it is >>> going to insert it into extent map tree, some one has already commit >>> some write and split the huge extent into small parts. >>> >>> [REPRODUCER] >>> It is very easy to tiger using filebench with randomrw personality. >>> It is about 100% to reproduce when using 8G preallocated file in 60s >>> randonrw test. >>> >>> [FIX] >>> This patch can now handle any existing extent position. >>> Since it does not directly use existing->start, now it will find the >>> previous and next extent around map_start. >>> So the old existing->start < map_start bug will never happen again. >>> >>> [ENHANCE] >>> This patch will insert the best fitted extent map into extent map tree, >>> other than the oldest [map_start, map_start + sectorsize) or the >>> relatively newer but not perfect [map_start, existing->start). >>> >>> The patch will first search existing extent that does not intersects >>> with >>> the desired map range [map_start, map_start + len). >>> The existing extent will be either before or behind map_start, and >>> based >>> on the existing extent, we can find out the previous and next extent >>> around map_start. >>> >>> So the best fitted extent would be [prev->end, next->start). >>> For prev or next is not found, em->start would be prev->end and em->end >>> wold be next->start. >>> >>> With this patch, the fragment in extent map tree should be reduced much >>> more than the 51f39 commit and reduce an unneeded extent map tree >>> search. >>> >>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> >>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>> --- >>> fs/btrfs/inode.c | 79 >>> ++++++++++++++++++++++++++++++++++++++++---------------- >>> 1 file changed, 57 insertions(+), 22 deletions(-) >>> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index 016c403..8039021 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -6191,21 +6191,60 @@ out_fail_inode: >>> goto out_fail; >>> } >>> +/* Find next extent map of a given extent map, caller needs to >>> ensure locks */ >>> +static struct extent_map *next_extent_map(struct extent_map *em) >>> +{ >>> + struct rb_node *next; >>> + >>> + next = rb_next(&em->rb_node); >>> + if (!next) >>> + return NULL; >>> + return container_of(next, struct extent_map, rb_node); >>> +} >>> + >>> +static struct extent_map *prev_extent_map(struct extent_map *em) >>> +{ >>> + struct rb_node *prev; >>> + >>> + prev = rb_prev(&em->rb_node); >>> + if (!prev) >>> + return NULL; >>> + return container_of(prev, struct extent_map, rb_node); >>> +} >>> + >>> /* helper for btfs_get_extent. Given an existing extent in the tree, >>> + * the existing extent is the nearest extent to map_start, >>> * and an extent that you want to insert, deal with overlap and >>> insert >>> - * the new extent into the tree. >>> + * the best fitted new extent into the tree. >>> */ >>> static int merge_extent_mapping(struct extent_map_tree *em_tree, >>> struct extent_map *existing, >>> struct extent_map *em, >>> u64 map_start) >>> { >>> + struct extent_map *prev; >>> + struct extent_map *next; >>> + u64 start; >>> + u64 end; >>> u64 start_diff; >>> BUG_ON(map_start < em->start || map_start >= >>> extent_map_end(em)); >>> - start_diff = map_start - em->start; >>> - em->start = map_start; >>> - em->len = existing->start - em->start; >>> + >>> + if (existing->start > map_start) { >>> + next = existing; >>> + prev = prev_extent_map(next); >>> + } else { >>> + prev = existing; >>> + next = next_extent_map(prev); >>> + } >>> + >>> + start = prev ? extent_map_end(prev) : em->start; >>> + start = max_t(u64, start, em->start); >>> + end = next ? next->start : extent_map_end(em); >>> + end = min_t(u64, end, extent_map_end(em)); >>> + start_diff = start - em->start; >>> + em->start = start; >>> + em->len = end - start; >>> if (em->block_start < EXTENT_MAP_LAST_BYTE && >>> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { >>> em->block_start += start_diff; >>> @@ -6482,25 +6521,21 @@ insert: >>> ret = 0; >>> - existing = lookup_extent_mapping(em_tree, start, len); >>> - if (existing && (existing->start > start || >>> - existing->start + existing->len <= start)) { >>> + existing = search_extent_mapping(em_tree, start, len); >>> + /* >>> + * existing will always be non-NULL, since there must be >>> + * extent causing the -EEXIST. >>> + */ >>> + if (start >= extent_map_end(existing) || >>> + start + len <= existing->start) { >> This will introduce something wrong, the 'else' part is 'em = >> existing;', >> and the condition is actually >> (start < extent_map_end(existing) && start + len > existing->start), >> which means the existing overlaps with [start, start+len). > Nope, the else part is doing the right thing. > > Before the patch, going to the 'em = existing;' routine's condition is > like the following: > 1) existing returned by lookup_extent_mapping is not NULL > 2) (existing->start > start || existing->start + existing->len > <=start) is not met > > 1) implies the following condition: (in extent_map.c, > __lookup_extent_mapping()) > !!(end > existing->start && start < extent_map_end(existing)), which > is equal to the following: > start + len > existing->start(1) && start < extent_map_end(existing) (2) > > 2) is actually the following > start >= existing->start (3) && start < extent_map_end(existing) (4) > > And the hidden condition len > 0(5) > combining 1) and 2), you will find the real condition to go to 'em = > existing' routine is what the patch does. > Due to (5), (1) and (3) is the same condition, and (2) (4) is the same > too. > So the patch is OK. 'em = existing' condition is not broken. > >> >> And one of overlapping cases is (existing->start > start), ie. >> em->start > start, this is >> against our rule of btrfs_get_extent, > Nope again, this overlapping in fact is quite normal in multithread > random read/write. > The files's [0~16) is a preallocated one, > Thread A: > write [4K, 8K) into the file, but not committed yet. > extent map tree contains [0,16K) only > Thread B: > btrfs_get_extent() > the map_start is 8K, len is 4K as an example > grab a large em, take [0,16K), since [4K,8K) write is not committed. > comes to insert: btrfs_release_path(path); > > Thread A: > [4K, 8K) is not committed typo here, [4K, 8K) is now committed > the extent map is now [0, 4K) [4K, 8K) [8K, 16K). > > Thread B: > goes to insert: add_extent_mapping() > the [0,16K) is overlapping, and the returned existing one is [8K, > 16K). > which contains the [map_start, map_start + len). > >> struct extent_map *btrfs_get_extent(...) >> { >> [...] >> insert: >> btrfs_release_path(path); >> if (em->start > start || extent_map_end(em) <= start) { >> btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed >> [%llu %llu]", >> em->start, em->len, start, len); >> err = -EIO; >> goto out; >> } >> [...] >> } >> >> thanks, >> -liubo >> >>> + /* >>> + * The existing extent map is the one nearest to >>> + * the [start, start + len) range which overlaps >>> + */ >>> + err = merge_extent_mapping(em_tree, existing, >>> + em, start); >>> free_extent_map(existing); >>> - existing = NULL; >>> - } >>> - if (!existing) { >>> - existing = lookup_extent_mapping(em_tree, em->start, >>> - em->len); >>> - if (existing) { >>> - err = merge_extent_mapping(em_tree, existing, >>> - em, start); >>> - free_extent_map(existing); >>> - if (err) { >>> - free_extent_map(em); >>> - em = NULL; >>> - } >>> - } else { >>> - err = -EIO; >>> + if (err) { >>> free_extent_map(em); >>> em = NULL; >>> } >>> -- >>> 2.1.0 >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 18, 2014 at 01:36:47PM +0800, Qu Wenruo wrote: > > -------- Original Message -------- > Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() > to insert best fitted extent map > From: Liu Bo <bo.li.liu@oracle.com> > To: Qu Wenruo <quwenruo@cn.fujitsu.com> > Date: 2014?09?18? 12:21 > >On Wed, Sep 17, 2014 at 11:53:35AM +0800, Qu Wenruo wrote: > >>The following commit enhanced the merge_extent_mapping() to reduce > >>fragment in extent map tree, but it can't handle case which existing > >>lies before map_start: > >>51f39 btrfs: Use right extent length when inserting overlap extent map. > >> > >>[BUG] > >>When existing extent map's start is before map_start, > >>the em->len will be minus, which will corrupt the extent map and fail to > >>insert the new extent map. > >>This will happen when someone get a large extent map, but when it is > >>going to insert it into extent map tree, some one has already commit > >>some write and split the huge extent into small parts. > >> > >>[REPRODUCER] > >>It is very easy to tiger using filebench with randomrw personality. > >>It is about 100% to reproduce when using 8G preallocated file in 60s > >>randonrw test. > >> > >>[FIX] > >>This patch can now handle any existing extent position. > >>Since it does not directly use existing->start, now it will find the > >>previous and next extent around map_start. > >>So the old existing->start < map_start bug will never happen again. > >> > >>[ENHANCE] > >>This patch will insert the best fitted extent map into extent map tree, > >>other than the oldest [map_start, map_start + sectorsize) or the > >>relatively newer but not perfect [map_start, existing->start). > >> > >>The patch will first search existing extent that does not intersects with > >>the desired map range [map_start, map_start + len). > >>The existing extent will be either before or behind map_start, and based > >>on the existing extent, we can find out the previous and next extent > >>around map_start. > >> > >>So the best fitted extent would be [prev->end, next->start). > >>For prev or next is not found, em->start would be prev->end and em->end > >>wold be next->start. > >> > >>With this patch, the fragment in extent map tree should be reduced much > >>more than the 51f39 commit and reduce an unneeded extent map tree search. > >> > >>Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> > >>Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > >>--- > >> fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++---------------- > >> 1 file changed, 57 insertions(+), 22 deletions(-) > >> > >>diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >>index 016c403..8039021 100644 > >>--- a/fs/btrfs/inode.c > >>+++ b/fs/btrfs/inode.c > >>@@ -6191,21 +6191,60 @@ out_fail_inode: > >> goto out_fail; > >> } > >>+/* Find next extent map of a given extent map, caller needs to ensure locks */ > >>+static struct extent_map *next_extent_map(struct extent_map *em) > >>+{ > >>+ struct rb_node *next; > >>+ > >>+ next = rb_next(&em->rb_node); > >>+ if (!next) > >>+ return NULL; > >>+ return container_of(next, struct extent_map, rb_node); > >>+} > >>+ > >>+static struct extent_map *prev_extent_map(struct extent_map *em) > >>+{ > >>+ struct rb_node *prev; > >>+ > >>+ prev = rb_prev(&em->rb_node); > >>+ if (!prev) > >>+ return NULL; > >>+ return container_of(prev, struct extent_map, rb_node); > >>+} > >>+ > >> /* helper for btfs_get_extent. Given an existing extent in the tree, > >>+ * the existing extent is the nearest extent to map_start, > >> * and an extent that you want to insert, deal with overlap and insert > >>- * the new extent into the tree. > >>+ * the best fitted new extent into the tree. > >> */ > >> static int merge_extent_mapping(struct extent_map_tree *em_tree, > >> struct extent_map *existing, > >> struct extent_map *em, > >> u64 map_start) > >> { > >>+ struct extent_map *prev; > >>+ struct extent_map *next; > >>+ u64 start; > >>+ u64 end; > >> u64 start_diff; > >> BUG_ON(map_start < em->start || map_start >= extent_map_end(em)); > >>- start_diff = map_start - em->start; > >>- em->start = map_start; > >>- em->len = existing->start - em->start; > >>+ > >>+ if (existing->start > map_start) { > >>+ next = existing; > >>+ prev = prev_extent_map(next); > >>+ } else { > >>+ prev = existing; > >>+ next = next_extent_map(prev); > >>+ } > >>+ > >>+ start = prev ? extent_map_end(prev) : em->start; > >>+ start = max_t(u64, start, em->start); > >>+ end = next ? next->start : extent_map_end(em); > >>+ end = min_t(u64, end, extent_map_end(em)); > >>+ start_diff = start - em->start; > >>+ em->start = start; > >>+ em->len = end - start; > >> if (em->block_start < EXTENT_MAP_LAST_BYTE && > >> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { > >> em->block_start += start_diff; > >>@@ -6482,25 +6521,21 @@ insert: > >> ret = 0; > >>- existing = lookup_extent_mapping(em_tree, start, len); > >>- if (existing && (existing->start > start || > >>- existing->start + existing->len <= start)) { > >>+ existing = search_extent_mapping(em_tree, start, len); > >>+ /* > >>+ * existing will always be non-NULL, since there must be > >>+ * extent causing the -EEXIST. > >>+ */ > >>+ if (start >= extent_map_end(existing) || > >>+ start + len <= existing->start) { > >This will introduce something wrong, the 'else' part is 'em = existing;', > >and the condition is actually > >(start < extent_map_end(existing) && start + len > existing->start), > >which means the existing overlaps with [start, start+len). > Nope, the else part is doing the right thing. > > Before the patch, going to the 'em = existing;' routine's condition > is like the following: > 1) existing returned by lookup_extent_mapping is not NULL > 2) (existing->start > start || existing->start + existing->len > <=start) is not met > > 1) implies the following condition: (in extent_map.c, > __lookup_extent_mapping()) > !!(end > existing->start && start < extent_map_end(existing)), which > is equal to the following: > start + len > existing->start(1) && start < extent_map_end(existing) (2) > > 2) is actually the following > start >= existing->start (3) && start < extent_map_end(existing) (4) > > And the hidden condition len > 0(5) > combining 1) and 2), you will find the real condition to go to 'em = > existing' routine is what the patch does. > Due to (5), (1) and (3) is the same condition, and (2) (4) is the same too. > So the patch is OK. 'em = existing' condition is not broken. Okay, but think it twice, they're not same, original: (start >= existing->start && start < extent_map_end(existing)) this patch: (start < extent_map_end(existing) && start + len > existing->start) (start + len > existing->start) doesn't equal to start >= existing->start, here is a case of (start+len > existing->start) but (start <= existing->start). |--------| -->(existing) |--------| -->[start, start+len) And calling search_extent_mapping() doesn't make sure that (start >= existing->start) is true, either. > > > > >And one of overlapping cases is (existing->start > start), ie. em->start > start, this is > >against our rule of btrfs_get_extent, > Nope again, this overlapping in fact is quite normal in multithread > random read/write. > The files's [0~16) is a preallocated one, > Thread A: > write [4K, 8K) into the file, but not committed yet. > extent map tree contains [0,16K) only > Thread B: > btrfs_get_extent() > the map_start is 8K, len is 4K as an example > grab a large em, take [0,16K), since [4K,8K) write is not committed. > comes to insert: btrfs_release_path(path); > > Thread A: > [4K, 8K) is not committed > the extent map is now [0, 4K) [4K, 8K) [8K, 16K). > > Thread B: > goes to insert: add_extent_mapping() > the [0,16K) is overlapping, and the returned existing one is [8K, 16K). > which contains the [map_start, map_start + len). So this's an example of existing->start == start (both are 8K), not existing->start > start. See __extent_writepage_io(), { ... em = epd->get_extent(inode, page, pg_offset, cur, end - cur + 1, 1); if (IS_ERR_OR_NULL(em)) { SetPageError(page); ret = PTR_ERR_OR_ZERO(em); break; } extent_offset = cur - em->start; ^^^^^^^^^^^^^^^^^^^^^^^^^^^ it needs to be (em->start <= cur) ... } thanks, -liubo > > >struct extent_map *btrfs_get_extent(...) > >{ > > [...] > > insert: > > btrfs_release_path(path); > > if (em->start > start || extent_map_end(em) <= start) { > > btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed > > [%llu %llu]", > > em->start, em->len, start, len); > > err = -EIO; > > goto out; > > } > > [...] > >} > > > >thanks, > >-liubo > > > >>+ /* > >>+ * The existing extent map is the one nearest to > >>+ * the [start, start + len) range which overlaps > >>+ */ > >>+ err = merge_extent_mapping(em_tree, existing, > >>+ em, start); > >> free_extent_map(existing); > >>- existing = NULL; > >>- } > >>- if (!existing) { > >>- existing = lookup_extent_mapping(em_tree, em->start, > >>- em->len); > >>- if (existing) { > >>- err = merge_extent_mapping(em_tree, existing, > >>- em, start); > >>- free_extent_map(existing); > >>- if (err) { > >>- free_extent_map(em); > >>- em = NULL; > >>- } > >>- } else { > >>- err = -EIO; > >>+ if (err) { > >> free_extent_map(em); > >> em = NULL; > >> } > >>-- > >>2.1.0 > >> > >>-- > >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >>the body of a message to majordomo@vger.kernel.org > >>More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-------- Original Message -------- Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Liu Bo <bo.li.liu@oracle.com> To: Qu Wenruo <quwenruo@cn.fujitsu.com> Date: 2014?09?18? 15:33 > On Thu, Sep 18, 2014 at 01:36:47PM +0800, Qu Wenruo wrote: >> -------- Original Message -------- >> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() >> to insert best fitted extent map >> From: Liu Bo <bo.li.liu@oracle.com> >> To: Qu Wenruo <quwenruo@cn.fujitsu.com> >> Date: 2014?09?18? 12:21 >>> On Wed, Sep 17, 2014 at 11:53:35AM +0800, Qu Wenruo wrote: >>>> The following commit enhanced the merge_extent_mapping() to reduce >>>> fragment in extent map tree, but it can't handle case which existing >>>> lies before map_start: >>>> 51f39 btrfs: Use right extent length when inserting overlap extent map. >>>> >>>> [BUG] >>>> When existing extent map's start is before map_start, >>>> the em->len will be minus, which will corrupt the extent map and fail to >>>> insert the new extent map. >>>> This will happen when someone get a large extent map, but when it is >>>> going to insert it into extent map tree, some one has already commit >>>> some write and split the huge extent into small parts. >>>> >>>> [REPRODUCER] >>>> It is very easy to tiger using filebench with randomrw personality. >>>> It is about 100% to reproduce when using 8G preallocated file in 60s >>>> randonrw test. >>>> >>>> [FIX] >>>> This patch can now handle any existing extent position. >>>> Since it does not directly use existing->start, now it will find the >>>> previous and next extent around map_start. >>>> So the old existing->start < map_start bug will never happen again. >>>> >>>> [ENHANCE] >>>> This patch will insert the best fitted extent map into extent map tree, >>>> other than the oldest [map_start, map_start + sectorsize) or the >>>> relatively newer but not perfect [map_start, existing->start). >>>> >>>> The patch will first search existing extent that does not intersects with >>>> the desired map range [map_start, map_start + len). >>>> The existing extent will be either before or behind map_start, and based >>>> on the existing extent, we can find out the previous and next extent >>>> around map_start. >>>> >>>> So the best fitted extent would be [prev->end, next->start). >>>> For prev or next is not found, em->start would be prev->end and em->end >>>> wold be next->start. >>>> >>>> With this patch, the fragment in extent map tree should be reduced much >>>> more than the 51f39 commit and reduce an unneeded extent map tree search. >>>> >>>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> >>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>>> --- >>>> fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++---------------- >>>> 1 file changed, 57 insertions(+), 22 deletions(-) >>>> >>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>>> index 016c403..8039021 100644 >>>> --- a/fs/btrfs/inode.c >>>> +++ b/fs/btrfs/inode.c >>>> @@ -6191,21 +6191,60 @@ out_fail_inode: >>>> goto out_fail; >>>> } >>>> +/* Find next extent map of a given extent map, caller needs to ensure locks */ >>>> +static struct extent_map *next_extent_map(struct extent_map *em) >>>> +{ >>>> + struct rb_node *next; >>>> + >>>> + next = rb_next(&em->rb_node); >>>> + if (!next) >>>> + return NULL; >>>> + return container_of(next, struct extent_map, rb_node); >>>> +} >>>> + >>>> +static struct extent_map *prev_extent_map(struct extent_map *em) >>>> +{ >>>> + struct rb_node *prev; >>>> + >>>> + prev = rb_prev(&em->rb_node); >>>> + if (!prev) >>>> + return NULL; >>>> + return container_of(prev, struct extent_map, rb_node); >>>> +} >>>> + >>>> /* helper for btfs_get_extent. Given an existing extent in the tree, >>>> + * the existing extent is the nearest extent to map_start, >>>> * and an extent that you want to insert, deal with overlap and insert >>>> - * the new extent into the tree. >>>> + * the best fitted new extent into the tree. >>>> */ >>>> static int merge_extent_mapping(struct extent_map_tree *em_tree, >>>> struct extent_map *existing, >>>> struct extent_map *em, >>>> u64 map_start) >>>> { >>>> + struct extent_map *prev; >>>> + struct extent_map *next; >>>> + u64 start; >>>> + u64 end; >>>> u64 start_diff; >>>> BUG_ON(map_start < em->start || map_start >= extent_map_end(em)); >>>> - start_diff = map_start - em->start; >>>> - em->start = map_start; >>>> - em->len = existing->start - em->start; >>>> + >>>> + if (existing->start > map_start) { >>>> + next = existing; >>>> + prev = prev_extent_map(next); >>>> + } else { >>>> + prev = existing; >>>> + next = next_extent_map(prev); >>>> + } >>>> + >>>> + start = prev ? extent_map_end(prev) : em->start; >>>> + start = max_t(u64, start, em->start); >>>> + end = next ? next->start : extent_map_end(em); >>>> + end = min_t(u64, end, extent_map_end(em)); >>>> + start_diff = start - em->start; >>>> + em->start = start; >>>> + em->len = end - start; >>>> if (em->block_start < EXTENT_MAP_LAST_BYTE && >>>> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { >>>> em->block_start += start_diff; >>>> @@ -6482,25 +6521,21 @@ insert: >>>> ret = 0; >>>> - existing = lookup_extent_mapping(em_tree, start, len); >>>> - if (existing && (existing->start > start || >>>> - existing->start + existing->len <= start)) { >>>> + existing = search_extent_mapping(em_tree, start, len); >>>> + /* >>>> + * existing will always be non-NULL, since there must be >>>> + * extent causing the -EEXIST. >>>> + */ >>>> + if (start >= extent_map_end(existing) || >>>> + start + len <= existing->start) { >>> This will introduce something wrong, the 'else' part is 'em = existing;', >>> and the condition is actually >>> (start < extent_map_end(existing) && start + len > existing->start), >>> which means the existing overlaps with [start, start+len). >> Nope, the else part is doing the right thing. >> >> Before the patch, going to the 'em = existing;' routine's condition >> is like the following: >> 1) existing returned by lookup_extent_mapping is not NULL >> 2) (existing->start > start || existing->start + existing->len >> <=start) is not met >> >> 1) implies the following condition: (in extent_map.c, >> __lookup_extent_mapping()) >> !!(end > existing->start && start < extent_map_end(existing)), which >> is equal to the following: >> start + len > existing->start(1) && start < extent_map_end(existing) (2) >> >> 2) is actually the following >> start >= existing->start (3) && start < extent_map_end(existing) (4) >> >> And the hidden condition len > 0(5) >> combining 1) and 2), you will find the real condition to go to 'em = >> existing' routine is what the patch does. >> Due to (5), (1) and (3) is the same condition, and (2) (4) is the same too. >> So the patch is OK. 'em = existing' condition is not broken. > Okay, but think it twice, they're not same, > > original: (start >= existing->start && start < extent_map_end(existing)) > this patch: (start < extent_map_end(existing) && start + len > existing->start) > > (start + len > existing->start) doesn't equal to start >= existing->start, > here is a case of (start+len > existing->start) but (start <= existing->start). > > |--------| -->(existing) > |--------| -->[start, start+len) > > And calling search_extent_mapping() doesn't make sure that > (start >= existing->start) is true, either. All right, that case is right and I'm wrong. I'll change the if() to use start >= existing->start. BTW, Any other problem? Thanks, Qu > >>> And one of overlapping cases is (existing->start > start), ie. em->start > start, this is >>> against our rule of btrfs_get_extent, >> Nope again, this overlapping in fact is quite normal in multithread >> random read/write. >> The files's [0~16) is a preallocated one, >> Thread A: >> write [4K, 8K) into the file, but not committed yet. >> extent map tree contains [0,16K) only >> Thread B: >> btrfs_get_extent() >> the map_start is 8K, len is 4K as an example >> grab a large em, take [0,16K), since [4K,8K) write is not committed. >> comes to insert: btrfs_release_path(path); >> >> Thread A: >> [4K, 8K) is not committed >> the extent map is now [0, 4K) [4K, 8K) [8K, 16K). >> >> Thread B: >> goes to insert: add_extent_mapping() >> the [0,16K) is overlapping, and the returned existing one is [8K, 16K). >> which contains the [map_start, map_start + len). > So this's an example of existing->start == start (both are 8K), not > existing->start > start. > > See __extent_writepage_io(), > > { > ... > em = epd->get_extent(inode, page, pg_offset, cur, > end - cur + 1, 1); > if (IS_ERR_OR_NULL(em)) { > SetPageError(page); > ret = PTR_ERR_OR_ZERO(em); > break; > } > > extent_offset = cur - em->start; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ it needs to be (em->start <= cur) > > ... > } > > thanks, > -liubo > >>> struct extent_map *btrfs_get_extent(...) >>> { >>> [...] >>> insert: >>> btrfs_release_path(path); >>> if (em->start > start || extent_map_end(em) <= start) { >>> btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed >>> [%llu %llu]", >>> em->start, em->len, start, len); >>> err = -EIO; >>> goto out; >>> } >>> [...] >>> } >>> >>> thanks, >>> -liubo >>> >>>> + /* >>>> + * The existing extent map is the one nearest to >>>> + * the [start, start + len) range which overlaps >>>> + */ >>>> + err = merge_extent_mapping(em_tree, existing, >>>> + em, start); >>>> free_extent_map(existing); >>>> - existing = NULL; >>>> - } >>>> - if (!existing) { >>>> - existing = lookup_extent_mapping(em_tree, em->start, >>>> - em->len); >>>> - if (existing) { >>>> - err = merge_extent_mapping(em_tree, existing, >>>> - em, start); >>>> - free_extent_map(existing); >>>> - if (err) { >>>> - free_extent_map(em); >>>> - em = NULL; >>>> - } >>>> - } else { >>>> - err = -EIO; >>>> + if (err) { >>>> free_extent_map(em); >>>> em = NULL; >>>> } >>>> -- >>>> 2.1.0 >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 18, 2014 at 03:58:26PM +0800, Qu Wenruo wrote: > [...] > > > >original: (start >= existing->start && start < extent_map_end(existing)) > >this patch: (start < extent_map_end(existing) && start + len > existing->start) > > > >(start + len > existing->start) doesn't equal to start >= existing->start, > >here is a case of (start+len > existing->start) but (start <= existing->start). > > > > |--------| -->(existing) > > |--------| -->[start, start+len) > > > >And calling search_extent_mapping() doesn't make sure that > >(start >= existing->start) is true, either. > All right, that case is right and I'm wrong. > I'll change the if() to use start >= existing->start. > > BTW, Any other problem? Others look good. thanks, -liubo > > Thanks, > Qu > > > >>>And one of overlapping cases is (existing->start > start), ie. em->start > start, this is > >>>against our rule of btrfs_get_extent, > >>Nope again, this overlapping in fact is quite normal in multithread > >>random read/write. > >>The files's [0~16) is a preallocated one, > >>Thread A: > >> write [4K, 8K) into the file, but not committed yet. > >> extent map tree contains [0,16K) only > >>Thread B: > >>btrfs_get_extent() > >> the map_start is 8K, len is 4K as an example > >> grab a large em, take [0,16K), since [4K,8K) write is not committed. > >> comes to insert: btrfs_release_path(path); > >> > >>Thread A: > >> [4K, 8K) is not committed > >> the extent map is now [0, 4K) [4K, 8K) [8K, 16K). > >> > >>Thread B: > >> goes to insert: add_extent_mapping() > >> the [0,16K) is overlapping, and the returned existing one is [8K, 16K). > >> which contains the [map_start, map_start + len). > >So this's an example of existing->start == start (both are 8K), not > >existing->start > start. > > > >See __extent_writepage_io(), > > > >{ > > ... > > em = epd->get_extent(inode, page, pg_offset, cur, > > end - cur + 1, 1); > > if (IS_ERR_OR_NULL(em)) { > > SetPageError(page); > > ret = PTR_ERR_OR_ZERO(em); > > break; > > } > > extent_offset = cur - em->start; > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ it needs to be (em->start <= cur) > > > > ... > >} > > > >thanks, > >-liubo > > > >>>struct extent_map *btrfs_get_extent(...) > >>>{ > >>> [...] > >>> insert: > >>> btrfs_release_path(path); > >>> if (em->start > start || extent_map_end(em) <= start) { > >>> btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed > >>> [%llu %llu]", > >>> em->start, em->len, start, len); > >>> err = -EIO; > >>> goto out; > >>> } > >>> [...] > >>>} > >>> > >>>thanks, > >>>-liubo > >>> > >>>>+ /* > >>>>+ * The existing extent map is the one nearest to > >>>>+ * the [start, start + len) range which overlaps > >>>>+ */ > >>>>+ err = merge_extent_mapping(em_tree, existing, > >>>>+ em, start); > >>>> free_extent_map(existing); > >>>>- existing = NULL; > >>>>- } > >>>>- if (!existing) { > >>>>- existing = lookup_extent_mapping(em_tree, em->start, > >>>>- em->len); > >>>>- if (existing) { > >>>>- err = merge_extent_mapping(em_tree, existing, > >>>>- em, start); > >>>>- free_extent_map(existing); > >>>>- if (err) { > >>>>- free_extent_map(em); > >>>>- em = NULL; > >>>>- } > >>>>- } else { > >>>>- err = -EIO; > >>>>+ if (err) { > >>>> free_extent_map(em); > >>>> em = NULL; > >>>> } > >>>>-- > >>>>2.1.0 > >>>> > >>>>-- > >>>>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >>>>the body of a message to majordomo@vger.kernel.org > >>>>More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-------- Original Message -------- Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Liu Bo <bo.li.liu@oracle.com> To: Qu Wenruo <quwenruo@cn.fujitsu.com> Date: 2014?09?18? 16:20 > On Thu, Sep 18, 2014 at 03:58:26PM +0800, Qu Wenruo wrote: > [...] >>> original: (start >= existing->start && start < extent_map_end(existing)) >>> this patch: (start < extent_map_end(existing) && start + len > existing->start) >>> >>> (start + len > existing->start) doesn't equal to start >= existing->start, >>> here is a case of (start+len > existing->start) but (start <= existing->start). >>> >>> |--------| -->(existing) >>> |--------| -->[start, start+len) >>> >>> And calling search_extent_mapping() doesn't make sure that >>> (start >= existing->start) is true, either. >> All right, that case is right and I'm wrong. >> I'll change the if() to use start >= existing->start. >> >> BTW, Any other problem? > Others look good. > > thanks, > -liubo Would you mind me adding your reviewed-by? Thanks Qu >> Thanks, >> Qu >>>>> And one of overlapping cases is (existing->start > start), ie. em->start > start, this is >>>>> against our rule of btrfs_get_extent, >>>> Nope again, this overlapping in fact is quite normal in multithread >>>> random read/write. >>>> The files's [0~16) is a preallocated one, >>>> Thread A: >>>> write [4K, 8K) into the file, but not committed yet. >>>> extent map tree contains [0,16K) only >>>> Thread B: >>>> btrfs_get_extent() >>>> the map_start is 8K, len is 4K as an example >>>> grab a large em, take [0,16K), since [4K,8K) write is not committed. >>>> comes to insert: btrfs_release_path(path); >>>> >>>> Thread A: >>>> [4K, 8K) is not committed >>>> the extent map is now [0, 4K) [4K, 8K) [8K, 16K). >>>> >>>> Thread B: >>>> goes to insert: add_extent_mapping() >>>> the [0,16K) is overlapping, and the returned existing one is [8K, 16K). >>>> which contains the [map_start, map_start + len). >>> So this's an example of existing->start == start (both are 8K), not >>> existing->start > start. >>> >>> See __extent_writepage_io(), >>> >>> { >>> ... >>> em = epd->get_extent(inode, page, pg_offset, cur, >>> end - cur + 1, 1); >>> if (IS_ERR_OR_NULL(em)) { >>> SetPageError(page); >>> ret = PTR_ERR_OR_ZERO(em); >>> break; >>> } >>> extent_offset = cur - em->start; >>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^ it needs to be (em->start <= cur) >>> >>> ... >>> } >>> >>> thanks, >>> -liubo >>> >>>>> struct extent_map *btrfs_get_extent(...) >>>>> { >>>>> [...] >>>>> insert: >>>>> btrfs_release_path(path); >>>>> if (em->start > start || extent_map_end(em) <= start) { >>>>> btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed >>>>> [%llu %llu]", >>>>> em->start, em->len, start, len); >>>>> err = -EIO; >>>>> goto out; >>>>> } >>>>> [...] >>>>> } >>>>> >>>>> thanks, >>>>> -liubo >>>>> >>>>>> + /* >>>>>> + * The existing extent map is the one nearest to >>>>>> + * the [start, start + len) range which overlaps >>>>>> + */ >>>>>> + err = merge_extent_mapping(em_tree, existing, >>>>>> + em, start); >>>>>> free_extent_map(existing); >>>>>> - existing = NULL; >>>>>> - } >>>>>> - if (!existing) { >>>>>> - existing = lookup_extent_mapping(em_tree, em->start, >>>>>> - em->len); >>>>>> - if (existing) { >>>>>> - err = merge_extent_mapping(em_tree, existing, >>>>>> - em, start); >>>>>> - free_extent_map(existing); >>>>>> - if (err) { >>>>>> - free_extent_map(em); >>>>>> - em = NULL; >>>>>> - } >>>>>> - } else { >>>>>> - err = -EIO; >>>>>> + if (err) { >>>>>> free_extent_map(em); >>>>>> em = NULL; >>>>>> } >>>>>> -- >>>>>> 2.1.0 >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 18, 2014 at 04:24:48PM +0800, Qu Wenruo wrote: > > -------- Original Message -------- > Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() > to insert best fitted extent map > From: Liu Bo <bo.li.liu@oracle.com> > To: Qu Wenruo <quwenruo@cn.fujitsu.com> > Date: 2014?09?18? 16:20 > >On Thu, Sep 18, 2014 at 03:58:26PM +0800, Qu Wenruo wrote: > >[...] > >>>original: (start >= existing->start && start < extent_map_end(existing)) > >>>this patch: (start < extent_map_end(existing) && start + len > existing->start) > >>> > >>>(start + len > existing->start) doesn't equal to start >= existing->start, > >>>here is a case of (start+len > existing->start) but (start <= existing->start). > >>> > >>> |--------| -->(existing) > >>> |--------| -->[start, start+len) > >>> > >>>And calling search_extent_mapping() doesn't make sure that > >>>(start >= existing->start) is true, either. > >>All right, that case is right and I'm wrong. > >>I'll change the if() to use start >= existing->start. > >> > >>BTW, Any other problem? > >Others look good. > > > >thanks, > >-liubo > Would you mind me adding your reviewed-by? > Sure. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> thanks, -liubo > Thanks > Qu > >>Thanks, > >>Qu > >>>>>And one of overlapping cases is (existing->start > start), ie. em->start > start, this is > >>>>>against our rule of btrfs_get_extent, > >>>>Nope again, this overlapping in fact is quite normal in multithread > >>>>random read/write. > >>>>The files's [0~16) is a preallocated one, > >>>>Thread A: > >>>> write [4K, 8K) into the file, but not committed yet. > >>>> extent map tree contains [0,16K) only > >>>>Thread B: > >>>>btrfs_get_extent() > >>>> the map_start is 8K, len is 4K as an example > >>>> grab a large em, take [0,16K), since [4K,8K) write is not committed. > >>>> comes to insert: btrfs_release_path(path); > >>>> > >>>>Thread A: > >>>> [4K, 8K) is not committed > >>>> the extent map is now [0, 4K) [4K, 8K) [8K, 16K). > >>>> > >>>>Thread B: > >>>> goes to insert: add_extent_mapping() > >>>> the [0,16K) is overlapping, and the returned existing one is [8K, 16K). > >>>> which contains the [map_start, map_start + len). > >>>So this's an example of existing->start == start (both are 8K), not > >>>existing->start > start. > >>> > >>>See __extent_writepage_io(), > >>> > >>>{ > >>> ... > >>> em = epd->get_extent(inode, page, pg_offset, cur, > >>> end - cur + 1, 1); > >>> if (IS_ERR_OR_NULL(em)) { > >>> SetPageError(page); > >>> ret = PTR_ERR_OR_ZERO(em); > >>> break; > >>> } > >>> extent_offset = cur - em->start; > >>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^ it needs to be (em->start <= cur) > >>> > >>> ... > >>>} > >>> > >>>thanks, > >>>-liubo > >>> > >>>>>struct extent_map *btrfs_get_extent(...) > >>>>>{ > >>>>> [...] > >>>>> insert: > >>>>> btrfs_release_path(path); > >>>>> if (em->start > start || extent_map_end(em) <= start) { > >>>>> btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed > >>>>> [%llu %llu]", > >>>>> em->start, em->len, start, len); > >>>>> err = -EIO; > >>>>> goto out; > >>>>> } > >>>>> [...] > >>>>>} > >>>>> > >>>>>thanks, > >>>>>-liubo > >>>>> > >>>>>>+ /* > >>>>>>+ * The existing extent map is the one nearest to > >>>>>>+ * the [start, start + len) range which overlaps > >>>>>>+ */ > >>>>>>+ err = merge_extent_mapping(em_tree, existing, > >>>>>>+ em, start); > >>>>>> free_extent_map(existing); > >>>>>>- existing = NULL; > >>>>>>- } > >>>>>>- if (!existing) { > >>>>>>- existing = lookup_extent_mapping(em_tree, em->start, > >>>>>>- em->len); > >>>>>>- if (existing) { > >>>>>>- err = merge_extent_mapping(em_tree, existing, > >>>>>>- em, start); > >>>>>>- free_extent_map(existing); > >>>>>>- if (err) { > >>>>>>- free_extent_map(em); > >>>>>>- em = NULL; > >>>>>>- } > >>>>>>- } else { > >>>>>>- err = -EIO; > >>>>>>+ if (err) { > >>>>>> free_extent_map(em); > >>>>>> em = NULL; > >>>>>> } > >>>>>>-- > >>>>>>2.1.0 > >>>>>> > >>>>>>-- > >>>>>>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >>>>>>the body of a message to majordomo@vger.kernel.org > >>>>>>More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > The following commit enhanced the merge_extent_mapping() to reduce > fragment in extent map tree, but it can't handle case which existing > lies before map_start: > 51f39 btrfs: Use right extent length when inserting overlap extent map. > > [BUG] > When existing extent map's start is before map_start, > the em->len will be minus, which will corrupt the extent map and fail to > insert the new extent map. > This will happen when someone get a large extent map, but when it is > going to insert it into extent map tree, some one has already commit > some write and split the huge extent into small parts. This sounds like very deterministic to me. Any reason to not add tests to the sanity tests that exercise this/these case/cases? Thanks > > [REPRODUCER] > It is very easy to tiger using filebench with randomrw personality. > It is about 100% to reproduce when using 8G preallocated file in 60s > randonrw test. > > [FIX] > This patch can now handle any existing extent position. > Since it does not directly use existing->start, now it will find the > previous and next extent around map_start. > So the old existing->start < map_start bug will never happen again. > > [ENHANCE] > This patch will insert the best fitted extent map into extent map tree, > other than the oldest [map_start, map_start + sectorsize) or the > relatively newer but not perfect [map_start, existing->start). > > The patch will first search existing extent that does not intersects with > the desired map range [map_start, map_start + len). > The existing extent will be either before or behind map_start, and based > on the existing extent, we can find out the previous and next extent > around map_start. > > So the best fitted extent would be [prev->end, next->start). > For prev or next is not found, em->start would be prev->end and em->end > wold be next->start. > > With this patch, the fragment in extent map tree should be reduced much > more than the 51f39 commit and reduce an unneeded extent map tree search. > > Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > --- > fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++---------------- > 1 file changed, 57 insertions(+), 22 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 016c403..8039021 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -6191,21 +6191,60 @@ out_fail_inode: > goto out_fail; > } > > +/* Find next extent map of a given extent map, caller needs to ensure locks */ > +static struct extent_map *next_extent_map(struct extent_map *em) > +{ > + struct rb_node *next; > + > + next = rb_next(&em->rb_node); > + if (!next) > + return NULL; > + return container_of(next, struct extent_map, rb_node); > +} > + > +static struct extent_map *prev_extent_map(struct extent_map *em) > +{ > + struct rb_node *prev; > + > + prev = rb_prev(&em->rb_node); > + if (!prev) > + return NULL; > + return container_of(prev, struct extent_map, rb_node); > +} > + > /* helper for btfs_get_extent. Given an existing extent in the tree, > + * the existing extent is the nearest extent to map_start, > * and an extent that you want to insert, deal with overlap and insert > - * the new extent into the tree. > + * the best fitted new extent into the tree. > */ > static int merge_extent_mapping(struct extent_map_tree *em_tree, > struct extent_map *existing, > struct extent_map *em, > u64 map_start) > { > + struct extent_map *prev; > + struct extent_map *next; > + u64 start; > + u64 end; > u64 start_diff; > > BUG_ON(map_start < em->start || map_start >= extent_map_end(em)); > - start_diff = map_start - em->start; > - em->start = map_start; > - em->len = existing->start - em->start; > + > + if (existing->start > map_start) { > + next = existing; > + prev = prev_extent_map(next); > + } else { > + prev = existing; > + next = next_extent_map(prev); > + } > + > + start = prev ? extent_map_end(prev) : em->start; > + start = max_t(u64, start, em->start); > + end = next ? next->start : extent_map_end(em); > + end = min_t(u64, end, extent_map_end(em)); > + start_diff = start - em->start; > + em->start = start; > + em->len = end - start; > if (em->block_start < EXTENT_MAP_LAST_BYTE && > !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { > em->block_start += start_diff; > @@ -6482,25 +6521,21 @@ insert: > > ret = 0; > > - existing = lookup_extent_mapping(em_tree, start, len); > - if (existing && (existing->start > start || > - existing->start + existing->len <= start)) { > + existing = search_extent_mapping(em_tree, start, len); > + /* > + * existing will always be non-NULL, since there must be > + * extent causing the -EEXIST. > + */ > + if (start >= extent_map_end(existing) || > + start + len <= existing->start) { > + /* > + * The existing extent map is the one nearest to > + * the [start, start + len) range which overlaps > + */ > + err = merge_extent_mapping(em_tree, existing, > + em, start); > free_extent_map(existing); > - existing = NULL; > - } > - if (!existing) { > - existing = lookup_extent_mapping(em_tree, em->start, > - em->len); > - if (existing) { > - err = merge_extent_mapping(em_tree, existing, > - em, start); > - free_extent_map(existing); > - if (err) { > - free_extent_map(em); > - em = NULL; > - } > - } else { > - err = -EIO; > + if (err) { > free_extent_map(em); > em = NULL; > } > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
-------- Original Message -------- Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Filipe David Manana <fdmanana@gmail.com> To: Qu Wenruo <quwenruo@cn.fujitsu.com> Date: 2014?09?18? 21:16 > On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> The following commit enhanced the merge_extent_mapping() to reduce >> fragment in extent map tree, but it can't handle case which existing >> lies before map_start: >> 51f39 btrfs: Use right extent length when inserting overlap extent map. >> >> [BUG] >> When existing extent map's start is before map_start, >> the em->len will be minus, which will corrupt the extent map and fail to >> insert the new extent map. >> This will happen when someone get a large extent map, but when it is >> going to insert it into extent map tree, some one has already commit >> some write and split the huge extent into small parts. > This sounds like very deterministic to me. > Any reason to not add tests to the sanity tests that exercise > this/these case/cases? Yes, thanks for the informing. Will add the test case for it soon. Thanks, Qu > > Thanks > >> [REPRODUCER] >> It is very easy to tiger using filebench with randomrw personality. >> It is about 100% to reproduce when using 8G preallocated file in 60s >> randonrw test. >> >> [FIX] >> This patch can now handle any existing extent position. >> Since it does not directly use existing->start, now it will find the >> previous and next extent around map_start. >> So the old existing->start < map_start bug will never happen again. >> >> [ENHANCE] >> This patch will insert the best fitted extent map into extent map tree, >> other than the oldest [map_start, map_start + sectorsize) or the >> relatively newer but not perfect [map_start, existing->start). >> >> The patch will first search existing extent that does not intersects with >> the desired map range [map_start, map_start + len). >> The existing extent will be either before or behind map_start, and based >> on the existing extent, we can find out the previous and next extent >> around map_start. >> >> So the best fitted extent would be [prev->end, next->start). >> For prev or next is not found, em->start would be prev->end and em->end >> wold be next->start. >> >> With this patch, the fragment in extent map tree should be reduced much >> more than the 51f39 commit and reduce an unneeded extent map tree search. >> >> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> >> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >> --- >> fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++---------------- >> 1 file changed, 57 insertions(+), 22 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 016c403..8039021 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -6191,21 +6191,60 @@ out_fail_inode: >> goto out_fail; >> } >> >> +/* Find next extent map of a given extent map, caller needs to ensure locks */ >> +static struct extent_map *next_extent_map(struct extent_map *em) >> +{ >> + struct rb_node *next; >> + >> + next = rb_next(&em->rb_node); >> + if (!next) >> + return NULL; >> + return container_of(next, struct extent_map, rb_node); >> +} >> + >> +static struct extent_map *prev_extent_map(struct extent_map *em) >> +{ >> + struct rb_node *prev; >> + >> + prev = rb_prev(&em->rb_node); >> + if (!prev) >> + return NULL; >> + return container_of(prev, struct extent_map, rb_node); >> +} >> + >> /* helper for btfs_get_extent. Given an existing extent in the tree, >> + * the existing extent is the nearest extent to map_start, >> * and an extent that you want to insert, deal with overlap and insert >> - * the new extent into the tree. >> + * the best fitted new extent into the tree. >> */ >> static int merge_extent_mapping(struct extent_map_tree *em_tree, >> struct extent_map *existing, >> struct extent_map *em, >> u64 map_start) >> { >> + struct extent_map *prev; >> + struct extent_map *next; >> + u64 start; >> + u64 end; >> u64 start_diff; >> >> BUG_ON(map_start < em->start || map_start >= extent_map_end(em)); >> - start_diff = map_start - em->start; >> - em->start = map_start; >> - em->len = existing->start - em->start; >> + >> + if (existing->start > map_start) { >> + next = existing; >> + prev = prev_extent_map(next); >> + } else { >> + prev = existing; >> + next = next_extent_map(prev); >> + } >> + >> + start = prev ? extent_map_end(prev) : em->start; >> + start = max_t(u64, start, em->start); >> + end = next ? next->start : extent_map_end(em); >> + end = min_t(u64, end, extent_map_end(em)); >> + start_diff = start - em->start; >> + em->start = start; >> + em->len = end - start; >> if (em->block_start < EXTENT_MAP_LAST_BYTE && >> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { >> em->block_start += start_diff; >> @@ -6482,25 +6521,21 @@ insert: >> >> ret = 0; >> >> - existing = lookup_extent_mapping(em_tree, start, len); >> - if (existing && (existing->start > start || >> - existing->start + existing->len <= start)) { >> + existing = search_extent_mapping(em_tree, start, len); >> + /* >> + * existing will always be non-NULL, since there must be >> + * extent causing the -EEXIST. >> + */ >> + if (start >= extent_map_end(existing) || >> + start + len <= existing->start) { >> + /* >> + * The existing extent map is the one nearest to >> + * the [start, start + len) range which overlaps >> + */ >> + err = merge_extent_mapping(em_tree, existing, >> + em, start); >> free_extent_map(existing); >> - existing = NULL; >> - } >> - if (!existing) { >> - existing = lookup_extent_mapping(em_tree, em->start, >> - em->len); >> - if (existing) { >> - err = merge_extent_mapping(em_tree, existing, >> - em, start); >> - free_extent_map(existing); >> - if (err) { >> - free_extent_map(em); >> - em = NULL; >> - } >> - } else { >> - err = -EIO; >> + if (err) { >> free_extent_map(em); >> em = NULL; >> } >> -- >> 2.1.0 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 19, 2014 at 1:31 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > -------- Original Message -------- > Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert > best fitted extent map > From: Filipe David Manana <fdmanana@gmail.com> > To: Qu Wenruo <quwenruo@cn.fujitsu.com> > Date: 2014?09?18? 21:16 >> >> On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> >> wrote: >>> >>> The following commit enhanced the merge_extent_mapping() to reduce >>> fragment in extent map tree, but it can't handle case which existing >>> lies before map_start: >>> 51f39 btrfs: Use right extent length when inserting overlap extent map. >>> >>> [BUG] >>> When existing extent map's start is before map_start, >>> the em->len will be minus, which will corrupt the extent map and fail to >>> insert the new extent map. >>> This will happen when someone get a large extent map, but when it is >>> going to insert it into extent map tree, some one has already commit >>> some write and split the huge extent into small parts. >> >> This sounds like very deterministic to me. >> Any reason to not add tests to the sanity tests that exercise >> this/these case/cases? > > Yes, thanks for the informing. > Will add the test case for it soon. Hi Qu, Any progress on the test? This is a very important one IMHO, not only because of the bad consequences of the bug (extent map corruption, leading to all sorts of chaos), but also because this problem was not found by the full xfstests suite on several developer machines. thanks > > Thanks, > Qu > >> >> Thanks >> >>> [REPRODUCER] >>> It is very easy to tiger using filebench with randomrw personality. >>> It is about 100% to reproduce when using 8G preallocated file in 60s >>> randonrw test. >>> >>> [FIX] >>> This patch can now handle any existing extent position. >>> Since it does not directly use existing->start, now it will find the >>> previous and next extent around map_start. >>> So the old existing->start < map_start bug will never happen again. >>> >>> [ENHANCE] >>> This patch will insert the best fitted extent map into extent map tree, >>> other than the oldest [map_start, map_start + sectorsize) or the >>> relatively newer but not perfect [map_start, existing->start). >>> >>> The patch will first search existing extent that does not intersects with >>> the desired map range [map_start, map_start + len). >>> The existing extent will be either before or behind map_start, and based >>> on the existing extent, we can find out the previous and next extent >>> around map_start. >>> >>> So the best fitted extent would be [prev->end, next->start). >>> For prev or next is not found, em->start would be prev->end and em->end >>> wold be next->start. >>> >>> With this patch, the fragment in extent map tree should be reduced much >>> more than the 51f39 commit and reduce an unneeded extent map tree search. >>> >>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> >>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>> --- >>> fs/btrfs/inode.c | 79 >>> ++++++++++++++++++++++++++++++++++++++++---------------- >>> 1 file changed, 57 insertions(+), 22 deletions(-) >>> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index 016c403..8039021 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -6191,21 +6191,60 @@ out_fail_inode: >>> goto out_fail; >>> } >>> >>> +/* Find next extent map of a given extent map, caller needs to ensure >>> locks */ >>> +static struct extent_map *next_extent_map(struct extent_map *em) >>> +{ >>> + struct rb_node *next; >>> + >>> + next = rb_next(&em->rb_node); >>> + if (!next) >>> + return NULL; >>> + return container_of(next, struct extent_map, rb_node); >>> +} >>> + >>> +static struct extent_map *prev_extent_map(struct extent_map *em) >>> +{ >>> + struct rb_node *prev; >>> + >>> + prev = rb_prev(&em->rb_node); >>> + if (!prev) >>> + return NULL; >>> + return container_of(prev, struct extent_map, rb_node); >>> +} >>> + >>> /* helper for btfs_get_extent. Given an existing extent in the tree, >>> + * the existing extent is the nearest extent to map_start, >>> * and an extent that you want to insert, deal with overlap and insert >>> - * the new extent into the tree. >>> + * the best fitted new extent into the tree. >>> */ >>> static int merge_extent_mapping(struct extent_map_tree *em_tree, >>> struct extent_map *existing, >>> struct extent_map *em, >>> u64 map_start) >>> { >>> + struct extent_map *prev; >>> + struct extent_map *next; >>> + u64 start; >>> + u64 end; >>> u64 start_diff; >>> >>> BUG_ON(map_start < em->start || map_start >= >>> extent_map_end(em)); >>> - start_diff = map_start - em->start; >>> - em->start = map_start; >>> - em->len = existing->start - em->start; >>> + >>> + if (existing->start > map_start) { >>> + next = existing; >>> + prev = prev_extent_map(next); >>> + } else { >>> + prev = existing; >>> + next = next_extent_map(prev); >>> + } >>> + >>> + start = prev ? extent_map_end(prev) : em->start; >>> + start = max_t(u64, start, em->start); >>> + end = next ? next->start : extent_map_end(em); >>> + end = min_t(u64, end, extent_map_end(em)); >>> + start_diff = start - em->start; >>> + em->start = start; >>> + em->len = end - start; >>> if (em->block_start < EXTENT_MAP_LAST_BYTE && >>> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { >>> em->block_start += start_diff; >>> @@ -6482,25 +6521,21 @@ insert: >>> >>> ret = 0; >>> >>> - existing = lookup_extent_mapping(em_tree, start, len); >>> - if (existing && (existing->start > start || >>> - existing->start + existing->len <= start)) { >>> + existing = search_extent_mapping(em_tree, start, len); >>> + /* >>> + * existing will always be non-NULL, since there must be >>> + * extent causing the -EEXIST. >>> + */ >>> + if (start >= extent_map_end(existing) || >>> + start + len <= existing->start) { >>> + /* >>> + * The existing extent map is the one nearest to >>> + * the [start, start + len) range which overlaps >>> + */ >>> + err = merge_extent_mapping(em_tree, existing, >>> + em, start); >>> free_extent_map(existing); >>> - existing = NULL; >>> - } >>> - if (!existing) { >>> - existing = lookup_extent_mapping(em_tree, >>> em->start, >>> - em->len); >>> - if (existing) { >>> - err = merge_extent_mapping(em_tree, >>> existing, >>> - em, start); >>> - free_extent_map(existing); >>> - if (err) { >>> - free_extent_map(em); >>> - em = NULL; >>> - } >>> - } else { >>> - err = -EIO; >>> + if (err) { >>> free_extent_map(em); >>> em = NULL; >>> } >>> -- >>> 2.1.0 >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >
-------- Original Message -------- Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Filipe David Manana <fdmanana@gmail.com> To: Qu Wenruo <quwenruo@cn.fujitsu.com> Date: 2014?10?08? 20:08 > On Fri, Sep 19, 2014 at 1:31 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> -------- Original Message -------- >> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert >> best fitted extent map >> From: Filipe David Manana <fdmanana@gmail.com> >> To: Qu Wenruo <quwenruo@cn.fujitsu.com> >> Date: 2014?09?18? 21:16 >>> On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> >>> wrote: >>>> The following commit enhanced the merge_extent_mapping() to reduce >>>> fragment in extent map tree, but it can't handle case which existing >>>> lies before map_start: >>>> 51f39 btrfs: Use right extent length when inserting overlap extent map. >>>> >>>> [BUG] >>>> When existing extent map's start is before map_start, >>>> the em->len will be minus, which will corrupt the extent map and fail to >>>> insert the new extent map. >>>> This will happen when someone get a large extent map, but when it is >>>> going to insert it into extent map tree, some one has already commit >>>> some write and split the huge extent into small parts. >>> This sounds like very deterministic to me. >>> Any reason to not add tests to the sanity tests that exercise >>> this/these case/cases? >> Yes, thanks for the informing. >> Will add the test case for it soon. > Hi Qu, > > Any progress on the test? > > This is a very important one IMHO, not only because of the bad > consequences of the bug (extent map corruption, leading to all sorts > of chaos), but also because this problem was not found by the full > xfstests suite on several developer machines. > > thanks Still trying to reproduce it under xfstest framework. But even followiiing the FileBench randomrw behavior(1 thread random read 1 thread random write on preallocated space), I still failed to reproduce it. Still investigating how to reproduce it. Worst case may be add a new C program into src of xfstests? Thanks, Qu > >> Thanks, >> Qu >> >>> Thanks >>> >>>> [REPRODUCER] >>>> It is very easy to tiger using filebench with randomrw personality. >>>> It is about 100% to reproduce when using 8G preallocated file in 60s >>>> randonrw test. >>>> >>>> [FIX] >>>> This patch can now handle any existing extent position. >>>> Since it does not directly use existing->start, now it will find the >>>> previous and next extent around map_start. >>>> So the old existing->start < map_start bug will never happen again. >>>> >>>> [ENHANCE] >>>> This patch will insert the best fitted extent map into extent map tree, >>>> other than the oldest [map_start, map_start + sectorsize) or the >>>> relatively newer but not perfect [map_start, existing->start). >>>> >>>> The patch will first search existing extent that does not intersects with >>>> the desired map range [map_start, map_start + len). >>>> The existing extent will be either before or behind map_start, and based >>>> on the existing extent, we can find out the previous and next extent >>>> around map_start. >>>> >>>> So the best fitted extent would be [prev->end, next->start). >>>> For prev or next is not found, em->start would be prev->end and em->end >>>> wold be next->start. >>>> >>>> With this patch, the fragment in extent map tree should be reduced much >>>> more than the 51f39 commit and reduce an unneeded extent map tree search. >>>> >>>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> >>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>>> --- >>>> fs/btrfs/inode.c | 79 >>>> ++++++++++++++++++++++++++++++++++++++++---------------- >>>> 1 file changed, 57 insertions(+), 22 deletions(-) >>>> >>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>>> index 016c403..8039021 100644 >>>> --- a/fs/btrfs/inode.c >>>> +++ b/fs/btrfs/inode.c >>>> @@ -6191,21 +6191,60 @@ out_fail_inode: >>>> goto out_fail; >>>> } >>>> >>>> +/* Find next extent map of a given extent map, caller needs to ensure >>>> locks */ >>>> +static struct extent_map *next_extent_map(struct extent_map *em) >>>> +{ >>>> + struct rb_node *next; >>>> + >>>> + next = rb_next(&em->rb_node); >>>> + if (!next) >>>> + return NULL; >>>> + return container_of(next, struct extent_map, rb_node); >>>> +} >>>> + >>>> +static struct extent_map *prev_extent_map(struct extent_map *em) >>>> +{ >>>> + struct rb_node *prev; >>>> + >>>> + prev = rb_prev(&em->rb_node); >>>> + if (!prev) >>>> + return NULL; >>>> + return container_of(prev, struct extent_map, rb_node); >>>> +} >>>> + >>>> /* helper for btfs_get_extent. Given an existing extent in the tree, >>>> + * the existing extent is the nearest extent to map_start, >>>> * and an extent that you want to insert, deal with overlap and insert >>>> - * the new extent into the tree. >>>> + * the best fitted new extent into the tree. >>>> */ >>>> static int merge_extent_mapping(struct extent_map_tree *em_tree, >>>> struct extent_map *existing, >>>> struct extent_map *em, >>>> u64 map_start) >>>> { >>>> + struct extent_map *prev; >>>> + struct extent_map *next; >>>> + u64 start; >>>> + u64 end; >>>> u64 start_diff; >>>> >>>> BUG_ON(map_start < em->start || map_start >= >>>> extent_map_end(em)); >>>> - start_diff = map_start - em->start; >>>> - em->start = map_start; >>>> - em->len = existing->start - em->start; >>>> + >>>> + if (existing->start > map_start) { >>>> + next = existing; >>>> + prev = prev_extent_map(next); >>>> + } else { >>>> + prev = existing; >>>> + next = next_extent_map(prev); >>>> + } >>>> + >>>> + start = prev ? extent_map_end(prev) : em->start; >>>> + start = max_t(u64, start, em->start); >>>> + end = next ? next->start : extent_map_end(em); >>>> + end = min_t(u64, end, extent_map_end(em)); >>>> + start_diff = start - em->start; >>>> + em->start = start; >>>> + em->len = end - start; >>>> if (em->block_start < EXTENT_MAP_LAST_BYTE && >>>> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { >>>> em->block_start += start_diff; >>>> @@ -6482,25 +6521,21 @@ insert: >>>> >>>> ret = 0; >>>> >>>> - existing = lookup_extent_mapping(em_tree, start, len); >>>> - if (existing && (existing->start > start || >>>> - existing->start + existing->len <= start)) { >>>> + existing = search_extent_mapping(em_tree, start, len); >>>> + /* >>>> + * existing will always be non-NULL, since there must be >>>> + * extent causing the -EEXIST. >>>> + */ >>>> + if (start >= extent_map_end(existing) || >>>> + start + len <= existing->start) { >>>> + /* >>>> + * The existing extent map is the one nearest to >>>> + * the [start, start + len) range which overlaps >>>> + */ >>>> + err = merge_extent_mapping(em_tree, existing, >>>> + em, start); >>>> free_extent_map(existing); >>>> - existing = NULL; >>>> - } >>>> - if (!existing) { >>>> - existing = lookup_extent_mapping(em_tree, >>>> em->start, >>>> - em->len); >>>> - if (existing) { >>>> - err = merge_extent_mapping(em_tree, >>>> existing, >>>> - em, start); >>>> - free_extent_map(existing); >>>> - if (err) { >>>> - free_extent_map(em); >>>> - em = NULL; >>>> - } >>>> - } else { >>>> - err = -EIO; >>>> + if (err) { >>>> free_extent_map(em); >>>> em = NULL; >>>> } >>>> -- >>>> 2.1.0 >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 9, 2014 at 1:28 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > -------- Original Message -------- > Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert > best fitted extent map > From: Filipe David Manana <fdmanana@gmail.com> > To: Qu Wenruo <quwenruo@cn.fujitsu.com> > Date: 2014?10?08? 20:08 >> >> On Fri, Sep 19, 2014 at 1:31 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> >> wrote: >>> >>> -------- Original Message -------- >>> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to >>> insert >>> best fitted extent map >>> From: Filipe David Manana <fdmanana@gmail.com> >>> To: Qu Wenruo <quwenruo@cn.fujitsu.com> >>> Date: 2014?09?18? 21:16 >>>> >>>> On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> >>>> wrote: >>>>> >>>>> The following commit enhanced the merge_extent_mapping() to reduce >>>>> fragment in extent map tree, but it can't handle case which existing >>>>> lies before map_start: >>>>> 51f39 btrfs: Use right extent length when inserting overlap extent map. >>>>> >>>>> [BUG] >>>>> When existing extent map's start is before map_start, >>>>> the em->len will be minus, which will corrupt the extent map and fail >>>>> to >>>>> insert the new extent map. >>>>> This will happen when someone get a large extent map, but when it is >>>>> going to insert it into extent map tree, some one has already commit >>>>> some write and split the huge extent into small parts. >>>> >>>> This sounds like very deterministic to me. >>>> Any reason to not add tests to the sanity tests that exercise >>>> this/these case/cases? >>> >>> Yes, thanks for the informing. >>> Will add the test case for it soon. >> >> Hi Qu, >> >> Any progress on the test? >> >> This is a very important one IMHO, not only because of the bad >> consequences of the bug (extent map corruption, leading to all sorts >> of chaos), but also because this problem was not found by the full >> xfstests suite on several developer machines. >> >> thanks > > Still trying to reproduce it under xfstest framework. That's the problem, wasn't apparently reproducible (or detectable at least) by anyone with xfstests. > But even followiiing the FileBench randomrw behavior(1 thread random read 1 > thread random write on preallocated space), > I still failed to reproduce it. > > Still investigating how to reproduce it. > Worst case may be add a new C program into src of xfstests? How about the sanity tests (fs/btrfs/tests/*.c)? Create an empty map tree, add some extent maps, then try to merge some new extent maps that used to fail before this fix. Seems simple, no? thanks Qu > > Thanks, > Qu > >> >>> Thanks, >>> Qu >>> >>>> Thanks >>>> >>>>> [REPRODUCER] >>>>> It is very easy to tiger using filebench with randomrw personality. >>>>> It is about 100% to reproduce when using 8G preallocated file in 60s >>>>> randonrw test. >>>>> >>>>> [FIX] >>>>> This patch can now handle any existing extent position. >>>>> Since it does not directly use existing->start, now it will find the >>>>> previous and next extent around map_start. >>>>> So the old existing->start < map_start bug will never happen again. >>>>> >>>>> [ENHANCE] >>>>> This patch will insert the best fitted extent map into extent map tree, >>>>> other than the oldest [map_start, map_start + sectorsize) or the >>>>> relatively newer but not perfect [map_start, existing->start). >>>>> >>>>> The patch will first search existing extent that does not intersects >>>>> with >>>>> the desired map range [map_start, map_start + len). >>>>> The existing extent will be either before or behind map_start, and >>>>> based >>>>> on the existing extent, we can find out the previous and next extent >>>>> around map_start. >>>>> >>>>> So the best fitted extent would be [prev->end, next->start). >>>>> For prev or next is not found, em->start would be prev->end and em->end >>>>> wold be next->start. >>>>> >>>>> With this patch, the fragment in extent map tree should be reduced much >>>>> more than the 51f39 commit and reduce an unneeded extent map tree >>>>> search. >>>>> >>>>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> >>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>>>> --- >>>>> fs/btrfs/inode.c | 79 >>>>> ++++++++++++++++++++++++++++++++++++++++---------------- >>>>> 1 file changed, 57 insertions(+), 22 deletions(-) >>>>> >>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>>>> index 016c403..8039021 100644 >>>>> --- a/fs/btrfs/inode.c >>>>> +++ b/fs/btrfs/inode.c >>>>> @@ -6191,21 +6191,60 @@ out_fail_inode: >>>>> goto out_fail; >>>>> } >>>>> >>>>> +/* Find next extent map of a given extent map, caller needs to ensure >>>>> locks */ >>>>> +static struct extent_map *next_extent_map(struct extent_map *em) >>>>> +{ >>>>> + struct rb_node *next; >>>>> + >>>>> + next = rb_next(&em->rb_node); >>>>> + if (!next) >>>>> + return NULL; >>>>> + return container_of(next, struct extent_map, rb_node); >>>>> +} >>>>> + >>>>> +static struct extent_map *prev_extent_map(struct extent_map *em) >>>>> +{ >>>>> + struct rb_node *prev; >>>>> + >>>>> + prev = rb_prev(&em->rb_node); >>>>> + if (!prev) >>>>> + return NULL; >>>>> + return container_of(prev, struct extent_map, rb_node); >>>>> +} >>>>> + >>>>> /* helper for btfs_get_extent. Given an existing extent in the >>>>> tree, >>>>> + * the existing extent is the nearest extent to map_start, >>>>> * and an extent that you want to insert, deal with overlap and >>>>> insert >>>>> - * the new extent into the tree. >>>>> + * the best fitted new extent into the tree. >>>>> */ >>>>> static int merge_extent_mapping(struct extent_map_tree *em_tree, >>>>> struct extent_map *existing, >>>>> struct extent_map *em, >>>>> u64 map_start) >>>>> { >>>>> + struct extent_map *prev; >>>>> + struct extent_map *next; >>>>> + u64 start; >>>>> + u64 end; >>>>> u64 start_diff; >>>>> >>>>> BUG_ON(map_start < em->start || map_start >= >>>>> extent_map_end(em)); >>>>> - start_diff = map_start - em->start; >>>>> - em->start = map_start; >>>>> - em->len = existing->start - em->start; >>>>> + >>>>> + if (existing->start > map_start) { >>>>> + next = existing; >>>>> + prev = prev_extent_map(next); >>>>> + } else { >>>>> + prev = existing; >>>>> + next = next_extent_map(prev); >>>>> + } >>>>> + >>>>> + start = prev ? extent_map_end(prev) : em->start; >>>>> + start = max_t(u64, start, em->start); >>>>> + end = next ? next->start : extent_map_end(em); >>>>> + end = min_t(u64, end, extent_map_end(em)); >>>>> + start_diff = start - em->start; >>>>> + em->start = start; >>>>> + em->len = end - start; >>>>> if (em->block_start < EXTENT_MAP_LAST_BYTE && >>>>> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { >>>>> em->block_start += start_diff; >>>>> @@ -6482,25 +6521,21 @@ insert: >>>>> >>>>> ret = 0; >>>>> >>>>> - existing = lookup_extent_mapping(em_tree, start, len); >>>>> - if (existing && (existing->start > start || >>>>> - existing->start + existing->len <= start)) { >>>>> + existing = search_extent_mapping(em_tree, start, len); >>>>> + /* >>>>> + * existing will always be non-NULL, since there must >>>>> be >>>>> + * extent causing the -EEXIST. >>>>> + */ >>>>> + if (start >= extent_map_end(existing) || >>>>> + start + len <= existing->start) { >>>>> + /* >>>>> + * The existing extent map is the one nearest >>>>> to >>>>> + * the [start, start + len) range which >>>>> overlaps >>>>> + */ >>>>> + err = merge_extent_mapping(em_tree, existing, >>>>> + em, start); >>>>> free_extent_map(existing); >>>>> - existing = NULL; >>>>> - } >>>>> - if (!existing) { >>>>> - existing = lookup_extent_mapping(em_tree, >>>>> em->start, >>>>> - em->len); >>>>> - if (existing) { >>>>> - err = merge_extent_mapping(em_tree, >>>>> existing, >>>>> - em, start); >>>>> - free_extent_map(existing); >>>>> - if (err) { >>>>> - free_extent_map(em); >>>>> - em = NULL; >>>>> - } >>>>> - } else { >>>>> - err = -EIO; >>>>> + if (err) { >>>>> free_extent_map(em); >>>>> em = NULL; >>>>> } >>>>> -- >>>>> 2.1.0 >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >>>>> in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >> >> >
-------- Original Message -------- Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Filipe David Manana <fdmanana@gmail.com> To: Qu Wenruo <quwenruo@cn.fujitsu.com> Date: 2014?10?09? 18:27 > On Thu, Oct 9, 2014 at 1:28 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> -------- Original Message -------- >> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert >> best fitted extent map >> From: Filipe David Manana <fdmanana@gmail.com> >> To: Qu Wenruo <quwenruo@cn.fujitsu.com> >> Date: 2014?10?08? 20:08 >>> On Fri, Sep 19, 2014 at 1:31 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> >>> wrote: >>>> -------- Original Message -------- >>>> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to >>>> insert >>>> best fitted extent map >>>> From: Filipe David Manana <fdmanana@gmail.com> >>>> To: Qu Wenruo <quwenruo@cn.fujitsu.com> >>>> Date: 2014?09?18? 21:16 >>>>> On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> >>>>> wrote: >>>>>> The following commit enhanced the merge_extent_mapping() to reduce >>>>>> fragment in extent map tree, but it can't handle case which existing >>>>>> lies before map_start: >>>>>> 51f39 btrfs: Use right extent length when inserting overlap extent map. >>>>>> >>>>>> [BUG] >>>>>> When existing extent map's start is before map_start, >>>>>> the em->len will be minus, which will corrupt the extent map and fail >>>>>> to >>>>>> insert the new extent map. >>>>>> This will happen when someone get a large extent map, but when it is >>>>>> going to insert it into extent map tree, some one has already commit >>>>>> some write and split the huge extent into small parts. >>>>> This sounds like very deterministic to me. >>>>> Any reason to not add tests to the sanity tests that exercise >>>>> this/these case/cases? >>>> Yes, thanks for the informing. >>>> Will add the test case for it soon. >>> Hi Qu, >>> >>> Any progress on the test? >>> >>> This is a very important one IMHO, not only because of the bad >>> consequences of the bug (extent map corruption, leading to all sorts >>> of chaos), but also because this problem was not found by the full >>> xfstests suite on several developer machines. >>> >>> thanks >> Still trying to reproduce it under xfstest framework. > That's the problem, wasn't apparently reproducible (or detectable at > least) by anyone with xfstests. I'll try to build a C program to behave the same of filebench and to see if it works. At least with filebench, it can be triggered in 60s with 100% possibility to reproduce. > >> But even followiiing the FileBench randomrw behavior(1 thread random read 1 >> thread random write on preallocated space), >> I still failed to reproduce it. >> >> Still investigating how to reproduce it. >> Worst case may be add a new C program into src of xfstests? > How about the sanity tests (fs/btrfs/tests/*.c)? Create an empty map > tree, add some extent maps, then try to merge some new extent maps > that used to fail before this fix. Seems simple, no? > > thanks Qu It needs concurrency read and write(commit) to trigger it, I am not sure it can be reproduced in sanity tests since it seems not commit things and lacks multithread facility. I'll give it a try on the filebench-behavior C program first, and then sanity tests if former doesn't work at all Thanks, Qu > > >> Thanks, >> Qu >> >>>> Thanks, >>>> Qu >>>> >>>>> Thanks >>>>> >>>>>> [REPRODUCER] >>>>>> It is very easy to tiger using filebench with randomrw personality. >>>>>> It is about 100% to reproduce when using 8G preallocated file in 60s >>>>>> randonrw test. >>>>>> >>>>>> [FIX] >>>>>> This patch can now handle any existing extent position. >>>>>> Since it does not directly use existing->start, now it will find the >>>>>> previous and next extent around map_start. >>>>>> So the old existing->start < map_start bug will never happen again. >>>>>> >>>>>> [ENHANCE] >>>>>> This patch will insert the best fitted extent map into extent map tree, >>>>>> other than the oldest [map_start, map_start + sectorsize) or the >>>>>> relatively newer but not perfect [map_start, existing->start). >>>>>> >>>>>> The patch will first search existing extent that does not intersects >>>>>> with >>>>>> the desired map range [map_start, map_start + len). >>>>>> The existing extent will be either before or behind map_start, and >>>>>> based >>>>>> on the existing extent, we can find out the previous and next extent >>>>>> around map_start. >>>>>> >>>>>> So the best fitted extent would be [prev->end, next->start). >>>>>> For prev or next is not found, em->start would be prev->end and em->end >>>>>> wold be next->start. >>>>>> >>>>>> With this patch, the fragment in extent map tree should be reduced much >>>>>> more than the 51f39 commit and reduce an unneeded extent map tree >>>>>> search. >>>>>> >>>>>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> >>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>>>>> --- >>>>>> fs/btrfs/inode.c | 79 >>>>>> ++++++++++++++++++++++++++++++++++++++++---------------- >>>>>> 1 file changed, 57 insertions(+), 22 deletions(-) >>>>>> >>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>>>>> index 016c403..8039021 100644 >>>>>> --- a/fs/btrfs/inode.c >>>>>> +++ b/fs/btrfs/inode.c >>>>>> @@ -6191,21 +6191,60 @@ out_fail_inode: >>>>>> goto out_fail; >>>>>> } >>>>>> >>>>>> +/* Find next extent map of a given extent map, caller needs to ensure >>>>>> locks */ >>>>>> +static struct extent_map *next_extent_map(struct extent_map *em) >>>>>> +{ >>>>>> + struct rb_node *next; >>>>>> + >>>>>> + next = rb_next(&em->rb_node); >>>>>> + if (!next) >>>>>> + return NULL; >>>>>> + return container_of(next, struct extent_map, rb_node); >>>>>> +} >>>>>> + >>>>>> +static struct extent_map *prev_extent_map(struct extent_map *em) >>>>>> +{ >>>>>> + struct rb_node *prev; >>>>>> + >>>>>> + prev = rb_prev(&em->rb_node); >>>>>> + if (!prev) >>>>>> + return NULL; >>>>>> + return container_of(prev, struct extent_map, rb_node); >>>>>> +} >>>>>> + >>>>>> /* helper for btfs_get_extent. Given an existing extent in the >>>>>> tree, >>>>>> + * the existing extent is the nearest extent to map_start, >>>>>> * and an extent that you want to insert, deal with overlap and >>>>>> insert >>>>>> - * the new extent into the tree. >>>>>> + * the best fitted new extent into the tree. >>>>>> */ >>>>>> static int merge_extent_mapping(struct extent_map_tree *em_tree, >>>>>> struct extent_map *existing, >>>>>> struct extent_map *em, >>>>>> u64 map_start) >>>>>> { >>>>>> + struct extent_map *prev; >>>>>> + struct extent_map *next; >>>>>> + u64 start; >>>>>> + u64 end; >>>>>> u64 start_diff; >>>>>> >>>>>> BUG_ON(map_start < em->start || map_start >= >>>>>> extent_map_end(em)); >>>>>> - start_diff = map_start - em->start; >>>>>> - em->start = map_start; >>>>>> - em->len = existing->start - em->start; >>>>>> + >>>>>> + if (existing->start > map_start) { >>>>>> + next = existing; >>>>>> + prev = prev_extent_map(next); >>>>>> + } else { >>>>>> + prev = existing; >>>>>> + next = next_extent_map(prev); >>>>>> + } >>>>>> + >>>>>> + start = prev ? extent_map_end(prev) : em->start; >>>>>> + start = max_t(u64, start, em->start); >>>>>> + end = next ? next->start : extent_map_end(em); >>>>>> + end = min_t(u64, end, extent_map_end(em)); >>>>>> + start_diff = start - em->start; >>>>>> + em->start = start; >>>>>> + em->len = end - start; >>>>>> if (em->block_start < EXTENT_MAP_LAST_BYTE && >>>>>> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { >>>>>> em->block_start += start_diff; >>>>>> @@ -6482,25 +6521,21 @@ insert: >>>>>> >>>>>> ret = 0; >>>>>> >>>>>> - existing = lookup_extent_mapping(em_tree, start, len); >>>>>> - if (existing && (existing->start > start || >>>>>> - existing->start + existing->len <= start)) { >>>>>> + existing = search_extent_mapping(em_tree, start, len); >>>>>> + /* >>>>>> + * existing will always be non-NULL, since there must >>>>>> be >>>>>> + * extent causing the -EEXIST. >>>>>> + */ >>>>>> + if (start >= extent_map_end(existing) || >>>>>> + start + len <= existing->start) { >>>>>> + /* >>>>>> + * The existing extent map is the one nearest >>>>>> to >>>>>> + * the [start, start + len) range which >>>>>> overlaps >>>>>> + */ >>>>>> + err = merge_extent_mapping(em_tree, existing, >>>>>> + em, start); >>>>>> free_extent_map(existing); >>>>>> - existing = NULL; >>>>>> - } >>>>>> - if (!existing) { >>>>>> - existing = lookup_extent_mapping(em_tree, >>>>>> em->start, >>>>>> - em->len); >>>>>> - if (existing) { >>>>>> - err = merge_extent_mapping(em_tree, >>>>>> existing, >>>>>> - em, start); >>>>>> - free_extent_map(existing); >>>>>> - if (err) { >>>>>> - free_extent_map(em); >>>>>> - em = NULL; >>>>>> - } >>>>>> - } else { >>>>>> - err = -EIO; >>>>>> + if (err) { >>>>>> free_extent_map(em); >>>>>> em = NULL; >>>>>> } >>>>>> -- >>>>>> 2.1.0 >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >>>>>> in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 10, 2014 at 3:39 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > -------- Original Message -------- > Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert > best fitted extent map > From: Filipe David Manana <fdmanana@gmail.com> > To: Qu Wenruo <quwenruo@cn.fujitsu.com> > Date: 2014?10?09? 18:27 >> >> On Thu, Oct 9, 2014 at 1:28 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >>> >>> -------- Original Message -------- >>> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to >>> insert >>> best fitted extent map >>> From: Filipe David Manana <fdmanana@gmail.com> >>> To: Qu Wenruo <quwenruo@cn.fujitsu.com> >>> Date: 2014?10?08? 20:08 >>>> >>>> On Fri, Sep 19, 2014 at 1:31 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> >>>> wrote: >>>>> >>>>> -------- Original Message -------- >>>>> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to >>>>> insert >>>>> best fitted extent map >>>>> From: Filipe David Manana <fdmanana@gmail.com> >>>>> To: Qu Wenruo <quwenruo@cn.fujitsu.com> >>>>> Date: 2014?09?18? 21:16 >>>>>> >>>>>> On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> >>>>>> wrote: >>>>>>> >>>>>>> The following commit enhanced the merge_extent_mapping() to reduce >>>>>>> fragment in extent map tree, but it can't handle case which existing >>>>>>> lies before map_start: >>>>>>> 51f39 btrfs: Use right extent length when inserting overlap extent >>>>>>> map. >>>>>>> >>>>>>> [BUG] >>>>>>> When existing extent map's start is before map_start, >>>>>>> the em->len will be minus, which will corrupt the extent map and fail >>>>>>> to >>>>>>> insert the new extent map. >>>>>>> This will happen when someone get a large extent map, but when it is >>>>>>> going to insert it into extent map tree, some one has already commit >>>>>>> some write and split the huge extent into small parts. >>>>>> >>>>>> This sounds like very deterministic to me. >>>>>> Any reason to not add tests to the sanity tests that exercise >>>>>> this/these case/cases? >>>>> >>>>> Yes, thanks for the informing. >>>>> Will add the test case for it soon. >>>> >>>> Hi Qu, >>>> >>>> Any progress on the test? >>>> >>>> This is a very important one IMHO, not only because of the bad >>>> consequences of the bug (extent map corruption, leading to all sorts >>>> of chaos), but also because this problem was not found by the full >>>> xfstests suite on several developer machines. >>>> >>>> thanks >>> >>> Still trying to reproduce it under xfstest framework. >> >> That's the problem, wasn't apparently reproducible (or detectable at >> least) by anyone with xfstests. > > I'll try to build a C program to behave the same of filebench and to see if > it works. > At least with filebench, it can be triggered in 60s with 100% possibility to > reproduce. >> >> >>> But even followiiing the FileBench randomrw behavior(1 thread random read >>> 1 >>> thread random write on preallocated space), >>> I still failed to reproduce it. >>> >>> Still investigating how to reproduce it. >>> Worst case may be add a new C program into src of xfstests? >> >> How about the sanity tests (fs/btrfs/tests/*.c)? Create an empty map >> tree, add some extent maps, then try to merge some new extent maps >> that used to fail before this fix. Seems simple, no? >> >> thanks Qu > > It needs concurrency read and write(commit) to trigger it, I am not sure it > can be reproduced in sanity tests > since it seems not commit things and lacks multithread facility. Hum? Why does concurrency or persistence matters? Let's review the problem. So you fixed the function inode.c:merge_extent_mapping(). That function merges a new extent map (not in the extent map tree) with an existing extent map (which is in the tree). The issue was that the merge was incorrect for some cases - producing a bad extent map (compared to the rest of the existing extent maps) that either overlaps existing ones or introduces incorrect gaps, etc - doesn't really matter the reason. Now, this function is run while holding the write lock of the inode's extent map tree. So why does concurrency (or persistence) matters here? Why can't we have a sanity test that simply reproduces a scenario where immediately after attempting to merge extent maps, we get an (in-memory) extent map that is incorrect? Also, you don't need to go to such great lengths as writing a C program that mimics filebench's behaviour. The issue can be reproduced from user space without file write and read concurrency as well, using only common tools like fallocate (or xfs_io), dd and filefrag. See the thread at: http://www.spinics.net/lists/linux-btrfs/msg38045.html thanks > > I'll give it a try on the filebench-behavior C program first, and then > sanity tests if former doesn't work at all > > Thanks, > > Qu >> >> >> >>> Thanks, >>> Qu >>> >>>>> Thanks, >>>>> Qu >>>>> >>>>>> Thanks >>>>>> >>>>>>> [REPRODUCER] >>>>>>> It is very easy to tiger using filebench with randomrw personality. >>>>>>> It is about 100% to reproduce when using 8G preallocated file in 60s >>>>>>> randonrw test. >>>>>>> >>>>>>> [FIX] >>>>>>> This patch can now handle any existing extent position. >>>>>>> Since it does not directly use existing->start, now it will find the >>>>>>> previous and next extent around map_start. >>>>>>> So the old existing->start < map_start bug will never happen again. >>>>>>> >>>>>>> [ENHANCE] >>>>>>> This patch will insert the best fitted extent map into extent map >>>>>>> tree, >>>>>>> other than the oldest [map_start, map_start + sectorsize) or the >>>>>>> relatively newer but not perfect [map_start, existing->start). >>>>>>> >>>>>>> The patch will first search existing extent that does not intersects >>>>>>> with >>>>>>> the desired map range [map_start, map_start + len). >>>>>>> The existing extent will be either before or behind map_start, and >>>>>>> based >>>>>>> on the existing extent, we can find out the previous and next extent >>>>>>> around map_start. >>>>>>> >>>>>>> So the best fitted extent would be [prev->end, next->start). >>>>>>> For prev or next is not found, em->start would be prev->end and >>>>>>> em->end >>>>>>> wold be next->start. >>>>>>> >>>>>>> With this patch, the fragment in extent map tree should be reduced >>>>>>> much >>>>>>> more than the 51f39 commit and reduce an unneeded extent map tree >>>>>>> search. >>>>>>> >>>>>>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> >>>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>>>>>> --- >>>>>>> fs/btrfs/inode.c | 79 >>>>>>> ++++++++++++++++++++++++++++++++++++++++---------------- >>>>>>> 1 file changed, 57 insertions(+), 22 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>>>>>> index 016c403..8039021 100644 >>>>>>> --- a/fs/btrfs/inode.c >>>>>>> +++ b/fs/btrfs/inode.c >>>>>>> @@ -6191,21 +6191,60 @@ out_fail_inode: >>>>>>> goto out_fail; >>>>>>> } >>>>>>> >>>>>>> +/* Find next extent map of a given extent map, caller needs to >>>>>>> ensure >>>>>>> locks */ >>>>>>> +static struct extent_map *next_extent_map(struct extent_map *em) >>>>>>> +{ >>>>>>> + struct rb_node *next; >>>>>>> + >>>>>>> + next = rb_next(&em->rb_node); >>>>>>> + if (!next) >>>>>>> + return NULL; >>>>>>> + return container_of(next, struct extent_map, rb_node); >>>>>>> +} >>>>>>> + >>>>>>> +static struct extent_map *prev_extent_map(struct extent_map *em) >>>>>>> +{ >>>>>>> + struct rb_node *prev; >>>>>>> + >>>>>>> + prev = rb_prev(&em->rb_node); >>>>>>> + if (!prev) >>>>>>> + return NULL; >>>>>>> + return container_of(prev, struct extent_map, rb_node); >>>>>>> +} >>>>>>> + >>>>>>> /* helper for btfs_get_extent. Given an existing extent in the >>>>>>> tree, >>>>>>> + * the existing extent is the nearest extent to map_start, >>>>>>> * and an extent that you want to insert, deal with overlap and >>>>>>> insert >>>>>>> - * the new extent into the tree. >>>>>>> + * the best fitted new extent into the tree. >>>>>>> */ >>>>>>> static int merge_extent_mapping(struct extent_map_tree *em_tree, >>>>>>> struct extent_map *existing, >>>>>>> struct extent_map *em, >>>>>>> u64 map_start) >>>>>>> { >>>>>>> + struct extent_map *prev; >>>>>>> + struct extent_map *next; >>>>>>> + u64 start; >>>>>>> + u64 end; >>>>>>> u64 start_diff; >>>>>>> >>>>>>> BUG_ON(map_start < em->start || map_start >= >>>>>>> extent_map_end(em)); >>>>>>> - start_diff = map_start - em->start; >>>>>>> - em->start = map_start; >>>>>>> - em->len = existing->start - em->start; >>>>>>> + >>>>>>> + if (existing->start > map_start) { >>>>>>> + next = existing; >>>>>>> + prev = prev_extent_map(next); >>>>>>> + } else { >>>>>>> + prev = existing; >>>>>>> + next = next_extent_map(prev); >>>>>>> + } >>>>>>> + >>>>>>> + start = prev ? extent_map_end(prev) : em->start; >>>>>>> + start = max_t(u64, start, em->start); >>>>>>> + end = next ? next->start : extent_map_end(em); >>>>>>> + end = min_t(u64, end, extent_map_end(em)); >>>>>>> + start_diff = start - em->start; >>>>>>> + em->start = start; >>>>>>> + em->len = end - start; >>>>>>> if (em->block_start < EXTENT_MAP_LAST_BYTE && >>>>>>> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { >>>>>>> em->block_start += start_diff; >>>>>>> @@ -6482,25 +6521,21 @@ insert: >>>>>>> >>>>>>> ret = 0; >>>>>>> >>>>>>> - existing = lookup_extent_mapping(em_tree, start, >>>>>>> len); >>>>>>> - if (existing && (existing->start > start || >>>>>>> - existing->start + existing->len <= start)) { >>>>>>> + existing = search_extent_mapping(em_tree, start, >>>>>>> len); >>>>>>> + /* >>>>>>> + * existing will always be non-NULL, since there must >>>>>>> be >>>>>>> + * extent causing the -EEXIST. >>>>>>> + */ >>>>>>> + if (start >= extent_map_end(existing) || >>>>>>> + start + len <= existing->start) { >>>>>>> + /* >>>>>>> + * The existing extent map is the one nearest >>>>>>> to >>>>>>> + * the [start, start + len) range which >>>>>>> overlaps >>>>>>> + */ >>>>>>> + err = merge_extent_mapping(em_tree, existing, >>>>>>> + em, start); >>>>>>> free_extent_map(existing); >>>>>>> - existing = NULL; >>>>>>> - } >>>>>>> - if (!existing) { >>>>>>> - existing = lookup_extent_mapping(em_tree, >>>>>>> em->start, >>>>>>> - em->len); >>>>>>> - if (existing) { >>>>>>> - err = merge_extent_mapping(em_tree, >>>>>>> existing, >>>>>>> - em, >>>>>>> start); >>>>>>> - free_extent_map(existing); >>>>>>> - if (err) { >>>>>>> - free_extent_map(em); >>>>>>> - em = NULL; >>>>>>> - } >>>>>>> - } else { >>>>>>> - err = -EIO; >>>>>>> + if (err) { >>>>>>> free_extent_map(em); >>>>>>> em = NULL; >>>>>>> } >>>>>>> -- >>>>>>> 2.1.0 >>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>> linux-btrfs" >>>>>>> in >>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>>> >>>> >> >> >
-------- Original Message -------- Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Filipe David Manana <fdmanana@gmail.com> To: Qu Wenruo <quwenruo@cn.fujitsu.com> Date: 2014?10?10? 16:08 > On Fri, Oct 10, 2014 at 3:39 AM, Qu Wenruo<quwenruo@cn.fujitsu.com> wrote: >> -------- Original Message -------- >> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert >> best fitted extent map >> From: Filipe David Manana<fdmanana@gmail.com> >> To: Qu Wenruo<quwenruo@cn.fujitsu.com> >> Date: 2014?10?09? 18:27 >>> On Thu, Oct 9, 2014 at 1:28 AM, Qu Wenruo<quwenruo@cn.fujitsu.com> wrote: >>>> -------- Original Message -------- >>>> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to >>>> insert >>>> best fitted extent map >>>> From: Filipe David Manana<fdmanana@gmail.com> >>>> To: Qu Wenruo<quwenruo@cn.fujitsu.com> >>>> Date: 2014?10?08? 20:08 >>>>> On Fri, Sep 19, 2014 at 1:31 AM, Qu Wenruo<quwenruo@cn.fujitsu.com> >>>>> wrote: >>>>>> -------- Original Message -------- >>>>>> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to >>>>>> insert >>>>>> best fitted extent map >>>>>> From: Filipe David Manana<fdmanana@gmail.com> >>>>>> To: Qu Wenruo<quwenruo@cn.fujitsu.com> >>>>>> Date: 2014?09?18? 21:16 >>>>>>> On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo<quwenruo@cn.fujitsu.com> >>>>>>> wrote: >>>>>>>> The following commit enhanced the merge_extent_mapping() to reduce >>>>>>>> fragment in extent map tree, but it can't handle case which existing >>>>>>>> lies before map_start: >>>>>>>> 51f39 btrfs: Use right extent length when inserting overlap extent >>>>>>>> map. >>>>>>>> >>>>>>>> [BUG] >>>>>>>> When existing extent map's start is before map_start, >>>>>>>> the em->len will be minus, which will corrupt the extent map and fail >>>>>>>> to >>>>>>>> insert the new extent map. >>>>>>>> This will happen when someone get a large extent map, but when it is >>>>>>>> going to insert it into extent map tree, some one has already commit >>>>>>>> some write and split the huge extent into small parts. >>>>>>> This sounds like very deterministic to me. >>>>>>> Any reason to not add tests to the sanity tests that exercise >>>>>>> this/these case/cases? >>>>>> Yes, thanks for the informing. >>>>>> Will add the test case for it soon. >>>>> Hi Qu, >>>>> >>>>> Any progress on the test? >>>>> >>>>> This is a very important one IMHO, not only because of the bad >>>>> consequences of the bug (extent map corruption, leading to all sorts >>>>> of chaos), but also because this problem was not found by the full >>>>> xfstests suite on several developer machines. >>>>> >>>>> thanks >>>> Still trying to reproduce it under xfstest framework. >>> That's the problem, wasn't apparently reproducible (or detectable at >>> least) by anyone with xfstests. >> I'll try to build a C program to behave the same of filebench and to see if >> it works. >> At least with filebench, it can be triggered in 60s with 100% possibility to >> reproduce. >>>> But even followiiing the FileBench randomrw behavior(1 thread random read >>>> 1 >>>> thread random write on preallocated space), >>>> I still failed to reproduce it. >>>> >>>> Still investigating how to reproduce it. >>>> Worst case may be add a new C program into src of xfstests? >>> How about the sanity tests (fs/btrfs/tests/*.c)? Create an empty map >>> tree, add some extent maps, then try to merge some new extent maps >>> that used to fail before this fix. Seems simple, no? >>> >>> thanks Qu >> It needs concurrency read and write(commit) to trigger it, I am not sure it >> can be reproduced in sanity tests >> since it seems not commit things and lacks multithread facility. > Hum? > Why does concurrency or persistence matters? > > Let's review the problem. > So you fixed the function inode.c:merge_extent_mapping(). That > function merges a new extent map (not in the extent map tree) with an > existing extent map (which is in the tree). > The issue was that the merge was incorrect for some cases - producing > a bad extent map (compared to the rest of the existing extent maps) > that either overlaps existing ones or introduces incorrect gaps, etc - > doesn't really matter the reason. > Now, this function is run while holding the write lock of the inode's > extent map tree. > So why does concurrency (or persistence) matters here? It is true that the patch only fixed the above merge problem. But the bug involves more. 1) the direct cause. existing extent map's start is smaller than map_start in merge_extent_mapping(). 2) the root cause. As described in my V2 patch, there is a window between btrfs_release_path() and write_lock(em_tree->lock) in btrfs_get_extent(), and under concurrency, one may get a big extent map converted from on-disk file extent, and during the windows, a commit happens and the original extent map is split into several small ones. So 1) will happen and cause the bug. At least the reporter's filebench reproducer can be explained like above, and that's why concurrency is needed to trigger the bug under such circumstance. > Why can't we have a sanity test that simply reproduces a scenario > where immediately after attempting to merge extent maps, we get an > (in-memory) extent map that is incorrect? There is other situation triggering the bug(just like the mail below), but the above known circumstance needs concurrency to let commit happen during the small window. So at least sanity test can't trigger it under the *concurrency* circumstance. Maybe the fallocate/dd/filefrag circumstance will fit more for xfs test case. > Also, you don't need to go to such great lengths as writing a C > program that mimics filebench's behaviour. > The issue can be reproduced from user space without file write and > read concurrency as well, using only common tools like fallocate (or > xfs_io), dd and filefrag. See the thread at: > http://www.spinics.net/lists/linux-btrfs/msg38045.html > thanks Thanks for the information a lot!! I succeeded in reproduce the bug with a 10G file just using fallcate + dd(from /dev/zero) + filefrag. Although already good enough, I'll dig more to find out the circumstance the bug is triggered. Thanks, Qu >> I'll give it a try on the filebench-behavior C program first, and then >> sanity tests if former doesn't work at all > >> Thanks, >> >> Qu >>> >>>> Thanks, >>>> Qu >>>> >>>>>> Thanks, >>>>>> Qu >>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 016c403..8039021 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6191,21 +6191,60 @@ out_fail_inode: goto out_fail; } +/* Find next extent map of a given extent map, caller needs to ensure locks */ +static struct extent_map *next_extent_map(struct extent_map *em) +{ + struct rb_node *next; + + next = rb_next(&em->rb_node); + if (!next) + return NULL; + return container_of(next, struct extent_map, rb_node); +} + +static struct extent_map *prev_extent_map(struct extent_map *em) +{ + struct rb_node *prev; + + prev = rb_prev(&em->rb_node); + if (!prev) + return NULL; + return container_of(prev, struct extent_map, rb_node); +} + /* helper for btfs_get_extent. Given an existing extent in the tree, + * the existing extent is the nearest extent to map_start, * and an extent that you want to insert, deal with overlap and insert - * the new extent into the tree. + * the best fitted new extent into the tree. */ static int merge_extent_mapping(struct extent_map_tree *em_tree, struct extent_map *existing, struct extent_map *em, u64 map_start) { + struct extent_map *prev; + struct extent_map *next; + u64 start; + u64 end; u64 start_diff; BUG_ON(map_start < em->start || map_start >= extent_map_end(em)); - start_diff = map_start - em->start; - em->start = map_start; - em->len = existing->start - em->start; + + if (existing->start > map_start) { + next = existing; + prev = prev_extent_map(next); + } else { + prev = existing; + next = next_extent_map(prev); + } + + start = prev ? extent_map_end(prev) : em->start; + start = max_t(u64, start, em->start); + end = next ? next->start : extent_map_end(em); + end = min_t(u64, end, extent_map_end(em)); + start_diff = start - em->start; + em->start = start; + em->len = end - start; if (em->block_start < EXTENT_MAP_LAST_BYTE && !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { em->block_start += start_diff; @@ -6482,25 +6521,21 @@ insert: ret = 0; - existing = lookup_extent_mapping(em_tree, start, len); - if (existing && (existing->start > start || - existing->start + existing->len <= start)) { + existing = search_extent_mapping(em_tree, start, len); + /* + * existing will always be non-NULL, since there must be + * extent causing the -EEXIST. + */ + if (start >= extent_map_end(existing) || + start + len <= existing->start) { + /* + * The existing extent map is the one nearest to + * the [start, start + len) range which overlaps + */ + err = merge_extent_mapping(em_tree, existing, + em, start); free_extent_map(existing); - existing = NULL; - } - if (!existing) { - existing = lookup_extent_mapping(em_tree, em->start, - em->len); - if (existing) { - err = merge_extent_mapping(em_tree, existing, - em, start); - free_extent_map(existing); - if (err) { - free_extent_map(em); - em = NULL; - } - } else { - err = -EIO; + if (err) { free_extent_map(em); em = NULL; }
The following commit enhanced the merge_extent_mapping() to reduce fragment in extent map tree, but it can't handle case which existing lies before map_start: 51f39 btrfs: Use right extent length when inserting overlap extent map. [BUG] When existing extent map's start is before map_start, the em->len will be minus, which will corrupt the extent map and fail to insert the new extent map. This will happen when someone get a large extent map, but when it is going to insert it into extent map tree, some one has already commit some write and split the huge extent into small parts. [REPRODUCER] It is very easy to tiger using filebench with randomrw personality. It is about 100% to reproduce when using 8G preallocated file in 60s randonrw test. [FIX] This patch can now handle any existing extent position. Since it does not directly use existing->start, now it will find the previous and next extent around map_start. So the old existing->start < map_start bug will never happen again. [ENHANCE] This patch will insert the best fitted extent map into extent map tree, other than the oldest [map_start, map_start + sectorsize) or the relatively newer but not perfect [map_start, existing->start). The patch will first search existing extent that does not intersects with the desired map range [map_start, map_start + len). The existing extent will be either before or behind map_start, and based on the existing extent, we can find out the previous and next extent around map_start. So the best fitted extent would be [prev->end, next->start). For prev or next is not found, em->start would be prev->end and em->end wold be next->start. With this patch, the fragment in extent map tree should be reduced much more than the 51f39 commit and reduce an unneeded extent map tree search. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> --- fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 57 insertions(+), 22 deletions(-)