Message ID | 20220516165416.171196-9-p.raghav@samsung.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4,01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze | expand |
On 16/05/2022 18:55, Pankaj Raghav wrote: > Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB. > These are fixed at these locations so that recovery tools can reliably > retrieve the superblocks even if one of the mirror gets corrupted. > > power of 2 zone sizes align at these offsets irrespective of their > value but non power of 2 zone sizes will not align. > > To make sure the first zone at mirror 1 and mirror 2 align, write zero > operation is performed to move the write pointer of the first zone to > the expected offset. This operation is performed only after a zone reset > of the first zone, i.e., when the second zone that contains the sb is FULL. Hi Pankaj, stupid question. Npo2 devices still have a zone size being a multiple of 4k don't they? If not, we'd need to also have a tail padding of the superblock zones, in order to move the WP of these zones to the end, so the sb-log states match up.
On 2022-05-17 08:50, Johannes Thumshirn wrote: > On 16/05/2022 18:55, Pankaj Raghav wrote: >> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB. >> These are fixed at these locations so that recovery tools can reliably >> retrieve the superblocks even if one of the mirror gets corrupted. >> >> power of 2 zone sizes align at these offsets irrespective of their >> value but non power of 2 zone sizes will not align. >> >> To make sure the first zone at mirror 1 and mirror 2 align, write zero >> operation is performed to move the write pointer of the first zone to >> the expected offset. This operation is performed only after a zone reset >> of the first zone, i.e., when the second zone that contains the sb is FULL. > > Hi Pankaj, stupid question. Npo2 devices still have a zone size being a > multiple of 4k don't they? > > If not, we'd need to also have a tail padding of the superblock zones, in order > to move the WP of these zones to the end, so the sb-log states match up. Hi Johannes, NPO2 zoned devices has a minimum alignment requirement of 1MB. See commit: `btrfs: zoned: relax the alignment constraint for zoned devices` It will naturally align to 4k. So I don't think we need special handling there with tail padding.
On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote: > Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB. > These are fixed at these locations so that recovery tools can reliably > retrieve the superblocks even if one of the mirror gets corrupted. > > power of 2 zone sizes align at these offsets irrespective of their > value but non power of 2 zone sizes will not align. > > To make sure the first zone at mirror 1 and mirror 2 align, write zero > operation is performed to move the write pointer of the first zone to > the expected offset. This operation is performed only after a zone reset > of the first zone, i.e., when the second zone that contains the sb is FULL. Is it a good idea to do the "write zeros", instead of a plain "set write pointer"? I assume setting write pointer is instant, while writing potentially hundreds of megabytes may take significiant time. As the functions may be called from random contexts, the increased time may become a problem. > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > --- > fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 63 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c > index 3023c871e..805aeaa76 100644 > --- a/fs/btrfs/zoned.c > +++ b/fs/btrfs/zoned.c > @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) > return 0; > } > > +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone, > + int mirror, u64 *wp_ret) > +{ > + u64 offset = 0; > + int ret = 0; > + > + ASSERT(!is_power_of_two_u64(zone->len)); > + ASSERT(zone->wp == zone->start); > + ASSERT(mirror != 0); This could simply accept 0 as the mirror offset too, the calculation is trivial. > + > + switch (mirror) { > + case 1: > + div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT, > + zone->len, &offset); > + break; > + case 2: > + div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT, > + zone->len, &offset); > + break; > + } > + > + ret = blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0); > + if (ret) > + return ret; > + > + zone->wp += offset; > + zone->cond = BLK_ZONE_COND_IMP_OPEN; > + *wp_ret = zone->wp << SECTOR_SHIFT; > + > + return 0; > +} > + > static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, > - int rw, u64 *bytenr_ret) > + int rw, int mirror, u64 *bytenr_ret) > { > u64 wp; > int ret; > + bool zones_empty = false; > > if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) { > *bytenr_ret = zones[0].start << SECTOR_SHIFT; > @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, > if (ret != -ENOENT && ret < 0) > return ret; > > + if (ret == -ENOENT) > + zones_empty = true; > + > if (rw == WRITE) { > struct blk_zone *reset = NULL; > + bool is_sb_offset_write_req = false; > + u32 reset_zone_nr = -1; > > - if (wp == zones[0].start << SECTOR_SHIFT) > + if (wp == zones[0].start << SECTOR_SHIFT) { > reset = &zones[0]; > - else if (wp == zones[1].start << SECTOR_SHIFT) > + reset_zone_nr = 0; > + } else if (wp == zones[1].start << SECTOR_SHIFT) { > reset = &zones[1]; > + reset_zone_nr = 1; > + } > + > + /* > + * Non po2 zone sizes will not align naturally at > + * mirror 1 (512GB) and mirror 2 (4TB). The wp of the > + * 1st zone in those superblock mirrors need to be > + * moved to align at those offsets. > + */ Please move this comment to the helper fill_sb_wp_offset itself, there it's more discoverable. > + is_sb_offset_write_req = > + (zones_empty || (reset_zone_nr == 0)) && mirror && > + !is_power_of_2(zones[0].len); Accepting 0 as the mirror number would also get rid of this wild expression substituting and 'if'. > > if (reset && reset->cond != BLK_ZONE_COND_EMPTY) { > ASSERT(sb_zone_is_full(reset)); > @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, > reset->cond = BLK_ZONE_COND_EMPTY; > reset->wp = reset->start; > } > + > + if (is_sb_offset_write_req) { And get rid of the conditional. The point of supporting both po2 and nonpo2 is to hide any implementation details to wrappers as much as possible. > + ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp); > + if (ret) > + return ret; > + } > + > } else if (ret != -ENOENT) { > /* > * For READ, we want the previous one. Move write pointer to
On 2022-05-17 14:42, David Sterba wrote: > On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote: >> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB. >> These are fixed at these locations so that recovery tools can reliably >> retrieve the superblocks even if one of the mirror gets corrupted. >> >> power of 2 zone sizes align at these offsets irrespective of their >> value but non power of 2 zone sizes will not align. >> >> To make sure the first zone at mirror 1 and mirror 2 align, write zero >> operation is performed to move the write pointer of the first zone to >> the expected offset. This operation is performed only after a zone reset >> of the first zone, i.e., when the second zone that contains the sb is FULL. > > Is it a good idea to do the "write zeros", instead of a plain "set write > pointer"? I assume setting write pointer is instant, while writing > potentially hundreds of megabytes may take significiant time. As the > functions may be called from random contexts, the increased time may > become a problem. > Unfortunately it is not possible to just move the WP in zoned devices. The only alternative that I could use is to do write zeroes which are natively supported by some devices such as ZNS. It would be nice to know if someone had a better solution to this instead of doing write zeroes in zoned devices. >> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> >> --- >> fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++---- >> 1 file changed, 63 insertions(+), 5 deletions(-) >> >> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c >> index 3023c871e..805aeaa76 100644 >> --- a/fs/btrfs/zoned.c >> +++ b/fs/btrfs/zoned.c >> @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) >> return 0; >> } >> >> +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone, >> + int mirror, u64 *wp_ret) >> +{ >> + u64 offset = 0; >> + int ret = 0; >> + >> + ASSERT(!is_power_of_two_u64(zone->len)); >> + ASSERT(zone->wp == zone->start); >> + ASSERT(mirror != 0); > > This could simply accept 0 as the mirror offset too, the calculation is > trivial. > Ok. I will fix it up! >> + >> + switch (mirror) { >> + case 1: >> + div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT, >> + zone->len, &offset); >> + break; >> + case 2: >> + div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT, >> + zone->len, &offset); >> + break; >> + } >> + >> + ret = blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0); >> + if (ret) >> + return ret; >> + >> + /* >> + * Non po2 zone sizes will not align naturally at >> + * mirror 1 (512GB) and mirror 2 (4TB). The wp of the >> + * 1st zone in those superblock mirrors need to be >> + * moved to align at those offsets. >> + */ > > Please move this comment to the helper fill_sb_wp_offset itself, there > it's more discoverable. > Ok. >> + is_sb_offset_write_req = >> + (zones_empty || (reset_zone_nr == 0)) && mirror && >> + !is_power_of_2(zones[0].len); > > Accepting 0 as the mirror number would also get rid of this wild > expression substituting and 'if'. > >> >> if (reset && reset->cond != BLK_ZONE_COND_EMPTY) { >> ASSERT(sb_zone_is_full(reset)); >> @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, >> reset->cond = BLK_ZONE_COND_EMPTY; >> reset->wp = reset->start; >> } >> + >> + if (is_sb_offset_write_req) { > > And get rid of the conditional. The point of supporting both po2 and > nonpo2 is to hide any implementation details to wrappers as much as > possible. > Alright. I will move the logic to the wrapper instead of having the conditional in this function. >> + ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp); >> + if (ret) >> + return ret; >> + } >> + >> } else if (ret != -ENOENT) { >> /* >> * For READ, we want the previous one. Move write pointer to Thanks for your comments.
On 18/05/2022 11:17, Pankaj Raghav wrote: > On 2022-05-17 14:42, David Sterba wrote: >> On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote: >>> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB. >>> These are fixed at these locations so that recovery tools can reliably >>> retrieve the superblocks even if one of the mirror gets corrupted. >>> >>> power of 2 zone sizes align at these offsets irrespective of their >>> value but non power of 2 zone sizes will not align. >>> >>> To make sure the first zone at mirror 1 and mirror 2 align, write zero >>> operation is performed to move the write pointer of the first zone to >>> the expected offset. This operation is performed only after a zone reset >>> of the first zone, i.e., when the second zone that contains the sb is FULL. >> Is it a good idea to do the "write zeros", instead of a plain "set write >> pointer"? I assume setting write pointer is instant, while writing >> potentially hundreds of megabytes may take significiant time. As the >> functions may be called from random contexts, the increased time may >> become a problem. >> > Unfortunately it is not possible to just move the WP in zoned devices. > The only alternative that I could use is to do write zeroes which are > natively supported by some devices such as ZNS. It would be nice to know > if someone had a better solution to this instead of doing write zeroes > in zoned devices. > I have another question. In case we need to pad the sb zone with a write zeros and have a power fail between the write-zeros and the regular super-block write, what happens? I know this padding is only done for the backup super blocks, never the less it can happen and it can happen when the primary super block is also corrupted. AFAIU we're then trying to reach out for a backup super block, look at the write pointer and it only contains zeros but no super block, as only the write-zeros has reached the device and not the super block write. How is this situation handled? Thanks, Johannes
On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote: > Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB. > These are fixed at these locations so that recovery tools can reliably > retrieve the superblocks even if one of the mirror gets corrupted. > > power of 2 zone sizes align at these offsets irrespective of their > value but non power of 2 zone sizes will not align. > > To make sure the first zone at mirror 1 and mirror 2 align, write zero > operation is performed to move the write pointer of the first zone to > the expected offset. This operation is performed only after a zone reset > of the first zone, i.e., when the second zone that contains the sb is FULL. > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > --- > fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 63 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c > index 3023c871e..805aeaa76 100644 > --- a/fs/btrfs/zoned.c > +++ b/fs/btrfs/zoned.c > @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) > return 0; > } > > +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone, > + int mirror, u64 *wp_ret) > +{ > + u64 offset = 0; > + int ret = 0; > + > + ASSERT(!is_power_of_two_u64(zone->len)); > + ASSERT(zone->wp == zone->start); > + ASSERT(mirror != 0); > + > + switch (mirror) { > + case 1: > + div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT, > + zone->len, &offset); > + break; > + case 2: > + div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT, > + zone->len, &offset); > + break; > + } > + > + ret = blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0); > + if (ret) > + return ret; > + > + zone->wp += offset; > + zone->cond = BLK_ZONE_COND_IMP_OPEN; > + *wp_ret = zone->wp << SECTOR_SHIFT; > + > + return 0; > +} > + > static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, > - int rw, u64 *bytenr_ret) > + int rw, int mirror, u64 *bytenr_ret) > { > u64 wp; > int ret; > + bool zones_empty = false; > > if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) { > *bytenr_ret = zones[0].start << SECTOR_SHIFT; > @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, > if (ret != -ENOENT && ret < 0) > return ret; > > + if (ret == -ENOENT) > + zones_empty = true; > + I think, we don't need this. We need to issue the zeroout when zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) after sending ZONE_RESET if necessary. No? > if (rw == WRITE) { > struct blk_zone *reset = NULL; > + bool is_sb_offset_write_req = false; > + u32 reset_zone_nr = -1; > > - if (wp == zones[0].start << SECTOR_SHIFT) > + if (wp == zones[0].start << SECTOR_SHIFT) { > reset = &zones[0]; > - else if (wp == zones[1].start << SECTOR_SHIFT) > + reset_zone_nr = 0; > + } else if (wp == zones[1].start << SECTOR_SHIFT) { > reset = &zones[1]; > + reset_zone_nr = 1; > + } > + > + /* > + * Non po2 zone sizes will not align naturally at > + * mirror 1 (512GB) and mirror 2 (4TB). The wp of the > + * 1st zone in those superblock mirrors need to be > + * moved to align at those offsets. > + */ > + is_sb_offset_write_req = > + (zones_empty || (reset_zone_nr == 0)) && mirror && > + !is_power_of_2(zones[0].len); > > if (reset && reset->cond != BLK_ZONE_COND_EMPTY) { > ASSERT(sb_zone_is_full(reset)); > @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, > reset->cond = BLK_ZONE_COND_EMPTY; > reset->wp = reset->start; > } > + > + if (is_sb_offset_write_req) { > + ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp); > + if (ret) > + return ret; > + } > + > } else if (ret != -ENOENT) { > /* > * For READ, we want the previous one. Move write pointer to > @@ -851,7 +909,7 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, > if (ret != BTRFS_NR_SB_LOG_ZONES) > return -EIO; > > - return sb_log_location(bdev, zones, rw, bytenr_ret); > + return sb_log_location(bdev, zones, rw, mirror, bytenr_ret); > } > > int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, > @@ -877,7 +935,7 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, > > return sb_log_location(device->bdev, > &zinfo->sb_zones[BTRFS_NR_SB_LOG_ZONES * mirror], > - rw, bytenr_ret); > + rw, mirror, bytenr_ret); > } > > static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo, > -- > 2.25.1 >
On 5/19/22 09:57, Johannes Thumshirn wrote: >> Unfortunately it is not possible to just move the WP in zoned devices. >> The only alternative that I could use is to do write zeroes which are >> natively supported by some devices such as ZNS. It would be nice to know >> if someone had a better solution to this instead of doing write zeroes >> in zoned devices. >> > > I have another question. In case we need to pad the sb zone with a write > zeros and have a power fail between the write-zeros and the regular > super-block write, what happens? I know this padding is only done for the > backup super blocks, never the less it can happen and it can happen when > the primary super block is also corrupted. > > AFAIU we're then trying to reach out for a backup super block, look at the > write pointer and it only contains zeros but no super block, as only the > write-zeros has reached the device and not the super block write. > > How is this situation handled? > That is a very good point. I did think about this situation while adding padding to the mirror superblock with write zeroes. If the drive is **less than 4TB** and with the **primary superblock corrupted**, then it will be an issue with the situation you have described for npo2 drives. That situation is not handled here. Ofc this is not an issue when we have the second mirror at 4TB for bigger drives. Do you have some ideas in mind for this failure mode? > Thanks, > Johannes
On 5/19/22 09:59, Naohiro Aota wrote: >> + >> static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, >> - int rw, u64 *bytenr_ret) >> + int rw, int mirror, u64 *bytenr_ret) >> { >> u64 wp; >> int ret; >> + bool zones_empty = false; >> >> if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) { >> *bytenr_ret = zones[0].start << SECTOR_SHIFT; >> @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, >> if (ret != -ENOENT && ret < 0) >> return ret; >> >> + if (ret == -ENOENT) >> + zones_empty = true; >> + > > I think, we don't need this. We need to issue the zeroout when > zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) after sending > ZONE_RESET if necessary. No? > Yeah. I added this extra check to cover all the cases possible. But you are right that it is enough to issue zeroout only for this condition: zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) as both the zones empty is not likely to happen. >> if (rw == WRITE) { >> struct blk_zone *reset = NULL; >> + bool is_sb_offset_write_req = false; >> + u32 reset_zone_nr = -1; >> >> - if (wp == zones[0].start << SECTOR_SHIFT) >> + if (wp == zones[0].start << SECTOR_SHIFT) { >> reset = &zones[0]; >> - else if (wp == zones[1].start << SECTOR_SHIFT) >> + reset_zone_nr = 0;
On 20/05/2022 11:07, Pankaj Raghav wrote: > On 5/19/22 09:57, Johannes Thumshirn wrote: >>> Unfortunately it is not possible to just move the WP in zoned devices. >>> The only alternative that I could use is to do write zeroes which are >>> natively supported by some devices such as ZNS. It would be nice to know >>> if someone had a better solution to this instead of doing write zeroes >>> in zoned devices. >>> >> >> I have another question. In case we need to pad the sb zone with a write >> zeros and have a power fail between the write-zeros and the regular >> super-block write, what happens? I know this padding is only done for the >> backup super blocks, never the less it can happen and it can happen when >> the primary super block is also corrupted. >> >> AFAIU we're then trying to reach out for a backup super block, look at the >> write pointer and it only contains zeros but no super block, as only the >> write-zeros has reached the device and not the super block write. >> >> How is this situation handled? >> > That is a very good point. I did think about this situation while adding > padding to the mirror superblock with write zeroes. If the drive is > **less than 4TB** and with the **primary superblock corrupted**, then it > will be an issue with the situation you have described for npo2 drives. > That situation is not handled here. Ofc this is not an issue when we > have the second mirror at 4TB for bigger drives. Do you have some ideas > in mind for this failure mode? The only idea I have for this is creating a bounce buffer, write the padding and the super-block into the buffer and then submit it. But that's too ugly to live. And it would involve changing non-zoned super-block writing code, which I think is way to risky.
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 3023c871e..805aeaa76 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) return 0; } +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone, + int mirror, u64 *wp_ret) +{ + u64 offset = 0; + int ret = 0; + + ASSERT(!is_power_of_two_u64(zone->len)); + ASSERT(zone->wp == zone->start); + ASSERT(mirror != 0); + + switch (mirror) { + case 1: + div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT, + zone->len, &offset); + break; + case 2: + div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT, + zone->len, &offset); + break; + } + + ret = blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0); + if (ret) + return ret; + + zone->wp += offset; + zone->cond = BLK_ZONE_COND_IMP_OPEN; + *wp_ret = zone->wp << SECTOR_SHIFT; + + return 0; +} + static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, - int rw, u64 *bytenr_ret) + int rw, int mirror, u64 *bytenr_ret) { u64 wp; int ret; + bool zones_empty = false; if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) { *bytenr_ret = zones[0].start << SECTOR_SHIFT; @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, if (ret != -ENOENT && ret < 0) return ret; + if (ret == -ENOENT) + zones_empty = true; + if (rw == WRITE) { struct blk_zone *reset = NULL; + bool is_sb_offset_write_req = false; + u32 reset_zone_nr = -1; - if (wp == zones[0].start << SECTOR_SHIFT) + if (wp == zones[0].start << SECTOR_SHIFT) { reset = &zones[0]; - else if (wp == zones[1].start << SECTOR_SHIFT) + reset_zone_nr = 0; + } else if (wp == zones[1].start << SECTOR_SHIFT) { reset = &zones[1]; + reset_zone_nr = 1; + } + + /* + * Non po2 zone sizes will not align naturally at + * mirror 1 (512GB) and mirror 2 (4TB). The wp of the + * 1st zone in those superblock mirrors need to be + * moved to align at those offsets. + */ + is_sb_offset_write_req = + (zones_empty || (reset_zone_nr == 0)) && mirror && + !is_power_of_2(zones[0].len); if (reset && reset->cond != BLK_ZONE_COND_EMPTY) { ASSERT(sb_zone_is_full(reset)); @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, reset->cond = BLK_ZONE_COND_EMPTY; reset->wp = reset->start; } + + if (is_sb_offset_write_req) { + ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp); + if (ret) + return ret; + } + } else if (ret != -ENOENT) { /* * For READ, we want the previous one. Move write pointer to @@ -851,7 +909,7 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, if (ret != BTRFS_NR_SB_LOG_ZONES) return -EIO; - return sb_log_location(bdev, zones, rw, bytenr_ret); + return sb_log_location(bdev, zones, rw, mirror, bytenr_ret); } int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, @@ -877,7 +935,7 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, return sb_log_location(device->bdev, &zinfo->sb_zones[BTRFS_NR_SB_LOG_ZONES * mirror], - rw, bytenr_ret); + rw, mirror, bytenr_ret); } static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB. These are fixed at these locations so that recovery tools can reliably retrieve the superblocks even if one of the mirror gets corrupted. power of 2 zone sizes align at these offsets irrespective of their value but non power of 2 zone sizes will not align. To make sure the first zone at mirror 1 and mirror 2 align, write zero operation is performed to move the write pointer of the first zone to the expected offset. This operation is performed only after a zone reset of the first zone, i.e., when the second zone that contains the sb is FULL. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> --- fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 63 insertions(+), 5 deletions(-)