Message ID | fe07f3ca7b17b6739cff8ab228d57bdbea0c447b.1614760899.git.naohiro.aota@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fixes for zoned mode | expand |
On Wed, Mar 03, 2021 at 05:55:47PM +0900, Naohiro Aota wrote: > This commit moves the location of superblock logging zones basing on the > fixed address instead of the fixed zone number. > > By locating the superblock zones using fixed addresses, we can scan a > dumped file system image without the zone information. And, no drawbacks > exist. > > The following zones are reserved as the circular buffer on zoned btrfs. > - The primary superblock: zone at LBA 0 and the next zone > - The first copy: zone at LBA 16G and the next zone > - The second copy: zone at LBA 256G and the next zone > > If the location of the zones are outside of disk, we don't record the > superblock copy. > > The addresses are much larger than the usual superblock copies locations. > The copies' locations are decided to support possible future larger zone > size, not to overlap the log zones. We support zone size up to 8GB. One thing I don't see is that the reserved space for superblock is fixed regardless of the actual device zone size. In exclude_super_stripes. 0-16G for primary ... and now what, 16G would be the next copy thus reserving 16 up to 32G So the 64G offset for the 1st copy is more suitable: 0 - 16G primary 64G - 80G 1st copy 256G - 272G 2nd copy This still does not sound great because it just builds on the original offsets from 10 years ago. The device sizes are expected to be in terabytes but all the superblocks are in the first terabyte. What if we do that like 0 - 16G 1T - 1T+16G 8T - 8T+16G The HDD sizes start somewhere at 4T so the first two copies cover the small sizes, larger have all three copies. But we could go wild even more, like 0/4T/16T. I'm not sure if the capacities for non-HDD are going to be also that large, I could not find anything specific, the only existing ZNS is some DC ZN540 but no details. We need to get this right (best effort), so I'll postpone this patch until it's all sorted.
On 2021/03/05 0:20, David Sterba wrote: > On Wed, Mar 03, 2021 at 05:55:47PM +0900, Naohiro Aota wrote: >> This commit moves the location of superblock logging zones basing on the >> fixed address instead of the fixed zone number. >> >> By locating the superblock zones using fixed addresses, we can scan a >> dumped file system image without the zone information. And, no drawbacks >> exist. >> >> The following zones are reserved as the circular buffer on zoned btrfs. >> - The primary superblock: zone at LBA 0 and the next zone >> - The first copy: zone at LBA 16G and the next zone >> - The second copy: zone at LBA 256G and the next zone >> >> If the location of the zones are outside of disk, we don't record the >> superblock copy. >> >> The addresses are much larger than the usual superblock copies locations. >> The copies' locations are decided to support possible future larger zone >> size, not to overlap the log zones. We support zone size up to 8GB. > > One thing I don't see is that the reserved space for superblock is fixed > regardless of the actual device zone size. In exclude_super_stripes. > > 0-16G for primary > ... and now what, 16G would be the next copy thus reserving 16 up to 32G > > So the 64G offset for the 1st copy is more suitable: > > 0 - 16G primary > 64G - 80G 1st copy > 256G - 272G 2nd copy > > This still does not sound great because it just builds on the original > offsets from 10 years ago. The device sizes are expected to be in > terabytes but all the superblocks are in the first terabyte. I do not see an issue with that. For HDDs, one would ideally want each copy under a different head but determining which head serves which LBA is not possible with standard commands. LBAs are generally distributed initially across one head (platter side) up to one or more zones, then goes on the next head backward (other side of the same platter), and on to the following head/platter. So distribution is first vertical then goes inward (and when reaching middle of the platter, everything starts again from the spindle outward). 0/64G/256G likely gives you different heads. No way to tell for certain though. > What if we do that like > > 0 - 16G > 1T - 1T+16G > 8T - 8T+16G > > The HDD sizes start somewhere at 4T so the first two copies cover the > small sizes, larger have all three copies. But we could go wild even > more, like 0/4T/16T. That would work for HDDs. We are at 20T with SMR now and the lowest SMR capacity is 14T. For regular disks, yes, 4T is kind of the starting point for enterprise drives. Consumer/NAS drives can start lower though, at 1T or 2T. To be able able to cover all cases nicely, I would suggest not exceeding 1T for the SB copies. > I'm not sure if the capacities for non-HDD are going to be also that > large, I could not find anything specific, the only existing ZNS is some > DC ZN540 but no details. That one is sampling at 2T capacity now. This likely will be a lower boundary and higher capacities will be available. Not sure yet up to what point. Likely, different models at 4T, 6T, 8T, 16T... will be available. So kind of the same story as for HDDs. Keeping the SB copies within the first TB will allow supporting all models. So I kind of like your initial suggestion: 0 - 16G primary 64G - 80G 1st copy 256G - 272G 2nd copy And we could even do: 0 - 16G primary 128G - 160G 1st copy 512G - 544G 2nd copy Which would also safely allow larger zone sizes beyond 8G for ZNS too. (I do not think this will happen anytime soon though, but with these values, we are safer). > > We need to get this right (best effort), so I'll postpone this patch > until it's all sorted. >
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 1324bb6c3946..b8f50dc9fbb0 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -24,6 +24,15 @@ /* Number of superblock log zones */ #define BTRFS_NR_SB_LOG_ZONES 2 +/* Location of superblock log zones */ +#define BTRFS_FIRST_SB_LOG_ZONE SZ_16G +#define BTRFS_SECOND_SB_LOG_ZONE (256ULL * SZ_1G) +#define BTRFS_FIRST_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_FIRST_SB_LOG_ZONE) +#define BTRFS_SECOND_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_SECOND_SB_LOG_ZONE) + +/* Max size of supported zone size */ +#define BTRFS_MAX_ZONE_SIZE SZ_8G + static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, void *data) { struct blk_zone *zones = data; @@ -112,10 +121,9 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones, /* * The following zones are reserved as the circular buffer on ZONED btrfs. - * - The primary superblock: zones 0 and 1 - * - The first copy: zones 16 and 17 - * - The second copy: zones 1024 or zone at 256GB which is minimum, and - * the following one + * - The primary superblock: zone at LBA 0 and the next zone + * - The first copy: zone at LBA 16G and the next zone + * - The second copy: zone at LBA 256G and the next zone */ static inline u32 sb_zone_number(int shift, int mirror) { @@ -123,8 +131,8 @@ static inline u32 sb_zone_number(int shift, int mirror) switch (mirror) { case 0: return 0; - case 1: return 16; - case 2: return min_t(u64, btrfs_sb_offset(mirror) >> shift, 1024); + case 1: return 1 << (BTRFS_FIRST_SB_LOG_ZONE_SHIFT - shift); + case 2: return 1 << (BTRFS_SECOND_SB_LOG_ZONE_SHIFT - shift); } return 0; @@ -300,10 +308,25 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device) zone_sectors = bdev_zone_sectors(bdev); } - nr_sectors = bdev_nr_sectors(bdev); /* Check if it's power of 2 (see is_power_of_2) */ ASSERT(zone_sectors != 0 && (zone_sectors & (zone_sectors - 1)) == 0); zone_info->zone_size = zone_sectors << SECTOR_SHIFT; + + /* + * We must reject devices with a zone size larger than 8GB to avoid + * overlap between the super block primary and first copy fixed log + * locations. + */ + if (zone_info->zone_size > BTRFS_MAX_ZONE_SIZE) { + btrfs_err_in_rcu(fs_info, + "zoned: %s: zone size %llu is too large", + rcu_str_deref(device->name), + zone_info->zone_size); + ret = -EINVAL; + goto out; + } + + nr_sectors = bdev_nr_sectors(bdev); zone_info->zone_size_shift = ilog2(zone_info->zone_size); zone_info->max_zone_append_size = (u64)queue_max_zone_append_sectors(queue) << SECTOR_SHIFT;
This commit moves the location of superblock logging zones basing on the fixed address instead of the fixed zone number. By locating the superblock zones using fixed addresses, we can scan a dumped file system image without the zone information. And, no drawbacks exist. The following zones are reserved as the circular buffer on zoned btrfs. - The primary superblock: zone at LBA 0 and the next zone - The first copy: zone at LBA 16G and the next zone - The second copy: zone at LBA 256G and the next zone If the location of the zones are outside of disk, we don't record the superblock copy. The addresses are much larger than the usual superblock copies locations. The copies' locations are decided to support possible future larger zone size, not to overlap the log zones. We support zone size up to 8GB. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> --- fs/btrfs/zoned.c | 37 ++++++++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 7 deletions(-)