Message ID | cover.1612433345.git.naohiro.aota@wdc.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: zoned block device support | expand |
On Thu, Feb 04, 2021 at 07:21:39PM +0900, Naohiro Aota wrote: > This series adds zoned block device support to btrfs. Some of the patches > in the previous series are already merged as preparation patches. Moved from for-next to misc-next. > * Log-structured superblock > > Superblock (and its copies) is the only data structure in btrfs which > has a fixed location on a device. Since we cannot overwrite in a > sequential write required zone, we cannot place superblock in the > zone. > > This series implements superblock log writing. It uses two zones as a > circular buffer to write updated superblocks. Once the first zone is filled > up, start writing into the second zone. The first zone will be reset once > both zones are filled. We can determine the postion of the latest > superblock by reading the write pointer information from a device. About that, in this patchset it's still leaving superblock at the fixed zone number while we want it at a fixed location, spanning 2 zones regardless of their size.
On 10/02/2021 21:02, David Sterba wrote: >> This series implements superblock log writing. It uses two zones as a >> circular buffer to write updated superblocks. Once the first zone is filled >> up, start writing into the second zone. The first zone will be reset once >> both zones are filled. We can determine the postion of the latest >> superblock by reading the write pointer information from a device. > > About that, in this patchset it's still leaving superblock at the fixed > zone number while we want it at a fixed location, spanning 2 zones > regardless of their size. > We'll always need 2 zones or otherwise we won't be powercut safe.
On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote: > On 10/02/2021 21:02, David Sterba wrote: > >> This series implements superblock log writing. It uses two zones as a > >> circular buffer to write updated superblocks. Once the first zone is filled > >> up, start writing into the second zone. The first zone will be reset once > >> both zones are filled. We can determine the postion of the latest > >> superblock by reading the write pointer information from a device. > > > > About that, in this patchset it's still leaving superblock at the fixed > > zone number while we want it at a fixed location, spanning 2 zones > > regardless of their size. > > We'll always need 2 zones or otherwise we won't be powercut safe. Yes we do, that hasn't changed.
On 11/02/2021 16:21, David Sterba wrote: > On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote: >> On 10/02/2021 21:02, David Sterba wrote: >>>> This series implements superblock log writing. It uses two zones as a >>>> circular buffer to write updated superblocks. Once the first zone is filled >>>> up, start writing into the second zone. The first zone will be reset once >>>> both zones are filled. We can determine the postion of the latest >>>> superblock by reading the write pointer information from a device. >>> >>> About that, in this patchset it's still leaving superblock at the fixed >>> zone number while we want it at a fixed location, spanning 2 zones >>> regardless of their size. >> >> We'll always need 2 zones or otherwise we won't be powercut safe. > > Yes we do, that hasn't changed. > OK that I don't understand, with the log structured superblocks on a zoned filesystem, we're writing a new superblock until the 1st zone is filled. Then we advance to the second zone. As soon as we wrote a superblock to the second zone we can reset the first. If we only use one zone, we would need to write until it's end, reset and start writing again from the beginning. But if a powercut happens between reset and first write after the reset, we end up with no superblock.
On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote: > On 11/02/2021 16:21, David Sterba wrote: > > On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote: > >> On 10/02/2021 21:02, David Sterba wrote: > >>>> This series implements superblock log writing. It uses two zones as a > >>>> circular buffer to write updated superblocks. Once the first zone is filled > >>>> up, start writing into the second zone. The first zone will be reset once > >>>> both zones are filled. We can determine the postion of the latest > >>>> superblock by reading the write pointer information from a device. > >>> > >>> About that, in this patchset it's still leaving superblock at the fixed > >>> zone number while we want it at a fixed location, spanning 2 zones > >>> regardless of their size. > >> > >> We'll always need 2 zones or otherwise we won't be powercut safe. > > > > Yes we do, that hasn't changed. > > OK that I don't understand, with the log structured superblocks on a zoned > filesystem, we're writing a new superblock until the 1st zone is filled. > Then we advance to the second zone. As soon as we wrote a superblock to > the second zone we can reset the first. > If we only use one zone, No, that can't work and nobody suggests that. > we would need to write until it's end, reset and > start writing again from the beginning. But if a powercut happens between > reset and first write after the reset, we end up with no superblock. What I'm saying and what we discussed on slack in December, we can't fix the zone number for the 1st and 2nd copy of superblock like it is now in sb_zone_number. The primary superblock must be there for any reference and to actually let the tools learn about the incompat bits. The 1st copy is now fixed zone 16, which depends on the zone size. The idea is to define the superblock offsets to start at given offsets, where the ring buffer has the two consecutive zones, regardless of their size. primary: 0 1st copy: 16G 2nd copy: 256G Due to the variability of the zones in future devices, we'll reserve a space at the superblock interval, assuming the zone sizes can grow up to several gigabytes. Current working number is 1G, with some safety margin the reserved ranges would be (eg. for a 4G zone size): primary: 0 up to 8G 1st copy: 16G up to 24G 2nd copy: 256G up to 262G It is wasteful but we want to be future proof and expecting disk sizes from tens of terabytes to a hundred terabytes, it's not significant loss of space. If the zone sizes can be expected higher than 4G, the 1st copy can be defined at 64G, that would leave us some margin until somebody thinks that 32G zones are a great idea.
On 11/02/2021 16:48, David Sterba wrote: > On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote: >> On 11/02/2021 16:21, David Sterba wrote: >>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote: >>>> On 10/02/2021 21:02, David Sterba wrote: >>>>>> This series implements superblock log writing. It uses two zones as a >>>>>> circular buffer to write updated superblocks. Once the first zone is filled >>>>>> up, start writing into the second zone. The first zone will be reset once >>>>>> both zones are filled. We can determine the postion of the latest >>>>>> superblock by reading the write pointer information from a device. >>>>> >>>>> About that, in this patchset it's still leaving superblock at the fixed >>>>> zone number while we want it at a fixed location, spanning 2 zones >>>>> regardless of their size. >>>> >>>> We'll always need 2 zones or otherwise we won't be powercut safe. >>> >>> Yes we do, that hasn't changed. >> >> OK that I don't understand, with the log structured superblocks on a zoned >> filesystem, we're writing a new superblock until the 1st zone is filled. >> Then we advance to the second zone. As soon as we wrote a superblock to >> the second zone we can reset the first. >> If we only use one zone, > > No, that can't work and nobody suggests that. > >> we would need to write until it's end, reset and >> start writing again from the beginning. But if a powercut happens between >> reset and first write after the reset, we end up with no superblock. > > What I'm saying and what we discussed on slack in December, we can't fix > the zone number for the 1st and 2nd copy of superblock like it is now in > sb_zone_number. > > The primary superblock must be there for any reference and to actually > let the tools learn about the incompat bits. > > The 1st copy is now fixed zone 16, which depends on the zone size. The > idea is to define the superblock offsets to start at given offsets, > where the ring buffer has the two consecutive zones, regardless of their > size. > > primary: 0 > 1st copy: 16G > 2nd copy: 256G > > Due to the variability of the zones in future devices, we'll reserve a > space at the superblock interval, assuming the zone sizes can grow up to > several gigabytes. Current working number is 1G, with some safety margin > the reserved ranges would be (eg. for a 4G zone size): > > primary: 0 up to 8G > 1st copy: 16G up to 24G > 2nd copy: 256G up to 262G > > It is wasteful but we want to be future proof and expecting disk sizes > from tens of terabytes to a hundred terabytes, it's not significant > loss of space. > > If the zone sizes can be expected higher than 4G, the 1st copy can be > defined at 64G, that would leave us some margin until somebody thinks > that 32G zones are a great idea. > We've been talking about this today and our proposal would be as follows: Primary SB is two zones starting at LBA 0 Seconday SB the two zones starting with the zone that contains the address 16G Third SB the two zones starting with the zone that contains the address 256G or not present if the disk is too small. This would make it safe until a zone size of 8GB and we'd have adjacent superblock log zones then. How does that sound? Byte, Johannes
On Mon, Feb 15, 2021 at 04:58:05PM +0000, Johannes Thumshirn wrote: > On 11/02/2021 16:48, David Sterba wrote: > > On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote: > >> On 11/02/2021 16:21, David Sterba wrote: > >>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote: > >>>> On 10/02/2021 21:02, David Sterba wrote: > >>>>>> This series implements superblock log writing. It uses two zones as a > >>>>>> circular buffer to write updated superblocks. Once the first zone is filled > >>>>>> up, start writing into the second zone. The first zone will be reset once > >>>>>> both zones are filled. We can determine the postion of the latest > >>>>>> superblock by reading the write pointer information from a device. > >>>>> > >>>>> About that, in this patchset it's still leaving superblock at the fixed > >>>>> zone number while we want it at a fixed location, spanning 2 zones > >>>>> regardless of their size. > >>>> > >>>> We'll always need 2 zones or otherwise we won't be powercut safe. > >>> > >>> Yes we do, that hasn't changed. > >> > >> OK that I don't understand, with the log structured superblocks on a zoned > >> filesystem, we're writing a new superblock until the 1st zone is filled. > >> Then we advance to the second zone. As soon as we wrote a superblock to > >> the second zone we can reset the first. > >> If we only use one zone, > > > > No, that can't work and nobody suggests that. > > > >> we would need to write until it's end, reset and > >> start writing again from the beginning. But if a powercut happens between > >> reset and first write after the reset, we end up with no superblock. > > > > What I'm saying and what we discussed on slack in December, we can't fix > > the zone number for the 1st and 2nd copy of superblock like it is now in > > sb_zone_number. > > > > The primary superblock must be there for any reference and to actually > > let the tools learn about the incompat bits. > > > > The 1st copy is now fixed zone 16, which depends on the zone size. The > > idea is to define the superblock offsets to start at given offsets, > > where the ring buffer has the two consecutive zones, regardless of their > > size. > > > > primary: 0 > > 1st copy: 16G > > 2nd copy: 256G > > > > Due to the variability of the zones in future devices, we'll reserve a > > space at the superblock interval, assuming the zone sizes can grow up to > > several gigabytes. Current working number is 1G, with some safety margin > > the reserved ranges would be (eg. for a 4G zone size): > > > > primary: 0 up to 8G > > 1st copy: 16G up to 24G > > 2nd copy: 256G up to 262G > > > > It is wasteful but we want to be future proof and expecting disk sizes > > from tens of terabytes to a hundred terabytes, it's not significant > > loss of space. > > > > If the zone sizes can be expected higher than 4G, the 1st copy can be > > defined at 64G, that would leave us some margin until somebody thinks > > that 32G zones are a great idea. > > > > We've been talking about this today and our proposal would be as follows: > Primary SB is two zones starting at LBA 0 > Seconday SB the two zones starting with the zone that contains the address 16G > Third SB the two zones starting with the zone that contains the address 256G > or not present if the disk is too small. > > This would make it safe until a zone size of 8GB and we'd have adjacent > superblock log zones then. > > How does that sound? That we're on the same page regarding the superblock writes.
On Mon, Feb 15, 2021 at 04:58:05PM +0000, Johannes Thumshirn wrote: > On 11/02/2021 16:48, David Sterba wrote: > > On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote: > >> On 11/02/2021 16:21, David Sterba wrote: > >>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote: > >>>> On 10/02/2021 21:02, David Sterba wrote: > >>>>>> This series implements superblock log writing. It uses two zones as a > >>>>>> circular buffer to write updated superblocks. Once the first zone is filled > >>>>>> up, start writing into the second zone. The first zone will be reset once > >>>>>> both zones are filled. We can determine the postion of the latest > >>>>>> superblock by reading the write pointer information from a device. > >>>>> > >>>>> About that, in this patchset it's still leaving superblock at the fixed > >>>>> zone number while we want it at a fixed location, spanning 2 zones > >>>>> regardless of their size. > >>>> > >>>> We'll always need 2 zones or otherwise we won't be powercut safe. > >>> > >>> Yes we do, that hasn't changed. > >> > >> OK that I don't understand, with the log structured superblocks on a zoned > >> filesystem, we're writing a new superblock until the 1st zone is filled. > >> Then we advance to the second zone. As soon as we wrote a superblock to > >> the second zone we can reset the first. > >> If we only use one zone, > > > > No, that can't work and nobody suggests that. > > > >> we would need to write until it's end, reset and > >> start writing again from the beginning. But if a powercut happens between > >> reset and first write after the reset, we end up with no superblock. > > > > What I'm saying and what we discussed on slack in December, we can't fix > > the zone number for the 1st and 2nd copy of superblock like it is now in > > sb_zone_number. > > > > The primary superblock must be there for any reference and to actually > > let the tools learn about the incompat bits. > > > > The 1st copy is now fixed zone 16, which depends on the zone size. The > > idea is to define the superblock offsets to start at given offsets, > > where the ring buffer has the two consecutive zones, regardless of their > > size. > > > > primary: 0 > > 1st copy: 16G > > 2nd copy: 256G > > > > Due to the variability of the zones in future devices, we'll reserve a > > space at the superblock interval, assuming the zone sizes can grow up to > > several gigabytes. Current working number is 1G, with some safety margin > > the reserved ranges would be (eg. for a 4G zone size): > > > > primary: 0 up to 8G > > 1st copy: 16G up to 24G > > 2nd copy: 256G up to 262G > > > > It is wasteful but we want to be future proof and expecting disk sizes > > from tens of terabytes to a hundred terabytes, it's not significant > > loss of space. > > > > If the zone sizes can be expected higher than 4G, the 1st copy can be > > defined at 64G, that would leave us some margin until somebody thinks > > that 32G zones are a great idea. > > > > We've been talking about this today and our proposal would be as follows: > Primary SB is two zones starting at LBA 0 > Seconday SB the two zones starting with the zone that contains the address 16G For the secondary SB on a file system < 16GB, how do you think of using the last two zones (or zones #2, #3 will do)? Then, we can assure to have two SB copies even on such a file system. > Third SB the two zones starting with the zone that contains the address 256G > or not present if the disk is too small. > > This would make it safe until a zone size of 8GB and we'd have adjacent > superblock log zones then. > > How does that sound? > > Byte, > Johannes >
On Tue, Feb 16, 2021 at 01:33:28PM +0900, Naohiro Aota wrote: > On Mon, Feb 15, 2021 at 04:58:05PM +0000, Johannes Thumshirn wrote: > > On 11/02/2021 16:48, David Sterba wrote: > > > On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote: > > >> On 11/02/2021 16:21, David Sterba wrote: > > >>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote: > > >>>> On 10/02/2021 21:02, David Sterba wrote: > > >>>>>> This series implements superblock log writing. It uses two zones as a > > >>>>>> circular buffer to write updated superblocks. Once the first zone is filled > > >>>>>> up, start writing into the second zone. The first zone will be reset once > > >>>>>> both zones are filled. We can determine the postion of the latest > > >>>>>> superblock by reading the write pointer information from a device. > > >>>>> > > >>>>> About that, in this patchset it's still leaving superblock at the fixed > > >>>>> zone number while we want it at a fixed location, spanning 2 zones > > >>>>> regardless of their size. > > >>>> > > >>>> We'll always need 2 zones or otherwise we won't be powercut safe. > > >>> > > >>> Yes we do, that hasn't changed. > > >> > > >> OK that I don't understand, with the log structured superblocks on a zoned > > >> filesystem, we're writing a new superblock until the 1st zone is filled. > > >> Then we advance to the second zone. As soon as we wrote a superblock to > > >> the second zone we can reset the first. > > >> If we only use one zone, > > > > > > No, that can't work and nobody suggests that. > > > > > >> we would need to write until it's end, reset and > > >> start writing again from the beginning. But if a powercut happens between > > >> reset and first write after the reset, we end up with no superblock. > > > > > > What I'm saying and what we discussed on slack in December, we can't fix > > > the zone number for the 1st and 2nd copy of superblock like it is now in > > > sb_zone_number. > > > > > > The primary superblock must be there for any reference and to actually > > > let the tools learn about the incompat bits. > > > > > > The 1st copy is now fixed zone 16, which depends on the zone size. The > > > idea is to define the superblock offsets to start at given offsets, > > > where the ring buffer has the two consecutive zones, regardless of their > > > size. > > > > > > primary: 0 > > > 1st copy: 16G > > > 2nd copy: 256G > > > > > > Due to the variability of the zones in future devices, we'll reserve a > > > space at the superblock interval, assuming the zone sizes can grow up to > > > several gigabytes. Current working number is 1G, with some safety margin > > > the reserved ranges would be (eg. for a 4G zone size): > > > > > > primary: 0 up to 8G > > > 1st copy: 16G up to 24G > > > 2nd copy: 256G up to 262G > > > > > > It is wasteful but we want to be future proof and expecting disk sizes > > > from tens of terabytes to a hundred terabytes, it's not significant > > > loss of space. > > > > > > If the zone sizes can be expected higher than 4G, the 1st copy can be > > > defined at 64G, that would leave us some margin until somebody thinks > > > that 32G zones are a great idea. > > > > > > > We've been talking about this today and our proposal would be as follows: > > Primary SB is two zones starting at LBA 0 > > Seconday SB the two zones starting with the zone that contains the address 16G > > For the secondary SB on a file system < 16GB, how do you think of > using the last two zones (or zones #2, #3 will do)? Then, we can > assure to have two SB copies even on such a file system. For real hardware I think this is not relevant but for the emulated mode we need to deal with that case. The reserved size is wasteful and this will become noticeable for devices < 16G but I'd rather keep the logic simple and not care much about this corner case. So, the superblock range would be reserved and if there's not enough to store the secondary sb, then don't.
On Tue, Feb 16, 2021 at 12:46:11PM +0100, David Sterba wrote: > On Tue, Feb 16, 2021 at 01:33:28PM +0900, Naohiro Aota wrote: > > On Mon, Feb 15, 2021 at 04:58:05PM +0000, Johannes Thumshirn wrote: > > > On 11/02/2021 16:48, David Sterba wrote: > > > > On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote: > > > >> On 11/02/2021 16:21, David Sterba wrote: > > > >>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote: > > > >>>> On 10/02/2021 21:02, David Sterba wrote: > > > >>>>>> This series implements superblock log writing. It uses two zones as a > > > >>>>>> circular buffer to write updated superblocks. Once the first zone is filled > > > >>>>>> up, start writing into the second zone. The first zone will be reset once > > > >>>>>> both zones are filled. We can determine the postion of the latest > > > >>>>>> superblock by reading the write pointer information from a device. > > > >>>>> > > > >>>>> About that, in this patchset it's still leaving superblock at the fixed > > > >>>>> zone number while we want it at a fixed location, spanning 2 zones > > > >>>>> regardless of their size. > > > >>>> > > > >>>> We'll always need 2 zones or otherwise we won't be powercut safe. > > > >>> > > > >>> Yes we do, that hasn't changed. > > > >> > > > >> OK that I don't understand, with the log structured superblocks on a zoned > > > >> filesystem, we're writing a new superblock until the 1st zone is filled. > > > >> Then we advance to the second zone. As soon as we wrote a superblock to > > > >> the second zone we can reset the first. > > > >> If we only use one zone, > > > > > > > > No, that can't work and nobody suggests that. > > > > > > > >> we would need to write until it's end, reset and > > > >> start writing again from the beginning. But if a powercut happens between > > > >> reset and first write after the reset, we end up with no superblock. > > > > > > > > What I'm saying and what we discussed on slack in December, we can't fix > > > > the zone number for the 1st and 2nd copy of superblock like it is now in > > > > sb_zone_number. > > > > > > > > The primary superblock must be there for any reference and to actually > > > > let the tools learn about the incompat bits. > > > > > > > > The 1st copy is now fixed zone 16, which depends on the zone size. The > > > > idea is to define the superblock offsets to start at given offsets, > > > > where the ring buffer has the two consecutive zones, regardless of their > > > > size. > > > > > > > > primary: 0 > > > > 1st copy: 16G > > > > 2nd copy: 256G > > > > > > > > Due to the variability of the zones in future devices, we'll reserve a > > > > space at the superblock interval, assuming the zone sizes can grow up to > > > > several gigabytes. Current working number is 1G, with some safety margin > > > > the reserved ranges would be (eg. for a 4G zone size): > > > > > > > > primary: 0 up to 8G > > > > 1st copy: 16G up to 24G > > > > 2nd copy: 256G up to 262G > > > > > > > > It is wasteful but we want to be future proof and expecting disk sizes > > > > from tens of terabytes to a hundred terabytes, it's not significant > > > > loss of space. > > > > > > > > If the zone sizes can be expected higher than 4G, the 1st copy can be > > > > defined at 64G, that would leave us some margin until somebody thinks > > > > that 32G zones are a great idea. > > > > > > > > > > We've been talking about this today and our proposal would be as follows: > > > Primary SB is two zones starting at LBA 0 > > > Seconday SB the two zones starting with the zone that contains the address 16G > > > > For the secondary SB on a file system < 16GB, how do you think of > > using the last two zones (or zones #2, #3 will do)? Then, we can > > assure to have two SB copies even on such a file system. > > For real hardware I think this is not relevant but for the emulated mode > we need to deal with that case. The reserved size is wasteful and this > will become noticeable for devices < 16G but I'd rather keep the logic > simple and not care much about this corner case. So, the superblock > range would be reserved and if there's not enough to store the secondary > sb, then don't. Sure. That works. I'm running xfstests with these new SB locations. Once it passed, I'll post the patch. One corner case left. What should we do with zone size > 8G? In this case, the primary SB zones and the 1st copy SB zones overlap. I know this is unrealistic for real hardware, but you can still create such a device with null_blk. 1) Use the following zones (zones #2, #3) as the primary SB zones 2) Do not write the primary SBs 3) Reject to mkfs To be simple logic, method #3 would be appropriate here? Technically, all the log zones overlap with zone size > 128 GB. I'm considering to reject to mkfs in this insane case anyway.
On Mon, Feb 22, 2021 at 04:50:43PM +0900, Naohiro Aota wrote: > > For real hardware I think this is not relevant but for the emulated mode > > we need to deal with that case. The reserved size is wasteful and this > > will become noticeable for devices < 16G but I'd rather keep the logic > > simple and not care much about this corner case. So, the superblock > > range would be reserved and if there's not enough to store the secondary > > sb, then don't. > > Sure. That works. I'm running xfstests with these new SB > locations. Once it passed, I'll post the patch. > > One corner case left. What should we do with zone size > 8G? In this > case, the primary SB zones and the 1st copy SB zones overlap. I know > this is unrealistic for real hardware, but you can still create such a > device with null_blk. > > 1) Use the following zones (zones #2, #3) as the primary SB zones > 2) Do not write the primary SBs > 3) Reject to mkfs > > To be simple logic, method #3 would be appropriate here? > > Technically, all the log zones overlap with zone size > 128 GB. I'm > considering to reject to mkfs in this insane case anyway. The 8G zone size idea is to buy us some time to support future hardware, once this won't suffice we'll add an incompat bit like BIGZONES that will allow larger zone sizes. At that time we'll probably have a better idea about an exact number. So it's #3.