Message ID | 20150317132514.GA52950@bfoster.bfoster (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Mar 17, 2015 at 09:25:15AM -0400, Brian Foster wrote: > On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote: > > Hi Folks, > > > > As I told many people at Vault last week, I wrote a document > > outlining how we should modify the on-disk structures of XFS to > > support host aware SMR drives on the (long) plane flights to Boston. > > > > TL;DR: not a lot of change to the XFS kernel code is required, no > > specific SMR awareness is needed by the kernel code. Only > > relatively minor tweaks to the on-disk format will be needed and > > most of the userspace changes are relatively straight forward, too. > > > > The source for that document can be found in this git tree here: > > > > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation > > > > in the file design/xfs-smr-structure.asciidoc. Alternatively, > > pull it straight from cgit: > > > > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc > > > > Or there is a pdf version built from the current TOT on the xfs.org > > wiki here: > > > > http://xfs.org/index.php/Host_Aware_SMR_architecture > > > > Happy reading! > > > > Hi Dave, > > Thanks for sharing this. Here are some thoughts/notes/questions/etc. > from a first pass. This is mostly XFS oriented and I'll try to break it > down by section. > > I've also attached a diff to the original doc with some typo fixes and > whatnot. Feel free to just fold it into the original doc if you like. > > == Concepts > > - With regard to the assumption that the CMR region is not spread around > the drive, I saw at least one presentation at Vault that suggested > otherwise (the skylight one iirc). That said, it was theoretical and > based on a drive-managed drive. It is in no way clear to me whether that > is something to expect for host-managed drives. AFAIK, the CMR region is contiguous. The skylight paper spells it out pretty clearly that it is a contiguous 20-25GB region on the outer edge of the seagate drives. Other vendors I've spoken to indicate that the region in host managed drives is also contiguous and at the outer edge, and some vendors have indicated they have much more of it that the seagate drives analysed in the skylight paper. If it is not contiguous, then we can use DM to make that problem go away. i.e. use DM to stitch the CMR zones back together into a contiguous LBA region. Then we can size AGs in the data device to map to the size of the individual disjoint CMR regions, and we have a neat, well aligned, isolated solution to the problem without having to modify the XFS code at all. > - It isn't clear to me here and in other places whether you propose to > use the CMR regions as a "metadata device" or require some other > randomly writeable storage to serve that purpose. CMR as the "metadata device" if there is nothing else we can use. I'd really like to see hybrid drives with the "CMR" zone being the flash region in the drive.... > == Journal modifications > > - The tail->head log zeroing behavior on mount comes to mind here. Maybe > the writes are still sequential and it's not a problem, but we should > consider that with the proposition. It's probably not critical as we do > have the out of using the cmr region here (as noted). I assume we can > also cleanly relocate the log without breaking anything else (e.g., the > current location is performance oriented rather than architectural, > yes?). We place the log anywhere in the data device LBA space. You might want to go look up what L_AGNUM does in mkfs. :) And if we can use the CMR region for the log, then that's what we'll do - "no modifications required" is always the best solution. > == Data zones > > - Will this actually support data overwrite or will that return error? We'll support data overwrite. xfs_get_blocks() will need to detect overwrite.... > - TBH, I've never looked at realtime functionality so I don't grok the > high level approach yet. I'm wondering... have you considered a design > based on reflink and copy-on-write? Yes, I have. Complex, invasive and we don't even have basic reflink infrastructure yet. Such a solution pushes us a couple of years out, as opposed to having something before the end of the year... > I know the current plan is to > disentangle the reflink tree from the rmap tree, but my understanding is > the reflink tree is still in the pipeline. Assuming we have that > functionality, it seems like there's potential to use it to overcome > some of the overwrite complexity. There isn't much overwrite complexity - it's simply clearing bits in a zone bitmap to indicate free space, allocating new blocks and then rewriting bmbt extent records. It's fairly simple, really ;) > Just as a handwaving example, use the > per-zone inode to hold an additional reference to each allocated extent > in the zone, thus all writes are handled as if the file had a clone. If > the only reference drops to the zoneino, the extent is freed and thus > stale wrt to the zone cleaner logic. > > I suspect we would still need an allocation strategy, but I expect we're > going to have zone metadata regardless that will help deal with that. > Note that the current sparse inode proposal includes an allocation range > limit mechanism (for the inode record overlaps an ag boundary case), > which could potentially be used/extended to build something on top of > the existing allocator for zone allocation (e.g., if we had some kind of > zone record with the write pointer that indicated where it's safe to > allocate from). Again, just thinking out loud here. Yup, but the bitmap allocator doesn't have support for many of the btree allocator controls. It's a simple, fast, deterministic allocator, and we only need it is to track freed space in the zones as all allocation from the zones is going to be sequential... > == Zone cleaner > > - Paragraph 3 - "fixpel?" I would have just fixed this, but I can't > figure out what it's supposed to say. ;) > > - The idea sounds sane, but the dependency on userspace for a critical > fs mechanism sounds a bit scary to be honest. Is in kernel allocation > going to throttle/depend on background work in the userspace cleaner in > the event of low writeable free space? Of course. ENOSPC always throttles ;) I expect the cleaner will work zone group at a time; locking new, non-cleaner based allocations out of the zone group while it cleans zones. This means the cleaner should always be able to make progress w.r.t. ENOSPC - it gets triggered on a zone group before it runs out of clean zones for freespace defrag purposes.... I also expect that the cleaner won't be used in many bulk storage applications as data is never deleted. I also expect tht XFS-SMR won't be used for general purpose storage applications - that's what solid state storage will be used for - and so the cleaner is not something we need to focus a lot of time and effort on. And the thing that distributed storage guys should love: if we put the cleaner in userspace, then they can *write their own cleaners* that are customised to their own storage algorithms. > What if that userspace thing > dies, etc.? I suppose an implementation with as much mechanism in libxfs > as possible allows us greatest flexibility to go in either direction > here. If the cleaner dies of can't make progress, we ENOSPC. Whether the cleaner is in kernel or userspace is irrelevant to how we handle such cases. > - I'm also wondering how much real overlap there is in xfs_fsr (another > thing I haven't really looked at :) beyond that it calls swapext. > E.g., cleaning a zone sounds like it must map back to N files that could > have allocated extents in the zone vs. considering individual files for > defragmentation, fragmentation of the parent file may not be as much of > a consideration as resetting zones, etc. It sounds like a separate tool > might be warranted, even if there is code to steal from fsr. :) As I implied above, zone cleaning is addressing exactly the same problem as we are currently working on in xfs_fsr: defragmenting free space. > == Reverse mapping btrees > > - This is something I still need to grok, perhaps just because the rmap > code isn't available yet. But I'll note that this does seem like > another bit that could be unnecessary if we could get away with using > the traditional allocator. > > == Mkfs > > - We have references to the "metadata device" as well as random write > regions. Similar to my question above, is there an expectation of a > separate physical metadata device or is that terminology for the random > write regions? "metadata device" == "data device" == "CMR" == "random write region" > Finally, some general/summary notes: > > - Some kind of data structure outline would eventually make a nice > addition to this document. I understand it's probably too early yet, > but we are talking about new per-zone inodes, new and interesting > relationships between AGs and zones (?), etc. Fine grained detail is not > required, but an outline or visual that describes the high-level > mappings goes a long way to facilitate reasoning about the design. Sure, a plane flight is not long enough to do this. Future revisions, as the structure is clarified. > - A big question I had (and something that is touched on down thread wrt > to embedded flash) is whether the random write zones are runtime > configurable. If so, couldn't this facilitate use of existing AG > metadata (now that I think of it, it's not clear to me whether the > realtime mechanism excludes or coexists with AGs)? the "realtime device" contains only user data. It contains no filesystem metadata at all. That separation of user data and filesystem metadata is what makes it so appealing for supporting SMR devices.... > IOW, we obviously > need this kind of space for inodes, dirs, xattrs, btrees, etc. > regardless. It would be interesting if we had the added flexibility to > align it with AGs. I'm trying to keep the solution as simple as possible. No alignment, single whole disk only, metadata in the "data device" on CMR and user data in "real time" zones on SMR. > diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc > index dd959ab..2fea88f 100644 Oh, there's a patch. Thanks! ;) Cheers, Dave.
On Wed, Mar 18, 2015 at 08:28:35AM +1100, Dave Chinner wrote: > On Tue, Mar 17, 2015 at 09:25:15AM -0400, Brian Foster wrote: > > On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote: > > > Hi Folks, > > > > > > As I told many people at Vault last week, I wrote a document > > > outlining how we should modify the on-disk structures of XFS to > > > support host aware SMR drives on the (long) plane flights to Boston. > > > > > > TL;DR: not a lot of change to the XFS kernel code is required, no > > > specific SMR awareness is needed by the kernel code. Only > > > relatively minor tweaks to the on-disk format will be needed and > > > most of the userspace changes are relatively straight forward, too. > > > > > > The source for that document can be found in this git tree here: > > > > > > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation > > > > > > in the file design/xfs-smr-structure.asciidoc. Alternatively, > > > pull it straight from cgit: > > > > > > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc > > > > > > Or there is a pdf version built from the current TOT on the xfs.org > > > wiki here: > > > > > > http://xfs.org/index.php/Host_Aware_SMR_architecture > > > > > > Happy reading! > > > > > > > Hi Dave, > > > > Thanks for sharing this. Here are some thoughts/notes/questions/etc. > > from a first pass. This is mostly XFS oriented and I'll try to break it > > down by section. > > > > I've also attached a diff to the original doc with some typo fixes and > > whatnot. Feel free to just fold it into the original doc if you like. > > > > == Concepts > > > > - With regard to the assumption that the CMR region is not spread around > > the drive, I saw at least one presentation at Vault that suggested > > otherwise (the skylight one iirc). That said, it was theoretical and > > based on a drive-managed drive. It is in no way clear to me whether that > > is something to expect for host-managed drives. > > AFAIK, the CMR region is contiguous. The skylight paper spells it > out pretty clearly that it is a contiguous 20-25GB region on the > outer edge of the seagate drives. Other vendors I've spoken to > indicate that the region in host managed drives is also contiguous > and at the outer edge, and some vendors have indicated they have > much more of it that the seagate drives analysed in the skylight > paper. > > If it is not contiguous, then we can use DM to make that problem go > away. i.e. use DM to stitch the CMR zones back together into a > contiguous LBA region. Then we can size AGs in the data device to > map to the size of the individual disjoint CMR regions, and we > have a neat, well aligned, isolated solution to the problem without > having to modify the XFS code at all. > Looking back at the slides, that was apparently one of the emulated drives. So I guess that bit was more oriented towards showcasing the experimental method than to suggest how one of the drives works. Regardless, it seems reasonable to me to use dm to stitch things together (or go the other direction and split things up) if need be. > > - It isn't clear to me here and in other places whether you propose to > > use the CMR regions as a "metadata device" or require some other > > randomly writeable storage to serve that purpose. > > CMR as the "metadata device" if there is nothing else we can use. > I'd really like to see hybrid drives with the "CMR" zone being the > flash region in the drive.... > Ok. > > == Journal modifications > > > > - The tail->head log zeroing behavior on mount comes to mind here. Maybe > > the writes are still sequential and it's not a problem, but we should > > consider that with the proposition. It's probably not critical as we do > > have the out of using the cmr region here (as noted). I assume we can > > also cleanly relocate the log without breaking anything else (e.g., the > > current location is performance oriented rather than architectural, > > yes?). > > We place the log anywhere in the data device LBA space. You might > want to go look up what L_AGNUM does in mkfs. :) > > And if we can use the CMR region for the log, then that's what we'll > do - "no modifications required" is always the best solution. > > > == Data zones > > > > - Will this actually support data overwrite or will that return error? > > We'll support data overwrite. xfs_get_blocks() will need to detect > overwrite.... > > > - TBH, I've never looked at realtime functionality so I don't grok the > > high level approach yet. I'm wondering... have you considered a design > > based on reflink and copy-on-write? > > Yes, I have. Complex, invasive and we don't even have basic reflink > infrastructure yet. Such a solution pushes us a couple of years > out, as opposed to having something before the end of the year... > It certainly would take longer to implement, but the point is that it's a potential reuse of a mechanism we already plan to implement. I suppose a zone aware allocation is a more simple problem for now and we can revisit it down the road. > > I know the current plan is to > > disentangle the reflink tree from the rmap tree, but my understanding is > > the reflink tree is still in the pipeline. Assuming we have that > > functionality, it seems like there's potential to use it to overcome > > some of the overwrite complexity. > > There isn't much overwrite complexity - it's simply clearing bits > in a zone bitmap to indicate free space, allocating new blocks and > then rewriting bmbt extent records. It's fairly simple, really ;) > Perhaps, but it's not really the act of marking blocks allocated or free that I was interested in. It's the combination of managing the zone write constraints in the write path and the allocator, finding free blocks vs. stale blocks, etc. (e.g., the "extent lifecycle" for lack of a better term). > > Just as a handwaving example, use the > > per-zone inode to hold an additional reference to each allocated extent > > in the zone, thus all writes are handled as if the file had a clone. If > > the only reference drops to the zoneino, the extent is freed and thus > > stale wrt to the zone cleaner logic. > > > > I suspect we would still need an allocation strategy, but I expect we're > > going to have zone metadata regardless that will help deal with that. > > Note that the current sparse inode proposal includes an allocation range > > limit mechanism (for the inode record overlaps an ag boundary case), > > which could potentially be used/extended to build something on top of > > the existing allocator for zone allocation (e.g., if we had some kind of > > zone record with the write pointer that indicated where it's safe to > > allocate from). Again, just thinking out loud here. > > Yup, but the bitmap allocator doesn't have support for many of the > btree allocator controls. It's a simple, fast, deterministic > allocator, and we only need it is to track freed space in the zones > as all allocation from the zones is going to be sequential... > Right, the point is that the traditional allocator has some mechanisms that might facilitate zone compliant allocation provided we have the associated zone metadata. E.g., the allocation range mechanism facilitates allocation within a particular zone, within a "usable" range of a zone, or across a wider set of zones of similar state, depending on the allocator implementation details. Anyways, I don't want to hijack this thread too much. :) I might send you something separately for a sanity check or brainstorming purposes. > > == Zone cleaner > > > > - Paragraph 3 - "fixpel?" I would have just fixed this, but I can't > > figure out what it's supposed to say. ;) > > > > - The idea sounds sane, but the dependency on userspace for a critical > > fs mechanism sounds a bit scary to be honest. Is in kernel allocation > > going to throttle/depend on background work in the userspace cleaner in > > the event of low writeable free space? > > Of course. ENOSPC always throttles ;) > Heh. :) > I expect the cleaner will work zone group at a time; locking new, > non-cleaner based allocations out of the zone group while it cleans > zones. This means the cleaner should always be able to make progress > w.r.t. ENOSPC - it gets triggered on a zone group before it runs out > of clean zones for freespace defrag purposes.... > There's some interesting allocation dynamics going on here that aren't fully clear to me. E.g., on the one hand we want zone groups to be fairly large to help manage the zone count, on the other we're potentially locking out a TB-sized zone group at a time while the userspace tool does its thing..? I take it this means we'll also want some way to actually do zone-cleaning allocations (i.e., the extents copied from the cleaned zones) from this zone from the userspace tool while other general users are locked out. Even with that, incorporating any kind of locality into the allocator seems futile if the target zone group for an independently active file could be locked down at any given point in time. Maybe 256MB zone groups means that's less of a practical issue..? I'm probably reading too far into it at this point... :P > I also expect that the cleaner won't be used in many bulk storage > applications as data is never deleted. I also expect tht XFS-SMR > won't be used for general purpose storage applications - that's what > solid state storage will be used for - and so the cleaner is not > something we need to focus a lot of time and effort on. > > And the thing that distributed storage guys should love: if we put > the cleaner in userspace, then they can *write their own cleaners* > that are customised to their own storage algorithms. > > > What if that userspace thing > > dies, etc.? I suppose an implementation with as much mechanism in libxfs > > as possible allows us greatest flexibility to go in either direction > > here. > > If the cleaner dies of can't make progress, we ENOSPC. Whether the > cleaner is in kernel or userspace is irrelevant to how we handle > such cases. > > > - I'm also wondering how much real overlap there is in xfs_fsr (another > > thing I haven't really looked at :) beyond that it calls swapext. > > E.g., cleaning a zone sounds like it must map back to N files that could > > have allocated extents in the zone vs. considering individual files for > > defragmentation, fragmentation of the parent file may not be as much of > > a consideration as resetting zones, etc. It sounds like a separate tool > > might be warranted, even if there is code to steal from fsr. :) > > As I implied above, zone cleaning is addressing exactly the same > problem as we are currently working on in xfs_fsr: defragmenting > free space. > Ah, Ok. That is an interesting connection. There also seems to be an interesting correlation between zone cleaning and overwrite handling + unlink/truncate + discard handling (if you represent a zone with an inode that tracks a particular fsb range and references "stale" blocks before they are ultimately freed). > > == Reverse mapping btrees > > > > - This is something I still need to grok, perhaps just because the rmap > > code isn't available yet. But I'll note that this does seem like > > another bit that could be unnecessary if we could get away with using > > the traditional allocator. > > > > == Mkfs > > > > - We have references to the "metadata device" as well as random write > > regions. Similar to my question above, is there an expectation of a > > separate physical metadata device or is that terminology for the random > > write regions? > > "metadata device" == "data device" == "CMR" == "random write region" > > > Finally, some general/summary notes: > > > > - Some kind of data structure outline would eventually make a nice > > addition to this document. I understand it's probably too early yet, > > but we are talking about new per-zone inodes, new and interesting > > relationships between AGs and zones (?), etc. Fine grained detail is not > > required, but an outline or visual that describes the high-level > > mappings goes a long way to facilitate reasoning about the design. > > Sure, a plane flight is not long enough to do this. Future > revisions, as the structure is clarified. > Of course. :) > > - A big question I had (and something that is touched on down thread wrt > > to embedded flash) is whether the random write zones are runtime > > configurable. If so, couldn't this facilitate use of existing AG > > metadata (now that I think of it, it's not clear to me whether the > > realtime mechanism excludes or coexists with AGs)? > > the "realtime device" contains only user data. It contains no > filesystem metadata at all. That separation of user data and > filesystem metadata is what makes it so appealing for supporting SMR > devices.... > > > IOW, we obviously > > need this kind of space for inodes, dirs, xattrs, btrees, etc. > > regardless. It would be interesting if we had the added flexibility to > > align it with AGs. > > I'm trying to keep the solution as simple as possible. No alignment, > single whole disk only, metadata in the "data device" on CMR and > user data in "real time" zones on SMR. > Understood. From the commentary here and our irc discussion, my take away is that the primary objective is to get to some kind of SMR capable solution sooner rather than later. Beyond that, you have concerns about the complexity of making the current format work with smr drives. That all sounds reasonable to me. I get a bit more concerned when we start talking about implementing solutions to the same problems we've mostly solved with the existing algorithms, such as zone reservation vs. preallocation, zone group rotoring vs. ag rotoring, etc. At some point, I think it will be worth taking a harder look at whether we could reuse the more traditional layout and algorithms... Brian > > diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc > > index dd959ab..2fea88f 100644 > > Oh, there's a patch. Thanks! ;) > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc index dd959ab..2fea88f 100644 --- a/design/xfs-smr-structure.asciidoc +++ b/design/xfs-smr-structure.asciidoc @@ -95,7 +95,7 @@ going to need a special directory to expose this information. It would be useful to have a ".zones" directory hanging off the root directory that contains all the zone allocation inodes so userspace can simply open them. -THis biggest issue that has come to light here is the number of zones in a +This biggest issue that has come to light here is the number of zones in a device. Zones are typically 256MB in size, and so we are looking at 4,000 zones/TB. For a 10TB drive, that's 40,000 zones we have to keep track of. And if the devices keep getting larger at the expected rate, we're going to have to @@ -112,24 +112,24 @@ also have other benefits... While it seems like tracking free space is trivial for the purposes of allocation (and it is!), the complexity comes when we start to delete or overwrite data. Suddenly zones no longer contain contiguous ranges of valid -data; they have "freed" extents in the middle of them that contian stale data. +data; they have "freed" extents in the middle of them that contain stale data. We can't use that "stale space" until the entire zone is made up of "stale" extents. Hence we need a Cleaner. === Zone Cleaner The purpose of the cleaner is to find zones that are mostly stale space and -consolidate the remaining referenced data into a new, contigious zone, enabling +consolidate the remaining referenced data into a new, contiguous zone, enabling us to then "clean" the stale zone and make it available for writing new data again. -The real complexity here is finding the owner of the data that needs to be move, -but we are in the process of solving that with the reverse mapping btree and -parent pointer functionality. This gives us the mechanism by which we can +The real complexity here is finding the owner of the data that needs to be +moved, but we are in the process of solving that with the reverse mapping btree +and parent pointer functionality. This gives us the mechanism by which we can quickly re-organise files that have extents in zones that need cleaning. The key word here is "reorganise". We have a tool that already reorganises file -layout: xfs_fsr. The "Cleaner" is a finely targetted policy for xfs_fsr - +layout: xfs_fsr. The "Cleaner" is a finely targeted policy for xfs_fsr - instead of trying to minimise fixpel fragments, it finds zones that need cleaning by reading their summary info from the /.zones/ directory and analysing the free bitmap state if there is a high enough percentage of stale blocks. From @@ -142,7 +142,7 @@ Hence we don't actually need any major new data moving functionality in the kernel to enable this, except maybe an event channel for the kernel to tell xfs_fsr it needs to do some cleaning work. -If we arrange zones into zoen groups, we also have a method for keeping new +If we arrange zones into zone groups, we also have a method for keeping new allocations out of regions we are re-organising. That is, we need to be able to mark zone groups as "read only" so the kernel will not attempt to allocate from them while the cleaner is running and re-organising the data within the zones in @@ -166,17 +166,17 @@ inode to track the zone's owner information. == Mkfs Mkfs is going to have to integrate with the userspace zbc libraries to query the -layout of zones from the underlying disk and then do some magic to lay out al +layout of zones from the underlying disk and then do some magic to lay out all the necessary metadata correctly. I don't see there being any significant challenge to doing this, but we will need a stable libzbc API to work with and -it will need ot be packaged by distros. +it will need to be packaged by distros. -If mkfs cannot find ensough random write space for the amount of metadata we -need to track all the space in the sequential write zones and a decent amount of -internal fielsystem metadata (inodes, etc) then it will need to fail. Drive -vendors are going to need to provide sufficient space in these regions for us -to be able to make use of it, otherwise we'll simply not be able to do what we -need to do. +If mkfs cannot find enough random write space for the amount of metadata we need +to track all the space in the sequential write zones and a decent amount of +internal filesystem metadata (inodes, etc) then it will need to fail. Drive +vendors are going to need to provide sufficient space in these regions for us to +be able to make use of it, otherwise we'll simply not be able to do what we need +to do. mkfs will need to initialise all the zone allocation inodes, reset all the zone write pointers, create the /.zones directory, place the log in an appropriate @@ -187,13 +187,13 @@ place and initialise the metadata device as well. Because we've limited the metadata to a section of the drive that can be overwritten, we don't have to make significant changes to xfs_repair. It will need to be taught about the multiple zone allocation bitmaps for it's space -reference checking, but otherwise all the infrastructure we need ifor using +reference checking, but otherwise all the infrastructure we need for using bitmaps for verifying used space should already be there. -THere be dragons waiting for us if we don't have random write zones for +There be dragons waiting for us if we don't have random write zones for metadata. If that happens, we cannot repair metadata in place and we will have to redesign xfs_repair from the ground up to support such functionality. That's -jus tnot going to happen, so we'll need drives with a significant amount of +just not going to happen, so we'll need drives with a significant amount of random write space for all our metadata...... == Quantification of Random Write Zone Capacity @@ -214,7 +214,7 @@ performance, replace the CMR region with a SSD.... The allocator will need to learn about multiple allocation zones based on bitmaps. They aren't really allocation groups, but the initialisation and -iteration of them is going to be similar to allocation groups. To get use going +iteration of them is going to be similar to allocation groups. To get us going we can do some simple mapping between inode AG and data AZ mapping so that we keep some form of locality to related data (e.g. grouping of data by parent directory). @@ -273,19 +273,19 @@ location, the current location or anywhere in between. The only guarantee that we have is that if we flushed the cache (i.e. fsync'd a file) then they will at least be in a position at or past the location of the fsync. -Hence before a filesystem runs journal recovery, all it's zone allocation write +Hence before a filesystem runs journal recovery, all its zone allocation write pointers need to be set to what the drive thinks they are, and all of the zone allocation beyond the write pointer need to be cleared. We could do this during log recovery in kernel, but that means we need full ZBC awareness in log recovery to iterate and query all the zones. -Hence it's not clear if we want to do this in userspace as that has it's own -problems e.g. we'd need to have xfs.fsck detect that it's a smr filesystem and +Hence it's not clear if we want to do this in userspace as that has its own +problems e.g. we'd need to have xfs.fsck detect that it's an smr filesystem and perform that recovery, or write a mount.xfs helper that does it prior to mounting the filesystem. Either way, we need to synchronise the on-disk filesystem state to the internal disk zone state before doing anything else. -This needs more thought, because I have a nagging suspiscion that we need to do +This needs more thought, because I have a nagging suspicion that we need to do this write pointer resynchronisation *after log recovery* has completed so we can determine if we've got to now go and free extents that the filesystem has allocated and are referenced by some inode out there. This, again, will require