diff mbox

[ANNOUNCE] xfs: Supporting Host Aware SMR Drives

Message ID 20150317132514.GA52950@bfoster.bfoster (mailing list archive)
State New, archived
Headers show

Commit Message

Brian Foster March 17, 2015, 1:25 p.m. UTC
On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> Hi Folks,
> 
> As I told many people at Vault last week, I wrote a document
> outlining how we should modify the on-disk structures of XFS to
> support host aware SMR drives on the (long) plane flights to Boston.
> 
> TL;DR: not a lot of change to the XFS kernel code is required, no
> specific SMR awareness is needed by the kernel code.  Only
> relatively minor tweaks to the on-disk format will be needed and
> most of the userspace changes are relatively straight forward, too.
> 
> The source for that document can be found in this git tree here:
> 
> git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> 
> in the file design/xfs-smr-structure.asciidoc. Alternatively,
> pull it straight from cgit:
> 
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> 
> Or there is a pdf version built from the current TOT on the xfs.org
> wiki here:
> 
> http://xfs.org/index.php/Host_Aware_SMR_architecture
> 
> Happy reading!
> 

Hi Dave,

Thanks for sharing this. Here are some thoughts/notes/questions/etc.
from a first pass. This is mostly XFS oriented and I'll try to break it
down by section.

I've also attached a diff to the original doc with some typo fixes and
whatnot. Feel free to just fold it into the original doc if you like.

== Concepts

- With regard to the assumption that the CMR region is not spread around
the drive, I saw at least one presentation at Vault that suggested
otherwise (the skylight one iirc). That said, it was theoretical and
based on a drive-managed drive. It is in no way clear to me whether that
is something to expect for host-managed drives.

- It isn't clear to me here and in other places whether you propose to
use the CMR regions as a "metadata device" or require some other
randomly writeable storage to serve that purpose.

== Journal modifications

- The tail->head log zeroing behavior on mount comes to mind here. Maybe
the writes are still sequential and it's not a problem, but we should
consider that with the proposition. It's probably not critical as we do
have the out of using the cmr region here (as noted). I assume we can
also cleanly relocate the log without breaking anything else (e.g., the
current location is performance oriented rather than architectural,
yes?).

== Data zones

- Will this actually support data overwrite or will that return error?

- TBH, I've never looked at realtime functionality so I don't grok the
high level approach yet. I'm wondering... have you considered a design
based on reflink and copy-on-write? I know the current plan is to
disentangle the reflink tree from the rmap tree, but my understanding is
the reflink tree is still in the pipeline. Assuming we have that
functionality, it seems like there's potential to use it to overcome
some of the overwrite complexity. Just as a handwaving example, use the
per-zone inode to hold an additional reference to each allocated extent
in the zone, thus all writes are handled as if the file had a clone. If
the only reference drops to the zoneino, the extent is freed and thus
stale wrt to the zone cleaner logic.

I suspect we would still need an allocation strategy, but I expect we're
going to have zone metadata regardless that will help deal with that.
Note that the current sparse inode proposal includes an allocation range
limit mechanism (for the inode record overlaps an ag boundary case),
which could potentially be used/extended to build something on top of
the existing allocator for zone allocation (e.g., if we had some kind of
zone record with the write pointer that indicated where it's safe to
allocate from). Again, just thinking out loud here.

== Zone cleaner

- Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
figure out what it's supposed to say. ;)

- The idea sounds sane, but the dependency on userspace for a critical
fs mechanism sounds a bit scary to be honest. Is in kernel allocation
going to throttle/depend on background work in the userspace cleaner in
the event of low writeable free space? What if that userspace thing
dies, etc.? I suppose an implementation with as much mechanism in libxfs
as possible allows us greatest flexibility to go in either direction
here.

- I'm also wondering how much real overlap there is in xfs_fsr (another
thing I haven't really looked at :) beyond that it calls swapext.
E.g., cleaning a zone sounds like it must map back to N files that could
have allocated extents in the zone vs. considering individual files for
defragmentation, fragmentation of the parent file may not be as much of
a consideration as resetting zones, etc. It sounds like a separate tool
might be warranted, even if there is code to steal from fsr. :)

== Reverse mapping btrees

- This is something I still need to grok, perhaps just because the rmap
code isn't available yet. But I'll note that this does seem like
another bit that could be unnecessary if we could get away with using
the traditional allocator.

== Mkfs

- We have references to the "metadata device" as well as random write
regions. Similar to my question above, is there an expectation of a
separate physical metadata device or is that terminology for the random
write regions?

Finally, some general/summary notes:

- Some kind of data structure outline would eventually make a nice
addition to this document. I understand it's probably too early yet,
but we are talking about new per-zone inodes, new and interesting
relationships between AGs and zones (?), etc. Fine grained detail is not
required, but an outline or visual that describes the high-level
mappings goes a long way to facilitate reasoning about the design.

- A big question I had (and something that is touched on down thread wrt
to embedded flash) is whether the random write zones are runtime
configurable. If so, couldn't this facilitate use of existing AG
metadata (now that I think of it, it's not clear to me whether the
realtime mechanism excludes or coexists with AGs)? IOW, we obviously
need this kind of space for inodes, dirs, xattrs, btrees, etc.
regardless. It would be interesting if we had the added flexibility to
align it with AGs.

Thanks again!

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

Comments

Dave Chinner March 17, 2015, 9:28 p.m. UTC | #1
On Tue, Mar 17, 2015 at 09:25:15AM -0400, Brian Foster wrote:
> On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> > Hi Folks,
> > 
> > As I told many people at Vault last week, I wrote a document
> > outlining how we should modify the on-disk structures of XFS to
> > support host aware SMR drives on the (long) plane flights to Boston.
> > 
> > TL;DR: not a lot of change to the XFS kernel code is required, no
> > specific SMR awareness is needed by the kernel code.  Only
> > relatively minor tweaks to the on-disk format will be needed and
> > most of the userspace changes are relatively straight forward, too.
> > 
> > The source for that document can be found in this git tree here:
> > 
> > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> > 
> > in the file design/xfs-smr-structure.asciidoc. Alternatively,
> > pull it straight from cgit:
> > 
> > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> > 
> > Or there is a pdf version built from the current TOT on the xfs.org
> > wiki here:
> > 
> > http://xfs.org/index.php/Host_Aware_SMR_architecture
> > 
> > Happy reading!
> > 
> 
> Hi Dave,
> 
> Thanks for sharing this. Here are some thoughts/notes/questions/etc.
> from a first pass. This is mostly XFS oriented and I'll try to break it
> down by section.
> 
> I've also attached a diff to the original doc with some typo fixes and
> whatnot. Feel free to just fold it into the original doc if you like.
> 
> == Concepts
> 
> - With regard to the assumption that the CMR region is not spread around
> the drive, I saw at least one presentation at Vault that suggested
> otherwise (the skylight one iirc). That said, it was theoretical and
> based on a drive-managed drive. It is in no way clear to me whether that
> is something to expect for host-managed drives.

AFAIK, the CMR region is contiguous. The skylight paper spells it
out pretty clearly that it is a contiguous 20-25GB region on the
outer edge of the seagate drives. Other vendors I've spoken to
indicate that the region in host managed drives is also contiguous
and at the outer edge, and some vendors have indicated they have
much more of it that the seagate drives analysed in the skylight
paper.

If it is not contiguous, then we can use DM to make that problem go
away. i.e. use DM to stitch the CMR zones back together into a
contiguous LBA region. Then we can size AGs in the data device to
map to the size of the individual disjoint CMR regions, and we
have a neat, well aligned, isolated solution to the problem without
having to modify the XFS code at all.

> - It isn't clear to me here and in other places whether you propose to
> use the CMR regions as a "metadata device" or require some other
> randomly writeable storage to serve that purpose.

CMR as the "metadata device" if there is nothing else we can use.
I'd really like to see hybrid drives with the "CMR" zone being the
flash region in the drive....

> == Journal modifications
> 
> - The tail->head log zeroing behavior on mount comes to mind here. Maybe
> the writes are still sequential and it's not a problem, but we should
> consider that with the proposition.  It's probably not critical as we do
> have the out of using the cmr region here (as noted). I assume we can
> also cleanly relocate the log without breaking anything else (e.g., the
> current location is performance oriented rather than architectural,
> yes?).

We place the log anywhere in the data device LBA space. You might
want to go look up what L_AGNUM does in mkfs. :)

And if we can use the CMR region for the log, then that's what we'll
do - "no modifications required" is always the best solution.

> == Data zones
> 
> - Will this actually support data overwrite or will that return error?

We'll support data overwrite. xfs_get_blocks() will need to detect
overwrite....

> - TBH, I've never looked at realtime functionality so I don't grok the
> high level approach yet. I'm wondering... have you considered a design
> based on reflink and copy-on-write?

Yes, I have. Complex, invasive and we don't even have basic reflink
infrastructure yet. Such a solution pushes us a couple of years
out, as opposed to having something before the end of the year...

> I know the current plan is to
> disentangle the reflink tree from the rmap tree, but my understanding is
> the reflink tree is still in the pipeline. Assuming we have that
> functionality, it seems like there's potential to use it to overcome
> some of the overwrite complexity.

There isn't much overwrite complexity - it's simply clearing bits
in a zone bitmap to indicate free space, allocating new blocks and
then rewriting bmbt extent records. It's fairly simple, really ;)

> Just as a handwaving example, use the
> per-zone inode to hold an additional reference to each allocated extent
> in the zone, thus all writes are handled as if the file had a clone. If
> the only reference drops to the zoneino, the extent is freed and thus
> stale wrt to the zone cleaner logic.
> 
> I suspect we would still need an allocation strategy, but I expect we're
> going to have zone metadata regardless that will help deal with that.
> Note that the current sparse inode proposal includes an allocation range
> limit mechanism (for the inode record overlaps an ag boundary case),
> which could potentially be used/extended to build something on top of
> the existing allocator for zone allocation (e.g., if we had some kind of
> zone record with the write pointer that indicated where it's safe to
> allocate from). Again, just thinking out loud here.

Yup, but the bitmap allocator doesn't have support for many of the
btree allocator controls.  It's a simple, fast, deterministic
allocator, and we only need it is to track freed space in the zones
as all allocation from the zones is going to be sequential...

> == Zone cleaner
> 
> - Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
> figure out what it's supposed to say. ;)
> 
> - The idea sounds sane, but the dependency on userspace for a critical
> fs mechanism sounds a bit scary to be honest. Is in kernel allocation
> going to throttle/depend on background work in the userspace cleaner in
> the event of low writeable free space?

Of course. ENOSPC always throttles ;)

I expect the cleaner will work zone group at a time; locking new,
non-cleaner based allocations out of the zone group while it cleans
zones. This means the cleaner should always be able to make progress
w.r.t. ENOSPC - it gets triggered on a zone group before it runs out
of clean zones for freespace defrag purposes....

I also expect that the cleaner won't be used in many bulk storage
applications as data is never deleted. I also expect tht XFS-SMR
won't be used for general purpose storage applications - that's what
solid state storage will be used for - and so the cleaner is not
something we need to focus a lot of time and effort on.

And the thing that distributed storage guys should love: if we put
the cleaner in userspace, then they can *write their own cleaners*
that are customised to their own storage algorithms.

> What if that userspace thing
> dies, etc.? I suppose an implementation with as much mechanism in libxfs
> as possible allows us greatest flexibility to go in either direction
> here.

If the cleaner dies of can't make progress, we ENOSPC. Whether the
cleaner is in kernel or userspace is irrelevant to how we handle
such cases.

> - I'm also wondering how much real overlap there is in xfs_fsr (another
> thing I haven't really looked at :) beyond that it calls swapext.
> E.g., cleaning a zone sounds like it must map back to N files that could
> have allocated extents in the zone vs. considering individual files for
> defragmentation, fragmentation of the parent file may not be as much of
> a consideration as resetting zones, etc. It sounds like a separate tool
> might be warranted, even if there is code to steal from fsr. :)

As I implied above, zone cleaning is addressing exactly the same
problem as we are currently working on in xfs_fsr: defragmenting
free space.

> == Reverse mapping btrees
> 
> - This is something I still need to grok, perhaps just because the rmap
> code isn't available yet. But I'll note that this does seem like
> another bit that could be unnecessary if we could get away with using
> the traditional allocator.
> 
> == Mkfs
> 
> - We have references to the "metadata device" as well as random write
> regions. Similar to my question above, is there an expectation of a
> separate physical metadata device or is that terminology for the random
> write regions?

"metadata device" == "data device" == "CMR" == "random write region"

> Finally, some general/summary notes:
> 
> - Some kind of data structure outline would eventually make a nice
> addition to this document. I understand it's probably too early yet,
> but we are talking about new per-zone inodes, new and interesting
> relationships between AGs and zones (?), etc. Fine grained detail is not
> required, but an outline or visual that describes the high-level
> mappings goes a long way to facilitate reasoning about the design.

Sure, a plane flight is not long enough to do this. Future
revisions, as the structure is clarified.

> - A big question I had (and something that is touched on down thread wrt
> to embedded flash) is whether the random write zones are runtime
> configurable. If so, couldn't this facilitate use of existing AG
> metadata (now that I think of it, it's not clear to me whether the
> realtime mechanism excludes or coexists with AGs)?

the "realtime device" contains only user data. It contains no
filesystem metadata at all. That separation of user data and
filesystem metadata is what makes it so appealing for supporting SMR
devices....

> IOW, we obviously
> need this kind of space for inodes, dirs, xattrs, btrees, etc.
> regardless. It would be interesting if we had the added flexibility to
> align it with AGs.

I'm trying to keep the solution as simple as possible. No alignment,
single whole disk only, metadata in the "data device" on CMR and
user data in "real time" zones on SMR.

> diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc
> index dd959ab..2fea88f 100644

Oh, there's a patch. Thanks! ;)

Cheers,

Dave.
Brian Foster March 21, 2015, 2:48 p.m. UTC | #2
On Wed, Mar 18, 2015 at 08:28:35AM +1100, Dave Chinner wrote:
> On Tue, Mar 17, 2015 at 09:25:15AM -0400, Brian Foster wrote:
> > On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> > > Hi Folks,
> > > 
> > > As I told many people at Vault last week, I wrote a document
> > > outlining how we should modify the on-disk structures of XFS to
> > > support host aware SMR drives on the (long) plane flights to Boston.
> > > 
> > > TL;DR: not a lot of change to the XFS kernel code is required, no
> > > specific SMR awareness is needed by the kernel code.  Only
> > > relatively minor tweaks to the on-disk format will be needed and
> > > most of the userspace changes are relatively straight forward, too.
> > > 
> > > The source for that document can be found in this git tree here:
> > > 
> > > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> > > 
> > > in the file design/xfs-smr-structure.asciidoc. Alternatively,
> > > pull it straight from cgit:
> > > 
> > > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> > > 
> > > Or there is a pdf version built from the current TOT on the xfs.org
> > > wiki here:
> > > 
> > > http://xfs.org/index.php/Host_Aware_SMR_architecture
> > > 
> > > Happy reading!
> > > 
> > 
> > Hi Dave,
> > 
> > Thanks for sharing this. Here are some thoughts/notes/questions/etc.
> > from a first pass. This is mostly XFS oriented and I'll try to break it
> > down by section.
> > 
> > I've also attached a diff to the original doc with some typo fixes and
> > whatnot. Feel free to just fold it into the original doc if you like.
> > 
> > == Concepts
> > 
> > - With regard to the assumption that the CMR region is not spread around
> > the drive, I saw at least one presentation at Vault that suggested
> > otherwise (the skylight one iirc). That said, it was theoretical and
> > based on a drive-managed drive. It is in no way clear to me whether that
> > is something to expect for host-managed drives.
> 
> AFAIK, the CMR region is contiguous. The skylight paper spells it
> out pretty clearly that it is a contiguous 20-25GB region on the
> outer edge of the seagate drives. Other vendors I've spoken to
> indicate that the region in host managed drives is also contiguous
> and at the outer edge, and some vendors have indicated they have
> much more of it that the seagate drives analysed in the skylight
> paper.
> 
> If it is not contiguous, then we can use DM to make that problem go
> away. i.e. use DM to stitch the CMR zones back together into a
> contiguous LBA region. Then we can size AGs in the data device to
> map to the size of the individual disjoint CMR regions, and we
> have a neat, well aligned, isolated solution to the problem without
> having to modify the XFS code at all.
> 

Looking back at the slides, that was apparently one of the emulated
drives. So I guess that bit was more oriented towards showcasing the
experimental method than to suggest how one of the drives works.
Regardless, it seems reasonable to me to use dm to stitch things
together (or go the other direction and split things up) if need be.

> > - It isn't clear to me here and in other places whether you propose to
> > use the CMR regions as a "metadata device" or require some other
> > randomly writeable storage to serve that purpose.
> 
> CMR as the "metadata device" if there is nothing else we can use.
> I'd really like to see hybrid drives with the "CMR" zone being the
> flash region in the drive....
> 

Ok.

> > == Journal modifications
> > 
> > - The tail->head log zeroing behavior on mount comes to mind here. Maybe
> > the writes are still sequential and it's not a problem, but we should
> > consider that with the proposition.  It's probably not critical as we do
> > have the out of using the cmr region here (as noted). I assume we can
> > also cleanly relocate the log without breaking anything else (e.g., the
> > current location is performance oriented rather than architectural,
> > yes?).
> 
> We place the log anywhere in the data device LBA space. You might
> want to go look up what L_AGNUM does in mkfs. :)
> 
> And if we can use the CMR region for the log, then that's what we'll
> do - "no modifications required" is always the best solution.
> 
> > == Data zones
> > 
> > - Will this actually support data overwrite or will that return error?
> 
> We'll support data overwrite. xfs_get_blocks() will need to detect
> overwrite....
> 
> > - TBH, I've never looked at realtime functionality so I don't grok the
> > high level approach yet. I'm wondering... have you considered a design
> > based on reflink and copy-on-write?
> 
> Yes, I have. Complex, invasive and we don't even have basic reflink
> infrastructure yet. Such a solution pushes us a couple of years
> out, as opposed to having something before the end of the year...
> 

It certainly would take longer to implement, but the point is that it's
a potential reuse of a mechanism we already plan to implement. I suppose
a zone aware allocation is a more simple problem for now and we can
revisit it down the road.

> > I know the current plan is to
> > disentangle the reflink tree from the rmap tree, but my understanding is
> > the reflink tree is still in the pipeline. Assuming we have that
> > functionality, it seems like there's potential to use it to overcome
> > some of the overwrite complexity.
> 
> There isn't much overwrite complexity - it's simply clearing bits
> in a zone bitmap to indicate free space, allocating new blocks and
> then rewriting bmbt extent records. It's fairly simple, really ;)
> 

Perhaps, but it's not really the act of marking blocks allocated or free
that I was interested in. It's the combination of managing the zone
write constraints in the write path and the allocator, finding free
blocks vs. stale blocks, etc. (e.g., the "extent lifecycle" for lack of
a better term).

> > Just as a handwaving example, use the
> > per-zone inode to hold an additional reference to each allocated extent
> > in the zone, thus all writes are handled as if the file had a clone. If
> > the only reference drops to the zoneino, the extent is freed and thus
> > stale wrt to the zone cleaner logic.
> > 
> > I suspect we would still need an allocation strategy, but I expect we're
> > going to have zone metadata regardless that will help deal with that.
> > Note that the current sparse inode proposal includes an allocation range
> > limit mechanism (for the inode record overlaps an ag boundary case),
> > which could potentially be used/extended to build something on top of
> > the existing allocator for zone allocation (e.g., if we had some kind of
> > zone record with the write pointer that indicated where it's safe to
> > allocate from). Again, just thinking out loud here.
> 
> Yup, but the bitmap allocator doesn't have support for many of the
> btree allocator controls.  It's a simple, fast, deterministic
> allocator, and we only need it is to track freed space in the zones
> as all allocation from the zones is going to be sequential...
> 

Right, the point is that the traditional allocator has some mechanisms
that might facilitate zone compliant allocation provided we have the
associated zone metadata. E.g., the allocation range mechanism
facilitates allocation within a particular zone, within a "usable" range
of a zone, or across a wider set of zones of similar state, depending on
the allocator implementation details.

Anyways, I don't want to hijack this thread too much. :) I might send
you something separately for a sanity check or brainstorming purposes.

> > == Zone cleaner
> > 
> > - Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
> > figure out what it's supposed to say. ;)
> > 
> > - The idea sounds sane, but the dependency on userspace for a critical
> > fs mechanism sounds a bit scary to be honest. Is in kernel allocation
> > going to throttle/depend on background work in the userspace cleaner in
> > the event of low writeable free space?
> 
> Of course. ENOSPC always throttles ;)
> 

Heh. :)

> I expect the cleaner will work zone group at a time; locking new,
> non-cleaner based allocations out of the zone group while it cleans
> zones. This means the cleaner should always be able to make progress
> w.r.t. ENOSPC - it gets triggered on a zone group before it runs out
> of clean zones for freespace defrag purposes....
> 

There's some interesting allocation dynamics going on here that aren't
fully clear to me. E.g., on the one hand we want zone groups to be
fairly large to help manage the zone count, on the other we're
potentially locking out a TB-sized zone group at a time while the
userspace tool does its thing..? I take it this means we'll also want
some way to actually do zone-cleaning allocations (i.e., the extents
copied from the cleaned zones) from this zone from the userspace tool
while other general users are locked out. Even with that, incorporating
any kind of locality into the allocator seems futile if the target zone
group for an independently active file could be locked down at any given
point in time.

Maybe 256MB zone groups means that's less of a practical issue..? I'm
probably reading too far into it at this point... :P

> I also expect that the cleaner won't be used in many bulk storage
> applications as data is never deleted. I also expect tht XFS-SMR
> won't be used for general purpose storage applications - that's what
> solid state storage will be used for - and so the cleaner is not
> something we need to focus a lot of time and effort on.
> 
> And the thing that distributed storage guys should love: if we put
> the cleaner in userspace, then they can *write their own cleaners*
> that are customised to their own storage algorithms.
> 
> > What if that userspace thing
> > dies, etc.? I suppose an implementation with as much mechanism in libxfs
> > as possible allows us greatest flexibility to go in either direction
> > here.
> 
> If the cleaner dies of can't make progress, we ENOSPC. Whether the
> cleaner is in kernel or userspace is irrelevant to how we handle
> such cases.
> 
> > - I'm also wondering how much real overlap there is in xfs_fsr (another
> > thing I haven't really looked at :) beyond that it calls swapext.
> > E.g., cleaning a zone sounds like it must map back to N files that could
> > have allocated extents in the zone vs. considering individual files for
> > defragmentation, fragmentation of the parent file may not be as much of
> > a consideration as resetting zones, etc. It sounds like a separate tool
> > might be warranted, even if there is code to steal from fsr. :)
> 
> As I implied above, zone cleaning is addressing exactly the same
> problem as we are currently working on in xfs_fsr: defragmenting
> free space.
> 

Ah, Ok. That is an interesting connection. There also seems to be an
interesting correlation between zone cleaning and overwrite handling +
unlink/truncate + discard handling (if you represent a zone with an
inode that tracks a particular fsb range and references "stale" blocks
before they are ultimately freed).

> > == Reverse mapping btrees
> > 
> > - This is something I still need to grok, perhaps just because the rmap
> > code isn't available yet. But I'll note that this does seem like
> > another bit that could be unnecessary if we could get away with using
> > the traditional allocator.
> > 
> > == Mkfs
> > 
> > - We have references to the "metadata device" as well as random write
> > regions. Similar to my question above, is there an expectation of a
> > separate physical metadata device or is that terminology for the random
> > write regions?
> 
> "metadata device" == "data device" == "CMR" == "random write region"
> 
> > Finally, some general/summary notes:
> > 
> > - Some kind of data structure outline would eventually make a nice
> > addition to this document. I understand it's probably too early yet,
> > but we are talking about new per-zone inodes, new and interesting
> > relationships between AGs and zones (?), etc. Fine grained detail is not
> > required, but an outline or visual that describes the high-level
> > mappings goes a long way to facilitate reasoning about the design.
> 
> Sure, a plane flight is not long enough to do this. Future
> revisions, as the structure is clarified.
> 

Of course. :)

> > - A big question I had (and something that is touched on down thread wrt
> > to embedded flash) is whether the random write zones are runtime
> > configurable. If so, couldn't this facilitate use of existing AG
> > metadata (now that I think of it, it's not clear to me whether the
> > realtime mechanism excludes or coexists with AGs)?
> 
> the "realtime device" contains only user data. It contains no
> filesystem metadata at all. That separation of user data and
> filesystem metadata is what makes it so appealing for supporting SMR
> devices....
> 
> > IOW, we obviously
> > need this kind of space for inodes, dirs, xattrs, btrees, etc.
> > regardless. It would be interesting if we had the added flexibility to
> > align it with AGs.
> 
> I'm trying to keep the solution as simple as possible. No alignment,
> single whole disk only, metadata in the "data device" on CMR and
> user data in "real time" zones on SMR.
> 

Understood. From the commentary here and our irc discussion, my take
away is that the primary objective is to get to some kind of SMR capable
solution sooner rather than later. Beyond that, you have concerns about
the complexity of making the current format work with smr drives. That
all sounds reasonable to me.

I get a bit more concerned when we start talking about implementing
solutions to the same problems we've mostly solved with the existing
algorithms, such as zone reservation vs. preallocation, zone group
rotoring vs. ag rotoring, etc. At some point, I think it will be worth
taking a harder look at whether we could reuse the more traditional
layout and algorithms...

Brian

> > diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc
> > index dd959ab..2fea88f 100644
> 
> Oh, there's a patch. Thanks! ;)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc
index dd959ab..2fea88f 100644
--- a/design/xfs-smr-structure.asciidoc
+++ b/design/xfs-smr-structure.asciidoc
@@ -95,7 +95,7 @@  going to need a special directory to expose this information. It would be useful
 to have a ".zones" directory hanging off the root directory that contains all
 the zone allocation inodes so userspace can simply open them.
 
-THis biggest issue that has come to light here is the number of zones in a
+This biggest issue that has come to light here is the number of zones in a
 device. Zones are typically 256MB in size, and so we are looking at 4,000
 zones/TB. For a 10TB drive, that's 40,000 zones we have to keep track of. And if
 the devices keep getting larger at the expected rate, we're going to have to
@@ -112,24 +112,24 @@  also have other benefits...
 While it seems like tracking free space is trivial for the purposes of
 allocation (and it is!), the complexity comes when we start to delete or
 overwrite data. Suddenly zones no longer contain contiguous ranges of valid
-data; they have "freed" extents in the middle of them that contian stale data.
+data; they have "freed" extents in the middle of them that contain stale data.
 We can't use that "stale space" until the entire zone is made up of "stale"
 extents. Hence we need a Cleaner.
 
 === Zone Cleaner
 
 The purpose of the cleaner is to find zones that are mostly stale space and
-consolidate the remaining referenced data into a new, contigious zone, enabling
+consolidate the remaining referenced data into a new, contiguous zone, enabling
 us to then "clean" the stale zone and make it available for writing new data
 again.
 
-The real complexity here is finding the owner of the data that needs to be move,
-but we are in the process of solving that with the reverse mapping btree and
-parent pointer functionality. This gives us the mechanism by which we can
+The real complexity here is finding the owner of the data that needs to be
+moved, but we are in the process of solving that with the reverse mapping btree
+and parent pointer functionality. This gives us the mechanism by which we can
 quickly re-organise files that have extents in zones that need cleaning.
 
 The key word here is "reorganise". We have a tool that already reorganises file
-layout: xfs_fsr. The "Cleaner" is a finely targetted policy for xfs_fsr -
+layout: xfs_fsr. The "Cleaner" is a finely targeted policy for xfs_fsr -
 instead of trying to minimise fixpel fragments, it finds zones that need
 cleaning by reading their summary info from the /.zones/ directory and analysing
 the free bitmap state if there is a high enough percentage of stale blocks. From
@@ -142,7 +142,7 @@  Hence we don't actually need any major new data moving functionality in the
 kernel to enable this, except maybe an event channel for the kernel to tell
 xfs_fsr it needs to do some cleaning work.
 
-If we arrange zones into zoen groups, we also have a method for keeping new
+If we arrange zones into zone groups, we also have a method for keeping new
 allocations out of regions we are re-organising. That is, we need to be able to
 mark zone groups as "read only" so the kernel will not attempt to allocate from
 them while the cleaner is running and re-organising the data within the zones in
@@ -166,17 +166,17 @@  inode to track the zone's owner information.
 == Mkfs
 
 Mkfs is going to have to integrate with the userspace zbc libraries to query the
-layout of zones from the underlying disk and then do some magic to lay out al
+layout of zones from the underlying disk and then do some magic to lay out all
 the necessary metadata correctly. I don't see there being any significant
 challenge to doing this, but we will need a stable libzbc API to work with and
-it will need ot be packaged by distros.
+it will need to be packaged by distros.
 
-If mkfs cannot find ensough random write space for the amount of metadata we
-need to track all the space in the sequential write zones and a decent amount of
-internal fielsystem metadata (inodes, etc) then it will need to fail. Drive
-vendors are going to need to provide sufficient space in these regions for us
-to be able to make use of it, otherwise we'll simply not be able to do what we
-need to do.
+If mkfs cannot find enough random write space for the amount of metadata we need
+to track all the space in the sequential write zones and a decent amount of
+internal filesystem metadata (inodes, etc) then it will need to fail. Drive
+vendors are going to need to provide sufficient space in these regions for us to
+be able to make use of it, otherwise we'll simply not be able to do what we need
+to do.
 
 mkfs will need to initialise all the zone allocation inodes, reset all the zone
 write pointers, create the /.zones directory, place the log in an appropriate
@@ -187,13 +187,13 @@  place and initialise the metadata device as well.
 Because we've limited the metadata to a section of the drive that can be
 overwritten, we don't have to make significant changes to xfs_repair. It will
 need to be taught about the multiple zone allocation bitmaps for it's space
-reference checking, but otherwise all the infrastructure we need ifor using
+reference checking, but otherwise all the infrastructure we need for using
 bitmaps for verifying used space should already be there.
 
-THere be dragons waiting for us if we don't have random write zones for
+There be dragons waiting for us if we don't have random write zones for
 metadata. If that happens, we cannot repair metadata in place and we will have
 to redesign xfs_repair from the ground up to support such functionality. That's
-jus tnot going to happen, so we'll need drives with a significant amount of
+just not going to happen, so we'll need drives with a significant amount of
 random write space for all our metadata......
 
 == Quantification of Random Write Zone Capacity
@@ -214,7 +214,7 @@  performance, replace the CMR region with a SSD....
 
 The allocator will need to learn about multiple allocation zones based on
 bitmaps. They aren't really allocation groups, but the initialisation and
-iteration of them is going to be similar to allocation groups. To get use going
+iteration of them is going to be similar to allocation groups. To get us going
 we can do some simple mapping between inode AG and data AZ mapping so that we
 keep some form of locality to related data (e.g. grouping of data by parent
 directory).
@@ -273,19 +273,19 @@  location, the current location or anywhere in between. The only guarantee that
 we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
 least be in a position at or past the location of the fsync.
 
-Hence before a filesystem runs journal recovery, all it's zone allocation write
+Hence before a filesystem runs journal recovery, all its zone allocation write
 pointers need to be set to what the drive thinks they are, and all of the zone
 allocation beyond the write pointer need to be cleared. We could do this during
 log recovery in kernel, but that means we need full ZBC awareness in log
 recovery to iterate and query all the zones.
 
-Hence it's not clear if we want to do this in userspace as that has it's own
-problems e.g. we'd need to  have xfs.fsck detect that it's a smr filesystem and
+Hence it's not clear if we want to do this in userspace as that has its own
+problems e.g. we'd need to  have xfs.fsck detect that it's an smr filesystem and
 perform that recovery, or write a mount.xfs helper that does it prior to
 mounting the filesystem. Either way, we need to synchronise the on-disk
 filesystem state to the internal disk zone state before doing anything else.
 
-This needs more thought, because I have a nagging suspiscion that we need to do
+This needs more thought, because I have a nagging suspicion that we need to do
 this write pointer resynchronisation *after log recovery* has completed so we
 can determine if we've got to now go and free extents that the filesystem has
 allocated and are referenced by some inode out there. This, again, will require