diff mbox series

mkfs.xfs: don't go into multidisk mode if there is only one stripe

Message ID 20181004175839.18736-1-idryomov@gmail.com (mailing list archive)
State Deferred, archived
Headers show
Series mkfs.xfs: don't go into multidisk mode if there is only one stripe | expand

Commit Message

Ilya Dryomov Oct. 4, 2018, 5:58 p.m. UTC
rbd devices report the following geometry:

  $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
  512
  512
  4194304
  4194304

(4M is unnecessarily high and will probably be made configurable and
changed to 64K in the future.  By default, the new bluestore backend
does double-write for I/Os smaller than 64K.)

If pbsz != iomin, mkfs.xfs goes into multidisk mode and, under the
assumption that larger multidisk filesystems will have more devices,
chooses a higher agcount.  Though rbd devices are indeed backed by
multiple OSD devices, it appears that high agcount actually degrades
the performance with multiple rbd devices on the same host.

Commit 9a106b5fbb88 ("mkfs.xfs: Don't stagger AG for a single disk")
has set a precedent for treating sunit == swidth specially.  Take it
one step further.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
 mkfs/xfs_mkfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Eric Sandeen Oct. 4, 2018, 6:33 p.m. UTC | #1
On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> rbd devices report the following geometry:
> 
>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
>   512
>   512
>   4194304
>   4194304
> 
> (4M is unnecessarily high and will probably be made configurable and
> changed to 64K in the future.  By default, the new bluestore backend
> does double-write for I/Os smaller than 64K.)
> 
> If pbsz != iomin, mkfs.xfs goes into multidisk mode and, under the
> assumption that larger multidisk filesystems will have more devices,
> chooses a higher agcount.  Though rbd devices are indeed backed by
> multiple OSD devices, it appears that high agcount actually degrades
> the performance with multiple rbd devices on the same host.
> 
> Commit 9a106b5fbb88 ("mkfs.xfs: Don't stagger AG for a single disk")
> has set a precedent for treating sunit == swidth specially.  Take it
> one step further.
> 
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
> ---
>  mkfs/xfs_mkfs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 2e53c1e83b6a..c3efa30005a2 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -2650,8 +2650,8 @@ _("agsize (%s) not a multiple of fs blk size (%d)\n"),
>  				(cfg->dblocks % cfg->agcount != 0);
>  	} else {
>  		calc_default_ag_geometry(cfg->blocklog, cfg->dblocks,
> -					 cfg->dsunit, &cfg->agsize,
> -					 &cfg->agcount);
> +					 cfg->dsunit != cfg->dswidth,
> +					 &cfg->agsize, &cfg->agcount);
>  	}
>  }

I think this makes reasonable sense - it's tough to argue that storage is
advertising parallelism (multidisk) if swidth == sunit, IMHO.  I'd be curious
to know what others think though, and I may go back and review the definitions
of what these values actually mean and how we choose to interpret them :)

TBH though if the functionality is OK (and I think it is) I'd rather pass
dsunit and dswidth both to the function, and let /it/ make the determination
about multidisk as opposed to leaving  that up to the callers.  Seems like
I had some other reason to do that as well, though I'm not remembering it
right now ...

-Eric
Ilya Dryomov Oct. 4, 2018, 6:56 p.m. UTC | #2
On Thu, Oct 4, 2018 at 8:33 PM Eric Sandeen <sandeen@sandeen.net> wrote:
>
> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> > rbd devices report the following geometry:
> >
> >   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> >   512
> >   512
> >   4194304
> >   4194304
> >
> > (4M is unnecessarily high and will probably be made configurable and
> > changed to 64K in the future.  By default, the new bluestore backend
> > does double-write for I/Os smaller than 64K.)
> >
> > If pbsz != iomin, mkfs.xfs goes into multidisk mode and, under the
> > assumption that larger multidisk filesystems will have more devices,
> > chooses a higher agcount.  Though rbd devices are indeed backed by
> > multiple OSD devices, it appears that high agcount actually degrades
> > the performance with multiple rbd devices on the same host.
> >
> > Commit 9a106b5fbb88 ("mkfs.xfs: Don't stagger AG for a single disk")
> > has set a precedent for treating sunit == swidth specially.  Take it
> > one step further.
> >
> > Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
> > ---
> >  mkfs/xfs_mkfs.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> > index 2e53c1e83b6a..c3efa30005a2 100644
> > --- a/mkfs/xfs_mkfs.c
> > +++ b/mkfs/xfs_mkfs.c
> > @@ -2650,8 +2650,8 @@ _("agsize (%s) not a multiple of fs blk size (%d)\n"),
> >                               (cfg->dblocks % cfg->agcount != 0);
> >       } else {
> >               calc_default_ag_geometry(cfg->blocklog, cfg->dblocks,
> > -                                      cfg->dsunit, &cfg->agsize,
> > -                                      &cfg->agcount);
> > +                                      cfg->dsunit != cfg->dswidth,
> > +                                      &cfg->agsize, &cfg->agcount);
> >       }
> >  }
>
> I think this makes reasonable sense - it's tough to argue that storage is
> advertising parallelism (multidisk) if swidth == sunit, IMHO.  I'd be curious
> to know what others think though, and I may go back and review the definitions
> of what these values actually mean and how we choose to interpret them :)
>
> TBH though if the functionality is OK (and I think it is) I'd rather pass
> dsunit and dswidth both to the function, and let /it/ make the determination
> about multidisk as opposed to leaving  that up to the callers.  Seems like
> I had some other reason to do that as well, though I'm not remembering it
> right now ...

Yeah, there is another slightly inconsistent caller in repair/sb.c
(although "dsunit" and "dswidth | dsunit" are probably always the same
in practice, given the checks in xfs_mkfs.c).  This is more of an RFC,
I'll make the change in v2.

Thanks,

                Ilya
Dave Chinner Oct. 4, 2018, 10:29 p.m. UTC | #3
On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> > rbd devices report the following geometry:
> > 
> >   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> >   512
> >   512
> >   4194304
> >   4194304

dm-thinp does this as well. THis is from the thinp device created
by tests/generic/459:

512
4096
65536
65536

And I've also seen some hardware raid controllers do this, too,
because they only expose the stripe width in their enquiry page
rather than stripe unit and stripe width.

IOWs, this behaviour isn't really specific to Ceph's rbd device, and
it does occur on multi-disk devices that have something layered over
the top (dm-thinp, hardware raid, etc). As such, I don't think
there's a "one size fits all" solution and so someone is going to
have to tweak mkfs settings to have it do the right thing for their
storage subsystem....

Indeed, does Ceph really needs 4MB aligned filesystem IO? What
performance benefit does that give over setting iomin=PAGE_SIZE and
ioopt=0? (i.e mkfs.xfs -d sunit=0,swidth=0 or just mounting with -o
noalign)


Cheers,

Dave.
Ilya Dryomov Oct. 5, 2018, 11:27 a.m. UTC | #4
On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
> > On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> > > rbd devices report the following geometry:
> > >
> > >   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> > >   512
> > >   512
> > >   4194304
> > >   4194304
>
> dm-thinp does this as well. THis is from the thinp device created
> by tests/generic/459:
>
> 512
> 4096
> 65536
> 65536

(adding Mike)

... and that 300M filesystem ends up with 8 AGs, when normally you get
4 AGs for anything less than 4T.  Is that really intended?

AFAIK dm-thinp reports these values for the same exact reason as rbd:
we are passing up the information about the efficient I/O size.  In the
case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
top of a RAID array, I suspect it would pass up the array's preferred
sizes, as long as they are a proper factor of the thinp block size.

The high agcount on dm-thinp has come up before and you suggested that
dm-thinp should report iomin == ioopt (i.e. sunit == swidth).  If that
was the right fix back in 2014, mkfs.xfs must have regressed:

  https://marc.info/?l=linux-xfs&m=137783388617206&w=2
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fdfb4c8c1a9fc8dd8cf8eeb4e3ed83573b375285

>
> And I've also seen some hardware raid controllers do this, too,
> because they only expose the stripe width in their enquiry page
> rather than stripe unit and stripe width.
>
> IOWs, this behaviour isn't really specific to Ceph's rbd device, and
> it does occur on multi-disk devices that have something layered over
> the top (dm-thinp, hardware raid, etc). As such, I don't think
> there's a "one size fits all" solution and so someone is going to
> have to tweak mkfs settings to have it do the right thing for their
> storage subsystem....

FWIW I was surprised to see that calc_default_ag_geometry() doesn't
look at swidth and just assumes that there will be "more parallelism
available".  I expected it to be based on swidth to sunit ratio (i.e.
sw).  sw is supposed to be the multiplier equal to the number of
data-bearing disks, so it's the first thing that comes to mind for
a parallelism estimate.

I'd argue that hardware RAID administrators are much more likely to pay
attention to the output of mkfs.xfs and be able to tweak the settings
to work around broken controllers that only expose stripe width.

>
> Indeed, does Ceph really needs 4MB aligned filesystem IO? What
> performance benefit does that give over setting iomin=PAGE_SIZE and
> ioopt=0? (i.e mkfs.xfs -d sunit=0,swidth=0 or just mounting with -o
> noalign)

As I said in the patch, 4M is unnecessarily high.  However, bluestore
is typically configured with a 64K block size and does journal anything
smaller than that.

Thanks,

                Ilya
Eric Sandeen Oct. 5, 2018, 1:51 p.m. UTC | #5
On 10/5/18 6:27 AM, Ilya Dryomov wrote:
> On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
>>> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
>>>> rbd devices report the following geometry:
>>>>
>>>>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
>>>>   512
>>>>   512
>>>>   4194304
>>>>   4194304
>>
>> dm-thinp does this as well. THis is from the thinp device created
>> by tests/generic/459:
>>
>> 512
>> 4096
>> 65536
>> 65536
> 
> (adding Mike)
> 
> ... and that 300M filesystem ends up with 8 AGs, when normally you get
> 4 AGs for anything less than 4T.  Is that really intended?

Well, yes.  Multi-disk mode gives you more AGs, how many more is scaled
by fs size.

        /*
         * For the multidisk configs we choose an AG count based on the number
         * of data blocks available, trying to keep the number of AGs higher
         * than the single disk configurations. This makes the assumption that
         * larger filesystems have more parallelism available to them.
         */

For really tiny filesystems we cut down the number of AGs, but in general
if the storage "told" us it has parallelism, mkfs uses it by default.

> AFAIK dm-thinp reports these values for the same exact reason as rbd:
> we are passing up the information about the efficient I/O size.  In the
> case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
> top of a RAID array, I suspect it would pass up the array's preferred
> sizes, as long as they are a proper factor of the thinp block size.
> 
> The high agcount on dm-thinp has come up before and you suggested that
> dm-thinp should report iomin == ioopt (i.e. sunit == swidth).  If that
> was the right fix back in 2014, mkfs.xfs must have regressed:
> 
>   https://marc.info/?l=linux-xfs&m=137783388617206&w=2

Well, that question started out as being about higher AG counts.  And
it did come from the stripe geometry advertised, but I don't think
Dave's suggestion was intended to keep the AG count lower, it was just
explaining that the values seemed wrong.  It was a bit of an odd situation,
the existence of stripe geometry bumped up AG count but then mkfs decided
that a 512 byte "stripe unit" wasn't legit, and set it back to zero.  But
kept the higher AG count.  So there a few related issues there.

>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fdfb4c8c1a9fc8dd8cf8eeb4e3ed83573b375285

So that sets them to the same size, but I don't think that change was
ever expected to change the AG count heuristic behavior, which has been
there for probably a decade or more.

>>
>> And I've also seen some hardware raid controllers do this, too,
>> because they only expose the stripe width in their enquiry page
>> rather than stripe unit and stripe width.

(which should be considered semi broken hardware, no?)

>> IOWs, this behaviour isn't really specific to Ceph's rbd device, and
>> it does occur on multi-disk devices that have something layered over
>> the top (dm-thinp, hardware raid, etc). As such, I don't think
>> there's a "one size fits all" solution and so someone is going to
>> have to tweak mkfs settings to have it do the right thing for their
>> storage subsystem....
> 
> FWIW I was surprised to see that calc_default_ag_geometry() doesn't
> look at swidth and just assumes that there will be "more parallelism
> available".  I expected it to be based on swidth to sunit ratio (i.e.
> sw).  sw is supposed to be the multiplier equal to the number of
> data-bearing disks, so it's the first thing that comes to mind for
> a parallelism estimate.
> 
> I'd argue that hardware RAID administrators are much more likely to pay
> attention to the output of mkfs.xfs and be able to tweak the settings
> to work around broken controllers that only expose stripe width.

Yeah, this starts to get a little philosophical.  We don't want to second
guess geometry or try to figure out what the raid array "really meant" if
it's sending weird numbers. [1]

But at the end of the day, it seems reasonable to always apply the
"swidth/sunit = number of data disks" rule  (which we apply in reverse when
we tell people how to manually figure out stripe widths) and stop treating
sunit==swidth as any indication of parallelism.

Dave, do you have any problem with changing the behavior to only go into
multidisk if swidth > sunit?  The more I think about it, the more it makes
sense to me.

-Eric


[1] for example,
    99a3f52 mkfs: avoid divide-by-zero when hardware reports optimal i/o size as 0
Mike Snitzer Oct. 5, 2018, 2:50 p.m. UTC | #6
On Fri, Oct 05 2018 at  7:27am -0400,
Ilya Dryomov <idryomov@gmail.com> wrote:

> On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
> > > On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> > > > rbd devices report the following geometry:
> > > >
> > > >   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> > > >   512
> > > >   512
> > > >   4194304
> > > >   4194304
> >
> > dm-thinp does this as well. THis is from the thinp device created
> > by tests/generic/459:
> >
> > 512
> > 4096
> > 65536
> > 65536
> 
> (adding Mike)
> 
> ... and that 300M filesystem ends up with 8 AGs, when normally you get
> 4 AGs for anything less than 4T.  Is that really intended?
> 
> AFAIK dm-thinp reports these values for the same exact reason as rbd:
> we are passing up the information about the efficient I/O size.  In the
> case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
> top of a RAID array, I suspect it would pass up the array's preferred
> sizes, as long as they are a proper factor of the thinp block size.

Right, see pool_io_hints() for all the logic thinp uses to consume block
core's blk_stack_limits() provided limits.. thinp can override if the
underlying limits are _not_ a factor of the thinp's blocksize.

> The high agcount on dm-thinp has come up before and you suggested that
> dm-thinp should report iomin == ioopt (i.e. sunit == swidth).  If that
> was the right fix back in 2014, mkfs.xfs must have regressed:
> 
>   https://marc.info/?l=linux-xfs&m=137783388617206&w=2
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fdfb4c8c1a9fc8dd8cf8eeb4e3ed83573b375285

Yeah, if we're getting larger AG count again, certainly seems like
mkfs.xfs regressed.

Mike
Eric Sandeen Oct. 5, 2018, 2:55 p.m. UTC | #7
On 10/5/18 9:50 AM, Mike Snitzer wrote:
> On Fri, Oct 05 2018 at  7:27am -0400,
> Ilya Dryomov <idryomov@gmail.com> wrote:
> 
>> On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
>>>
>>> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
>>>> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
>>>>> rbd devices report the following geometry:
>>>>>
>>>>>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
>>>>>   512
>>>>>   512
>>>>>   4194304
>>>>>   4194304
>>>
>>> dm-thinp does this as well. THis is from the thinp device created
>>> by tests/generic/459:
>>>
>>> 512
>>> 4096
>>> 65536
>>> 65536
>>
>> (adding Mike)
>>
>> ... and that 300M filesystem ends up with 8 AGs, when normally you get
>> 4 AGs for anything less than 4T.  Is that really intended?
>>
>> AFAIK dm-thinp reports these values for the same exact reason as rbd:
>> we are passing up the information about the efficient I/O size.  In the
>> case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
>> top of a RAID array, I suspect it would pass up the array's preferred
>> sizes, as long as they are a proper factor of the thinp block size.
> 
> Right, see pool_io_hints() for all the logic thinp uses to consume block
> core's blk_stack_limits() provided limits.. thinp can override if the
> underlying limits are _not_ a factor of the thinp's blocksize.
> 
>> The high agcount on dm-thinp has come up before and you suggested that
>> dm-thinp should report iomin == ioopt (i.e. sunit == swidth).  If that
>> was the right fix back in 2014, mkfs.xfs must have regressed:
>>
>>   https://marc.info/?l=linux-xfs&m=137783388617206&w=2
>>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fdfb4c8c1a9fc8dd8cf8eeb4e3ed83573b375285
> 
> Yeah, if we're getting larger AG count again, certainly seems like
> mkfs.xfs regressed.

Our emails crossed ;) - no, there's no regression.  Dave's comment about
the thinp values weren't related to the mkfs AG heuristic.

-Eric
Ilya Dryomov Oct. 5, 2018, 5:21 p.m. UTC | #8
On Fri, Oct 5, 2018 at 4:55 PM Eric Sandeen <sandeen@sandeen.net> wrote:
>
>
>
> On 10/5/18 9:50 AM, Mike Snitzer wrote:
> > On Fri, Oct 05 2018 at  7:27am -0400,
> > Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> >> On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
> >>>
> >>> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
> >>>> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> >>>>> rbd devices report the following geometry:
> >>>>>
> >>>>>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> >>>>>   512
> >>>>>   512
> >>>>>   4194304
> >>>>>   4194304
> >>>
> >>> dm-thinp does this as well. THis is from the thinp device created
> >>> by tests/generic/459:
> >>>
> >>> 512
> >>> 4096
> >>> 65536
> >>> 65536
> >>
> >> (adding Mike)
> >>
> >> ... and that 300M filesystem ends up with 8 AGs, when normally you get
> >> 4 AGs for anything less than 4T.  Is that really intended?
> >>
> >> AFAIK dm-thinp reports these values for the same exact reason as rbd:
> >> we are passing up the information about the efficient I/O size.  In the
> >> case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
> >> top of a RAID array, I suspect it would pass up the array's preferred
> >> sizes, as long as they are a proper factor of the thinp block size.
> >
> > Right, see pool_io_hints() for all the logic thinp uses to consume block
> > core's blk_stack_limits() provided limits.. thinp can override if the
> > underlying limits are _not_ a factor of the thinp's blocksize.
> >
> >> The high agcount on dm-thinp has come up before and you suggested that
> >> dm-thinp should report iomin == ioopt (i.e. sunit == swidth).  If that
> >> was the right fix back in 2014, mkfs.xfs must have regressed:
> >>
> >>   https://marc.info/?l=linux-xfs&m=137783388617206&w=2
> >>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fdfb4c8c1a9fc8dd8cf8eeb4e3ed83573b375285
> >
> > Yeah, if we're getting larger AG count again, certainly seems like
> > mkfs.xfs regressed.
>
> Our emails crossed ;) - no, there's no regression.  Dave's comment about
> the thinp values weren't related to the mkfs AG heuristic.

I was going by "This fixes an issue reported where mkfs.xfs would
create more XFS Allocation Groups on thinp volumes than on a normal
linear LV of comparable size" in the commit message (Jul 2014, 3.17
kernel).

I've just went through xfsprogs releases v3.1.8 (Mar 2012) through
v3.2.2 (Dec 2014) with dm-thinp on 3.18.108 kernel and they all do the
same high agcount thing, so there appears to be no regression indeed.

Thanks,

                Ilya
Dave Chinner Oct. 5, 2018, 11:27 p.m. UTC | #9
On Fri, Oct 05, 2018 at 08:51:59AM -0500, Eric Sandeen wrote:
> On 10/5/18 6:27 AM, Ilya Dryomov wrote:
> > On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
> >>> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> >>>> rbd devices report the following geometry:
> >>>>
> >>>>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> >>>>   512
> >>>>   512
> >>>>   4194304
> >>>>   4194304
> >>
> >> dm-thinp does this as well. THis is from the thinp device created
> >> by tests/generic/459:
> >>
> >> 512
> >> 4096
> >> 65536
> >> 65536
> > 
> > (adding Mike)
> > 
> > ... and that 300M filesystem ends up with 8 AGs, when normally you get
> > 4 AGs for anything less than 4T.  Is that really intended?
> 
> Well, yes.  Multi-disk mode gives you more AGs, how many more is scaled
> by fs size.
> 
>         /*
>          * For the multidisk configs we choose an AG count based on the number
>          * of data blocks available, trying to keep the number of AGs higher
>          * than the single disk configurations. This makes the assumption that
>          * larger filesystems have more parallelism available to them.
>          */
> 
> For really tiny filesystems we cut down the number of AGs, but in general
> if the storage "told" us it has parallelism, mkfs uses it by default.

We only keep the number of AGs down on single disks because of the
seek penalty it causes spinning disks. It's a trade off between
parallelism and seek time.

> > AFAIK dm-thinp reports these values for the same exact reason as rbd:
> > we are passing up the information about the efficient I/O size.  In the
> > case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
> > top of a RAID array, I suspect it would pass up the array's preferred
> > sizes, as long as they are a proper factor of the thinp block size.

dm-thinp is passing up it's allocation chunk size, not the
underlying device geometry. dm-thinp might be tuning it's chunk size
to match the underlying storage, but that's irrelevant to XFS.

That's because dm-thinp is a virtual mapping device in the same way
the OS provides virtually mapped memory to users. That it, there is
no relationship between the block device address space index and the
location on disk. Hence the seek times between different regions of
the block device address space are not linear or predictable. 

Hence dm-thinp completely changes the parallelism vs seek time
trade-off the filesystem layout makes.  We can't optimise for
minimal seek time anymore because we don't know the physical layout
of the storage, so all we care about is alignment to the block
device chunk size.

i.e. what we want to do is give dm-thinp IO that is optimal (e.g.
large aligned writes for streaming IO) and we don't want to leave
lots of little unused holes in the dmthinp mapping that waste space.
To do this, we need to ensure minimal allocator contention occurs,
and hence we allow more concurrency in allocation by inreasing the
AG count, knowing that we can't make the seek time problem any worse
by doing this.

i.e. we're not using sunit/swidth on dm-thinp to optimise physical
device layout. We're using it to optimise for contiguous space usage
and minimise the allocation load on dm-thinp. Optimising the layout
for physical storage is dm-thinp's problem, not ours.

> >> And I've also seen some hardware raid controllers do this, too,
> >> because they only expose the stripe width in their enquiry page
> >> rather than stripe unit and stripe width.
> 
> (which should be considered semi broken hardware, no?)

Yes. That should be fixed by the vendor or with mkfs CLI options.
We're not going to change default beahviour to cater for broken
hardware.

> >> IOWs, this behaviour isn't really specific to Ceph's rbd device, and
> >> it does occur on multi-disk devices that have something layered over
> >> the top (dm-thinp, hardware raid, etc). As such, I don't think
> >> there's a "one size fits all" solution and so someone is going to
> >> have to tweak mkfs settings to have it do the right thing for their
> >> storage subsystem....
> > 
> > FWIW I was surprised to see that calc_default_ag_geometry() doesn't
> > look at swidth and just assumes that there will be "more parallelism
> > available".  I expected it to be based on swidth to sunit ratio (i.e.
> > sw).  sw is supposed to be the multiplier equal to the number of
> > data-bearing disks, so it's the first thing that comes to mind for
> > a parallelism estimate.
> > 
> > I'd argue that hardware RAID administrators are much more likely to pay
> > attention to the output of mkfs.xfs and be able to tweak the settings
> > to work around broken controllers that only expose stripe width.
> 
> Yeah, this starts to get a little philosophical.  We don't want to second
> guess geometry or try to figure out what the raid array "really meant" if
> it's sending weird numbers. [1]
> 
> But at the end of the day, it seems reasonable to always apply the
> "swidth/sunit = number of data disks" rule  (which we apply in reverse when
> we tell people how to manually figure out stripe widths) and stop treating
> sunit==swidth as any indication of parallelism.

But swidth/sunit does not mean "number of data disks".

They represent a pair of alignment constraints that indicate how we
should align larger objects during allocation. Small objects are
filesystem block aligned, objects larger than "sunit" are sunit
aligned, and objects larger than swidth are swidth aligned if the
swalloc mount option is used.

These are /generic/ alignment characteristics. While they were
originally derived from RAID characteristics, they have far wider
scope of use than just for configuring RAID devices. e.g. thinp,
exposing image file extent size hints as filesystem allocation
alignments similar to thinp, selecting what aspect of a multi-level
stacked RAID made up of hundreds of disks the filesystem should
align to, aligning to internal SSD structures (be it raid, erase
page sizes, etc), optimising for OSD block sizes, remote replication
block size constraints, helping DAX align allocations to huge page
sizes, etc.

My point is that just looking at sunit/swidth as "the number of data
disks" completely ignores the many other uses we've found for it
over the last 20 years. In that time, it's almost always been the
case that devices requiring alignment have not been bound by the
seek time constraints of a single spinning spindle, and the default
behaviour reflects that.

> Dave, do you have any problem with changing the behavior to only go into
> multidisk if swidth > sunit?  The more I think about it, the more it makes
> sense to me.

Changing the existing behaviour doesn't make much sense to me. :)

Cheers,

Dave.
Ilya Dryomov Oct. 6, 2018, 12:17 p.m. UTC | #10
On Sat, Oct 6, 2018 at 1:27 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Oct 05, 2018 at 08:51:59AM -0500, Eric Sandeen wrote:
> > On 10/5/18 6:27 AM, Ilya Dryomov wrote:
> > > On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
> > >>
> > >> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
> > >>> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> > >>>> rbd devices report the following geometry:
> > >>>>
> > >>>>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> > >>>>   512
> > >>>>   512
> > >>>>   4194304
> > >>>>   4194304
> > >>
> > >> dm-thinp does this as well. THis is from the thinp device created
> > >> by tests/generic/459:
> > >>
> > >> 512
> > >> 4096
> > >> 65536
> > >> 65536
> > >
> > > (adding Mike)
> > >
> > > ... and that 300M filesystem ends up with 8 AGs, when normally you get
> > > 4 AGs for anything less than 4T.  Is that really intended?
> >
> > Well, yes.  Multi-disk mode gives you more AGs, how many more is scaled
> > by fs size.
> >
> >         /*
> >          * For the multidisk configs we choose an AG count based on the number
> >          * of data blocks available, trying to keep the number of AGs higher
> >          * than the single disk configurations. This makes the assumption that
> >          * larger filesystems have more parallelism available to them.
> >          */
> >
> > For really tiny filesystems we cut down the number of AGs, but in general
> > if the storage "told" us it has parallelism, mkfs uses it by default.
>
> We only keep the number of AGs down on single disks because of the
> seek penalty it causes spinning disks. It's a trade off between
> parallelism and seek time.

If it's primarily about seek times, why aren't you looking at rotational
attribute for that?

>
> > > AFAIK dm-thinp reports these values for the same exact reason as rbd:
> > > we are passing up the information about the efficient I/O size.  In the
> > > case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
> > > top of a RAID array, I suspect it would pass up the array's preferred
> > > sizes, as long as they are a proper factor of the thinp block size.
>
> dm-thinp is passing up it's allocation chunk size, not the
> underlying device geometry. dm-thinp might be tuning it's chunk size
> to match the underlying storage, but that's irrelevant to XFS.

I think the thinp chunk size is more about whether you just want thin
provisioning or plan to do a lot of snapshotting, etc.  dm-thinp passes
up the underlying device geometry if it's more demanding than the thinp
chunk size.  Here is dm-thinp with 64K chunk size on top of mdraid:

  $ blockdev --getss --getpbsz --getiomin --getioopt /dev/mapper/vg1-thin1
  512
  512
  524288
  1048576

>
> That's because dm-thinp is a virtual mapping device in the same way
> the OS provides virtually mapped memory to users. That it, there is
> no relationship between the block device address space index and the
> location on disk. Hence the seek times between different regions of
> the block device address space are not linear or predictable.
>
> Hence dm-thinp completely changes the parallelism vs seek time
> trade-off the filesystem layout makes.  We can't optimise for
> minimal seek time anymore because we don't know the physical layout
> of the storage, so all we care about is alignment to the block
> device chunk size.
>
> i.e. what we want to do is give dm-thinp IO that is optimal (e.g.
> large aligned writes for streaming IO) and we don't want to leave
> lots of little unused holes in the dmthinp mapping that waste space.
> To do this, we need to ensure minimal allocator contention occurs,
> and hence we allow more concurrency in allocation by inreasing the
> AG count, knowing that we can't make the seek time problem any worse
> by doing this.

And yet dm-thinp presents itself as rotational if (at least one of) the
underlying disk(s) is marked as rotational.

As it is, we get the nomultidisk trade-parallelism-for-seek-times
behaviour on bare SSD devices, but dm-thinp on top of a single HDD
device is regarded up to 8 (XFS_MULTIDISK_AGLOG - XFS_NOMULTIDISK_AGLOG)
times more parallel...

>
> i.e. we're not using sunit/swidth on dm-thinp to optimise physical
> device layout. We're using it to optimise for contiguous space usage
> and minimise the allocation load on dm-thinp. Optimising the layout
> for physical storage is dm-thinp's problem, not ours.
>
> > >> And I've also seen some hardware raid controllers do this, too,
> > >> because they only expose the stripe width in their enquiry page
> > >> rather than stripe unit and stripe width.
> >
> > (which should be considered semi broken hardware, no?)
>
> Yes. That should be fixed by the vendor or with mkfs CLI options.
> We're not going to change default beahviour to cater for broken
> hardware.
>
> > >> IOWs, this behaviour isn't really specific to Ceph's rbd device, and
> > >> it does occur on multi-disk devices that have something layered over
> > >> the top (dm-thinp, hardware raid, etc). As such, I don't think
> > >> there's a "one size fits all" solution and so someone is going to
> > >> have to tweak mkfs settings to have it do the right thing for their
> > >> storage subsystem....
> > >
> > > FWIW I was surprised to see that calc_default_ag_geometry() doesn't
> > > look at swidth and just assumes that there will be "more parallelism
> > > available".  I expected it to be based on swidth to sunit ratio (i.e.
> > > sw).  sw is supposed to be the multiplier equal to the number of
> > > data-bearing disks, so it's the first thing that comes to mind for
> > > a parallelism estimate.
> > >
> > > I'd argue that hardware RAID administrators are much more likely to pay
> > > attention to the output of mkfs.xfs and be able to tweak the settings
> > > to work around broken controllers that only expose stripe width.
> >
> > Yeah, this starts to get a little philosophical.  We don't want to second
> > guess geometry or try to figure out what the raid array "really meant" if
> > it's sending weird numbers. [1]
> >
> > But at the end of the day, it seems reasonable to always apply the
> > "swidth/sunit = number of data disks" rule  (which we apply in reverse when
> > we tell people how to manually figure out stripe widths) and stop treating
> > sunit==swidth as any indication of parallelism.
>
> But swidth/sunit does not mean "number of data disks".
>
> They represent a pair of alignment constraints that indicate how we
> should align larger objects during allocation. Small objects are
> filesystem block aligned, objects larger than "sunit" are sunit
> aligned, and objects larger than swidth are swidth aligned if the
> swalloc mount option is used.
>
> These are /generic/ alignment characteristics. While they were
> originally derived from RAID characteristics, they have far wider
> scope of use than just for configuring RAID devices. e.g. thinp,
> exposing image file extent size hints as filesystem allocation
> alignments similar to thinp, selecting what aspect of a multi-level
> stacked RAID made up of hundreds of disks the filesystem should
> align to, aligning to internal SSD structures (be it raid, erase
> page sizes, etc), optimising for OSD block sizes, remote replication
> block size constraints, helping DAX align allocations to huge page
> sizes, etc.

Exactly, they are generic data alignment characteristics useful for
both physical and virtual devices.  However, mkfs.xfs uses a heuristic
that conflates them with agcount through the physics of the underlying
device which it can't really reason about, especially in the virtual
or network case.

>
> My point is that just looking at sunit/swidth as "the number of data
> disks" completely ignores the many other uses we've found for it
> over the last 20 years. In that time, it's almost always been the
> case that devices requiring alignment have not been bound by the
> seek time constraints of a single spinning spindle, and the default
> behaviour reflects that.
>
> > Dave, do you have any problem with changing the behavior to only go into
> > multidisk if swidth > sunit?  The more I think about it, the more it makes
> > sense to me.
>
> Changing the existing behaviour doesn't make much sense to me. :)

The existing behaviour is to create 4 AGs on both spinning rust and
e.g. Intel DC P3700.  If I then put dm-thinp on top of that spinner,
it's suddenly deemed worthy of 32 AGs.  The issue here is that unlike
other filesystems, XFS is inherently parallel and perfectly capable of
subjecting it to 32 concurrent write streams.  This is pretty silly.

You agreed that broken RAID controllers that expose "sunit == swidth"
are their vendor's or administrator's problem.  The vast majority of
SSD devices in wide use either expose nothing or lie.  The information
about internal page size or erase block size is either hard to get or
not public.

Can you give an example of a use case that would be negatively affected
if this heuristic was switched from "sunit" to "sunit < swidth"?

Thanks,

                Ilya
Dave Chinner Oct. 6, 2018, 11:20 p.m. UTC | #11
On Sat, Oct 06, 2018 at 02:17:54PM +0200, Ilya Dryomov wrote:
> On Sat, Oct 6, 2018 at 1:27 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Fri, Oct 05, 2018 at 08:51:59AM -0500, Eric Sandeen wrote:
> > > On 10/5/18 6:27 AM, Ilya Dryomov wrote:
> > > > On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >>
> > > >> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
> > > >>> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> > > >>>> rbd devices report the following geometry:
> > > >>>>
> > > >>>>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> > > >>>>   512
> > > >>>>   512
> > > >>>>   4194304
> > > >>>>   4194304
> > > >>
> > > >> dm-thinp does this as well. THis is from the thinp device created
> > > >> by tests/generic/459:
> > > >>
> > > >> 512
> > > >> 4096
> > > >> 65536
> > > >> 65536
> > > >
> > > > (adding Mike)
> > > >
> > > > ... and that 300M filesystem ends up with 8 AGs, when normally you get
> > > > 4 AGs for anything less than 4T.  Is that really intended?
> > >
> > > Well, yes.  Multi-disk mode gives you more AGs, how many more is scaled
> > > by fs size.
> > >
> > >         /*
> > >          * For the multidisk configs we choose an AG count based on the number
> > >          * of data blocks available, trying to keep the number of AGs higher
> > >          * than the single disk configurations. This makes the assumption that
> > >          * larger filesystems have more parallelism available to them.
> > >          */
> > >
> > > For really tiny filesystems we cut down the number of AGs, but in general
> > > if the storage "told" us it has parallelism, mkfs uses it by default.
> >
> > We only keep the number of AGs down on single disks because of the
> > seek penalty it causes spinning disks. It's a trade off between
> > parallelism and seek time.
> 
> If it's primarily about seek times, why aren't you looking at rotational
> attribute for that?

Historically speaking, "rotational" hasn't been a reliable indicator
of device seek behaviour or alignment requirements. It was anasty
hack for people wanting to optimise for SSDs and most of those
optimisations were things we could already do with sunit/swidth
(such as aligning to internal SSD page sizes and/or erase blocks).

> > > > AFAIK dm-thinp reports these values for the same exact reason as rbd:
> > > > we are passing up the information about the efficient I/O size.  In the
> > > > case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
> > > > top of a RAID array, I suspect it would pass up the array's preferred
> > > > sizes, as long as they are a proper factor of the thinp block size.
> >
> > dm-thinp is passing up it's allocation chunk size, not the
> > underlying device geometry. dm-thinp might be tuning it's chunk size
> > to match the underlying storage, but that's irrelevant to XFS.
> 
> I think the thinp chunk size is more about whether you just want thin
> provisioning or plan to do a lot of snapshotting, etc.  dm-thinp passes
> up the underlying device geometry if it's more demanding than the thinp
> chunk size.  Here is dm-thinp with 64K chunk size on top of mdraid:
> 
>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/mapper/vg1-thin1
>   512
>   512
>   524288
>   1048576

That's how iomin/ioopt are supposed to be propagated on layered
devices. i.e. the layer with the largest values bubbles to the top,
and the filesystem aligns to that.

That doesn't change the fact that thinp and other COW-based block
devices fundamentally isolate the filesystem from the physical
storage properties. The filesystem sees the result of the COW
behaviour in the block device and it's allocated algorithm, not the
physical block device properties.

Last time I looked, dm-thinp did first-free allocation, which means
it fills the block device from one end to the other regardless of
how many widely spaced IOs are in progress from the filesystems.
That means all the new writes end up being sequential from dm-thinp
rather than causing seek storms because they are being written to 32
different locations across the block device. IOWs, a properly
implemented COW-based thinp device should be able to handle much
higher random write IO workloads than if the filesystem was placed
directly on the same block device.

IOWs, dm-thinp does not behave how one expects a rotational device
to behave even when it is placed on a rotational device. We have to
optimise filesystem behaviour differently for dm-thinp.

> > That's because dm-thinp is a virtual mapping device in the same way
> > the OS provides virtually mapped memory to users. That it, there is
> > no relationship between the block device address space index and the
> > location on disk. Hence the seek times between different regions of
> > the block device address space are not linear or predictable.
> >
> > Hence dm-thinp completely changes the parallelism vs seek time
> > trade-off the filesystem layout makes.  We can't optimise for
> > minimal seek time anymore because we don't know the physical layout
> > of the storage, so all we care about is alignment to the block
> > device chunk size.
> >
> > i.e. what we want to do is give dm-thinp IO that is optimal (e.g.
> > large aligned writes for streaming IO) and we don't want to leave
> > lots of little unused holes in the dmthinp mapping that waste space.
> > To do this, we need to ensure minimal allocator contention occurs,
> > and hence we allow more concurrency in allocation by inreasing the
> > AG count, knowing that we can't make the seek time problem any worse
> > by doing this.
> 
> And yet dm-thinp presents itself as rotational if (at least one of) the
> underlying disk(s) is marked as rotational.

Which, as per above, means rotational devices don't all behave like
you'd expect a spinning spindle to behave. i.e. It's not an
indication of a specific, consistent device model that we can
optimise for.

> As it is, we get the nomultidisk trade-parallelism-for-seek-times
> behaviour on bare SSD devices, but dm-thinp on top of a single HDD
> device is regarded up to 8 (XFS_MULTIDISK_AGLOG - XFS_NOMULTIDISK_AGLOG)
> times more parallel...

Yes, that's expected. The single SSD case has to take into account
the really slow, cheap SSDs that aren't much better than spinning
disks right through to high end nvme drives.

It's easy to drown a slow SSD, just like it's easy to drown a single
spindle. But there's /very few/ applications that can drive a high
end nvme SSD to be allocation bound on a 4 AG XFS filesystem because
of how fast the IO is. As such, I'm yet to hear of reports of XFS
allocation concurrency bottlenecks in production workloads on nvme
SSDs.

Defaults are a trade off.  There is no "one size fits all" solution,
so we end up with defaults that are a compromise of "doesn't suck
for the majority of use cases". That means there might be some
unexpected default behaviours, but that doesn't mean they are wrong.

> > These are /generic/ alignment characteristics. While they were
> > originally derived from RAID characteristics, they have far wider
> > scope of use than just for configuring RAID devices. e.g. thinp,
> > exposing image file extent size hints as filesystem allocation
> > alignments similar to thinp, selecting what aspect of a multi-level
> > stacked RAID made up of hundreds of disks the filesystem should
> > align to, aligning to internal SSD structures (be it raid, erase
> > page sizes, etc), optimising for OSD block sizes, remote replication
> > block size constraints, helping DAX align allocations to huge page
> > sizes, etc.
> 
> Exactly, they are generic data alignment characteristics useful for
> both physical and virtual devices.  However, mkfs.xfs uses a heuristic
> that conflates them with agcount through the physics of the underlying
> device which it can't really reason about, especially in the virtual
> or network case.

Yet it's a heuristic that has served use well for 20 years. Yes,
we'v been madly conflating allocation concurrency with storage that
requires alignment since long before XFS was ported to Linux.

The defaults are appropriate for the vast majority of installations
and use cases. The defaults are not ideal for everyone, but there's
years of thought, observation, problem solving and knowledge behind
them.

If you don't like the defaults, then override them on the command
line, post your benchmarked improvements and make the argument why
this particular workload and tuning is better to everyone.

> > My point is that just looking at sunit/swidth as "the number of data
> > disks" completely ignores the many other uses we've found for it
> > over the last 20 years. In that time, it's almost always been the
> > case that devices requiring alignment have not been bound by the
> > seek time constraints of a single spinning spindle, and the default
> > behaviour reflects that.
> >
> > > Dave, do you have any problem with changing the behavior to only go into
> > > multidisk if swidth > sunit?  The more I think about it, the more it makes
> > > sense to me.
> >
> > Changing the existing behaviour doesn't make much sense to me. :)
> 
> The existing behaviour is to create 4 AGs on both spinning rust and
> e.g. Intel DC P3700.

That's a really bad example.  The p3700 has internal RAID with a
128k page size that it doesn't expose to iomin/ioopt. It has
*really* bad IO throughput for sub-128k sized or aligned IO (think
100x slower, not just a little). It's a device that absolutely
should be exposing preferred alignment characteristics to the
filesystem...

> If I then put dm-thinp on top of that spinner,
> it's suddenly deemed worthy of 32 AGs.  The issue here is that unlike
> other filesystems, XFS is inherently parallel and perfectly capable of
> subjecting it to 32 concurrent write streams.  This is pretty silly.

COW algorithms linearise and serialise concurrent write streams -
that's exactly what they are designed to do and why they perform so
well on random write workloads.  Optimising the filesystem layout
and characteristics to take advantage of COW algorithms in the
storage laye is not "pretty silly" - it's the smart thing to do
because the dm-thinp COW algorithms are only as good as the garbage
they are fed.

> You agreed that broken RAID controllers that expose "sunit == swidth"
> are their vendor's or administrator's problem. 

No I didn't - I said that raid controllers that only advertise sunit
or swidth are broken. Advertising sunit == swidth is a valid thing
to do - we really only need a single alignment value for hardware
RAID w/ NVRAM caches: the IO size/alignment needed to avoid RMW
cycles.

> The vast majority of
> SSD devices in wide use either expose nothing or lie.  The information
> about internal page size or erase block size is either hard to get or
> not public.

Hence, like the broken RAID controller case, we don't try to
optimise for them.  If they expose those things (and the p3700 case
demonstrates that they should!) then we'll automatically optimise
the filesystem for their physical characteristics.

> Can you give an example of a use case that would be negatively affected
> if this heuristic was switched from "sunit" to "sunit < swidth"?

Any time you only know a single alignment characteristic of the
underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
iomin = ioopt, multi-level RAID constructs where only the largest
alignment requirement is exposed, RAID1 devices exposing their chunk
size, remote replication chunk alignment (because remote rep. is
slow and so we need more concurrency to keep the pipeline full),
etc.

Cheers,

Dave.
Eric Sandeen Oct. 7, 2018, 12:14 a.m. UTC | #12
On 10/6/18 6:20 PM, Dave Chinner wrote:
>> Can you give an example of a use case that would be negatively affected
>> if this heuristic was switched from "sunit" to "sunit < swidth"?
> Any time you only know a single alignment characteristic of the
> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
> iomin = ioopt, multi-level RAID constructs where only the largest
> alignment requirement is exposed, RAID1 devices exposing their chunk
> size, remote replication chunk alignment (because remote rep. is
> slow and so we need more concurrency to keep the pipeline full),
> etc.

So the tl;dr here is "given any iomin > 512, we should infer low seek
latency and parallelism and adjust geometry accordingly?"

-Eric
Ilya Dryomov Oct. 7, 2018, 1:54 p.m. UTC | #13
On Sun, Oct 7, 2018 at 1:20 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Oct 06, 2018 at 02:17:54PM +0200, Ilya Dryomov wrote:
> > On Sat, Oct 6, 2018 at 1:27 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Fri, Oct 05, 2018 at 08:51:59AM -0500, Eric Sandeen wrote:
> > > > On 10/5/18 6:27 AM, Ilya Dryomov wrote:
> > > > > On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > >>
> > > > >> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
> > > > >>> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> > > > >>>> rbd devices report the following geometry:
> > > > >>>>
> > > > >>>>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> > > > >>>>   512
> > > > >>>>   512
> > > > >>>>   4194304
> > > > >>>>   4194304
> > > > >>
> > > > >> dm-thinp does this as well. THis is from the thinp device created
> > > > >> by tests/generic/459:
> > > > >>
> > > > >> 512
> > > > >> 4096
> > > > >> 65536
> > > > >> 65536
> > > > >
> > > > > (adding Mike)
> > > > >
> > > > > ... and that 300M filesystem ends up with 8 AGs, when normally you get
> > > > > 4 AGs for anything less than 4T.  Is that really intended?
> > > >
> > > > Well, yes.  Multi-disk mode gives you more AGs, how many more is scaled
> > > > by fs size.
> > > >
> > > >         /*
> > > >          * For the multidisk configs we choose an AG count based on the number
> > > >          * of data blocks available, trying to keep the number of AGs higher
> > > >          * than the single disk configurations. This makes the assumption that
> > > >          * larger filesystems have more parallelism available to them.
> > > >          */
> > > >
> > > > For really tiny filesystems we cut down the number of AGs, but in general
> > > > if the storage "told" us it has parallelism, mkfs uses it by default.
> > >
> > > We only keep the number of AGs down on single disks because of the
> > > seek penalty it causes spinning disks. It's a trade off between
> > > parallelism and seek time.
> >
> > If it's primarily about seek times, why aren't you looking at rotational
> > attribute for that?
>
> Historically speaking, "rotational" hasn't been a reliable indicator
> of device seek behaviour or alignment requirements. It was anasty
> hack for people wanting to optimise for SSDs and most of those
> optimisations were things we could already do with sunit/swidth
> (such as aligning to internal SSD page sizes and/or erase blocks).

Thank you for the detailed replies, Dave.  Some excerpts definitely
deserve a big block comment near calc_default_ag_geometry().

>
> > > > > AFAIK dm-thinp reports these values for the same exact reason as rbd:
> > > > > we are passing up the information about the efficient I/O size.  In the
> > > > > case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
> > > > > top of a RAID array, I suspect it would pass up the array's preferred
> > > > > sizes, as long as they are a proper factor of the thinp block size.
> > >
> > > dm-thinp is passing up it's allocation chunk size, not the
> > > underlying device geometry. dm-thinp might be tuning it's chunk size
> > > to match the underlying storage, but that's irrelevant to XFS.
> >
> > I think the thinp chunk size is more about whether you just want thin
> > provisioning or plan to do a lot of snapshotting, etc.  dm-thinp passes
> > up the underlying device geometry if it's more demanding than the thinp
> > chunk size.  Here is dm-thinp with 64K chunk size on top of mdraid:
> >
> >   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/mapper/vg1-thin1
> >   512
> >   512
> >   524288
> >   1048576
>
> That's how iomin/ioopt are supposed to be propagated on layered
> devices. i.e. the layer with the largest values bubbles to the top,
> and the filesystem aligns to that.
>
> That doesn't change the fact that thinp and other COW-based block
> devices fundamentally isolate the filesystem from the physical
> storage properties. The filesystem sees the result of the COW
> behaviour in the block device and it's allocated algorithm, not the
> physical block device properties.
>
> Last time I looked, dm-thinp did first-free allocation, which means
> it fills the block device from one end to the other regardless of
> how many widely spaced IOs are in progress from the filesystems.
> That means all the new writes end up being sequential from dm-thinp
> rather than causing seek storms because they are being written to 32
> different locations across the block device. IOWs, a properly
> implemented COW-based thinp device should be able to handle much
> higher random write IO workloads than if the filesystem was placed
> directly on the same block device.
>
> IOWs, dm-thinp does not behave how one expects a rotational device
> to behave even when it is placed on a rotational device. We have to
> optimise filesystem behaviour differently for dm-thinp.
>
> > > That's because dm-thinp is a virtual mapping device in the same way
> > > the OS provides virtually mapped memory to users. That it, there is
> > > no relationship between the block device address space index and the
> > > location on disk. Hence the seek times between different regions of
> > > the block device address space are not linear or predictable.
> > >
> > > Hence dm-thinp completely changes the parallelism vs seek time
> > > trade-off the filesystem layout makes.  We can't optimise for
> > > minimal seek time anymore because we don't know the physical layout
> > > of the storage, so all we care about is alignment to the block
> > > device chunk size.
> > >
> > > i.e. what we want to do is give dm-thinp IO that is optimal (e.g.
> > > large aligned writes for streaming IO) and we don't want to leave
> > > lots of little unused holes in the dmthinp mapping that waste space.
> > > To do this, we need to ensure minimal allocator contention occurs,
> > > and hence we allow more concurrency in allocation by inreasing the
> > > AG count, knowing that we can't make the seek time problem any worse
> > > by doing this.
> >
> > And yet dm-thinp presents itself as rotational if (at least one of) the
> > underlying disk(s) is marked as rotational.
>
> Which, as per above, means rotational devices don't all behave like
> you'd expect a spinning spindle to behave. i.e. It's not an
> indication of a specific, consistent device model that we can
> optimise for.
>
> > As it is, we get the nomultidisk trade-parallelism-for-seek-times
> > behaviour on bare SSD devices, but dm-thinp on top of a single HDD
> > device is regarded up to 8 (XFS_MULTIDISK_AGLOG - XFS_NOMULTIDISK_AGLOG)
> > times more parallel...
>
> Yes, that's expected. The single SSD case has to take into account
> the really slow, cheap SSDs that aren't much better than spinning
> disks right through to high end nvme drives.
>
> It's easy to drown a slow SSD, just like it's easy to drown a single
> spindle. But there's /very few/ applications that can drive a high
> end nvme SSD to be allocation bound on a 4 AG XFS filesystem because
> of how fast the IO is. As such, I'm yet to hear of reports of XFS
> allocation concurrency bottlenecks in production workloads on nvme
> SSDs.
>
> Defaults are a trade off.  There is no "one size fits all" solution,
> so we end up with defaults that are a compromise of "doesn't suck
> for the majority of use cases". That means there might be some
> unexpected default behaviours, but that doesn't mean they are wrong.
>
> > > These are /generic/ alignment characteristics. While they were
> > > originally derived from RAID characteristics, they have far wider
> > > scope of use than just for configuring RAID devices. e.g. thinp,
> > > exposing image file extent size hints as filesystem allocation
> > > alignments similar to thinp, selecting what aspect of a multi-level
> > > stacked RAID made up of hundreds of disks the filesystem should
> > > align to, aligning to internal SSD structures (be it raid, erase
> > > page sizes, etc), optimising for OSD block sizes, remote replication
> > > block size constraints, helping DAX align allocations to huge page
> > > sizes, etc.
> >
> > Exactly, they are generic data alignment characteristics useful for
> > both physical and virtual devices.  However, mkfs.xfs uses a heuristic
> > that conflates them with agcount through the physics of the underlying
> > device which it can't really reason about, especially in the virtual
> > or network case.
>
> Yet it's a heuristic that has served use well for 20 years. Yes,
> we'v been madly conflating allocation concurrency with storage that
> requires alignment since long before XFS was ported to Linux.
>
> The defaults are appropriate for the vast majority of installations
> and use cases. The defaults are not ideal for everyone, but there's
> years of thought, observation, problem solving and knowledge behind
> them.
>
> If you don't like the defaults, then override them on the command
> line, post your benchmarked improvements and make the argument why
> this particular workload and tuning is better to everyone.

We were doing some tests with XFS on top of rbd, with multiple rbd
devices on the same client machine (pretty beefy in terms of number of
cores and RAM).  We ran the smallfile benchmark and found that with
just a dozen devices mkfs.xfs with -d agcount=4 improved per device
throughput by over 2x, with no degradation but rather a small bump on
a single device.  Mark (CCed) can share the details.

This prompted me to look at what mkfs.xfs was doing.  The default
behaviour just didn't make much sense until now, particularly the SSD
case.

>
> > > My point is that just looking at sunit/swidth as "the number of data
> > > disks" completely ignores the many other uses we've found for it
> > > over the last 20 years. In that time, it's almost always been the
> > > case that devices requiring alignment have not been bound by the
> > > seek time constraints of a single spinning spindle, and the default
> > > behaviour reflects that.
> > >
> > > > Dave, do you have any problem with changing the behavior to only go into
> > > > multidisk if swidth > sunit?  The more I think about it, the more it makes
> > > > sense to me.
> > >
> > > Changing the existing behaviour doesn't make much sense to me. :)
> >
> > The existing behaviour is to create 4 AGs on both spinning rust and
> > e.g. Intel DC P3700.
>
> That's a really bad example.  The p3700 has internal RAID with a
> 128k page size that it doesn't expose to iomin/ioopt. It has
> *really* bad IO throughput for sub-128k sized or aligned IO (think
> 100x slower, not just a little). It's a device that absolutely
> should be exposing preferred alignment characteristics to the
> filesystem...

That was deliberate, as an example of a device that is generally
considered to be pretty good -- even though the internal stripe size is
public information, it's not passed through the stack.

>
> > If I then put dm-thinp on top of that spinner,
> > it's suddenly deemed worthy of 32 AGs.  The issue here is that unlike
> > other filesystems, XFS is inherently parallel and perfectly capable of
> > subjecting it to 32 concurrent write streams.  This is pretty silly.
>
> COW algorithms linearise and serialise concurrent write streams -
> that's exactly what they are designed to do and why they perform so
> well on random write workloads.  Optimising the filesystem layout
> and characteristics to take advantage of COW algorithms in the
> storage laye is not "pretty silly" - it's the smart thing to do
> because the dm-thinp COW algorithms are only as good as the garbage
> they are fed.

Right, that's pretty much what I expected to hear.  But I picked on
dm-thinp because Mike has attempted to make mkfs.xfs go with the lower
agcount for dm-thinp in fdfb4c8c1a9f ("dm thin: set minimum_io_size to
pool's data block size").  Both the commit message and his reply in
this thread indicate that he wasn't just following your suggestion to
set both iomin and ioopt to the same value, but actually intended it to
defeat the agcount heuristic.  Mike, are there different expectations
here?

>
> > You agreed that broken RAID controllers that expose "sunit == swidth"
> > are their vendor's or administrator's problem.
>
> No I didn't - I said that raid controllers that only advertise sunit
> or swidth are broken. Advertising sunit == swidth is a valid thing
> to do - we really only need a single alignment value for hardware
> RAID w/ NVRAM caches: the IO size/alignment needed to avoid RMW
> cycles.
>
> > The vast majority of
> > SSD devices in wide use either expose nothing or lie.  The information
> > about internal page size or erase block size is either hard to get or
> > not public.
>
> Hence, like the broken RAID controller case, we don't try to
> optimise for them.  If they expose those things (and the p3700 case
> demonstrates that they should!) then we'll automatically optimise
> the filesystem for their physical characteristics.
>
> > Can you give an example of a use case that would be negatively affected
> > if this heuristic was switched from "sunit" to "sunit < swidth"?
>
> Any time you only know a single alignment characteristic of the
> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
> iomin = ioopt, multi-level RAID constructs where only the largest
> alignment requirement is exposed, RAID1 devices exposing their chunk
> size, remote replication chunk alignment (because remote rep. is
> slow and so we need more concurrency to keep the pipeline full),
> etc.

That's a good point, I didn't think of arrays with battery/flash
backed caches.

Can you poke holes in "sunit && swidth"?  It sounds like under "RAID
controllers that only advertise sunit or swidth are broken" it wouldn't
affect any RAID cases we care to optimize for and given that single HDD
and single SSD cases are considered to be the same and that most
devices report ioopt = 0 anyway that front is covered too.

Thanks,

                Ilya
Dave Chinner Oct. 10, 2018, 12:28 a.m. UTC | #14
On Sun, Oct 07, 2018 at 03:54:57PM +0200, Ilya Dryomov wrote:
> On Sun, Oct 7, 2018 at 1:20 AM Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Oct 06, 2018 at 02:17:54PM +0200, Ilya Dryomov wrote:
> > > On Sat, Oct 6, 2018 at 1:27 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Fri, Oct 05, 2018 at 08:51:59AM -0500, Eric Sandeen wrote:
> > The defaults are appropriate for the vast majority of installations
> > and use cases. The defaults are not ideal for everyone, but there's
> > years of thought, observation, problem solving and knowledge behind
> > them.
> >
> > If you don't like the defaults, then override them on the command
> > line, post your benchmarked improvements and make the argument why
> > this particular workload and tuning is better to everyone.
> 
> We were doing some tests with XFS on top of rbd, with multiple rbd
> devices on the same client machine (pretty beefy in terms of number of
> cores and RAM).  We ran the smallfile benchmark and found that with
> just a dozen devices mkfs.xfs with -d agcount=4 improved per device
> throughput by over 2x, with no degradation but rather a small bump on
> a single device.  Mark (CCed) can share the details.

So you're hosting multiple filesystems on the same spindles, then
running an IO intensive write workload that distributes the IO
across the entire filesystem on each filesystem and te result is you
see a massive amount of "random" write IO at the back end.

That's not really a general workload - that's driving a complex
setup to the point of degradation via targetted stress. How many
production workloads do you see that run concurrent write IO to 32
AGs at once across multiple devices and sustain them for long enough
that back end seek load is the perofrmance limitation?

i.e. what you are effectively testing is how many concurrent IO
streams the back end can support before performance is compromised.
Reducing the number of AGs per filesystem reduces the number of
streams from a fixed number of filesystems. I'm sure if you bump the
number of devices up by a factor of 8 (you reduced AG count by that
factor) you'd end up with the same performance degradation because
you have an identical number of concurrent write streams...

But the real question is this: how many of the filesystems in a
multi-tenant hosting situation like this actually sustain full write
stream concurrency for any extended period of time? I'd expect that
it's a very low percentage (needs writes sustained modifications to
at least 32 different directories to trigger these write patterns),
and those that do only do so in relatively short bursts. IOWs, 32
AGs can make sense from a storage stack longevity and aging
perspective, whilst the potential negative performance impact of
allowing greating concurrency can be largely ignored because no
hosted workload ever uses that /potential/.

IME, multi-tenented hosting requires much more careful configuration
than tuning for a highly stressful IO workload.  It requires knowing
what the typical workloads are and what their IO patterns look like
for the hosting that is being done....

> > > > Changing the existing behaviour doesn't make much sense to me. :)
> > >
> > > The existing behaviour is to create 4 AGs on both spinning rust and
> > > e.g. Intel DC P3700.
> >
> > That's a really bad example.  The p3700 has internal RAID with a
> > 128k page size that it doesn't expose to iomin/ioopt. It has
> > *really* bad IO throughput for sub-128k sized or aligned IO (think
> > 100x slower, not just a little). It's a device that absolutely
> > should be exposing preferred alignment characteristics to the
> > filesystem...
> 
> That was deliberate, as an example of a device that is generally
> considered to be pretty good -- even though the internal stripe size is
> public information, it's not passed through the stack.

*cough*

The p3700 is considered to be a device to avoid around here. Not
only is performance completely non-deterministic (i.e. garbage) when
the IO is not 128k aligned/sized, it was full of firmware bugs, too.
e.g. it had data corruption bugs w/ sub-4k IOs and discard
operations could hang the device and/or corrupt data.  Yeah, yet
more hardware bugs that were initially blamed on XFS....

> > > If I then put dm-thinp on top of that spinner,
> > > it's suddenly deemed worthy of 32 AGs.  The issue here is that unlike
> > > other filesystems, XFS is inherently parallel and perfectly capable of
> > > subjecting it to 32 concurrent write streams.  This is pretty silly.
> >
> > COW algorithms linearise and serialise concurrent write streams -
> > that's exactly what they are designed to do and why they perform so
> > well on random write workloads.  Optimising the filesystem layout
> > and characteristics to take advantage of COW algorithms in the
> > storage laye is not "pretty silly" - it's the smart thing to do
> > because the dm-thinp COW algorithms are only as good as the garbage
> > they are fed.
> 
> Right, that's pretty much what I expected to hear.  But I picked on
> dm-thinp because Mike has attempted to make mkfs.xfs go with the lower
> agcount for dm-thinp in fdfb4c8c1a9f ("dm thin: set minimum_io_size to
> pool's data block size").

But the commit didn't do that. That whole discussion was about
dm-thinp having an invalid iomin/ioopt configuration (which was
iomin != 0, ioopt == 0), not about the number of AG the filesystem
ended up with. Maybe the original bug raised by a user was about
mkfs differences between normal and thin LVs, but that wasn't the
issue that needed fixing.

> Both the commit message and his reply in
> this thread indicate that he wasn't just following your suggestion to
> set both iomin and ioopt to the same value, but actually intended it to
> defeat the agcount heuristic.  Mike, are there different expectations
> here?

If dm-thinp is trying to defeat a filesystem's mkfs layout by
screwing around with iomin/ioopt configuration, then that is a
layering violation at architectural, design and implementation
levels. This is not a game that block devices should be playing -
they provide information about their preferred IO sizes, and the
filesystem decides what to do from there.

If you want the filesystem to change behaviour, then you change the
filesystem code or use non-default filesystem options.  Every
filesystem has their own take on how to optimise for different block
device configurations, and if the block device is trying to game one
of the filesystems, then it's going to have unintended adverse
impact on what other filesystems do. Block devices must be
filesystem agnostic, because they don't know what data they are
going to contain.

Cheers,

Dave.
RIc Wheeler Nov. 29, 2018, 1:53 p.m. UTC | #15
On 10/6/18 8:14 PM, Eric Sandeen wrote:
> On 10/6/18 6:20 PM, Dave Chinner wrote:
>>> Can you give an example of a use case that would be negatively affected
>>> if this heuristic was switched from "sunit" to "sunit < swidth"?
>> Any time you only know a single alignment characteristic of the
>> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
>> iomin = ioopt, multi-level RAID constructs where only the largest
>> alignment requirement is exposed, RAID1 devices exposing their chunk
>> size, remote replication chunk alignment (because remote rep. is
>> slow and so we need more concurrency to keep the pipeline full),
>> etc.
> So the tl;dr here is "given any iomin > 512, we should infer low seek
> latency and parallelism and adjust geometry accordingly?"
>
> -Eric

Chiming in late here, but I do think that every decade or two (no disrespect to 
xfs!), it is worth having a second look at how the storage has changed under us.

The workload that has lots of file systems pounding on a shared device for 
example is one way to lay out container storage.

No argument about documenting how to fix this with command line tweaks for now, 
but maybe this would be a good topic for the next LSF/MM shared track of file & 
storage people to debate?

Ric
Dave Chinner Nov. 29, 2018, 9:48 p.m. UTC | #16
On Thu, Nov 29, 2018 at 08:53:39AM -0500, Ric Wheeler wrote:
> On 10/6/18 8:14 PM, Eric Sandeen wrote:
> >On 10/6/18 6:20 PM, Dave Chinner wrote:
> >>>Can you give an example of a use case that would be negatively affected
> >>>if this heuristic was switched from "sunit" to "sunit < swidth"?
> >>Any time you only know a single alignment characteristic of the
> >>underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
> >>iomin = ioopt, multi-level RAID constructs where only the largest
> >>alignment requirement is exposed, RAID1 devices exposing their chunk
> >>size, remote replication chunk alignment (because remote rep. is
> >>slow and so we need more concurrency to keep the pipeline full),
> >>etc.
> >So the tl;dr here is "given any iomin > 512, we should infer low seek
> >latency and parallelism and adjust geometry accordingly?"
> >
> >-Eric
> 
> Chiming in late here, but I do think that every decade or two (no
> disrespect to xfs!), it is worth having a second look at how the
> storage has changed under us.
> 
> The workload that has lots of file systems pounding on a shared
> device for example is one way to lay out container storage.

The problem is that defaults can't cater for every use case.
And in this case, we've got nothing to tell us that this is
aggregated/shared storage rather than "the fileystem owns the
entire device".

> No argument about documenting how to fix this with command line
> tweaks for now, but maybe this would be a good topic for the next
> LSF/MM shared track of file & storage people to debate?

Doubt it - this is really only an XFS problem at this point.

i.e. if we can't infer what the user wants from existing
information, then I don't see how the storage is going to be able to
tell us anything different, either.  i.e. somewhere in the stack the
user is going to have to tell the block device that this is
aggregated storage.

But even then, if it's aggregated solid state storage, we still want
to make use of the concurency on increased AG count because there is
no seek penalty like spinning drives end up with. Or if the
aggregated storage is thinly provisioned, the AG count of filesystem
just doesn't matter because the IO is going to be massively
randomised (i.e take random seek penalties) by the thinp layout.

So there's really no good way of "guessing" whether aggregated
storage should or shouldn't use elevated AG counts even if the
storage says "this is aggregated storage". The user still has to
give us some kind of explict hint about how the filesystem should
be configured.

What we need is for a solid, reliable detection hueristic to be
suggested by the people that need this functionality before there's
anything we can talk about.

Cheers,

Dave.
RIc Wheeler Nov. 29, 2018, 11:53 p.m. UTC | #17
On 11/29/18 4:48 PM, Dave Chinner wrote:
> On Thu, Nov 29, 2018 at 08:53:39AM -0500, Ric Wheeler wrote:
>> On 10/6/18 8:14 PM, Eric Sandeen wrote:
>>> On 10/6/18 6:20 PM, Dave Chinner wrote:
>>>>> Can you give an example of a use case that would be negatively affected
>>>>> if this heuristic was switched from "sunit" to "sunit < swidth"?
>>>> Any time you only know a single alignment characteristic of the
>>>> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
>>>> iomin = ioopt, multi-level RAID constructs where only the largest
>>>> alignment requirement is exposed, RAID1 devices exposing their chunk
>>>> size, remote replication chunk alignment (because remote rep. is
>>>> slow and so we need more concurrency to keep the pipeline full),
>>>> etc.
>>> So the tl;dr here is "given any iomin > 512, we should infer low seek
>>> latency and parallelism and adjust geometry accordingly?"
>>>
>>> -Eric
>> Chiming in late here, but I do think that every decade or two (no
>> disrespect to xfs!), it is worth having a second look at how the
>> storage has changed under us.
>>
>> The workload that has lots of file systems pounding on a shared
>> device for example is one way to lay out container storage.
> The problem is that defaults can't cater for every use case.
> And in this case, we've got nothing to tell us that this is
> aggregated/shared storage rather than "the fileystem owns the
> entire device".
>
>> No argument about documenting how to fix this with command line
>> tweaks for now, but maybe this would be a good topic for the next
>> LSF/MM shared track of file & storage people to debate?
> Doubt it - this is really only an XFS problem at this point.
>
> i.e. if we can't infer what the user wants from existing
> information, then I don't see how the storage is going to be able to
> tell us anything different, either.  i.e. somewhere in the stack the
> user is going to have to tell the block device that this is
> aggregated storage.
>
> But even then, if it's aggregated solid state storage, we still want
> to make use of the concurency on increased AG count because there is
> no seek penalty like spinning drives end up with. Or if the
> aggregated storage is thinly provisioned, the AG count of filesystem
> just doesn't matter because the IO is going to be massively
> randomised (i.e take random seek penalties) by the thinp layout.
>
> So there's really no good way of "guessing" whether aggregated
> storage should or shouldn't use elevated AG counts even if the
> storage says "this is aggregated storage". The user still has to
> give us some kind of explict hint about how the filesystem should
> be configured.
>
> What we need is for a solid, reliable detection hueristic to be
> suggested by the people that need this functionality before there's
> anything we can talk about.
>
> Cheers,
>
> Dave.

I think that is exactly the kind of discussion that the shared file/storage 
track is good for. Other file systems also need to accommodate/probe behind the 
fictitious visible storage device layer...  Specifically, is there something we 
can add  per block device to help here?  Number of independent devices or a map 
of those regions?

Regards,

Ric
Dave Chinner Nov. 30, 2018, 2:25 a.m. UTC | #18
On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
> On 11/29/18 4:48 PM, Dave Chinner wrote:
> >On Thu, Nov 29, 2018 at 08:53:39AM -0500, Ric Wheeler wrote:
> >>On 10/6/18 8:14 PM, Eric Sandeen wrote:
> >>>On 10/6/18 6:20 PM, Dave Chinner wrote:
> >>>>>Can you give an example of a use case that would be negatively affected
> >>>>>if this heuristic was switched from "sunit" to "sunit < swidth"?
> >>>>Any time you only know a single alignment characteristic of the
> >>>>underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
> >>>>iomin = ioopt, multi-level RAID constructs where only the largest
> >>>>alignment requirement is exposed, RAID1 devices exposing their chunk
> >>>>size, remote replication chunk alignment (because remote rep. is
> >>>>slow and so we need more concurrency to keep the pipeline full),
> >>>>etc.
> >>>So the tl;dr here is "given any iomin > 512, we should infer low seek
> >>>latency and parallelism and adjust geometry accordingly?"
> >>>
> >>>-Eric
> >>Chiming in late here, but I do think that every decade or two (no
> >>disrespect to xfs!), it is worth having a second look at how the
> >>storage has changed under us.
> >>
> >>The workload that has lots of file systems pounding on a shared
> >>device for example is one way to lay out container storage.
> >The problem is that defaults can't cater for every use case.
> >And in this case, we've got nothing to tell us that this is
> >aggregated/shared storage rather than "the fileystem owns the
> >entire device".
> >
> >>No argument about documenting how to fix this with command line
> >>tweaks for now, but maybe this would be a good topic for the next
> >>LSF/MM shared track of file & storage people to debate?
> >Doubt it - this is really only an XFS problem at this point.
> >
> >i.e. if we can't infer what the user wants from existing
> >information, then I don't see how the storage is going to be able to
> >tell us anything different, either.  i.e. somewhere in the stack the
> >user is going to have to tell the block device that this is
> >aggregated storage.
> >
> >But even then, if it's aggregated solid state storage, we still want
> >to make use of the concurency on increased AG count because there is
> >no seek penalty like spinning drives end up with. Or if the
> >aggregated storage is thinly provisioned, the AG count of filesystem
> >just doesn't matter because the IO is going to be massively
> >randomised (i.e take random seek penalties) by the thinp layout.
> >
> >So there's really no good way of "guessing" whether aggregated
> >storage should or shouldn't use elevated AG counts even if the
> >storage says "this is aggregated storage". The user still has to
> >give us some kind of explict hint about how the filesystem should
> >be configured.
> >
> >What we need is for a solid, reliable detection hueristic to be
> >suggested by the people that need this functionality before there's
> >anything we can talk about.
> 
> I think that is exactly the kind of discussion that the shared
> file/storage track is good for.

Yes, but why on earth do we need to wait 6 months to have that
conversation. Start it now...

> Other file systems also need to
> accommodate/probe behind the fictitious visible storage device
> layer... Specifically, is there something we can add per block
> device to help here? Number of independent devices

That's how mkfs.xfs used to do stripe unit/stripe width calculations
automatically on MD devices back in the 2000s. We got rid of that
for more generaly applicable configuration information such as
minimum/optimal IO sizes so we could expose equivalent alignment
information from lots of different types of storage device....

> or a map of
> those regions?

Not sure what this means or how we'd use it.

Cheers,

Dave.
RIc Wheeler Nov. 30, 2018, 6 p.m. UTC | #19
On 11/30/18 7:55 AM, Dave Chinner wrote:
> On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
>> On 11/29/18 4:48 PM, Dave Chinner wrote:
>>> On Thu, Nov 29, 2018 at 08:53:39AM -0500, Ric Wheeler wrote:
>>>> On 10/6/18 8:14 PM, Eric Sandeen wrote:
>>>>> On 10/6/18 6:20 PM, Dave Chinner wrote:
>>>>>>> Can you give an example of a use case that would be negatively affected
>>>>>>> if this heuristic was switched from "sunit" to "sunit < swidth"?
>>>>>> Any time you only know a single alignment characteristic of the
>>>>>> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
>>>>>> iomin = ioopt, multi-level RAID constructs where only the largest
>>>>>> alignment requirement is exposed, RAID1 devices exposing their chunk
>>>>>> size, remote replication chunk alignment (because remote rep. is
>>>>>> slow and so we need more concurrency to keep the pipeline full),
>>>>>> etc.
>>>>> So the tl;dr here is "given any iomin > 512, we should infer low seek
>>>>> latency and parallelism and adjust geometry accordingly?"
>>>>>
>>>>> -Eric
>>>> Chiming in late here, but I do think that every decade or two (no
>>>> disrespect to xfs!), it is worth having a second look at how the
>>>> storage has changed under us.
>>>>
>>>> The workload that has lots of file systems pounding on a shared
>>>> device for example is one way to lay out container storage.
>>> The problem is that defaults can't cater for every use case.
>>> And in this case, we've got nothing to tell us that this is
>>> aggregated/shared storage rather than "the fileystem owns the
>>> entire device".
>>>
>>>> No argument about documenting how to fix this with command line
>>>> tweaks for now, but maybe this would be a good topic for the next
>>>> LSF/MM shared track of file & storage people to debate?
>>> Doubt it - this is really only an XFS problem at this point.
>>>
>>> i.e. if we can't infer what the user wants from existing
>>> information, then I don't see how the storage is going to be able to
>>> tell us anything different, either.  i.e. somewhere in the stack the
>>> user is going to have to tell the block device that this is
>>> aggregated storage.
>>>
>>> But even then, if it's aggregated solid state storage, we still want
>>> to make use of the concurency on increased AG count because there is
>>> no seek penalty like spinning drives end up with. Or if the
>>> aggregated storage is thinly provisioned, the AG count of filesystem
>>> just doesn't matter because the IO is going to be massively
>>> randomised (i.e take random seek penalties) by the thinp layout.
>>>
>>> So there's really no good way of "guessing" whether aggregated
>>> storage should or shouldn't use elevated AG counts even if the
>>> storage says "this is aggregated storage". The user still has to
>>> give us some kind of explict hint about how the filesystem should
>>> be configured.
>>>
>>> What we need is for a solid, reliable detection hueristic to be
>>> suggested by the people that need this functionality before there's
>>> anything we can talk about.
>> I think that is exactly the kind of discussion that the shared
>> file/storage track is good for.
> Yes, but why on earth do we need to wait 6 months to have that
> conversation. Start it now...


Sure, that is definitely a good idea - added in some of the storage lists to 
this reply. No perfect all encompassing block layer list that I know of.


>
>> Other file systems also need to
>> accommodate/probe behind the fictitious visible storage device
>> layer... Specifically, is there something we can add per block
>> device to help here? Number of independent devices
> That's how mkfs.xfs used to do stripe unit/stripe width calculations
> automatically on MD devices back in the 2000s. We got rid of that
> for more generaly applicable configuration information such as
> minimum/optimal IO sizes so we could expose equivalent alignment
> information from lots of different types of storage device....
>
>> or a map of
>> those regions?
> Not sure what this means or how we'd use it.
>
> Cheers,
>
> Dave.

What I was thinking of was a way of giving up a good outline of how many 
independent regions that are behind one "virtual" block device like a ceph rbd 
or device mapper device. My assumption is that we are trying to lay down (at 
least one) allocation group per region.

What we need to optimize for includes:

     * how many independent regions are there?

     * what are the boundaries of those regions?

     * optimal IO size/alignment/etc

Some of that we have, but the current assumptions don't work well for all device 
types.

Regards,

Ric
Mark Nelson Nov. 30, 2018, 6:05 p.m. UTC | #20
On 11/30/18 12:00 PM, Ric Wheeler wrote:
> On 11/30/18 7:55 AM, Dave Chinner wrote:
>> On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
>>> On 11/29/18 4:48 PM, Dave Chinner wrote:
>>>> On Thu, Nov 29, 2018 at 08:53:39AM -0500, Ric Wheeler wrote:
>>>>> On 10/6/18 8:14 PM, Eric Sandeen wrote:
>>>>>> On 10/6/18 6:20 PM, Dave Chinner wrote:
>>>>>>>> Can you give an example of a use case that would be negatively 
>>>>>>>> affected
>>>>>>>> if this heuristic was switched from "sunit" to "sunit < swidth"?
>>>>>>> Any time you only know a single alignment characteristic of the
>>>>>>> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
>>>>>>> iomin = ioopt, multi-level RAID constructs where only the largest
>>>>>>> alignment requirement is exposed, RAID1 devices exposing their 
>>>>>>> chunk
>>>>>>> size, remote replication chunk alignment (because remote rep. is
>>>>>>> slow and so we need more concurrency to keep the pipeline full),
>>>>>>> etc.
>>>>>> So the tl;dr here is "given any iomin > 512, we should infer low 
>>>>>> seek
>>>>>> latency and parallelism and adjust geometry accordingly?"
>>>>>>
>>>>>> -Eric
>>>>> Chiming in late here, but I do think that every decade or two (no
>>>>> disrespect to xfs!), it is worth having a second look at how the
>>>>> storage has changed under us.
>>>>>
>>>>> The workload that has lots of file systems pounding on a shared
>>>>> device for example is one way to lay out container storage.
>>>> The problem is that defaults can't cater for every use case.
>>>> And in this case, we've got nothing to tell us that this is
>>>> aggregated/shared storage rather than "the fileystem owns the
>>>> entire device".
>>>>
>>>>> No argument about documenting how to fix this with command line
>>>>> tweaks for now, but maybe this would be a good topic for the next
>>>>> LSF/MM shared track of file & storage people to debate?
>>>> Doubt it - this is really only an XFS problem at this point.
>>>>
>>>> i.e. if we can't infer what the user wants from existing
>>>> information, then I don't see how the storage is going to be able to
>>>> tell us anything different, either.  i.e. somewhere in the stack the
>>>> user is going to have to tell the block device that this is
>>>> aggregated storage.
>>>>
>>>> But even then, if it's aggregated solid state storage, we still want
>>>> to make use of the concurency on increased AG count because there is
>>>> no seek penalty like spinning drives end up with. Or if the
>>>> aggregated storage is thinly provisioned, the AG count of filesystem
>>>> just doesn't matter because the IO is going to be massively
>>>> randomised (i.e take random seek penalties) by the thinp layout.
>>>>
>>>> So there's really no good way of "guessing" whether aggregated
>>>> storage should or shouldn't use elevated AG counts even if the
>>>> storage says "this is aggregated storage". The user still has to
>>>> give us some kind of explict hint about how the filesystem should
>>>> be configured.
>>>>
>>>> What we need is for a solid, reliable detection hueristic to be
>>>> suggested by the people that need this functionality before there's
>>>> anything we can talk about.
>>> I think that is exactly the kind of discussion that the shared
>>> file/storage track is good for.
>> Yes, but why on earth do we need to wait 6 months to have that
>> conversation. Start it now...
>
>
> Sure, that is definitely a good idea - added in some of the storage 
> lists to this reply. No perfect all encompassing block layer list that 
> I know of.
>
>
>>
>>> Other file systems also need to
>>> accommodate/probe behind the fictitious visible storage device
>>> layer... Specifically, is there something we can add per block
>>> device to help here? Number of independent devices
>> That's how mkfs.xfs used to do stripe unit/stripe width calculations
>> automatically on MD devices back in the 2000s. We got rid of that
>> for more generaly applicable configuration information such as
>> minimum/optimal IO sizes so we could expose equivalent alignment
>> information from lots of different types of storage device....
>>
>>> or a map of
>>> those regions?
>> Not sure what this means or how we'd use it.
>>
>> Cheers,
>>
>> Dave.
>
> What I was thinking of was a way of giving up a good outline of how 
> many independent regions that are behind one "virtual" block device 
> like a ceph rbd or device mapper device. My assumption is that we are 
> trying to lay down (at least one) allocation group per region.
>
> What we need to optimize for includes:
>
>     * how many independent regions are there?
>
>     * what are the boundaries of those regions?
>
>     * optimal IO size/alignment/etc
>
> Some of that we have, but the current assumptions don't work well for 
> all device types.
>
> Regards,
>
> Ric
>
I won't comment on the details as there are others here that are far 
more knowledgeable than I am, but at a high level I think your idea is 
absolutely fantastic from the standpoint of making this decision process 
more explicit.


Mark
Dave Chinner Dec. 1, 2018, 4:35 a.m. UTC | #21
On Fri, Nov 30, 2018 at 01:00:52PM -0500, Ric Wheeler wrote:
> On 11/30/18 7:55 AM, Dave Chinner wrote:
> >On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
> >>Other file systems also need to
> >>accommodate/probe behind the fictitious visible storage device
> >>layer... Specifically, is there something we can add per block
> >>device to help here? Number of independent devices
> >That's how mkfs.xfs used to do stripe unit/stripe width calculations
> >automatically on MD devices back in the 2000s. We got rid of that
> >for more generaly applicable configuration information such as
> >minimum/optimal IO sizes so we could expose equivalent alignment
> >information from lots of different types of storage device....
> >
> >>or a map of
> >>those regions?
> >Not sure what this means or how we'd use it.
> >Dave.
> 
> What I was thinking of was a way of giving up a good outline of how
> many independent regions that are behind one "virtual" block device
> like a ceph rbd or device mapper device. My assumption is that we
> are trying to lay down (at least one) allocation group per region.
> 
> What we need to optimize for includes:
> 
>     * how many independent regions are there?
> 
>     * what are the boundaries of those regions?
> 
>     * optimal IO size/alignment/etc
> 
> Some of that we have, but the current assumptions don't work well
> for all device types.

Oh, so essential "independent regions" of the storage device. I
wrote this in 2008:

http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption#Failure_Domains

This was derived from the ideas in prototype code I wrote in ~2007
to try to optimise file layout and load distribution across linear
concats of multi-TB RAID6 luns. Some of that work was published
long after I left SGI:

https://marc.info/?l=linux-xfs&m=123441191222714&w=2

Essentially, independent regions - called "Logical
Extension Groups", or "legs" of the filesystem - and would
essentially be an aggregation of AGs in that region. The
concept was that we'd move the geometry information from the
superblock into the legs, and so we could have different AG
geoemetry optimies for each independent leg of the filesystem.

eg the SSD region could have numerous small AGs, the large,
contiguous RAID6 part could have maximally size AGs or even make use
of the RT allocator for free space management instead of the
AG/btree allocator. Basically it was seen as a mechanism for getting
rid of needing to specify block devices as command line or mount
options.

Fundamentally, though, it was based on the concept that Linux would
eventually grow an interface for the block device/volume manager to
tell the filesystem where the independent regions in the device
were(*), but that's not something that has ever appeared. If you can
provide an indepedent region map in an easy to digest format (e.g. a
set of {offset, len, geometry} tuples), then we can obviously make
use of it in XFS....

Cheers,

Dave.

(*) Basically provide a linux version of the functionality Irix
volume managers had provided filesystems since the late 80s....
RIc Wheeler Dec. 1, 2018, 8:52 p.m. UTC | #22
On 11/30/18 11:35 PM, Dave Chinner wrote:
> On Fri, Nov 30, 2018 at 01:00:52PM -0500, Ric Wheeler wrote:
>> On 11/30/18 7:55 AM, Dave Chinner wrote:
>>> On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
>>>> Other file systems also need to
>>>> accommodate/probe behind the fictitious visible storage device
>>>> layer... Specifically, is there something we can add per block
>>>> device to help here? Number of independent devices
>>> That's how mkfs.xfs used to do stripe unit/stripe width calculations
>>> automatically on MD devices back in the 2000s. We got rid of that
>>> for more generaly applicable configuration information such as
>>> minimum/optimal IO sizes so we could expose equivalent alignment
>>> information from lots of different types of storage device....
>>>
>>>> or a map of
>>>> those regions?
>>> Not sure what this means or how we'd use it.
>>> Dave.
>> What I was thinking of was a way of giving up a good outline of how
>> many independent regions that are behind one "virtual" block device
>> like a ceph rbd or device mapper device. My assumption is that we
>> are trying to lay down (at least one) allocation group per region.
>>
>> What we need to optimize for includes:
>>
>>      * how many independent regions are there?
>>
>>      * what are the boundaries of those regions?
>>
>>      * optimal IO size/alignment/etc
>>
>> Some of that we have, but the current assumptions don't work well
>> for all device types.
> Oh, so essential "independent regions" of the storage device. I
> wrote this in 2008:
>
> http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption#Failure_Domains
>
> This was derived from the ideas in prototype code I wrote in ~2007
> to try to optimise file layout and load distribution across linear
> concats of multi-TB RAID6 luns. Some of that work was published
> long after I left SGI:
>
> https://marc.info/?l=linux-xfs&m=123441191222714&w=2
>
> Essentially, independent regions - called "Logical
> Extension Groups", or "legs" of the filesystem - and would
> essentially be an aggregation of AGs in that region. The
> concept was that we'd move the geometry information from the
> superblock into the legs, and so we could have different AG
> geoemetry optimies for each independent leg of the filesystem.
>
> eg the SSD region could have numerous small AGs, the large,
> contiguous RAID6 part could have maximally size AGs or even make use
> of the RT allocator for free space management instead of the
> AG/btree allocator. Basically it was seen as a mechanism for getting
> rid of needing to specify block devices as command line or mount
> options.
>
> Fundamentally, though, it was based on the concept that Linux would
> eventually grow an interface for the block device/volume manager to
> tell the filesystem where the independent regions in the device
> were(*), but that's not something that has ever appeared. If you can
> provide an indepedent region map in an easy to digest format (e.g. a
> set of {offset, len, geometry} tuples), then we can obviously make
> use of it in XFS....
>
> Cheers,
>
> Dave.
>
> (*) Basically provide a linux version of the functionality Irix
> volume managers had provided filesystems since the late 80s....
>
Hi Dave,

This is exactly the kind of thing I think would be useful.  We might want to 
have a distinct value (like the rotational) that indicates this is a device with 
multiple "legs" so that normally we query that and don't have to look for the 
more complicated information.

Regards,

Ric
diff mbox series

Patch

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 2e53c1e83b6a..c3efa30005a2 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -2650,8 +2650,8 @@  _("agsize (%s) not a multiple of fs blk size (%d)\n"),
 				(cfg->dblocks % cfg->agcount != 0);
 	} else {
 		calc_default_ag_geometry(cfg->blocklog, cfg->dblocks,
-					 cfg->dsunit, &cfg->agsize,
-					 &cfg->agcount);
+					 cfg->dsunit != cfg->dswidth,
+					 &cfg->agsize, &cfg->agcount);
 	}
 }