[0/5] *** Introduce new space allocation algorithm ***

Message ID	20241104014439.3786609-1-zhangshida@kylinos.cn (mailing list archive)
Headers	show Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4D46716132E; Mon, 4 Nov 2024 01:44:53 +0000 (UTC) From: zhangshida <starzhangzsd@gmail.com> To: djwong@kernel.org, dchinner@redhat.com, leo.lilong@huawei.com, wozizhi@huawei.com, osandov@fb.com, xiang@kernel.org, zhangjiachen.jaycee@bytedance.com Cc: linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, zhangshida@kylinos.cn, starzhangzsd@gmail.com Subject: [PATCH 0/5] * Introduce new space allocation algorithm * Date: Mon, 4 Nov 2024 09:44:34 +0800 Message-Id: <20241104014439.3786609-1-zhangshida@kylinos.cn> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	* Introduce new space allocation algorithm * \| expand [0/5] * Introduce new space allocation algorithm * [1/5] xfs: add two wrappers for iterating ags in a AF [2/5] xfs: add two mp member to record the alloction field layout [3/5] xfs: add mount options as a way to change the AF layout [4/5] xfs: add infrastructure to support AF allocation algorithm [5/5] xfs: modify the logic to comply with AF rules

Stephen Zhang Nov. 4, 2024, 1:44 a.m. UTC

From: Shida Zhang <zhangshida@kylinos.cn>

Hi all,

Recently, we've been encounter xfs problems from our two
major users continuously.
They are all manifested as the same phonomenon: a xfs 
filesystem can't touch new file when there are nearly
half of the available space even with sparse inode enabled.

It turns out that the filesystem is too fragmented to have
enough continuous free space to create a new file.

Life still has to goes on. 
But from our users' perspective, worse than the situation
that xfs is hard to use is that xfs is non-able to use, 
since even one single file can't be created now. 

So we try to introduce a new space allocation algorithm to
solve this.

To achieve that, we try to propose a new concept:
   Allocation Fields, where its name is borrowed from the 
mathmatical concepts(Groups,Rings,Fields), will be 
abbrivated as AF in the rest of the article. 

what is a AF?
An one-pic-to-say-it-all version of explaination:

|<--------+ af 0 +-------->|<--+ af 1 +-->| af 2|
|------------------------------------------------+
| ag 0 | ag 1 | ag 2 | ag 3| ag 4 | ag 5 | ag 6 |
+------------------------------------------------+

A text-based definition of AF:
1.An AF is a incore-only concept comparing with the on-disk
  AG concept.
2.An AF is consisted of a continuous series of AGs. 
3.Lower AFs will NEVER go to higher AFs for allocation if 
  it can complete it in the current AF.

Rule 3 can serve as a barrier between the AF to slow down
the over-speed extending of fragmented pieces. 

With these patches applied, the code logic will be exactly
the same as the original code logic, unless you run with the
extra mount opiton. For example:
   mount -o af1=1 $dev $mnt

That will change the default AF layout:

|<--------+ af 0 +--------->| 
|----------------------------
| ag 0 | ag 1 | ag 2 | ag 3 |
+----------------------------

to :

|<-----+ af 0 +----->|<af 1>| 
|----------------------------
| ag 0 | ag 1 | ag 2 | ag 3 |
+----------------------------

So the 'af1=1' here means the start agno is one ag away from
the m_sb.agcount.

We did some tests verify it. You can verify it yourself
by running the following the command:

1. Create an 1g sized img file and formated it as xfs:
  dd if=/dev/zero of=test.img bs=1M count=1024
  mkfs.xfs -f test.img
  sync
2. Make a mount directory:
  mkdir mnt
3. Run the auto_frag.sh script, which will call another scripts
  frag.sh. These scripts will be attached in the mail. 
  To enable the AF, run:
    ./auto_frag.sh 1
  To disable the AF, run:
    ./auto_frag.sh 0

Please feel free to communicate with us if you have any thoughts
about these problems.

Cheers,
Shida


Shida Zhang (5):
  xfs: add two wrappers for iterating ags in a AF
  xfs: add two mp member to record the alloction field layout
  xfs: add mount options as a way to change the AF layout
  xfs: add infrastructure to support AF allocation algorithm
  xfs: modify the logic to comply with AF rules

 fs/xfs/libxfs/xfs_ag.h         | 17 ++++++++++++
 fs/xfs/libxfs/xfs_alloc.c      | 20 ++++++++++++++-
 fs/xfs/libxfs/xfs_alloc.h      |  2 ++
 fs/xfs/libxfs/xfs_bmap.c       | 47 ++++++++++++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_bmap_btree.c |  2 ++
 fs/xfs/xfs_mount.h             |  3 +++
 fs/xfs/xfs_super.c             | 12 ++++++++-
 7 files changed, 99 insertions(+), 4 deletions(-)

Stephen Zhang Nov. 4, 2024, 1:49 a.m. UTC | #1

Hi all,

I just send the scripts to test these series here.

Cheers,
Shida

zhangshida <starzhangzsd@gmail.com> 于2024年11月4日周一 09:44写道：
>
> From: Shida Zhang <zhangshida@kylinos.cn>
>
> Hi all,
>
> Recently, we've been encounter xfs problems from our two
> major users continuously.
> They are all manifested as the same phonomenon: a xfs
> filesystem can't touch new file when there are nearly
> half of the available space even with sparse inode enabled.
>
> It turns out that the filesystem is too fragmented to have
> enough continuous free space to create a new file.
>
> Life still has to goes on.
> But from our users' perspective, worse than the situation
> that xfs is hard to use is that xfs is non-able to use,
> since even one single file can't be created now.
>
> So we try to introduce a new space allocation algorithm to
> solve this.
>
> To achieve that, we try to propose a new concept:
>    Allocation Fields, where its name is borrowed from the
> mathmatical concepts(Groups,Rings,Fields), will be
> abbrivated as AF in the rest of the article.
>
> what is a AF?
> An one-pic-to-say-it-all version of explaination:
>
> |<--------+ af 0 +-------->|<--+ af 1 +-->| af 2|
> |------------------------------------------------+
> | ag 0 | ag 1 | ag 2 | ag 3| ag 4 | ag 5 | ag 6 |
> +------------------------------------------------+
>
> A text-based definition of AF:
> 1.An AF is a incore-only concept comparing with the on-disk
>   AG concept.
> 2.An AF is consisted of a continuous series of AGs.
> 3.Lower AFs will NEVER go to higher AFs for allocation if
>   it can complete it in the current AF.
>
> Rule 3 can serve as a barrier between the AF to slow down
> the over-speed extending of fragmented pieces.
>
> With these patches applied, the code logic will be exactly
> the same as the original code logic, unless you run with the
> extra mount opiton. For example:
>    mount -o af1=1 $dev $mnt
>
> That will change the default AF layout:
>
> |<--------+ af 0 +--------->|
> |----------------------------
> | ag 0 | ag 1 | ag 2 | ag 3 |
> +----------------------------
>
> to :
>
> |<-----+ af 0 +----->|<af 1>|
> |----------------------------
> | ag 0 | ag 1 | ag 2 | ag 3 |
> +----------------------------
>
> So the 'af1=1' here means the start agno is one ag away from
> the m_sb.agcount.
>
> We did some tests verify it. You can verify it yourself
> by running the following the command:
>
> 1. Create an 1g sized img file and formated it as xfs:
>   dd if=/dev/zero of=test.img bs=1M count=1024
>   mkfs.xfs -f test.img
>   sync
> 2. Make a mount directory:
>   mkdir mnt
> 3. Run the auto_frag.sh script, which will call another scripts
>   frag.sh. These scripts will be attached in the mail.
>   To enable the AF, run:
>     ./auto_frag.sh 1
>   To disable the AF, run:
>     ./auto_frag.sh 0
>
> Please feel free to communicate with us if you have any thoughts
> about these problems.
>
> Cheers,
> Shida
>
>
> Shida Zhang (5):
>   xfs: add two wrappers for iterating ags in a AF
>   xfs: add two mp member to record the alloction field layout
>   xfs: add mount options as a way to change the AF layout
>   xfs: add infrastructure to support AF allocation algorithm
>   xfs: modify the logic to comply with AF rules
>
>  fs/xfs/libxfs/xfs_ag.h         | 17 ++++++++++++
>  fs/xfs/libxfs/xfs_alloc.c      | 20 ++++++++++++++-
>  fs/xfs/libxfs/xfs_alloc.h      |  2 ++
>  fs/xfs/libxfs/xfs_bmap.c       | 47 ++++++++++++++++++++++++++++++++--
>  fs/xfs/libxfs/xfs_bmap_btree.c |  2 ++
>  fs/xfs/xfs_mount.h             |  3 +++
>  fs/xfs/xfs_super.c             | 12 ++++++++-
>  7 files changed, 99 insertions(+), 4 deletions(-)
>
> --
> 2.33.0
>

Dave Chinner Nov. 4, 2024, 3:32 a.m. UTC | #2

On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
> From: Shida Zhang <zhangshida@kylinos.cn>
> 
> Hi all,
> 
> Recently, we've been encounter xfs problems from our two
> major users continuously.
> They are all manifested as the same phonomenon: a xfs 
> filesystem can't touch new file when there are nearly
> half of the available space even with sparse inode enabled.

What application is causing this, and does using extent size hints
make the problem go away?

Also, xfs_info and xfs_spaceman free space histograms would be
useful information.

> It turns out that the filesystem is too fragmented to have
> enough continuous free space to create a new file.

> Life still has to goes on. 
> But from our users' perspective, worse than the situation
> that xfs is hard to use is that xfs is non-able to use, 
> since even one single file can't be created now. 
> 
> So we try to introduce a new space allocation algorithm to
> solve this.
> 
> To achieve that, we try to propose a new concept:
>    Allocation Fields, where its name is borrowed from the 
> mathmatical concepts(Groups,Rings,Fields), will be 

I have no idea what this means. We don't have rings or fields,
and an allocation group is simply a linear address space range.
Please explain this concept (pointers to definitions and algorithms
appreciated!)

> abbrivated as AF in the rest of the article. 
> 
> what is a AF?
> An one-pic-to-say-it-all version of explaination:
> 
> |<--------+ af 0 +-------->|<--+ af 1 +-->| af 2|
> |------------------------------------------------+
> | ag 0 | ag 1 | ag 2 | ag 3| ag 4 | ag 5 | ag 6 |
> +------------------------------------------------+
> 
> A text-based definition of AF:
> 1.An AF is a incore-only concept comparing with the on-disk
>   AG concept.
> 2.An AF is consisted of a continuous series of AGs. 
> 3.Lower AFs will NEVER go to higher AFs for allocation if 
>   it can complete it in the current AF.
> 
> Rule 3 can serve as a barrier between the AF to slow down
> the over-speed extending of fragmented pieces. 

To a point, yes. But it's not really a reliable solution, because
directories are rotored across all AGs. Hence if the workload is
running across multiple AGs, then all of the AFs can be being
fragmented at the same time.

Given that I don't know how an application controls what AF it's
files are located in, I can't really say much more than that.

> With these patches applied, the code logic will be exactly
> the same as the original code logic, unless you run with the
> extra mount opiton. For example:
>    mount -o af1=1 $dev $mnt
> 
> That will change the default AF layout:
> 
> |<--------+ af 0 +--------->| 
> |----------------------------
> | ag 0 | ag 1 | ag 2 | ag 3 |
> +----------------------------
> 
> to :
> 
> |<-----+ af 0 +----->|<af 1>| 
> |----------------------------
> | ag 0 | ag 1 | ag 2 | ag 3 |
> +----------------------------
> 
> So the 'af1=1' here means the start agno is one ag away from
> the m_sb.agcount.

Yup, so kinda what we did back in 2006 in a proprietary SGI NAS
product with "concat groups" to create aggregations of allocation
groups that all sat on the same physical RAID5 luns in a linear
concat volume. They were fixed size, because the (dozens of) luns
were all the same size. This construct was heavily tailored to
maximising the performance provided by the underlying storage
hardware architecture, so wasn't really a general policy solution.

To make it work, we also had to change how various other allocation
distribution algorithms worked (e.g. directory rotoring) so that
the load was distributed more evenly across the physical hardware
backing the filesystem address space.

I don't see anything like that in this patch set - there's no actual
control mechanism to select what AF an inode lands in.  how does an
applicaiton or user actually use this reliably to prevent all the
AFs being fragmented by the workload that is running?

> We did some tests verify it. You can verify it yourself
> by running the following the command:
> 
> 1. Create an 1g sized img file and formated it as xfs:
>   dd if=/dev/zero of=test.img bs=1M count=1024
>   mkfs.xfs -f test.img
>   sync
> 2. Make a mount directory:
>   mkdir mnt
> 3. Run the auto_frag.sh script, which will call another scripts
>   frag.sh. These scripts will be attached in the mail. 
>   To enable the AF, run:
>     ./auto_frag.sh 1
>   To disable the AF, run:
>     ./auto_frag.sh 0
> 
> Please feel free to communicate with us if you have any thoughts
> about these problems.

We already have inode/metadata preferred allocation groups that
are avoided for data allocation if at all possible. This is how we
keep space free below 1TB for inodes when the inode32 allocator has
been selected. See xfs_perag_prefers_metadata().

Perhaps being able to control this preference from userspace (e.g.
via xfs_spaceman commands through ioctls and/or sysfs knobs) would
acheive the same results with a minimum of code and/or policy
changes. i.e. if AG0 is preferred for metadata rather than data,
we won't allocate data in it until all higher AGs are largely full.

-Dave.

Dave Chinner Nov. 4, 2024, 3:34 a.m. UTC | #3

On Mon, Nov 04, 2024 at 09:49:32AM +0800, Stephen Zhang wrote:
> Hi all,
> 
> I just send the scripts to test these series here.

Please include the scripts as in line text like is done for patches
- base64 encoded attachments are not quotable nor readable in
archives.

-Dave.

Stephen Zhang Nov. 4, 2024, 6:55 a.m. UTC | #4

Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 11:34写道：
>
> On Mon, Nov 04, 2024 at 09:49:32AM +0800, Stephen Zhang wrote:
> > Hi all,
> >
> > I just send the scripts to test these series here.
>

Okay, I've resend them in these series.

Cheers,
Shida

> Please include the scripts as in line text like is done for patches
> - base64 encoded attachments are not quotable nor readable in
> archives.
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Stephen Zhang Nov. 4, 2024, 9:25 a.m. UTC | #5

Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 11:32写道：
>
> On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
> > From: Shida Zhang <zhangshida@kylinos.cn>
> >
> > Hi all,
> >
> > Recently, we've been encounter xfs problems from our two
> > major users continuously.
> > They are all manifested as the same phonomenon: a xfs
> > filesystem can't touch new file when there are nearly
> > half of the available space even with sparse inode enabled.
>
> What application is causing this, and does using extent size hints
> make the problem go away?
>

Both are database-like applications, like mysql. Their source code
isn't available to us. And I doubt if they have the ability to modify the
database source code...

> Also, xfs_info and xfs_spaceman free space histograms would be
> useful information.
>

There are two such cases.
In one case:
$ xfs_info disk.img
meta-data=disk.img               isize=512    agcount=344, agsize=1638400 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=563085312, imaxpct=25
         =                       sunit=64     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

$ xfs_db -c freesp disk.img
   from      to extents  blocks    pct
      1       1 43375262 43375262  22.32
      2       3 64068226 150899026  77.66
      4       7       1       5   0.00
     32      63       3     133   0.00
    256     511       1     315   0.00
   1024    2047       1    1917   0.00
   8192   16383       2   20477   0.01


Another was mentioned already by one of my teammates. See:
https://lore.kernel.org/linux-xfs/173053338963.1934091.14116776076321174850.b4-ty@kernel.org/T/#t

[root@localhost ~]# xfs_db -c freesp /dev/vdb
   from      to extents  blocks    pct
      1       1     215     215   0.01
      2       3  994476 1988952  99.99


> > It turns out that the filesystem is too fragmented to have
> > enough continuous free space to create a new file.
>
> > Life still has to goes on.
> > But from our users' perspective, worse than the situation
> > that xfs is hard to use is that xfs is non-able to use,
> > since even one single file can't be created now.
> >
> > So we try to introduce a new space allocation algorithm to
> > solve this.
> >
> > To achieve that, we try to propose a new concept:
> >    Allocation Fields, where its name is borrowed from the
> > mathmatical concepts(Groups,Rings,Fields), will be
>
> I have no idea what this means. We don't have rings or fields,
> and an allocation group is simply a linear address space range.
> Please explain this concept (pointers to definitions and algorithms
> appreciated!)
>
>
> > abbrivated as AF in the rest of the article.
> >
> > what is a AF?
> > An one-pic-to-say-it-all version of explaination:
> >
> > |<--------+ af 0 +-------->|<--+ af 1 +-->| af 2|
> > |------------------------------------------------+
> > | ag 0 | ag 1 | ag 2 | ag 3| ag 4 | ag 5 | ag 6 |
> > +------------------------------------------------+
> >
> > A text-based definition of AF:
> > 1.An AF is a incore-only concept comparing with the on-disk
> >   AG concept.
> > 2.An AF is consisted of a continuous series of AGs.
> > 3.Lower AFs will NEVER go to higher AFs for allocation if
> >   it can complete it in the current AF.
> >
> > Rule 3 can serve as a barrier between the AF to slow down
> > the over-speed extending of fragmented pieces.
>
> To a point, yes. But it's not really a reliable solution, because
> directories are rotored across all AGs. Hence if the workload is
> running across multiple AGs, then all of the AFs can be being
> fragmented at the same time.
>

You mean the inode of the directory is expected to be distributed
evenly over the entire system, and the file extent of that directory will be
distributed in the same way?

The ideal layout of af to be constructed is to limit the higher af
in the small part of the entire [0, agcount). Like:

|<-----+ af 0 +----->|<af 1>|
|----------------------------
| ag 0 | ag 1 | ag 2 | ag 3 |
+----------------------------

So for much of the ags(0, 1, 2) in af 0, that will not be a problem.
And for the ag in the small part, like ag 3.
if there is inode in ag3, and there comes the space allocation of the
inode, it will not find space in ag 3 first. It will still search from the
af0 to af1, whose logic is reflected in the patch:

[PATCH 4/5] xfs: add infrastructure to support AF allocation algorithm

it says:

+ /* if start_agno is not in current AF range, make it be. */
+ if ((start_agno < start_af) || (start_agno > end_af))
+       start_agno = start_af;

which means, the start_agno will not be used to comply with locality
principle.

In general, the evenly distributed layout is slightly broken, but only for
the last small AG, if you choose the AF layout properly.

> Given that I don't know how an application controls what AF it's
> files are located in, I can't really say much more than that.
>
> > With these patches applied, the code logic will be exactly
> > the same as the original code logic, unless you run with the
> > extra mount opiton. For example:
> >    mount -o af1=1 $dev $mnt
> >
> > That will change the default AF layout:
> >
> > |<--------+ af 0 +--------->|
> > |----------------------------
> > | ag 0 | ag 1 | ag 2 | ag 3 |
> > +----------------------------
> >
> > to :
> >
> > |<-----+ af 0 +----->|<af 1>|
> > |----------------------------
> > | ag 0 | ag 1 | ag 2 | ag 3 |
> > +----------------------------
> >
> > So the 'af1=1' here means the start agno is one ag away from
> > the m_sb.agcount.
>
> Yup, so kinda what we did back in 2006 in a proprietary SGI NAS
> product with "concat groups" to create aggregations of allocation
> groups that all sat on the same physical RAID5 luns in a linear
> concat volume. They were fixed size, because the (dozens of) luns
> were all the same size. This construct was heavily tailored to
> maximising the performance provided by the underlying storage
> hardware architecture, so wasn't really a general policy solution.
>
> To make it work, we also had to change how various other allocation
> distribution algorithms worked (e.g. directory rotoring) so that
> the load was distributed more evenly across the physical hardware
> backing the filesystem address space.
>
> I don't see anything like that in this patch set - there's no actual
> control mechanism to select what AF an inode lands in.  how does an
> applicaiton or user actually use this reliably to prevent all the
> AFs being fragmented by the workload that is running?
>

> > 3.Lower AFs will NEVER go to higher AFs for allocation if
> >   it can complete it in the current AF.

From that rule, we can infer that,
     For any specific af, if len1 > len2, then,
     P(len1) <= P(len2)

where P(len) represents the probability of the success allocation for an
exact *len* length of extent.

To prove that, Imagine we have to allocate two extent at len 1 and 4 in af 0,
if we can allocate len 4 in af 0, then we can allocate len 1 in af 0.
but,
if we can allocate len 1 in af 1, we may not allocate len 4 in af 0.

So, P(len1) <= P(len2).

That means it will naturally form a layer of different len. like:

       +------------------------+
       |            8           |
af 2   |    1   8     8  1      |
       |       1   1            |
       +------------------------+
       |                        |
       |    4                   |
       |          4             |
af 1   |        4     1         |
       |    1       4           |
       |                  4     |
       +------------------------+
       |                        |
       |  1     1     1         |
       |                        |
       |           1            |
       |  1  1 4       1        |
af 0   |           1            |
       |      1                 |
       |                  1     |
       |          1             |
       |                        |
       +------------------------+

So there is no need so provide extra preference control info for
an allocation. It will naturally find where it should go.

So without the extra need of changing the application source code.




> > We did some tests verify it. You can verify it yourself
> > by running the following the command:
> >
> > 1. Create an 1g sized img file and formated it as xfs:
> >   dd if=/dev/zero of=test.img bs=1M count=1024
> >   mkfs.xfs -f test.img
> >   sync
> > 2. Make a mount directory:
> >   mkdir mnt
> > 3. Run the auto_frag.sh script, which will call another scripts
> >   frag.sh. These scripts will be attached in the mail.
> >   To enable the AF, run:
> >     ./auto_frag.sh 1
> >   To disable the AF, run:
> >     ./auto_frag.sh 0
> >
> > Please feel free to communicate with us if you have any thoughts
> > about these problems.
>
> We already have inode/metadata preferred allocation groups that
> are avoided for data allocation if at all possible. This is how we
> keep space free below 1TB for inodes when the inode32 allocator has
> been selected. See xfs_perag_prefers_metadata().
>
> Perhaps being able to control this preference from userspace (e.g.
> via xfs_spaceman commands through ioctls and/or sysfs knobs) would
> acheive the same results with a minimum of code and/or policy
> changes. i.e. if AG0 is preferred for metadata rather than data,
> we won't allocate data in it until all higher AGs are largely full.
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Dave Chinner Nov. 4, 2024, 12:15 p.m. UTC | #6

On Mon, Nov 04, 2024 at 05:25:38PM +0800, Stephen Zhang wrote:
> Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 11:32写道：
> >
> > On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
> > > From: Shida Zhang <zhangshida@kylinos.cn>
> > >
> > > Hi all,
> > >
> > > Recently, we've been encounter xfs problems from our two
> > > major users continuously.
> > > They are all manifested as the same phonomenon: a xfs
> > > filesystem can't touch new file when there are nearly
> > > half of the available space even with sparse inode enabled.
> >
> > What application is causing this, and does using extent size hints
> > make the problem go away?
> >
> 
> Both are database-like applications, like mysql. Their source code
> isn't available to us. And I doubt if they have the ability to modify the
> database source code...

Ok, so I did a bit of research. It's the MySQL transparent page
compression algorithm that is the problem here. Essentially what
this does is this:

Write : Page -> Transform -> Write transformed page to disk -> Punch hole

Read  : Page from disk -> Transform -> Original Page

Essentially, every page is still indexed at the expected offset,
but the data is compressed and the space that is saved by the
compression is then punched out of the filesystem. Basically it
makes every database page region on disk sparse, and it does it via
brute force.

This is -awful- for most filesystems. We use a similar technique in
fstests to generate worst case file and free space fragmentation
('punch_alternating') for exercising this sort of behaviour, and it
really does screw up performance and functionality on all the major
Linux filesysetms. XFS is the canary because of the way it
dynamically allocates inodes.

It's also awful for performance. There can be no concurrency in
database IO when it is doing hole punching like this. Hole punching
completely serialises all operations on that file, so it cannot be
doing read(), write() or even IO whilst the hole punch is being
done. 

IOWs, page compression allows the database to store more data in the
filesystem, but it does so at a cost. The cost is that it degrades
the filesystem free space badly over time. Hence as the FS fills up,
stuff really starts to slow down and, in some cases, stop working...

As the old saying goes: TANSTAAFL.

(There Ain't No Such Thing As A Free Lunch)

If you can't turn off page compression via a database configuration
flag, you could always shim the fallocate() syscall to always return
-EOPNOTSUPP to fallocate(FALLOC_FL_PUNCH_HOLE)....

> > Also, xfs_info and xfs_spaceman free space histograms would be
> > useful information.
> >
> 
> There are two such cases.
> In one case:
> $ xfs_info disk.img
> meta-data=disk.img               isize=512    agcount=344, agsize=1638400 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1
> data     =                       bsize=4096   blocks=563085312, imaxpct=25
>          =                       sunit=64     swidth=64 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=12800, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

This filesystem has 344 6GB AGs. That is .... not good. This will
make the free space fragmentation problems worse than having
4-16AGs (default for a fs of that size) because the free space is
already fragmented into lots of small disjoint regions even when the
filesytem is empty.

> $ xfs_db -c freesp disk.img
>    from      to extents  blocks    pct
>       1       1 43375262 43375262  22.32
>       2       3 64068226 150899026  77.66
>       4       7       1       5   0.00
>      32      63       3     133   0.00
>     256     511       1     315   0.00
>    1024    2047       1    1917   0.00
>    8192   16383       2   20477   0.01

Yeah, 100 million free space extents, 600GB of space indexed in
them, and there's *nothing* the filesystem can do about it because
that's what exactly happens when you have a 16kB database page size
and every 16kB page write involves a 1-3 block hole punch in the
16kB region of the file.

> > > It turns out that the filesystem is too fragmented to have
> > > enough continuous free space to create a new file.
> >
> > > Life still has to goes on.
> > > But from our users' perspective, worse than the situation
> > > that xfs is hard to use is that xfs is non-able to use,
> > > since even one single file can't be created now.
> > >
> > > So we try to introduce a new space allocation algorithm to
> > > solve this.
> > >
> > > To achieve that, we try to propose a new concept:
> > >    Allocation Fields, where its name is borrowed from the
> > > mathmatical concepts(Groups,Rings,Fields), will be
> >
> > I have no idea what this means. We don't have rings or fields,
> > and an allocation group is simply a linear address space range.
> > Please explain this concept (pointers to definitions and algorithms
> > appreciated!)
> >
> >
> > > abbrivated as AF in the rest of the article.
> > >
> > > what is a AF?
> > > An one-pic-to-say-it-all version of explaination:
> > >
> > > |<--------+ af 0 +-------->|<--+ af 1 +-->| af 2|
> > > |------------------------------------------------+
> > > | ag 0 | ag 1 | ag 2 | ag 3| ag 4 | ag 5 | ag 6 |
> > > +------------------------------------------------+
> > >
> > > A text-based definition of AF:
> > > 1.An AF is a incore-only concept comparing with the on-disk
> > >   AG concept.
> > > 2.An AF is consisted of a continuous series of AGs.
> > > 3.Lower AFs will NEVER go to higher AFs for allocation if
> > >   it can complete it in the current AF.
> > >
> > > Rule 3 can serve as a barrier between the AF to slow down
> > > the over-speed extending of fragmented pieces.
> >
> > To a point, yes. But it's not really a reliable solution, because
> > directories are rotored across all AGs. Hence if the workload is
> > running across multiple AGs, then all of the AFs can be being
> > fragmented at the same time.
> >
> 
> You mean the inode of the directory is expected to be distributed
> evenly over the entire system, and the file extent of that directory will be
> distributed in the same way?

The directory -structure- is distributed across all AGs. Each newly
created directly is placed in a different AG to it's parent inode.
Each file created in a directory is located in the same AG as the
parent directory inode. Inodes then try to alllocate data as close
to the inode as possible (i.e. allocation targets the same AG). This
keeps related data and metadata close together (i.e. preserved
locality of reference) and allows for delayed allocation
optimisations like multi-file write clustering....

This means that when you have something like a kernel source tree,
it will be distributed across every AG in the filesystem because
it's a complex directory tree with many files in it.

Of course, this all changes when you mount with "inode32". Inodes
are all located in the <1TB region, and the initial data allocation
target for each inode is distributed across AGs in the >1TG region.
There is no locality between the inode and it's data, and the
behaviour of the filesystem from an IO perspective is completely
different.

> The ideal layout of af to be constructed is to limit the higher af
> in the small part of the entire [0, agcount). Like:
> 
> |<-----+ af 0 +----->|<af 1>|
> |----------------------------
> | ag 0 | ag 1 | ag 2 | ag 3 |
> +----------------------------
> 
> So for much of the ags(0, 1, 2) in af 0, that will not be a problem.
> And for the ag in the small part, like ag 3.
> if there is inode in ag3, and there comes the space allocation of the
> inode, it will not find space in ag 3 first. It will still search from the
> af0 to af1, whose logic is reflected in the patch:

That was not clear from your description. I'm not about to try to
reverse engineer your proposed allocation algorithm from the code
you have presented. Looking at the implementation doesn't inform me
about the design and intent of the allocation policy.

> [PATCH 4/5] xfs: add infrastructure to support AF allocation algorithm
> 
> it says:
> 
> + /* if start_agno is not in current AF range, make it be. */
> + if ((start_agno < start_af) || (start_agno > end_af))
> +       start_agno = start_af;
> which means, the start_agno will not be used to comply with locality
> principle.

Which is basically the inode32 algorithm done backwards.

> > > 3.Lower AFs will NEVER go to higher AFs for allocation if
> > >   it can complete it in the current AF.
> 
> From that rule, we can infer that,
>      For any specific af, if len1 > len2, then,
>      P(len1) <= P(len2)
> 
> where P(len) represents the probability of the success allocation for an
> exact *len* length of extent.
>
> To prove that, Imagine we have to allocate two extent at len 1 and 4 in af 0,
> if we can allocate len 4 in af 0, then we can allocate len 1 in af 0.
> but,
> if we can allocate len 1 in af 1, we may not allocate len 4 in af 0.
>
> So, P(len1) <= P(len2).
> 
> That means it will naturally form a layer of different len. like:
> 
>        +------------------------+
>        |            8           |
> af 2   |    1   8     8  1      |
>        |       1   1            |
>        +------------------------+
>        |                        |
>        |    4                   |
>        |          4             |
> af 1   |        4     1         |
>        |    1       4           |
>        |                  4     |
>        +------------------------+
>        |                        |
>        |  1     1     1         |
>        |                        |
>        |           1            |
>        |  1  1 4       1        |
> af 0   |           1            |
>        |      1                 |
>        |                  1     |
>        |          1             |
>        |                        |
>        +------------------------+
> 
> So there is no need so provide extra preference control info for
> an allocation. It will naturally find where it should go.

This appears to be a "first fit from index zero" selection routine.
It optimises for discontiguous, small IO hole filling over
locality preserving large contiguous allocation, concurrency and IO
load distribution. XFS optimises for the latter, not the former.

First fit allocation generally results in performance hotspots in
large storage arrays. e.g. with a linear concat of two luns, a
first-fit from zero algorithm will not direct any IO to the second
lun until the first lun is almost entirely full. IOWs, half the
performance of the storage hardware is not being used with such an
algorithm. The larger the storage array gets, the worse this
under-utilisation becomes, and hence we have never needed to
optimise for such an inefficient IO pattern as the innodb page
compression algorithm uses.

FWIW, as this appears to be a first-fit algorithm, why is there
a need for special "AF"s to control behaviour? I may be missing
something else, but if we treat each AG as an AF, then we
effectively get the same result, right?

The only issue would be AG contention would result in allocations in
higher AGs before the lower AGs are completely full, but the
filesystem would still always fill from one end to the other as this
AF construct is attempting to do. That leaves space for inodes to be
allocated right up until the last AG in the fs becomes too
fragmented to allocate inodes.

AFAICT, this "reserve AG space for inodes" behaviour that you are
trying to acheive is effectively what the inode32 allocator already
implements. By forcing inode allocation into the AGs below 1TB and
preventing data from being allocated in those AGs until allocation
in all the AGs above start failing, it effectively provides the same
functionality but without the constraints of a global first fit
allocation policy.

We can do this with any AG by setting it up to prefer metadata,
but given we already have the inode32 allocator we can run some
tests to see if setting the metadata-preferred flag makes the
existing allocation policies do what is needed.

That is, mkfs a new 2TB filesystem with the same 344AG geometry as
above, mount it with -o inode32 and run the workload that fragments
all the free space. What we should see is that AGs in the upper TB
of the filesystem should fill almost to full before any significant
amount of allocation occurs in the AGs in the first TB of space.

If that's the observed behaviour, then I think this problem can be
solved by adding a mechanism to control which AGs in the filesystem
are metadata preferred...

-Dave.

Stephen Zhang Nov. 8, 2024, 1:34 a.m. UTC | #7

Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 20:15写道：
>
> On Mon, Nov 04, 2024 at 05:25:38PM +0800, Stephen Zhang wrote:
> > Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 11:32写道：
> > >
> > > On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
> > > > From: Shida Zhang <zhangshida@kylinos.cn>
> > > >
> > > > Hi all,
> > > >
> > > > Recently, we've been encounter xfs problems from our two
> > > > major users continuously.
> > > > They are all manifested as the same phonomenon: a xfs
> > > > filesystem can't touch new file when there are nearly
> > > > half of the available space even with sparse inode enabled.
> > >
> > > What application is causing this, and does using extent size hints
> > > make the problem go away?
> > >
> >
> > Both are database-like applications, like mysql. Their source code
> > isn't available to us. And I doubt if they have the ability to modify the
> > database source code...
>
> Ok, so I did a bit of research. It's the MySQL transparent page
> compression algorithm that is the problem here. Essentially what
> this does is this:
>
> Write : Page -> Transform -> Write transformed page to disk -> Punch hole
>
> Read  : Page from disk -> Transform -> Original Page
>
> Essentially, every page is still indexed at the expected offset,
> but the data is compressed and the space that is saved by the
> compression is then punched out of the filesystem. Basically it
> makes every database page region on disk sparse, and it does it via
> brute force.
>
> This is -awful- for most filesystems. We use a similar technique in
> fstests to generate worst case file and free space fragmentation
> ('punch_alternating') for exercising this sort of behaviour, and it
> really does screw up performance and functionality on all the major
> Linux filesysetms. XFS is the canary because of the way it
> dynamically allocates inodes.
>
> It's also awful for performance. There can be no concurrency in
> database IO when it is doing hole punching like this. Hole punching
> completely serialises all operations on that file, so it cannot be
> doing read(), write() or even IO whilst the hole punch is being
> done.
>
> IOWs, page compression allows the database to store more data in the
> filesystem, but it does so at a cost. The cost is that it degrades
> the filesystem free space badly over time. Hence as the FS fills up,
> stuff really starts to slow down and, in some cases, stop working...
>
> As the old saying goes: TANSTAAFL.
>
> (There Ain't No Such Thing As A Free Lunch)
>
> If you can't turn off page compression via a database configuration
> flag, you could always shim the fallocate() syscall to always return
> -EOPNOTSUPP to fallocate(FALLOC_FL_PUNCH_HOLE)....
>

Wow...thanks for the research. I didn't even think about going to
dig this far with respect to the MySQL application layer...

And I decide to do a research to the source code in:

https://github.com/mysql/mysql-server/blob/61a3a1d8ef15512396b4c2af46e922a19bf2b174/storage/innobase/os/os0file.cc#L4796

Here is the simplified version of the function os_file_io():
--------
os_file_io(src_len)
    if (is_compressed())
        len = os_file_compress_page(src_len);
    sync_file_io();
    if (len != src_len)
        (os_file_punch_hole(src_len - len));
--------
that means, on default case, the disk and the memory are indexed
like:
+------------+-------------+
|   16 KB    |     16 KB   |  memory
+--------------------------+
|            |             |
v            v             v
+--------------------------+
|   16 KB    |     16 KB   |  disk
+------------+-------------+

When the compression is enabled. It's like:
+------------+-------------+
|   16 KB    |     16 KB   |  memory
+-----------X+-------------X
|         X  |          X
v       X    v        X
+------+v------------v-----+
| 8 KB | hole|  8 KB|  hole|  disk
+------+-------------------+
So it trades the extra cpu time to compress/decompress for
saving half of storage cost.
In another words, it will double the storage cost when the config
is disabled.
How much will that cost? like a billion dollars?
Then I think it will not be a simple problem any more. It is a war
about money. :P
The filesystem who can fully take advantage of these saved storage
will earn millions or even billions of dollars from it. :p


> Of course, this all changes when you mount with "inode32". Inodes
> are all located in the <1TB region, and the initial data allocation
> target for each inode is distributed across AGs in the >1TG region.
> There is no locality between the inode and it's data, and the
> behaviour of the filesystem from an IO perspective is completely
> different.
>
> > The ideal layout of af to be constructed is to limit the higher af
> > in the small part of the entire [0, agcount). Like:
> >
> > |<-----+ af 0 +----->|<af 1>|
> > |----------------------------
> > | ag 0 | ag 1 | ag 2 | ag 3 |
> > +----------------------------
> >
> > So for much of the ags(0, 1, 2) in af 0, that will not be a problem.
> > And for the ag in the small part, like ag 3.
> > if there is inode in ag3, and there comes the space allocation of the
> > inode, it will not find space in ag 3 first. It will still search from the
> > af0 to af1, whose logic is reflected in the patch:
>
> That was not clear from your description. I'm not about to try to
> reverse engineer your proposed allocation algorithm from the code
> you have presented. Looking at the implementation doesn't inform me
> about the design and intent of the allocation policy.
>
> > [PATCH 4/5] xfs: add infrastructure to support AF allocation algorithm
> >
> > it says:
> >
> > + /* if start_agno is not in current AF range, make it be. */
> > + if ((start_agno < start_af) || (start_agno > end_af))
> > +       start_agno = start_af;
> > which means, the start_agno will not be used to comply with locality
> > principle.
>
> Which is basically the inode32 algorithm done backwards.
>

Yep, to some degree, it can be equivalent to the inode32 algorithm.
Because firstly, I didn't know the inode32 algorithm.
And I want to design an algorithm that will reserve continuous space
for inodes only. The info of the preferred agno can be passed from the
user via ioctl() or something.

But then I think, now that we can reserve the space for inodes, can
we just enlarge this idea so that the space can be arranged in any
desired way?

The problem is, to achieve this goal, we have to pass various info from
the user to the kernel and add some code to manage the special AG
properly..., which will greatly increase the complexity of the system.

And one idea comes to my mind, if only we can add a rule:
    Lower AFs will NEVER go to higher AFs for allocation if it can
complete it in the current AF. [1]

This rule can serve as a way to slow down the over-speed extending of
space,
since without this rule[1], the space will be extending quickly by:
1.Ag contention. That means if an ag is busy, then it will search for the
  next ag.
2.Best length fit. That means if the current ag have no suitable length for
  the current allocation even if it has enough space, it will go to next ag.
3.Rotored directories across all AGs and locality rules? (That's what
  I learned from Dave.)

Using this single rule[1] to rule them all, then everything will be
manifested as a clear and beautiful form, which will not add extra
complexity to design construct that preference needed or add code to
manage the reserved AG or pass the extra info from user space.

So I'll explain how the inode32 algorithm can be seen as a special case/subset
of AF.
When you do:
    mount -o af1=1 $dev $mnt
the AF layout becomes:
|<-----+ af 0 +----->|<af 1>|
|----------------------------
| ag 0 | ag 1 | ag 2 | ag 3 |
+----------------------------
And with the AF rule[1] above, no extra preference info is needed,
ag3 just becomes the reserved space of inodes.
Not only it can be used as the reserved space of inodes, but also it can
be used for any type of space that must need 4 continuous blocks.
More generally, with more levels of AFs, Higher AF can be used as a reserved
space of the lower AF that must be at length 4/5/../8/16/...whatever.

To illustrate that, Imagine that now AF 0 is badly fragmented:
       +------------------------+
       |                        |
af 2   |                        |
       +------------------------+
       |                        |
af 1   |                        |
       +------------------------+
       |  1     1     1         |
       |  1        1  1         |
       |  1  1 4       1        |
af 0   |           1     1      |
       |      1     1           |
       |    4  8          1     |
       +------------------------+

We want to allocate a space len of 4, but af 0 is so fragmented that there
is no such continuous space that we have only two choices:
1. Break down the 4 into many small pieces so as to get a success
   allocation in af 0.
2. Go to AF 1 for the allocation.

So keep in mind the rule[1],
for regular data extent allocation, it can break it down into small pieces, so
it must allocate in af 0.
for inode allocation, it cannot allocate in af 0 at any possibility. so it has
to go to af 1 for allocation.

That's how it became the reserved space for inodes. But not confined to
inodes only, it reserves space for any allocation that must be 4.
further,
       +------------------------+
       |                        |
af 2   |                        |
       +------------------------+
       |     4      4     4     |
af 1   |  4     4  1 1  4    4  |
       +------------------------+
       |  1     1     1         |
       |  1        1  1         |
       |  1  1 4       1        |
af 0   |           1     1      |
       |      1     1           |
       |    4  8          1     |
       +------------------------+
if the af 1 is badly fragmented too, now we want an allocation of continuous
len 8 exactly.
Because we want an exact 8 so we can't break it down. Thus we have to go to
af 2 for allocation.
Finally it will form layers of different len:
       +------------------------+
       |   8                    |
af 2   | 8  8    8              |
       +------------------------+
       |     4      4     4     |
af 1   |  4     4  1 1  4    4  |
       +------------------------+
       |  1     1     1         |
       |  1        1  1         |
       |  1  1 4       1        |
af 0   |           1     1      |
       |      1     1           |
       |    4  8          1     |
       +------------------------+

And what? you don't want it because it breaks the original AG rules too much.
Then you can adjust the af1 leftwards or rightwards:
|<-----+ af 0 +----->|<af 1>|
|----------------------------
| ag 0 | ag 1 | ag 2 | ag 3 |
+----------------------------
let's say, you don't want it all. then you move af1 rightwards:
|<-----+ af 0 +------------>|
|----------------------------
| ag 0 | ag 1 | ag 2 | ag 3 |
+----------------------------
The behavior will be totally the same as the original logic.
Or, you want it, but not that much. Then I believe there must a point in
the middle that will satisfy your demand.

To sum it up,
The rules used by AG are more about extending outwards.
whilst
The rules used by AF are more about restricting inwards.
like:
|                           |
+---->    af rule    <------+
|                           |
|        +          +       |
|    <-+ |  ag rule | +->   |
+        +          +       +
They are constructed in a symmetrical way so that we can strength/weaken
one rule or the other to reach a best balance, by adjusting the af location,
when combining the advantage and disadvantage of them.

> > > > 3.Lower AFs will NEVER go to higher AFs for allocation if
> > > >   it can complete it in the current AF.
> >
> > From that rule, we can infer that,
> >      For any specific af, if len1 > len2, then,
> >      P(len1) <= P(len2)
> >
> > where P(len) represents the probability of the success allocation for an
> > exact *len* length of extent.
> >
> > To prove that, Imagine we have to allocate two extent at len 1 and 4 in af 0,
> > if we can allocate len 4 in af 0, then we can allocate len 1 in af 0.
> > but,
> > if we can allocate len 1 in af 1, we may not allocate len 4 in af 0.
> >
> > So, P(len1) <= P(len2).
> >
> > That means it will naturally form a layer of different len. like:
> >
> >        +------------------------+
> >        |            8           |
> > af 2   |    1   8     8  1      |
> >        |       1   1            |
> >        +------------------------+
> >        |                        |
> >        |    4                   |
> >        |          4             |
> > af 1   |        4     1         |
> >        |    1       4           |
> >        |                  4     |
> >        +------------------------+
> >        |                        |
> >        |  1     1     1         |
> >        |                        |
> >        |           1            |
> >        |  1  1 4       1        |
> > af 0   |           1            |
> >        |      1                 |
> >        |                  1     |
> >        |          1             |
> >        |                        |
> >        +------------------------+
> >
> > So there is no need so provide extra preference control info for
> > an allocation. It will naturally find where it should go.
>
> This appears to be a "first fit from index zero" selection routine.
> It optimises for discontiguous, small IO hole filling over
> locality preserving large contiguous allocation, concurrency and IO
> load distribution. XFS optimises for the latter, not the former.
>
> First fit allocation generally results in performance hotspots in
> large storage arrays. e.g. with a linear concat of two luns, a
> first-fit from zero algorithm will not direct any IO to the second
> lun until the first lun is almost entirely full. IOWs, half the
> performance of the storage hardware is not being used with such an
> algorithm. The larger the storage array gets, the worse this
> under-utilisation becomes, and hence we have never needed to
> optimise for such an inefficient IO pattern as the innodb page
> compression algorithm uses.
>
> FWIW, as this appears to be a first-fit algorithm, why is there
> a need for special "AF"s to control behaviour? I may be missing
> something else, but if we treat each AG as an AF, then we
> effectively get the same result, right?
>
> The only issue would be AG contention would result in allocations in
> higher AGs before the lower AGs are completely full, but the
> filesystem would still always fill from one end to the other as this
> AF construct is attempting to do. That leaves space for inodes to be
> allocated right up until the last AG in the fs becomes too
> fragmented to allocate inodes.
>
> AFAICT, this "reserve AG space for inodes" behaviour that you are
> trying to acheive is effectively what the inode32 allocator already
> implements. By forcing inode allocation into the AGs below 1TB and
> preventing data from being allocated in those AGs until allocation
> in all the AGs above start failing, it effectively provides the same
> functionality but without the constraints of a global first fit
> allocation policy.
>
> We can do this with any AG by setting it up to prefer metadata,
> but given we already have the inode32 allocator we can run some
> tests to see if setting the metadata-preferred flag makes the
> existing allocation policies do what is needed.
>
> That is, mkfs a new 2TB filesystem with the same 344AG geometry as
> above, mount it with -o inode32 and run the workload that fragments
> all the free space. What we should see is that AGs in the upper TB
> of the filesystem should fill almost to full before any significant
> amount of allocation occurs in the AGs in the first TB of space.
>

So for the inode32 logarithm:
1. I need to specify a preferred ag, like ag 0:
|----------------------------
| ag 0 | ag 1 | ag 2 | ag 3 |
+----------------------------
2. Someday space will be used up to 100%, Then we have to growfs to ag 7:
+------+------+------+------+------+------+------+------+
| full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
+------+------+------+------+------+------+------+------+
3. specify another ag for inodes again.
4. repeat 1-3.

for the AF logarithm:
    mount -o af1=1 $dev $mnt
and we are done.
|<-----+ af 0 +----->|<af 1>|
|----------------------------
| ag 0 | ag 1 | ag 2 | ag 3 |
+----------------------------
because the af is a relative number to ag_count, so when growfs, it will
become:
|<-----+ af 0 +--------------------------------->|<af 1>|
+------+------+------+------+------+------+------+------+
| full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
+------+------+------+------+------+------+------+------+
So just set it once, and run forever.

Cheers,
Shida
> If that's the observed behaviour, then I think this problem can be
> solved by adding a mechanism to control which AGs in the filesystem
> are metadata preferred...
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Dave Chinner Nov. 11, 2024, 2:04 a.m. UTC | #8

On Fri, Nov 08, 2024 at 09:34:17AM +0800, Stephen Zhang wrote:
> Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 20:15写道：
> > On Mon, Nov 04, 2024 at 05:25:38PM +0800, Stephen Zhang wrote:
> > > Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 11:32写道：
> > > > On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:

[snip unnecessary stereotyping, accusations and repeated information]

> > AFAICT, this "reserve AG space for inodes" behaviour that you are
> > trying to acheive is effectively what the inode32 allocator already
> > implements. By forcing inode allocation into the AGs below 1TB and
> > preventing data from being allocated in those AGs until allocation
> > in all the AGs above start failing, it effectively provides the same
> > functionality but without the constraints of a global first fit
> > allocation policy.
> >
> > We can do this with any AG by setting it up to prefer metadata,
> > but given we already have the inode32 allocator we can run some
> > tests to see if setting the metadata-preferred flag makes the
> > existing allocation policies do what is needed.
> >
> > That is, mkfs a new 2TB filesystem with the same 344AG geometry as
> > above, mount it with -o inode32 and run the workload that fragments
> > all the free space. What we should see is that AGs in the upper TB
> > of the filesystem should fill almost to full before any significant
> > amount of allocation occurs in the AGs in the first TB of space.

Have you performed this experiment yet?

I did not ask it idly, and I certainly did not ask it with the intent
that we might implement inode32 with AFs. It is fundamentally
impossible to implement inode32 with the proposed AF feature.

The inode32 policy -requires- top down data fill so that AG 0 is the
*last to fill* with user data. The AF first-fit proposal guarantees
bottom up fill where AG 0 is the *first to fill* with user data.

For example:

> So for the inode32 logarithm:
> 1. I need to specify a preferred ag, like ag 0:
> |----------------------------
> | ag 0 | ag 1 | ag 2 | ag 3 |
> +----------------------------
> 2. Someday space will be used up to 100%, Then we have to growfs to ag 7:
> +------+------+------+------+------+------+------+------+
> | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> +------+------+------+------+------+------+------+------+
> 3. specify another ag for inodes again.
> 4. repeat 1-3.

Lets's assume that AGs are 512GB each and so AGs 0 and 1 fill the
entire lower 1TB of the filesystem. Hence if we get to all AGs full
the entire inode32 inode allocation space is full.

Even if we grow the filesystem at this point, we still *cannot*
allocate more inodes in the inode32 space. That space (AGs 0-1) is
full even after the growfs.  Hence we will still give ENOSPC, and
that is -correct behaviour- because the inode32 policy requires this
behaviour.

IOWs, growfs and changing the AF bounds cannot fix ENOSPC on inode32
when the inode space is exhausted. Only physically moving data out
of the lower AGs can fix that problem...

> for the AF logarithm:
>     mount -o af1=1 $dev $mnt
> and we are done.
> |<-----+ af 0 +----->|<af 1>|
> |----------------------------
> | ag 0 | ag 1 | ag 2 | ag 3 |
> +----------------------------
> because the af is a relative number to ag_count, so when growfs, it will
> become:
> |<-----+ af 0 +--------------------------------->|<af 1>|
> +------+------+------+------+------+------+------+------+
> | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> +------+------+------+------+------+------+------+------+
> So just set it once, and run forever.

That is actually the general solution to the original problem being
reported. I realised this about half way through reading your
original proposal. This is why I pointed out inode32 and the
preferred metadata mechanism in the AG allocator policies.

That is, a general solution should only require the highest AG
to be marked as metadata preferred. Then -all- data allocation will
then skip over the highest AG until there is no space left in any of
the lower AGs. This behaviour will be enforced by the existing AG
iteration allocation algorithms without any change being needed.

Then when we grow the fs, we set the new highest AG to be metadata
preferred, and that space will now be reserved for inodes until all
other space is consumed.

Do you now understand why I asked you to test whether the inode32
mount option kept the data out of the lower AGs until the higher AGs
were completely filled? It's because I wanted confirmation that the
metadata preferred flag would do what we need to implement a
general solution for the problematic workload.

Remeber: free space fragmentation can happen for many reasons - this
mysql thing is just the latest one discovered.  The best solution is
having general mechanisms in the filesystem that automatically
mitigate the effects of free space fragmentation on inode
allocation. The worst solution is requiring users to tweak knobs...

-Dave.

Stephen Zhang Nov. 17, 2024, 1:34 a.m. UTC | #9

Dave Chinner <david@fromorbit.com> 于2024年11月11日周一 10:04写道：
>
> On Fri, Nov 08, 2024 at 09:34:17AM +0800, Stephen Zhang wrote:
> > Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 20:15写道：
> > > On Mon, Nov 04, 2024 at 05:25:38PM +0800, Stephen Zhang wrote:
> > > > Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 11:32写道：
> > > > > On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
>
> [snip unnecessary stereotyping, accusations and repeated information]
>
> > > AFAICT, this "reserve AG space for inodes" behaviour that you are
> > > trying to acheive is effectively what the inode32 allocator already
> > > implements. By forcing inode allocation into the AGs below 1TB and
> > > preventing data from being allocated in those AGs until allocation
> > > in all the AGs above start failing, it effectively provides the same
> > > functionality but without the constraints of a global first fit
> > > allocation policy.
> > >
> > > We can do this with any AG by setting it up to prefer metadata,
> > > but given we already have the inode32 allocator we can run some
> > > tests to see if setting the metadata-preferred flag makes the
> > > existing allocation policies do what is needed.
> > >
> > > That is, mkfs a new 2TB filesystem with the same 344AG geometry as
> > > above, mount it with -o inode32 and run the workload that fragments
> > > all the free space. What we should see is that AGs in the upper TB
> > > of the filesystem should fill almost to full before any significant
> > > amount of allocation occurs in the AGs in the first TB of space.
>
> Have you performed this experiment yet?
>
> I did not ask it idly, and I certainly did not ask it with the intent
> that we might implement inode32 with AFs. It is fundamentally
> impossible to implement inode32 with the proposed AF feature.
>
> The inode32 policy -requires- top down data fill so that AG 0 is the
> *last to fill* with user data. The AF first-fit proposal guarantees
> bottom up fill where AG 0 is the *first to fill* with user data.
>
> For example:
>
> > So for the inode32 logarithm:
> > 1. I need to specify a preferred ag, like ag 0:
> > |----------------------------
> > | ag 0 | ag 1 | ag 2 | ag 3 |
> > +----------------------------
> > 2. Someday space will be used up to 100%, Then we have to growfs to ag 7:
> > +------+------+------+------+------+------+------+------+
> > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > +------+------+------+------+------+------+------+------+
> > 3. specify another ag for inodes again.
> > 4. repeat 1-3.
>
> Lets's assume that AGs are 512GB each and so AGs 0 and 1 fill the
> entire lower 1TB of the filesystem. Hence if we get to all AGs full
> the entire inode32 inode allocation space is full.
>
> Even if we grow the filesystem at this point, we still *cannot*
> allocate more inodes in the inode32 space. That space (AGs 0-1) is
> full even after the growfs.  Hence we will still give ENOSPC, and
> that is -correct behaviour- because the inode32 policy requires this
> behaviour.
>
> IOWs, growfs and changing the AF bounds cannot fix ENOSPC on inode32
> when the inode space is exhausted. Only physically moving data out
> of the lower AGs can fix that problem...
>
> > for the AF logarithm:
> >     mount -o af1=1 $dev $mnt
> > and we are done.
> > |<-----+ af 0 +----->|<af 1>|
> > |----------------------------
> > | ag 0 | ag 1 | ag 2 | ag 3 |
> > +----------------------------
> > because the af is a relative number to ag_count, so when growfs, it will
> > become:
> > |<-----+ af 0 +--------------------------------->|<af 1>|
> > +------+------+------+------+------+------+------+------+
> > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > +------+------+------+------+------+------+------+------+
> > So just set it once, and run forever.
>
> That is actually the general solution to the original problem being
> reported. I realised this about half way through reading your
> original proposal. This is why I pointed out inode32 and the
> preferred metadata mechanism in the AG allocator policies.
>
> That is, a general solution should only require the highest AG
> to be marked as metadata preferred. Then -all- data allocation will
> then skip over the highest AG until there is no space left in any of
> the lower AGs. This behaviour will be enforced by the existing AG
> iteration allocation algorithms without any change being needed.
>
> Then when we grow the fs, we set the new highest AG to be metadata
> preferred, and that space will now be reserved for inodes until all
> other space is consumed.
>
> Do you now understand why I asked you to test whether the inode32
> mount option kept the data out of the lower AGs until the higher AGs
> were completely filled? It's because I wanted confirmation that the
> metadata preferred flag would do what we need to implement a
> general solution for the problematic workload.
>

Hi, I have tested the inode32 mount option. To my suprise, the inode32
or the metadata preferred structure (will be referred to as inode32 for the
rest reply) doesn't implement the desired behavior as the AF rule[1] does:
        Lower AFs/AGs will do anything they can for allocation before going
to HIGHER/RESERVED AFs/AGS. [1]

While the inode32 does:
        Lower AFs/AGs won't do anything they can for allocation before going
to HIGHER/RESERVED AFs/AGS.

To illustrate that, imagine that now AG 2 is badly fragmented and AG 0/1 are
reserved for inode32:
       +------------------------+
       |                        |
ag 0   |                        |
       +------------------------+
       |                        |
ag 1   |                        |
       +------------------------+
       |  1     1     1         |
       |  1        1  1         |
       |  1  1 4       1        |
ag 2   |           1     1      |
       |      1     1           |
       |    4  8          1     |
       +------------------------+
We want a allocate a space len of 4, but ag 2 is so fragmented that there
is no such continuous space that we have only two choices:
1. Break down the 4 into many small pieces for a success allocation in AG 2.
2. Go to the reserved AG 0/1 for the allocation.

But unlike the AF, the inode32 will choose option 2...

To understand the reason for that, we must understand the general allocation
phases:
1. Best length fit. Find the best length for the current allocation in
two loops.
    1.1. First loop with *_TRY_LOCK flags.
    1.2. Second loop without *_TRY_LOCK flags.
2. Low space algorithm. Break the allocation into small pieces and fit them into
   the free space one by one.

So for the AF, it will do anything it can before going to higher AFs. *anything*
means the allocation must completely go through the whole 1.1->1.2->2 phase and
then go to the next AF.
But for the inode32, it will only go through 1.1-> and then go to the
reserved AG.

Take a look at the core code snippet for inode32 in xfs_alloc_fix_freelist():


        /*
         * If this is a metadata preferred pag and we are user data then try
         * somewhere else if we are not being asked to try harder at this
         * point
         */
        if (xfs_perag_prefers_metadata(pag) &&
            (args->datatype & XFS_ALLOC_USERDATA) &&
            (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) {
                ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING));
                goto out_agbp_relse;
        }

That's exactly how the inode32 sees if it should go to the RESERVED AG for
allocation or not.

The inode32 will see if the current alloction is in *_TRY_LOCK mode
or not, if it isn't, then it can go to the RESERVED AG for allocation.
But at this moment, the allocation in unreserved ags only have gone
through 1.1->...

And seen from the code analysis for metadata preference algorithm, using the
preference info to comply with the rule[1](indicate the unreserved AGs already
having gone through 1.1->1.2->2) will greatly increase the
system complexity compared with the AF algorithm, or basically impossible...

To sum it up:
1. The inode32/metadata-preference doesn't comply with the rule[1]. So it has
   no ability to solve the reported problem.
2. Since the inode32 is kinda conficted with AF, maybe the AF should be disabled
   when inode32 gets enabled...


> Remeber: free space fragmentation can happen for many reasons - this
> mysql thing is just the latest one discovered.  The best solution is
> having general mechanisms in the filesystem that automatically
> mitigate the effects of free space fragmentation on inode
> allocation. The worst solution is requiring users to tweak knobs...
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Dave Chinner Nov. 20, 2024, 10:53 p.m. UTC | #10

On Sun, Nov 17, 2024 at 09:34:53AM +0800, Stephen Zhang wrote:
> Dave Chinner <david@fromorbit.com> 于2024年11月11日周一 10:04写道：
> >
> > On Fri, Nov 08, 2024 at 09:34:17AM +0800, Stephen Zhang wrote:
> > > Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 20:15写道：
> > > > On Mon, Nov 04, 2024 at 05:25:38PM +0800, Stephen Zhang wrote:
> > > > > Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 11:32写道：
> > > > > > On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
> >
> > [snip unnecessary stereotyping, accusations and repeated information]
> >
> > > > AFAICT, this "reserve AG space for inodes" behaviour that you are
> > > > trying to acheive is effectively what the inode32 allocator already
> > > > implements. By forcing inode allocation into the AGs below 1TB and
> > > > preventing data from being allocated in those AGs until allocation
> > > > in all the AGs above start failing, it effectively provides the same
> > > > functionality but without the constraints of a global first fit
> > > > allocation policy.
> > > >
> > > > We can do this with any AG by setting it up to prefer metadata,
> > > > but given we already have the inode32 allocator we can run some
> > > > tests to see if setting the metadata-preferred flag makes the
> > > > existing allocation policies do what is needed.
> > > >
> > > > That is, mkfs a new 2TB filesystem with the same 344AG geometry as
> > > > above, mount it with -o inode32 and run the workload that fragments
> > > > all the free space. What we should see is that AGs in the upper TB
> > > > of the filesystem should fill almost to full before any significant
> > > > amount of allocation occurs in the AGs in the first TB of space.
> >
> > Have you performed this experiment yet?
> >
> > I did not ask it idly, and I certainly did not ask it with the intent
> > that we might implement inode32 with AFs. It is fundamentally
> > impossible to implement inode32 with the proposed AF feature.
> >
> > The inode32 policy -requires- top down data fill so that AG 0 is the
> > *last to fill* with user data. The AF first-fit proposal guarantees
> > bottom up fill where AG 0 is the *first to fill* with user data.
> >
> > For example:
> >
> > > So for the inode32 logarithm:
> > > 1. I need to specify a preferred ag, like ag 0:
> > > |----------------------------
> > > | ag 0 | ag 1 | ag 2 | ag 3 |
> > > +----------------------------
> > > 2. Someday space will be used up to 100%, Then we have to growfs to ag 7:
> > > +------+------+------+------+------+------+------+------+
> > > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > > +------+------+------+------+------+------+------+------+
> > > 3. specify another ag for inodes again.
> > > 4. repeat 1-3.
> >
> > Lets's assume that AGs are 512GB each and so AGs 0 and 1 fill the
> > entire lower 1TB of the filesystem. Hence if we get to all AGs full
> > the entire inode32 inode allocation space is full.
> >
> > Even if we grow the filesystem at this point, we still *cannot*
> > allocate more inodes in the inode32 space. That space (AGs 0-1) is
> > full even after the growfs.  Hence we will still give ENOSPC, and
> > that is -correct behaviour- because the inode32 policy requires this
> > behaviour.
> >
> > IOWs, growfs and changing the AF bounds cannot fix ENOSPC on inode32
> > when the inode space is exhausted. Only physically moving data out
> > of the lower AGs can fix that problem...
> >
> > > for the AF logarithm:
> > >     mount -o af1=1 $dev $mnt
> > > and we are done.
> > > |<-----+ af 0 +----->|<af 1>|
> > > |----------------------------
> > > | ag 0 | ag 1 | ag 2 | ag 3 |
> > > +----------------------------
> > > because the af is a relative number to ag_count, so when growfs, it will
> > > become:
> > > |<-----+ af 0 +--------------------------------->|<af 1>|
> > > +------+------+------+------+------+------+------+------+
> > > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > > +------+------+------+------+------+------+------+------+
> > > So just set it once, and run forever.
> >
> > That is actually the general solution to the original problem being
> > reported. I realised this about half way through reading your
> > original proposal. This is why I pointed out inode32 and the
> > preferred metadata mechanism in the AG allocator policies.
> >
> > That is, a general solution should only require the highest AG
> > to be marked as metadata preferred. Then -all- data allocation will
> > then skip over the highest AG until there is no space left in any of
> > the lower AGs. This behaviour will be enforced by the existing AG
> > iteration allocation algorithms without any change being needed.
> >
> > Then when we grow the fs, we set the new highest AG to be metadata
> > preferred, and that space will now be reserved for inodes until all
> > other space is consumed.
> >
> > Do you now understand why I asked you to test whether the inode32
> > mount option kept the data out of the lower AGs until the higher AGs
> > were completely filled? It's because I wanted confirmation that the
> > metadata preferred flag would do what we need to implement a
> > general solution for the problematic workload.
> >
> 
> Hi, I have tested the inode32 mount option. To my suprise, the inode32
> or the metadata preferred structure (will be referred to as inode32 for the
> rest reply) doesn't implement the desired behavior as the AF rule[1] does:
>         Lower AFs/AGs will do anything they can for allocation before going
> to HIGHER/RESERVED AFs/AGS. [1]

This isn't important or relevant to the experiment I asked you to
perform and report the results of.

I asked you to observe and report the filesystem fill pattern in
your environment when metadata preferred AGs are enabled. It isn't
important whether inode32 exactly solves your problem, what I want
to know is whether the underlying mechanism has sufficient control
to provide a general solution that is always enabled.

This is foundational engineering process: check your hypothesis work
as you expect before building more stuff on top of them. i.e.
perform experiments to confirm your ideas will work before doing
anything else.

If you answer a request for an experiment to be run with "theory
tells me it won't work" then you haven't understood why you were
asked to run an experiment in the first place.

If you can't run requested experiments or don't understand why an
expert might be asking for that experiment to be run, then say so.
I can explain in more detail, but I don't like to waste time on
ideas that I can't confirm have a solid basis in reality...

-Dave.

Stephen Zhang Nov. 21, 2024, 7:17 a.m. UTC | #11

Dave Chinner <david@fromorbit.com> 于2024年11月21日周四 06:53写道：
>
> On Sun, Nov 17, 2024 at 09:34:53AM +0800, Stephen Zhang wrote:
> > Dave Chinner <david@fromorbit.com> 于2024年11月11日周一 10:04写道：
> > >
> > > On Fri, Nov 08, 2024 at 09:34:17AM +0800, Stephen Zhang wrote:
> > > > Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 20:15写道：
> > > > > On Mon, Nov 04, 2024 at 05:25:38PM +0800, Stephen Zhang wrote:
> > > > > > Dave Chinner <david@fromorbit.com> 于2024年11月4日周一 11:32写道：
> > > > > > > On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
> > >
> > > [snip unnecessary stereotyping, accusations and repeated information]
> > >
> > > > > AFAICT, this "reserve AG space for inodes" behaviour that you are
> > > > > trying to acheive is effectively what the inode32 allocator already
> > > > > implements. By forcing inode allocation into the AGs below 1TB and
> > > > > preventing data from being allocated in those AGs until allocation
> > > > > in all the AGs above start failing, it effectively provides the same
> > > > > functionality but without the constraints of a global first fit
> > > > > allocation policy.
> > > > >
> > > > > We can do this with any AG by setting it up to prefer metadata,
> > > > > but given we already have the inode32 allocator we can run some
> > > > > tests to see if setting the metadata-preferred flag makes the
> > > > > existing allocation policies do what is needed.
> > > > >
> > > > > That is, mkfs a new 2TB filesystem with the same 344AG geometry as
> > > > > above, mount it with -o inode32 and run the workload that fragments
> > > > > all the free space. What we should see is that AGs in the upper TB
> > > > > of the filesystem should fill almost to full before any significant
> > > > > amount of allocation occurs in the AGs in the first TB of space.
> > >
> > > Have you performed this experiment yet?
> > >
> > > I did not ask it idly, and I certainly did not ask it with the intent
> > > that we might implement inode32 with AFs. It is fundamentally
> > > impossible to implement inode32 with the proposed AF feature.
> > >
> > > The inode32 policy -requires- top down data fill so that AG 0 is the
> > > *last to fill* with user data. The AF first-fit proposal guarantees
> > > bottom up fill where AG 0 is the *first to fill* with user data.
> > >
> > > For example:
> > >
> > > > So for the inode32 logarithm:
> > > > 1. I need to specify a preferred ag, like ag 0:
> > > > |----------------------------
> > > > | ag 0 | ag 1 | ag 2 | ag 3 |
> > > > +----------------------------
> > > > 2. Someday space will be used up to 100%, Then we have to growfs to ag 7:
> > > > +------+------+------+------+------+------+------+------+
> > > > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > > > +------+------+------+------+------+------+------+------+
> > > > 3. specify another ag for inodes again.
> > > > 4. repeat 1-3.
> > >
> > > Lets's assume that AGs are 512GB each and so AGs 0 and 1 fill the
> > > entire lower 1TB of the filesystem. Hence if we get to all AGs full
> > > the entire inode32 inode allocation space is full.
> > >
> > > Even if we grow the filesystem at this point, we still *cannot*
> > > allocate more inodes in the inode32 space. That space (AGs 0-1) is
> > > full even after the growfs.  Hence we will still give ENOSPC, and
> > > that is -correct behaviour- because the inode32 policy requires this
> > > behaviour.
> > >
> > > IOWs, growfs and changing the AF bounds cannot fix ENOSPC on inode32
> > > when the inode space is exhausted. Only physically moving data out
> > > of the lower AGs can fix that problem...
> > >
> > > > for the AF logarithm:
> > > >     mount -o af1=1 $dev $mnt
> > > > and we are done.
> > > > |<-----+ af 0 +----->|<af 1>|
> > > > |----------------------------
> > > > | ag 0 | ag 1 | ag 2 | ag 3 |
> > > > +----------------------------
> > > > because the af is a relative number to ag_count, so when growfs, it will
> > > > become:
> > > > |<-----+ af 0 +--------------------------------->|<af 1>|
> > > > +------+------+------+------+------+------+------+------+
> > > > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > > > +------+------+------+------+------+------+------+------+
> > > > So just set it once, and run forever.
> > >
> > > That is actually the general solution to the original problem being
> > > reported. I realised this about half way through reading your
> > > original proposal. This is why I pointed out inode32 and the
> > > preferred metadata mechanism in the AG allocator policies.
> > >
> > > That is, a general solution should only require the highest AG
> > > to be marked as metadata preferred. Then -all- data allocation will
> > > then skip over the highest AG until there is no space left in any of
> > > the lower AGs. This behaviour will be enforced by the existing AG
> > > iteration allocation algorithms without any change being needed.
> > >
> > > Then when we grow the fs, we set the new highest AG to be metadata
> > > preferred, and that space will now be reserved for inodes until all
> > > other space is consumed.
> > >
> > > Do you now understand why I asked you to test whether the inode32
> > > mount option kept the data out of the lower AGs until the higher AGs
> > > were completely filled? It's because I wanted confirmation that the
> > > metadata preferred flag would do what we need to implement a
> > > general solution for the problematic workload.
> > >
> >
> > Hi, I have tested the inode32 mount option. To my suprise, the inode32
> > or the metadata preferred structure (will be referred to as inode32 for the
> > rest reply) doesn't implement the desired behavior as the AF rule[1] does:
> >         Lower AFs/AGs will do anything they can for allocation before going
> > to HIGHER/RESERVED AFs/AGS. [1]
>
> This isn't important or relevant to the experiment I asked you to
> perform and report the results of.
>
> I asked you to observe and report the filesystem fill pattern in
> your environment when metadata preferred AGs are enabled. It isn't
> important whether inode32 exactly solves your problem, what I want
> to know is whether the underlying mechanism has sufficient control
> to provide a general solution that is always enabled.
>
> This is foundational engineering process: check your hypothesis work
> as you expect before building more stuff on top of them. i.e.
> perform experiments to confirm your ideas will work before doing
> anything else.
>
> If you answer a request for an experiment to be run with "theory
> tells me it won't work" then you haven't understood why you were
> asked to run an experiment in the first place.
>

If I understand your reply correctly, then maybe my expression is the
problem. What I replied before is:
1. I have tested the inode32 option with the metadata preferred AGs
enabled(Yeah, I do check if the AG is set with
XFS_AGSTATE_PREFERS_METADATA). And with the alternating-
punching pattern, I observed that the preferred AG will still get fragmented
quickly, but the AF will not.
(That's what I meant in the first sentence of my previous reply...)
2. Then I tried to explain why it doesn't work in theory.

Sorry for any misunderstanding because of my unclear reply.

Cheers,
Shida

> If you can't run requested experiments or don't understand why an
> expert might be asking for that experiment to be run, then say so.
> I can explain in more detail, but I don't like to waste time on
> ideas that I can't confirm have a solid basis in reality...
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

[0/5] *** Introduce new space allocation algorithm ***

Message

Comments

[0/5] * Introduce new space allocation algorithm *