diff mbox series

[6/8] xfs: introduce xfs_fs_destroy_super()

Message ID 20230531095742.2480623-7-qi.zheng@linux.dev (mailing list archive)
State Under Review
Headers show
Series make unregistration of super_block shrinker more faster | expand

Commit Message

Qi Zheng May 31, 2023, 9:57 a.m. UTC
From: Kirill Tkhai <tkhai@ya.ru>

xfs_fs_nr_cached_objects() touches sb->s_fs_info,
and this patch makes it to be destructed later.

After this patch xfs_fs_nr_cached_objects() is safe
for splitting unregister_shrinker(): mp->m_perag_tree
is stable till destroy_super_work(), while iteration
over it is already RCU-protected by internal XFS
business.

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

Comments

Dave Chinner May 31, 2023, 11:48 p.m. UTC | #1
On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <tkhai@ya.ru>
> 
> xfs_fs_nr_cached_objects() touches sb->s_fs_info,
> and this patch makes it to be destructed later.
> 
> After this patch xfs_fs_nr_cached_objects() is safe
> for splitting unregister_shrinker(): mp->m_perag_tree
> is stable till destroy_super_work(), while iteration
> over it is already RCU-protected by internal XFS
> business.
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 7e706255f165..694616524c76 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -743,11 +743,18 @@ xfs_fs_drop_inode(
>  }
>  
>  static void
> -xfs_mount_free(
> +xfs_free_names(
>  	struct xfs_mount	*mp)
>  {
>  	kfree(mp->m_rtname);
>  	kfree(mp->m_logname);
> +}
> +
> +static void
> +xfs_mount_free(
> +	struct xfs_mount	*mp)
> +{
> +	xfs_free_names(mp);
>  	kmem_free(mp);
>  }
>  
> @@ -1136,8 +1143,19 @@ xfs_fs_put_super(
>  	xfs_destroy_mount_workqueues(mp);
>  	xfs_close_devices(mp);
>  
> -	sb->s_fs_info = NULL;
> -	xfs_mount_free(mp);
> +	xfs_free_names(mp);
> +}
> +
> +static void
> +xfs_fs_destroy_super(
> +	struct super_block	*sb)
> +{
> +	if (sb->s_fs_info) {
> +		struct xfs_mount	*mp = XFS_M(sb);
> +
> +		kmem_free(mp);
> +		sb->s_fs_info = NULL;
> +	}
>  }
>  
>  static long
> @@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
>  	.dirty_inode		= xfs_fs_dirty_inode,
>  	.drop_inode		= xfs_fs_drop_inode,
>  	.put_super		= xfs_fs_put_super,
> +	.destroy_super		= xfs_fs_destroy_super,
>  	.sync_fs		= xfs_fs_sync_fs,
>  	.freeze_fs		= xfs_fs_freeze,
>  	.unfreeze_fs		= xfs_fs_unfreeze,

I don't really like this ->destroy_super() callback, especially as
it's completely undocumented as to why it exists. This is purely a
work-around for handling extended filesystem superblock shrinker
functionality, yet there's nothing that tells the reader this.

It also seems to imply that the superblock shrinker can continue to
run after the existing unregister_shrinker() call before ->kill_sb()
is called. This violates the assumption made in filesystems that the
superblock shrinkers have been stopped and will never run again
before ->kill_sb() is called. Hence ->kill_sb() implementations
assume there is nothing else accessing filesystem owned structures
and it can tear down internal structures safely.

Realistically, the days of XFS using this superblock shrinker
extension are numbered. We've got a lot of the infrastructure we
need in place to get rid of the background inode reclaim
infrastructure that requires this shrinker extension, and it's on my
list of things that need to be addressed in the near future. 

In fact, now that I look at it, I think the shmem usage of this
superblock shrinker interface is broken - it returns SHRINK_STOP to
->free_cached_objects(), but the only valid return value is the
number of objects freed (i.e. 0 is nothing freed). These special
superblock extension interfaces do not work like a normal
shrinker....

Hence I think the shmem usage should be replaced with an separate
internal shmem shrinker that is managed by the filesystem itself
(similar to how XFS has multiple internal shrinkers).

At this point, then the only user of this interface is (again) XFS.
Given this, adding new VFS methods for a single filesystem
for functionality that is planned to be removed is probably not the
best approach to solving the problem.

Cheers,

Dave.
Qi Zheng June 1, 2023, 8:43 a.m. UTC | #2
Hi Dave,

On 2023/6/1 07:48, Dave Chinner wrote:
> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>> From: Kirill Tkhai <tkhai@ya.ru>
>>
>> xfs_fs_nr_cached_objects() touches sb->s_fs_info,
>> and this patch makes it to be destructed later.
>>
>> After this patch xfs_fs_nr_cached_objects() is safe
>> for splitting unregister_shrinker(): mp->m_perag_tree
>> is stable till destroy_super_work(), while iteration
>> over it is already RCU-protected by internal XFS
>> business.
>>
>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
>>   1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 7e706255f165..694616524c76 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -743,11 +743,18 @@ xfs_fs_drop_inode(
>>   }
>>   
>>   static void
>> -xfs_mount_free(
>> +xfs_free_names(
>>   	struct xfs_mount	*mp)
>>   {
>>   	kfree(mp->m_rtname);
>>   	kfree(mp->m_logname);
>> +}
>> +
>> +static void
>> +xfs_mount_free(
>> +	struct xfs_mount	*mp)
>> +{
>> +	xfs_free_names(mp);
>>   	kmem_free(mp);
>>   }
>>   
>> @@ -1136,8 +1143,19 @@ xfs_fs_put_super(
>>   	xfs_destroy_mount_workqueues(mp);
>>   	xfs_close_devices(mp);
>>   
>> -	sb->s_fs_info = NULL;
>> -	xfs_mount_free(mp);
>> +	xfs_free_names(mp);
>> +}
>> +
>> +static void
>> +xfs_fs_destroy_super(
>> +	struct super_block	*sb)
>> +{
>> +	if (sb->s_fs_info) {
>> +		struct xfs_mount	*mp = XFS_M(sb);
>> +
>> +		kmem_free(mp);
>> +		sb->s_fs_info = NULL;
>> +	}
>>   }
>>   
>>   static long
>> @@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
>>   	.dirty_inode		= xfs_fs_dirty_inode,
>>   	.drop_inode		= xfs_fs_drop_inode,
>>   	.put_super		= xfs_fs_put_super,
>> +	.destroy_super		= xfs_fs_destroy_super,
>>   	.sync_fs		= xfs_fs_sync_fs,
>>   	.freeze_fs		= xfs_fs_freeze,
>>   	.unfreeze_fs		= xfs_fs_unfreeze,
> 
> I don't really like this ->destroy_super() callback, especially as
> it's completely undocumented as to why it exists. This is purely a
> work-around for handling extended filesystem superblock shrinker
> functionality, yet there's nothing that tells the reader this.
> 
> It also seems to imply that the superblock shrinker can continue to
> run after the existing unregister_shrinker() call before ->kill_sb()
> is called. This violates the assumption made in filesystems that the
> superblock shrinkers have been stopped and will never run again
> before ->kill_sb() is called. Hence ->kill_sb() implementations
> assume there is nothing else accessing filesystem owned structures
> and it can tear down internal structures safely.
> 
> Realistically, the days of XFS using this superblock shrinker
> extension are numbered. We've got a lot of the infrastructure we
> need in place to get rid of the background inode reclaim
> infrastructure that requires this shrinker extension, and it's on my
> list of things that need to be addressed in the near future.
> 
> In fact, now that I look at it, I think the shmem usage of this
> superblock shrinker interface is broken - it returns SHRINK_STOP to
> ->free_cached_objects(), but the only valid return value is the
> number of objects freed (i.e. 0 is nothing freed). These special
> superblock extension interfaces do not work like a normal
> shrinker....
> 
> Hence I think the shmem usage should be replaced with an separate
> internal shmem shrinker that is managed by the filesystem itself
> (similar to how XFS has multiple internal shrinkers).
> 
> At this point, then the only user of this interface is (again) XFS.
> Given this, adding new VFS methods for a single filesystem
> for functionality that is planned to be removed is probably not the
> best approach to solving the problem.

Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
new method[1], I cc'd you on the email.

[1]. 
https://lore.kernel.org/lkml/bab60fe4-964c-43a6-ecce-4cbd4981d875@ya.ru/

Thanks,
Qi

> 
> Cheers,
> 
> Dave.
Christian Brauner June 1, 2023, 9:58 a.m. UTC | #3
On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> Hi Dave,
> 
> On 2023/6/1 07:48, Dave Chinner wrote:
> > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > From: Kirill Tkhai <tkhai@ya.ru>
> > > 
> > > xfs_fs_nr_cached_objects() touches sb->s_fs_info,
> > > and this patch makes it to be destructed later.
> > > 
> > > After this patch xfs_fs_nr_cached_objects() is safe
> > > for splitting unregister_shrinker(): mp->m_perag_tree
> > > is stable till destroy_super_work(), while iteration
> > > over it is already RCU-protected by internal XFS
> > > business.
> > > 
> > > Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> > > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > > ---
> > >   fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
> > >   1 file changed, 22 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 7e706255f165..694616524c76 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -743,11 +743,18 @@ xfs_fs_drop_inode(
> > >   }
> > >   static void
> > > -xfs_mount_free(
> > > +xfs_free_names(
> > >   	struct xfs_mount	*mp)
> > >   {
> > >   	kfree(mp->m_rtname);
> > >   	kfree(mp->m_logname);
> > > +}
> > > +
> > > +static void
> > > +xfs_mount_free(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	xfs_free_names(mp);
> > >   	kmem_free(mp);
> > >   }
> > > @@ -1136,8 +1143,19 @@ xfs_fs_put_super(
> > >   	xfs_destroy_mount_workqueues(mp);
> > >   	xfs_close_devices(mp);
> > > -	sb->s_fs_info = NULL;
> > > -	xfs_mount_free(mp);
> > > +	xfs_free_names(mp);
> > > +}
> > > +
> > > +static void
> > > +xfs_fs_destroy_super(
> > > +	struct super_block	*sb)
> > > +{
> > > +	if (sb->s_fs_info) {
> > > +		struct xfs_mount	*mp = XFS_M(sb);
> > > +
> > > +		kmem_free(mp);
> > > +		sb->s_fs_info = NULL;
> > > +	}
> > >   }
> > >   static long
> > > @@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
> > >   	.dirty_inode		= xfs_fs_dirty_inode,
> > >   	.drop_inode		= xfs_fs_drop_inode,
> > >   	.put_super		= xfs_fs_put_super,
> > > +	.destroy_super		= xfs_fs_destroy_super,
> > >   	.sync_fs		= xfs_fs_sync_fs,
> > >   	.freeze_fs		= xfs_fs_freeze,
> > >   	.unfreeze_fs		= xfs_fs_unfreeze,
> > 
> > I don't really like this ->destroy_super() callback, especially as
> > it's completely undocumented as to why it exists. This is purely a
> > work-around for handling extended filesystem superblock shrinker
> > functionality, yet there's nothing that tells the reader this.
> > 
> > It also seems to imply that the superblock shrinker can continue to
> > run after the existing unregister_shrinker() call before ->kill_sb()
> > is called. This violates the assumption made in filesystems that the
> > superblock shrinkers have been stopped and will never run again
> > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > assume there is nothing else accessing filesystem owned structures
> > and it can tear down internal structures safely.
> > 
> > Realistically, the days of XFS using this superblock shrinker
> > extension are numbered. We've got a lot of the infrastructure we
> > need in place to get rid of the background inode reclaim
> > infrastructure that requires this shrinker extension, and it's on my
> > list of things that need to be addressed in the near future.
> > 
> > In fact, now that I look at it, I think the shmem usage of this
> > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > ->free_cached_objects(), but the only valid return value is the
> > number of objects freed (i.e. 0 is nothing freed). These special
> > superblock extension interfaces do not work like a normal
> > shrinker....
> > 
> > Hence I think the shmem usage should be replaced with an separate
> > internal shmem shrinker that is managed by the filesystem itself
> > (similar to how XFS has multiple internal shrinkers).
> > 
> > At this point, then the only user of this interface is (again) XFS.
> > Given this, adding new VFS methods for a single filesystem
> > for functionality that is planned to be removed is probably not the
> > best approach to solving the problem.
> 
> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> new method[1], I cc'd you on the email.
> 
> [1].
> https://lore.kernel.org/lkml/bab60fe4-964c-43a6-ecce-4cbd4981d875@ya.ru/

As long as we agree that we're not adding a new super operation that
sounds like a better way forward.
Dave Chinner June 1, 2023, 11:06 p.m. UTC | #4
On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> Hi Dave,
> On 2023/6/1 07:48, Dave Chinner wrote:
> > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > From: Kirill Tkhai <tkhai@ya.ru>
> > I don't really like this ->destroy_super() callback, especially as
> > it's completely undocumented as to why it exists. This is purely a
> > work-around for handling extended filesystem superblock shrinker
> > functionality, yet there's nothing that tells the reader this.
> > 
> > It also seems to imply that the superblock shrinker can continue to
> > run after the existing unregister_shrinker() call before ->kill_sb()
> > is called. This violates the assumption made in filesystems that the
> > superblock shrinkers have been stopped and will never run again
> > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > assume there is nothing else accessing filesystem owned structures
> > and it can tear down internal structures safely.
> > 
> > Realistically, the days of XFS using this superblock shrinker
> > extension are numbered. We've got a lot of the infrastructure we
> > need in place to get rid of the background inode reclaim
> > infrastructure that requires this shrinker extension, and it's on my
> > list of things that need to be addressed in the near future.
> > 
> > In fact, now that I look at it, I think the shmem usage of this
> > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > ->free_cached_objects(), but the only valid return value is the
> > number of objects freed (i.e. 0 is nothing freed). These special
> > superblock extension interfaces do not work like a normal
> > shrinker....
> > 
> > Hence I think the shmem usage should be replaced with an separate
> > internal shmem shrinker that is managed by the filesystem itself
> > (similar to how XFS has multiple internal shrinkers).
> > 
> > At this point, then the only user of this interface is (again) XFS.
> > Given this, adding new VFS methods for a single filesystem
> > for functionality that is planned to be removed is probably not the
> > best approach to solving the problem.
> 
> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> new method[1], I cc'd you on the email.

I;ve just read through that thread, and I've looked at the original
patch that caused the regression.

I'm a bit annoyed right now. Nobody cc'd me on the original patches
nor were any of the subsystems that use shrinkers were cc'd on the
patches that changed shrinker behaviour. I only find out about this
because someone tries to fix something they broke by *breaking more
stuff* and not even realising how broken what they are proposing is.

The previous code was not broken and it provided specific guarantees
to subsystems via unregister_shrinker(). From the above discussion,
it appears that the original authors of these changes either did not
know about or did not understand them, so that casts doubt in my
mind about the attempted solution and all the proposed fixes for it.

I don't have the time right now unravel this mess and fully
understand the original problem, changes or the band-aids that are
being thrown around. We are also getting quite late in the cycle to
be doing major surgery to critical infrastructure, especially as it
gives so little time to review regression test whatever new solution
is proposed.

Given this appears to be a change introduced in 6.4-rc1, I think the
right thing to do is to revert the change rather than make things
worse by trying to shove some "quick fix" into the kernel to address
it.

Andrew, could you please sort out a series to revert this shrinker
infrastructure change and all the dependent hacks that have been
added to try to fix it so far?

-Dave.
Qi Zheng June 2, 2023, 3:13 a.m. UTC | #5
Hi Dave,

On 2023/6/2 07:06, Dave Chinner wrote:
> On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
>> Hi Dave,
>> On 2023/6/1 07:48, Dave Chinner wrote:
>>> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>>>> From: Kirill Tkhai <tkhai@ya.ru>
>>> I don't really like this ->destroy_super() callback, especially as
>>> it's completely undocumented as to why it exists. This is purely a
>>> work-around for handling extended filesystem superblock shrinker
>>> functionality, yet there's nothing that tells the reader this.
>>>
>>> It also seems to imply that the superblock shrinker can continue to
>>> run after the existing unregister_shrinker() call before ->kill_sb()
>>> is called. This violates the assumption made in filesystems that the
>>> superblock shrinkers have been stopped and will never run again
>>> before ->kill_sb() is called. Hence ->kill_sb() implementations
>>> assume there is nothing else accessing filesystem owned structures
>>> and it can tear down internal structures safely.
>>>
>>> Realistically, the days of XFS using this superblock shrinker
>>> extension are numbered. We've got a lot of the infrastructure we
>>> need in place to get rid of the background inode reclaim
>>> infrastructure that requires this shrinker extension, and it's on my
>>> list of things that need to be addressed in the near future.
>>>
>>> In fact, now that I look at it, I think the shmem usage of this
>>> superblock shrinker interface is broken - it returns SHRINK_STOP to
>>> ->free_cached_objects(), but the only valid return value is the
>>> number of objects freed (i.e. 0 is nothing freed). These special
>>> superblock extension interfaces do not work like a normal
>>> shrinker....
>>>
>>> Hence I think the shmem usage should be replaced with an separate
>>> internal shmem shrinker that is managed by the filesystem itself
>>> (similar to how XFS has multiple internal shrinkers).
>>>
>>> At this point, then the only user of this interface is (again) XFS.
>>> Given this, adding new VFS methods for a single filesystem
>>> for functionality that is planned to be removed is probably not the
>>> best approach to solving the problem.
>>
>> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
>> new method[1], I cc'd you on the email.
> 
> I;ve just read through that thread, and I've looked at the original
> patch that caused the regression.
> 
> I'm a bit annoyed right now. Nobody cc'd me on the original patches
> nor were any of the subsystems that use shrinkers were cc'd on the
> patches that changed shrinker behaviour. I only find out about this

Sorry about that, my mistake. I followed the results of
scripts/get_maintainer.pl before.

> because someone tries to fix something they broke by *breaking more
> stuff* and not even realising how broken what they are proposing is.

Yes, this slows down the speed of umount. But the benefit is that
slab shrink becomes lockless, the mount operation and slab shrink no
longer affect each other, and the IPC no longer drops significantly,
etc.

And I used bpftrace to measure the time consumption of
unregister_shrinker():

```
And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
8260 CPU @ 2.40GHz) and the results are as follows:

1) use synchronize_srcu():

@ns[umount]:
[8K, 16K)             83 |@@@@@@@       |
[16K, 32K)           578 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)            78 |@@@@@@@       |
[64K, 128K)            6 |       |
[128K, 256K)           7 |       |
[256K, 512K)          29 |@@       |
[512K, 1M)            51 |@@@@      |
[1M, 2M)              90 |@@@@@@@@       |
[2M, 4M)              70 |@@@@@@      |
[4M, 8M)               8 |      |

2) use synchronize_srcu_expedited():

@ns[umount]:
[8K, 16K)             31 |@@       |
[16K, 32K)           803 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)           158 |@@@@@@@@@@       |
[64K, 128K)            4 |       |
[128K, 256K)           2 |       |
[256K, 512K)           2 |       |
```

With synchronize_srcu(), most of the time consumption is between 16us 
and 32us, the worst case between 4ms and 8ms. Is this totally
unacceptable?

This performance regression report comes from a stress test. Will the
umount action be executed so frequently under real workloads?

If there are really unacceptable, after applying the newly proposed
method, umount will be as fast as before (or even faster).

Thanks,
Qi

> 
> The previous code was not broken and it provided specific guarantees
> to subsystems via unregister_shrinker(). From the above discussion,
> it appears that the original authors of these changes either did not
> know about or did not understand them, so that casts doubt in my
> mind about the attempted solution and all the proposed fixes for it.
> 
> I don't have the time right now unravel this mess and fully
> understand the original problem, changes or the band-aids that are
> being thrown around. We are also getting quite late in the cycle to
> be doing major surgery to critical infrastructure, especially as it
> gives so little time to review regression test whatever new solution
> is proposed.
> 
> Given this appears to be a change introduced in 6.4-rc1, I think the
> right thing to do is to revert the change rather than make things
> worse by trying to shove some "quick fix" into the kernel to address
> it.
> 
> Andrew, could you please sort out a series to revert this shrinker
> infrastructure change and all the dependent hacks that have been
> added to try to fix it so far?
> 
> -Dave.
Darrick J. Wong June 2, 2023, 3:15 p.m. UTC | #6
On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
> Hi Dave,
> 
> On 2023/6/2 07:06, Dave Chinner wrote:
> > On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> > > Hi Dave,
> > > On 2023/6/1 07:48, Dave Chinner wrote:
> > > > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > > > From: Kirill Tkhai <tkhai@ya.ru>
> > > > I don't really like this ->destroy_super() callback, especially as
> > > > it's completely undocumented as to why it exists. This is purely a
> > > > work-around for handling extended filesystem superblock shrinker
> > > > functionality, yet there's nothing that tells the reader this.
> > > > 
> > > > It also seems to imply that the superblock shrinker can continue to
> > > > run after the existing unregister_shrinker() call before ->kill_sb()
> > > > is called. This violates the assumption made in filesystems that the
> > > > superblock shrinkers have been stopped and will never run again
> > > > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > > > assume there is nothing else accessing filesystem owned structures
> > > > and it can tear down internal structures safely.
> > > > 
> > > > Realistically, the days of XFS using this superblock shrinker
> > > > extension are numbered. We've got a lot of the infrastructure we
> > > > need in place to get rid of the background inode reclaim
> > > > infrastructure that requires this shrinker extension, and it's on my
> > > > list of things that need to be addressed in the near future.
> > > > 
> > > > In fact, now that I look at it, I think the shmem usage of this
> > > > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > > > ->free_cached_objects(), but the only valid return value is the
> > > > number of objects freed (i.e. 0 is nothing freed). These special
> > > > superblock extension interfaces do not work like a normal
> > > > shrinker....
> > > > 
> > > > Hence I think the shmem usage should be replaced with an separate
> > > > internal shmem shrinker that is managed by the filesystem itself
> > > > (similar to how XFS has multiple internal shrinkers).
> > > > 
> > > > At this point, then the only user of this interface is (again) XFS.
> > > > Given this, adding new VFS methods for a single filesystem
> > > > for functionality that is planned to be removed is probably not the
> > > > best approach to solving the problem.
> > > 
> > > Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> > > new method[1], I cc'd you on the email.
> > 
> > I;ve just read through that thread, and I've looked at the original
> > patch that caused the regression.
> > 
> > I'm a bit annoyed right now. Nobody cc'd me on the original patches
> > nor were any of the subsystems that use shrinkers were cc'd on the
> > patches that changed shrinker behaviour. I only find out about this
> 
> Sorry about that, my mistake. I followed the results of
> scripts/get_maintainer.pl before.

Sometimes I wonder if people who contribute a lot to a subsystem should
be more aggressive about listing themselves explicitly in MAINTAINERS
but then I look at the ~600 emails that came in while I was on vacation
for 6 days over a long weekend and ... shut up. :P

> > because someone tries to fix something they broke by *breaking more
> > stuff* and not even realising how broken what they are proposing is.
> 
> Yes, this slows down the speed of umount. But the benefit is that
> slab shrink becomes lockless, the mount operation and slab shrink no
> longer affect each other, and the IPC no longer drops significantly,
> etc.

The lockless shrink seems like a good thing to have, but ... is it
really true that the superblock shrinker can still be running after
->kill_sb?  /That/ is surprising to me.

--D

> And I used bpftrace to measure the time consumption of
> unregister_shrinker():
> 
> ```
> And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
> 8260 CPU @ 2.40GHz) and the results are as follows:
> 
> 1) use synchronize_srcu():
> 
> @ns[umount]:
> [8K, 16K)             83 |@@@@@@@       |
> [16K, 32K)           578
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)            78 |@@@@@@@       |
> [64K, 128K)            6 |       |
> [128K, 256K)           7 |       |
> [256K, 512K)          29 |@@       |
> [512K, 1M)            51 |@@@@      |
> [1M, 2M)              90 |@@@@@@@@       |
> [2M, 4M)              70 |@@@@@@      |
> [4M, 8M)               8 |      |
> 
> 2) use synchronize_srcu_expedited():
> 
> @ns[umount]:
> [8K, 16K)             31 |@@       |
> [16K, 32K)           803
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)           158 |@@@@@@@@@@       |
> [64K, 128K)            4 |       |
> [128K, 256K)           2 |       |
> [256K, 512K)           2 |       |
> ```
> 
> With synchronize_srcu(), most of the time consumption is between 16us and
> 32us, the worst case between 4ms and 8ms. Is this totally
> unacceptable?
> 
> This performance regression report comes from a stress test. Will the
> umount action be executed so frequently under real workloads?
> 
> If there are really unacceptable, after applying the newly proposed
> method, umount will be as fast as before (or even faster).
> 
> Thanks,
> Qi
> 
> > 
> > The previous code was not broken and it provided specific guarantees
> > to subsystems via unregister_shrinker(). From the above discussion,
> > it appears that the original authors of these changes either did not
> > know about or did not understand them, so that casts doubt in my
> > mind about the attempted solution and all the proposed fixes for it.
> > 
> > I don't have the time right now unravel this mess and fully
> > understand the original problem, changes or the band-aids that are
> > being thrown around. We are also getting quite late in the cycle to
> > be doing major surgery to critical infrastructure, especially as it
> > gives so little time to review regression test whatever new solution
> > is proposed.
> > 
> > Given this appears to be a change introduced in 6.4-rc1, I think the
> > right thing to do is to revert the change rather than make things
> > worse by trying to shove some "quick fix" into the kernel to address
> > it.
> > 
> > Andrew, could you please sort out a series to revert this shrinker
> > infrastructure change and all the dependent hacks that have been
> > added to try to fix it so far?
> > 
> > -Dave.
> 
> -- 
> Thanks,
> Qi
Christian Brauner June 5, 2023, 11:50 a.m. UTC | #7
On Fri, Jun 02, 2023 at 08:15:32AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
> > Hi Dave,
> > 
> > On 2023/6/2 07:06, Dave Chinner wrote:
> > > On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> > > > Hi Dave,
> > > > On 2023/6/1 07:48, Dave Chinner wrote:
> > > > > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > > > > From: Kirill Tkhai <tkhai@ya.ru>
> > > > > I don't really like this ->destroy_super() callback, especially as
> > > > > it's completely undocumented as to why it exists. This is purely a
> > > > > work-around for handling extended filesystem superblock shrinker
> > > > > functionality, yet there's nothing that tells the reader this.
> > > > > 
> > > > > It also seems to imply that the superblock shrinker can continue to
> > > > > run after the existing unregister_shrinker() call before ->kill_sb()
> > > > > is called. This violates the assumption made in filesystems that the
> > > > > superblock shrinkers have been stopped and will never run again
> > > > > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > > > > assume there is nothing else accessing filesystem owned structures
> > > > > and it can tear down internal structures safely.
> > > > > 
> > > > > Realistically, the days of XFS using this superblock shrinker
> > > > > extension are numbered. We've got a lot of the infrastructure we
> > > > > need in place to get rid of the background inode reclaim
> > > > > infrastructure that requires this shrinker extension, and it's on my
> > > > > list of things that need to be addressed in the near future.
> > > > > 
> > > > > In fact, now that I look at it, I think the shmem usage of this
> > > > > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > > > > ->free_cached_objects(), but the only valid return value is the
> > > > > number of objects freed (i.e. 0 is nothing freed). These special
> > > > > superblock extension interfaces do not work like a normal
> > > > > shrinker....
> > > > > 
> > > > > Hence I think the shmem usage should be replaced with an separate
> > > > > internal shmem shrinker that is managed by the filesystem itself
> > > > > (similar to how XFS has multiple internal shrinkers).
> > > > > 
> > > > > At this point, then the only user of this interface is (again) XFS.
> > > > > Given this, adding new VFS methods for a single filesystem
> > > > > for functionality that is planned to be removed is probably not the
> > > > > best approach to solving the problem.
> > > > 
> > > > Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> > > > new method[1], I cc'd you on the email.
> > > 
> > > I;ve just read through that thread, and I've looked at the original
> > > patch that caused the regression.
> > > 
> > > I'm a bit annoyed right now. Nobody cc'd me on the original patches
> > > nor were any of the subsystems that use shrinkers were cc'd on the
> > > patches that changed shrinker behaviour. I only find out about this
> > 
> > Sorry about that, my mistake. I followed the results of
> > scripts/get_maintainer.pl before.
> 
> Sometimes I wonder if people who contribute a lot to a subsystem should
> be more aggressive about listing themselves explicitly in MAINTAINERS
> but then I look at the ~600 emails that came in while I was on vacation
> for 6 days over a long weekend and ... shut up. :P
> 
> > > because someone tries to fix something they broke by *breaking more
> > > stuff* and not even realising how broken what they are proposing is.
> > 
> > Yes, this slows down the speed of umount. But the benefit is that
> > slab shrink becomes lockless, the mount operation and slab shrink no
> > longer affect each other, and the IPC no longer drops significantly,
> > etc.
> 
> The lockless shrink seems like a good thing to have, but ... is it
> really true that the superblock shrinker can still be running after
> ->kill_sb?  /That/ is surprising to me.

So what's the plan here? If this causes issues for filesystems that rely
on specific guarantees that are broken by the patch then either it needs
a clean fix or a revert.

> 
> --D
> 
> > And I used bpftrace to measure the time consumption of
> > unregister_shrinker():
> > 
> > ```
> > And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
> > 8260 CPU @ 2.40GHz) and the results are as follows:
> > 
> > 1) use synchronize_srcu():
> > 
> > @ns[umount]:
> > [8K, 16K)             83 |@@@@@@@       |
> > [16K, 32K)           578
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32K, 64K)            78 |@@@@@@@       |
> > [64K, 128K)            6 |       |
> > [128K, 256K)           7 |       |
> > [256K, 512K)          29 |@@       |
> > [512K, 1M)            51 |@@@@      |
> > [1M, 2M)              90 |@@@@@@@@       |
> > [2M, 4M)              70 |@@@@@@      |
> > [4M, 8M)               8 |      |
> > 
> > 2) use synchronize_srcu_expedited():
> > 
> > @ns[umount]:
> > [8K, 16K)             31 |@@       |
> > [16K, 32K)           803
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32K, 64K)           158 |@@@@@@@@@@       |
> > [64K, 128K)            4 |       |
> > [128K, 256K)           2 |       |
> > [256K, 512K)           2 |       |
> > ```
> > 
> > With synchronize_srcu(), most of the time consumption is between 16us and
> > 32us, the worst case between 4ms and 8ms. Is this totally
> > unacceptable?
> > 
> > This performance regression report comes from a stress test. Will the
> > umount action be executed so frequently under real workloads?
> > 
> > If there are really unacceptable, after applying the newly proposed
> > method, umount will be as fast as before (or even faster).
> > 
> > Thanks,
> > Qi
> > 
> > > 
> > > The previous code was not broken and it provided specific guarantees
> > > to subsystems via unregister_shrinker(). From the above discussion,
> > > it appears that the original authors of these changes either did not
> > > know about or did not understand them, so that casts doubt in my
> > > mind about the attempted solution and all the proposed fixes for it.
> > > 
> > > I don't have the time right now unravel this mess and fully
> > > understand the original problem, changes or the band-aids that are
> > > being thrown around. We are also getting quite late in the cycle to
> > > be doing major surgery to critical infrastructure, especially as it
> > > gives so little time to review regression test whatever new solution
> > > is proposed.
> > > 
> > > Given this appears to be a change introduced in 6.4-rc1, I think the
> > > right thing to do is to revert the change rather than make things
> > > worse by trying to shove some "quick fix" into the kernel to address
> > > it.
> > > 
> > > Andrew, could you please sort out a series to revert this shrinker
> > > infrastructure change and all the dependent hacks that have been
> > > added to try to fix it so far?
> > > 
> > > -Dave.
> > 
> > -- 
> > Thanks,
> > Qi
Qi Zheng June 5, 2023, 12:16 p.m. UTC | #8
On 2023/6/5 19:50, Christian Brauner wrote:
> On Fri, Jun 02, 2023 at 08:15:32AM -0700, Darrick J. Wong wrote:
>> On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
>>> Hi Dave,
>>>
>>> On 2023/6/2 07:06, Dave Chinner wrote:
>>>> On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
>>>>> Hi Dave,
>>>>> On 2023/6/1 07:48, Dave Chinner wrote:
>>>>>> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>>>>>>> From: Kirill Tkhai <tkhai@ya.ru>
>>>>>> I don't really like this ->destroy_super() callback, especially as
>>>>>> it's completely undocumented as to why it exists. This is purely a
>>>>>> work-around for handling extended filesystem superblock shrinker
>>>>>> functionality, yet there's nothing that tells the reader this.
>>>>>>
>>>>>> It also seems to imply that the superblock shrinker can continue to
>>>>>> run after the existing unregister_shrinker() call before ->kill_sb()
>>>>>> is called. This violates the assumption made in filesystems that the
>>>>>> superblock shrinkers have been stopped and will never run again
>>>>>> before ->kill_sb() is called. Hence ->kill_sb() implementations
>>>>>> assume there is nothing else accessing filesystem owned structures
>>>>>> and it can tear down internal structures safely.
>>>>>>
>>>>>> Realistically, the days of XFS using this superblock shrinker
>>>>>> extension are numbered. We've got a lot of the infrastructure we
>>>>>> need in place to get rid of the background inode reclaim
>>>>>> infrastructure that requires this shrinker extension, and it's on my
>>>>>> list of things that need to be addressed in the near future.
>>>>>>
>>>>>> In fact, now that I look at it, I think the shmem usage of this
>>>>>> superblock shrinker interface is broken - it returns SHRINK_STOP to
>>>>>> ->free_cached_objects(), but the only valid return value is the
>>>>>> number of objects freed (i.e. 0 is nothing freed). These special
>>>>>> superblock extension interfaces do not work like a normal
>>>>>> shrinker....
>>>>>>
>>>>>> Hence I think the shmem usage should be replaced with an separate
>>>>>> internal shmem shrinker that is managed by the filesystem itself
>>>>>> (similar to how XFS has multiple internal shrinkers).
>>>>>>
>>>>>> At this point, then the only user of this interface is (again) XFS.
>>>>>> Given this, adding new VFS methods for a single filesystem
>>>>>> for functionality that is planned to be removed is probably not the
>>>>>> best approach to solving the problem.
>>>>>
>>>>> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
>>>>> new method[1], I cc'd you on the email.
>>>>
>>>> I;ve just read through that thread, and I've looked at the original
>>>> patch that caused the regression.
>>>>
>>>> I'm a bit annoyed right now. Nobody cc'd me on the original patches
>>>> nor were any of the subsystems that use shrinkers were cc'd on the
>>>> patches that changed shrinker behaviour. I only find out about this
>>>
>>> Sorry about that, my mistake. I followed the results of
>>> scripts/get_maintainer.pl before.
>>
>> Sometimes I wonder if people who contribute a lot to a subsystem should
>> be more aggressive about listing themselves explicitly in MAINTAINERS
>> but then I look at the ~600 emails that came in while I was on vacation
>> for 6 days over a long weekend and ... shut up. :P
>>
>>>> because someone tries to fix something they broke by *breaking more
>>>> stuff* and not even realising how broken what they are proposing is.
>>>
>>> Yes, this slows down the speed of umount. But the benefit is that
>>> slab shrink becomes lockless, the mount operation and slab shrink no
>>> longer affect each other, and the IPC no longer drops significantly,
>>> etc.
>>
>> The lockless shrink seems like a good thing to have, but ... is it
>> really true that the superblock shrinker can still be running after
>> ->kill_sb?  /That/ is surprising to me.
> 
> So what's the plan here? If this causes issues for filesystems that rely
> on specific guarantees that are broken by the patch then either it needs
> a clean fix or a revert.

If the reduction in umount speed is really unacceptable, I think we can
try the patch[1] from Kirill Tkhai. At least the granularity of the
shrinker rwsem lock is reduced, and the file system code does not need
to be modified.

And IIUC, after clearing SHRINKER_REGISTERED under the write lock of
shrinker->rwsem, we can guarantee that the shrinker won't be running.
So synchronize_srcu() doesn't need to be called in unregister_shrinker()
anymore. So we don't need to split unregister_shrinker().

[1]. 
https://lore.kernel.org/lkml/bab60fe4-964c-43a6-ecce-4cbd4981d875@ya.ru/

> 
>>
>> --D
>>
>>> And I used bpftrace to measure the time consumption of
>>> unregister_shrinker():
>>>
>>> ```
>>> And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
>>> 8260 CPU @ 2.40GHz) and the results are as follows:
>>>
>>> 1) use synchronize_srcu():
>>>
>>> @ns[umount]:
>>> [8K, 16K)             83 |@@@@@@@       |
>>> [16K, 32K)           578
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K)            78 |@@@@@@@       |
>>> [64K, 128K)            6 |       |
>>> [128K, 256K)           7 |       |
>>> [256K, 512K)          29 |@@       |
>>> [512K, 1M)            51 |@@@@      |
>>> [1M, 2M)              90 |@@@@@@@@       |
>>> [2M, 4M)              70 |@@@@@@      |
>>> [4M, 8M)               8 |      |
>>>
>>> 2) use synchronize_srcu_expedited():
>>>
>>> @ns[umount]:
>>> [8K, 16K)             31 |@@       |
>>> [16K, 32K)           803
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K)           158 |@@@@@@@@@@       |
>>> [64K, 128K)            4 |       |
>>> [128K, 256K)           2 |       |
>>> [256K, 512K)           2 |       |
>>> ```
>>>
>>> With synchronize_srcu(), most of the time consumption is between 16us and
>>> 32us, the worst case between 4ms and 8ms. Is this totally
>>> unacceptable?
>>>
>>> This performance regression report comes from a stress test. Will the
>>> umount action be executed so frequently under real workloads?
>>>
>>> If there are really unacceptable, after applying the newly proposed
>>> method, umount will be as fast as before (or even faster).
>>>
>>> Thanks,
>>> Qi
>>>
>>>>
>>>> The previous code was not broken and it provided specific guarantees
>>>> to subsystems via unregister_shrinker(). From the above discussion,
>>>> it appears that the original authors of these changes either did not
>>>> know about or did not understand them, so that casts doubt in my
>>>> mind about the attempted solution and all the proposed fixes for it.
>>>>
>>>> I don't have the time right now unravel this mess and fully
>>>> understand the original problem, changes or the band-aids that are
>>>> being thrown around. We are also getting quite late in the cycle to
>>>> be doing major surgery to critical infrastructure, especially as it
>>>> gives so little time to review regression test whatever new solution
>>>> is proposed.
>>>>
>>>> Given this appears to be a change introduced in 6.4-rc1, I think the
>>>> right thing to do is to revert the change rather than make things
>>>> worse by trying to shove some "quick fix" into the kernel to address
>>>> it.
>>>>
>>>> Andrew, could you please sort out a series to revert this shrinker
>>>> infrastructure change and all the dependent hacks that have been
>>>> added to try to fix it so far?
>>>>
>>>> -Dave.
>>>
>>> -- 
>>> Thanks,
>>> Qi
diff mbox series

Patch

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 7e706255f165..694616524c76 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -743,11 +743,18 @@  xfs_fs_drop_inode(
 }
 
 static void
-xfs_mount_free(
+xfs_free_names(
 	struct xfs_mount	*mp)
 {
 	kfree(mp->m_rtname);
 	kfree(mp->m_logname);
+}
+
+static void
+xfs_mount_free(
+	struct xfs_mount	*mp)
+{
+	xfs_free_names(mp);
 	kmem_free(mp);
 }
 
@@ -1136,8 +1143,19 @@  xfs_fs_put_super(
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
 
-	sb->s_fs_info = NULL;
-	xfs_mount_free(mp);
+	xfs_free_names(mp);
+}
+
+static void
+xfs_fs_destroy_super(
+	struct super_block	*sb)
+{
+	if (sb->s_fs_info) {
+		struct xfs_mount	*mp = XFS_M(sb);
+
+		kmem_free(mp);
+		sb->s_fs_info = NULL;
+	}
 }
 
 static long
@@ -1165,6 +1183,7 @@  static const struct super_operations xfs_super_operations = {
 	.dirty_inode		= xfs_fs_dirty_inode,
 	.drop_inode		= xfs_fs_drop_inode,
 	.put_super		= xfs_fs_put_super,
+	.destroy_super		= xfs_fs_destroy_super,
 	.sync_fs		= xfs_fs_sync_fs,
 	.freeze_fs		= xfs_fs_freeze,
 	.unfreeze_fs		= xfs_fs_unfreeze,