ovl: skip overlayfs superblocks at global sync

Message ID	158642098777.5635.10501704178160375549.stgit@buzz (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=x5/O=5Z=vger.kernel.org=linux-fsdevel-owner@kernel.org> Subject: [PATCH] ovl: skip overlayfs superblocks at global sync From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> To: linux-fsdevel@vger.kernel.org, Miklos Szeredi <miklos@szeredi.hu>, linux-kernel@vger.kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>, linux-unionfs@vger.kernel.org Cc: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Date: Thu, 09 Apr 2020 11:29:47 +0300 Message-ID: <158642098777.5635.10501704178160375549.stgit@buzz> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	ovl: skip overlayfs superblocks at global sync \| expand ovl: skip overlayfs superblocks at global sync

Konstantin Khlebnikov April 9, 2020, 8:29 a.m. UTC

Stacked filesystems like overlayfs has no own writeback, but they have to
forward syncfs() requests to backend for keeping data integrity.

During global sync() each overlayfs instance calls method ->sync_fs()
for backend although it itself is in global list of superblocks too.
As a result one syscall sync() could write one superblock several times
and send multiple disk barriers.

This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/overlayfs/super.c |    5 +++--
 fs/sync.c            |    3 ++-
 include/linux/fs.h   |    2 ++
 3 files changed, 7 insertions(+), 3 deletions(-)

Amir Goldstein April 9, 2020, 10:23 a.m. UTC | #1

On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> Stacked filesystems like overlayfs has no own writeback, but they have to
> forward syncfs() requests to backend for keeping data integrity.
>
> During global sync() each overlayfs instance calls method ->sync_fs()
> for backend although it itself is in global list of superblocks too.
> As a result one syscall sync() could write one superblock several times
> and send multiple disk barriers.
>
> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>
> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---

Seems reasonable.
You may add:
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

+CC: containers list

This bring up old memories.
I posted this way back to fix handling of emergency_remount() in the
presence of loop mounted fs:
https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/

But seems to me that emergency_sync() and sync(2) are equally broken
for this use case.

I wonder if anyone cares enough about resilience of loop mounted fs to try
and change the iterate_* functions to iterate supers/bdevs in reverse order...

Thanks,
Amir.

Konstantin Khlebnikov April 9, 2020, 11:28 a.m. UTC | #2

On 09/04/2020 13.23, Amir Goldstein wrote:
> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>>
>> Stacked filesystems like overlayfs has no own writeback, but they have to
>> forward syncfs() requests to backend for keeping data integrity.
>>
>> During global sync() each overlayfs instance calls method ->sync_fs()
>> for backend although it itself is in global list of superblocks too.
>> As a result one syscall sync() could write one superblock several times
>> and send multiple disk barriers.
>>
>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>>
>> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> ---
> 
> Seems reasonable.
> You may add:
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> 
> +CC: containers list

Thanks

> 
> This bring up old memories.
> I posted this way back to fix handling of emergency_remount() in the
> presence of loop mounted fs:
> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> 
> But seems to me that emergency_sync() and sync(2) are equally broken
> for this use case.
> 
> I wonder if anyone cares enough about resilience of loop mounted fs to try
> and change the iterate_* functions to iterate supers/bdevs in reverse order...

Now I see reason behind "sync; sync; sync; reboot" =)

Order old -> new allows to not miss new items if list modifies.
Might be important for some users.

bdev iteration seems already reversed: inode_sb_list_add adds to the head

> 
> Thanks,
> Amir.
>

Amir Goldstein April 9, 2020, 11:48 a.m. UTC | #3

On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> On 09/04/2020 13.23, Amir Goldstein wrote:
> > On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
> > <khlebnikov@yandex-team.ru> wrote:
> >>
> >> Stacked filesystems like overlayfs has no own writeback, but they have to
> >> forward syncfs() requests to backend for keeping data integrity.
> >>
> >> During global sync() each overlayfs instance calls method ->sync_fs()
> >> for backend although it itself is in global list of superblocks too.
> >> As a result one syscall sync() could write one superblock several times
> >> and send multiple disk barriers.
> >>
> >> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
> >>
> >> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
> >> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> >> ---
> >
> > Seems reasonable.
> > You may add:
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >
> > +CC: containers list
>
> Thanks
>
> >
> > This bring up old memories.
> > I posted this way back to fix handling of emergency_remount() in the
> > presence of loop mounted fs:
> > https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> >
> > But seems to me that emergency_sync() and sync(2) are equally broken
> > for this use case.
> >
> > I wonder if anyone cares enough about resilience of loop mounted fs to try
> > and change the iterate_* functions to iterate supers/bdevs in reverse order...
>
> Now I see reason behind "sync; sync; sync; reboot" =)
>
> Order old -> new allows to not miss new items if list modifies.
> Might be important for some users.
>

That's not the reason I suggested reverse order.
The reason is that with loop mounted fs, the correct order of flushing is:
1. sync loop mounted fs inodes => writes to loop image file
2. sync loop mounted fs sb => fsyncs the loop image file
3. sync the loop image host fs sb

With forward sb iteration order, #3 happens before #1, so the
loop mounted fs changes are not really being made durable by
a single sync(2) call.

> bdev iteration seems already reversed: inode_sb_list_add adds to the head
>

I think bdev iteration order will not make a difference in this case.
flushing /dev/loopX will not be needed and it happens too late
anyway.

Thanks,
Amir.

Konstantin Khlebnikov April 9, 2020, 12:04 p.m. UTC | #4

On 09/04/2020 14.48, Amir Goldstein wrote:
> On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>>
>> On 09/04/2020 13.23, Amir Goldstein wrote:
>>> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
>>> <khlebnikov@yandex-team.ru> wrote:
>>>>
>>>> Stacked filesystems like overlayfs has no own writeback, but they have to
>>>> forward syncfs() requests to backend for keeping data integrity.
>>>>
>>>> During global sync() each overlayfs instance calls method ->sync_fs()
>>>> for backend although it itself is in global list of superblocks too.
>>>> As a result one syscall sync() could write one superblock several times
>>>> and send multiple disk barriers.
>>>>
>>>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>>>>
>>>> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>> ---
>>>
>>> Seems reasonable.
>>> You may add:
>>> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
>>>
>>> +CC: containers list
>>
>> Thanks
>>
>>>
>>> This bring up old memories.
>>> I posted this way back to fix handling of emergency_remount() in the
>>> presence of loop mounted fs:
>>> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
>>>
>>> But seems to me that emergency_sync() and sync(2) are equally broken
>>> for this use case.
>>>
>>> I wonder if anyone cares enough about resilience of loop mounted fs to try
>>> and change the iterate_* functions to iterate supers/bdevs in reverse order...
>>
>> Now I see reason behind "sync; sync; sync; reboot" =)
>>
>> Order old -> new allows to not miss new items if list modifies.
>> Might be important for some users.
>>
> 
> That's not the reason I suggested reverse order.
> The reason is that with loop mounted fs, the correct order of flushing is:
> 1. sync loop mounted fs inodes => writes to loop image file
> 2. sync loop mounted fs sb => fsyncs the loop image file
> 3. sync the loop image host fs sb
> 
> With forward sb iteration order, #3 happens before #1, so the
> loop mounted fs changes are not really being made durable by
> a single sync(2) call.

If fs in loop mounted with barriers then sync_fs will issue
REQ_OP_FLUSH to loop device and trigger fsync() for image file.
Sync() might write something twice but data should be safe.
Without barriers this scenario is broken for sure.

Emergency remount R/O is other thing. It really needs reverse order.

> 
>> bdev iteration seems already reversed: inode_sb_list_add adds to the head
>>
> 
> I think bdev iteration order will not make a difference in this case.
> flushing /dev/loopX will not be needed and it happens too late
> anyway.
> 
> Thanks,
> Amir.
>

Amir Goldstein April 9, 2020, 1:22 p.m. UTC | #5

On Thu, Apr 9, 2020 at 3:04 PM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
>
>
> On 09/04/2020 14.48, Amir Goldstein wrote:
> > On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
> > <khlebnikov@yandex-team.ru> wrote:
> >>
> >> On 09/04/2020 13.23, Amir Goldstein wrote:
> >>> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
> >>> <khlebnikov@yandex-team.ru> wrote:
> >>>>
> >>>> Stacked filesystems like overlayfs has no own writeback, but they have to
> >>>> forward syncfs() requests to backend for keeping data integrity.
> >>>>
> >>>> During global sync() each overlayfs instance calls method ->sync_fs()
> >>>> for backend although it itself is in global list of superblocks too.
> >>>> As a result one syscall sync() could write one superblock several times
> >>>> and send multiple disk barriers.
> >>>>
> >>>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
> >>>>
> >>>> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
> >>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> >>>> ---
> >>>
> >>> Seems reasonable.
> >>> You may add:
> >>> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >>>
> >>> +CC: containers list
> >>
> >> Thanks
> >>
> >>>
> >>> This bring up old memories.
> >>> I posted this way back to fix handling of emergency_remount() in the
> >>> presence of loop mounted fs:
> >>> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> >>>
> >>> But seems to me that emergency_sync() and sync(2) are equally broken
> >>> for this use case.
> >>>
> >>> I wonder if anyone cares enough about resilience of loop mounted fs to try
> >>> and change the iterate_* functions to iterate supers/bdevs in reverse order...
> >>
> >> Now I see reason behind "sync; sync; sync; reboot" =)
> >>
> >> Order old -> new allows to not miss new items if list modifies.
> >> Might be important for some users.
> >>
> >
> > That's not the reason I suggested reverse order.
> > The reason is that with loop mounted fs, the correct order of flushing is:
> > 1. sync loop mounted fs inodes => writes to loop image file
> > 2. sync loop mounted fs sb => fsyncs the loop image file
> > 3. sync the loop image host fs sb
> >
> > With forward sb iteration order, #3 happens before #1, so the
> > loop mounted fs changes are not really being made durable by
> > a single sync(2) call.
>
> If fs in loop mounted with barriers then sync_fs will issue
> REQ_OP_FLUSH to loop device and trigger fsync() for image file.
> Sync() might write something twice but data should be safe.
> Without barriers this scenario is broken for sure.
>
> Emergency remount R/O is other thing. It really needs reverse order.
>

Correct. There is no problem with durability.
Although for some filesystems it would be more efficient to first
write and fsync the loop images and then sync_fs().
I can potentially result in less overall disk barriers.

Thanks,
Amir.

Dave Chinner April 11, 2020, 10:28 p.m. UTC | #6

On Thu, Apr 09, 2020 at 11:29:47AM +0300, Konstantin Khlebnikov wrote:
> Stacked filesystems like overlayfs has no own writeback, but they have to
> forward syncfs() requests to backend for keeping data integrity.
> 
> During global sync() each overlayfs instance calls method ->sync_fs()
> for backend although it itself is in global list of superblocks too.
> As a result one syscall sync() could write one superblock several times
> and send multiple disk barriers.
> 
> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

Why wouldn't you just remove the ->sync_fs method from overlay?

I mean, if you don't need the filesystem to do anything special for
one specific data integrity sync_fs call, you don't need it for any
of them, yes?

-Dave.

Amir Goldstein April 12, 2020, 6:46 a.m. UTC | #7

On Sun, Apr 12, 2020 at 1:29 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Apr 09, 2020 at 11:29:47AM +0300, Konstantin Khlebnikov wrote:
> > Stacked filesystems like overlayfs has no own writeback, but they have to
> > forward syncfs() requests to backend for keeping data integrity.
> >
> > During global sync() each overlayfs instance calls method ->sync_fs()
> > for backend although it itself is in global list of superblocks too.
> > As a result one syscall sync() could write one superblock several times
> > and send multiple disk barriers.
> >
> > This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>
> Why wouldn't you just remove the ->sync_fs method from overlay?
>
> I mean, if you don't need the filesystem to do anything special for
> one specific data integrity sync_fs call, you don't need it for any
> of them, yes?
>

No, but I understand the confusion.

Say you have 1000 overlay sb's all of them using upper directories
from a single xfs sb (quite common for containers).

syncfs(2) of each overlay, must call sync_fs of xfs (see ovl_sync_fs)
sync(2) will call xfs sync_fs anyway, so there is no point in calling
ovl_sync_fs => xfs sync_fs 1000 more times.

Thanks,
Amir.

Miklos Szeredi April 21, 2020, 9:36 a.m. UTC | #8

On Thu, Apr 9, 2020 at 10:29 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> Stacked filesystems like overlayfs has no own writeback, but they have to
> forward syncfs() requests to backend for keeping data integrity.
>
> During global sync() each overlayfs instance calls method ->sync_fs()
> for backend although it itself is in global list of superblocks too.
> As a result one syscall sync() could write one superblock several times
> and send multiple disk barriers.
>
> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>
> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Thanks, applied.

Miklos

ovl: skip overlayfs superblocks at global sync

Commit Message

Comments

Patch