Message ID | 158642098777.5635.10501704178160375549.stgit@buzz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ovl: skip overlayfs superblocks at global sync | expand |
On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > > Stacked filesystems like overlayfs has no own writeback, but they have to > forward syncfs() requests to backend for keeping data integrity. > > During global sync() each overlayfs instance calls method ->sync_fs() > for backend although it itself is in global list of superblocks too. > As a result one syscall sync() could write one superblock several times > and send multiple disk barriers. > > This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. > > Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > --- Seems reasonable. You may add: Reviewed-by: Amir Goldstein <amir73il@gmail.com> +CC: containers list This bring up old memories. I posted this way back to fix handling of emergency_remount() in the presence of loop mounted fs: https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/ But seems to me that emergency_sync() and sync(2) are equally broken for this use case. I wonder if anyone cares enough about resilience of loop mounted fs to try and change the iterate_* functions to iterate supers/bdevs in reverse order... Thanks, Amir.
On 09/04/2020 13.23, Amir Goldstein wrote: > On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov > <khlebnikov@yandex-team.ru> wrote: >> >> Stacked filesystems like overlayfs has no own writeback, but they have to >> forward syncfs() requests to backend for keeping data integrity. >> >> During global sync() each overlayfs instance calls method ->sync_fs() >> for backend although it itself is in global list of superblocks too. >> As a result one syscall sync() could write one superblock several times >> and send multiple disk barriers. >> >> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. >> >> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> >> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> >> --- > > Seems reasonable. > You may add: > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > +CC: containers list Thanks > > This bring up old memories. > I posted this way back to fix handling of emergency_remount() in the > presence of loop mounted fs: > https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/ > > But seems to me that emergency_sync() and sync(2) are equally broken > for this use case. > > I wonder if anyone cares enough about resilience of loop mounted fs to try > and change the iterate_* functions to iterate supers/bdevs in reverse order... Now I see reason behind "sync; sync; sync; reboot" =) Order old -> new allows to not miss new items if list modifies. Might be important for some users. bdev iteration seems already reversed: inode_sb_list_add adds to the head > > Thanks, > Amir. >
On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > > On 09/04/2020 13.23, Amir Goldstein wrote: > > On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov > > <khlebnikov@yandex-team.ru> wrote: > >> > >> Stacked filesystems like overlayfs has no own writeback, but they have to > >> forward syncfs() requests to backend for keeping data integrity. > >> > >> During global sync() each overlayfs instance calls method ->sync_fs() > >> for backend although it itself is in global list of superblocks too. > >> As a result one syscall sync() could write one superblock several times > >> and send multiple disk barriers. > >> > >> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. > >> > >> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> > >> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > >> --- > > > > Seems reasonable. > > You may add: > > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > > > +CC: containers list > > Thanks > > > > > This bring up old memories. > > I posted this way back to fix handling of emergency_remount() in the > > presence of loop mounted fs: > > https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/ > > > > But seems to me that emergency_sync() and sync(2) are equally broken > > for this use case. > > > > I wonder if anyone cares enough about resilience of loop mounted fs to try > > and change the iterate_* functions to iterate supers/bdevs in reverse order... > > Now I see reason behind "sync; sync; sync; reboot" =) > > Order old -> new allows to not miss new items if list modifies. > Might be important for some users. > That's not the reason I suggested reverse order. The reason is that with loop mounted fs, the correct order of flushing is: 1. sync loop mounted fs inodes => writes to loop image file 2. sync loop mounted fs sb => fsyncs the loop image file 3. sync the loop image host fs sb With forward sb iteration order, #3 happens before #1, so the loop mounted fs changes are not really being made durable by a single sync(2) call. > bdev iteration seems already reversed: inode_sb_list_add adds to the head > I think bdev iteration order will not make a difference in this case. flushing /dev/loopX will not be needed and it happens too late anyway. Thanks, Amir.
On 09/04/2020 14.48, Amir Goldstein wrote: > On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov > <khlebnikov@yandex-team.ru> wrote: >> >> On 09/04/2020 13.23, Amir Goldstein wrote: >>> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov >>> <khlebnikov@yandex-team.ru> wrote: >>>> >>>> Stacked filesystems like overlayfs has no own writeback, but they have to >>>> forward syncfs() requests to backend for keeping data integrity. >>>> >>>> During global sync() each overlayfs instance calls method ->sync_fs() >>>> for backend although it itself is in global list of superblocks too. >>>> As a result one syscall sync() could write one superblock several times >>>> and send multiple disk barriers. >>>> >>>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. >>>> >>>> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> >>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> >>>> --- >>> >>> Seems reasonable. >>> You may add: >>> Reviewed-by: Amir Goldstein <amir73il@gmail.com> >>> >>> +CC: containers list >> >> Thanks >> >>> >>> This bring up old memories. >>> I posted this way back to fix handling of emergency_remount() in the >>> presence of loop mounted fs: >>> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/ >>> >>> But seems to me that emergency_sync() and sync(2) are equally broken >>> for this use case. >>> >>> I wonder if anyone cares enough about resilience of loop mounted fs to try >>> and change the iterate_* functions to iterate supers/bdevs in reverse order... >> >> Now I see reason behind "sync; sync; sync; reboot" =) >> >> Order old -> new allows to not miss new items if list modifies. >> Might be important for some users. >> > > That's not the reason I suggested reverse order. > The reason is that with loop mounted fs, the correct order of flushing is: > 1. sync loop mounted fs inodes => writes to loop image file > 2. sync loop mounted fs sb => fsyncs the loop image file > 3. sync the loop image host fs sb > > With forward sb iteration order, #3 happens before #1, so the > loop mounted fs changes are not really being made durable by > a single sync(2) call. If fs in loop mounted with barriers then sync_fs will issue REQ_OP_FLUSH to loop device and trigger fsync() for image file. Sync() might write something twice but data should be safe. Without barriers this scenario is broken for sure. Emergency remount R/O is other thing. It really needs reverse order. > >> bdev iteration seems already reversed: inode_sb_list_add adds to the head >> > > I think bdev iteration order will not make a difference in this case. > flushing /dev/loopX will not be needed and it happens too late > anyway. > > Thanks, > Amir. >
On Thu, Apr 9, 2020 at 3:04 PM Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > > > > On 09/04/2020 14.48, Amir Goldstein wrote: > > On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov > > <khlebnikov@yandex-team.ru> wrote: > >> > >> On 09/04/2020 13.23, Amir Goldstein wrote: > >>> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov > >>> <khlebnikov@yandex-team.ru> wrote: > >>>> > >>>> Stacked filesystems like overlayfs has no own writeback, but they have to > >>>> forward syncfs() requests to backend for keeping data integrity. > >>>> > >>>> During global sync() each overlayfs instance calls method ->sync_fs() > >>>> for backend although it itself is in global list of superblocks too. > >>>> As a result one syscall sync() could write one superblock several times > >>>> and send multiple disk barriers. > >>>> > >>>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. > >>>> > >>>> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> > >>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > >>>> --- > >>> > >>> Seems reasonable. > >>> You may add: > >>> Reviewed-by: Amir Goldstein <amir73il@gmail.com> > >>> > >>> +CC: containers list > >> > >> Thanks > >> > >>> > >>> This bring up old memories. > >>> I posted this way back to fix handling of emergency_remount() in the > >>> presence of loop mounted fs: > >>> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/ > >>> > >>> But seems to me that emergency_sync() and sync(2) are equally broken > >>> for this use case. > >>> > >>> I wonder if anyone cares enough about resilience of loop mounted fs to try > >>> and change the iterate_* functions to iterate supers/bdevs in reverse order... > >> > >> Now I see reason behind "sync; sync; sync; reboot" =) > >> > >> Order old -> new allows to not miss new items if list modifies. > >> Might be important for some users. > >> > > > > That's not the reason I suggested reverse order. > > The reason is that with loop mounted fs, the correct order of flushing is: > > 1. sync loop mounted fs inodes => writes to loop image file > > 2. sync loop mounted fs sb => fsyncs the loop image file > > 3. sync the loop image host fs sb > > > > With forward sb iteration order, #3 happens before #1, so the > > loop mounted fs changes are not really being made durable by > > a single sync(2) call. > > If fs in loop mounted with barriers then sync_fs will issue > REQ_OP_FLUSH to loop device and trigger fsync() for image file. > Sync() might write something twice but data should be safe. > Without barriers this scenario is broken for sure. > > Emergency remount R/O is other thing. It really needs reverse order. > Correct. There is no problem with durability. Although for some filesystems it would be more efficient to first write and fsync the loop images and then sync_fs(). I can potentially result in less overall disk barriers. Thanks, Amir.
On Thu, Apr 09, 2020 at 11:29:47AM +0300, Konstantin Khlebnikov wrote: > Stacked filesystems like overlayfs has no own writeback, but they have to > forward syncfs() requests to backend for keeping data integrity. > > During global sync() each overlayfs instance calls method ->sync_fs() > for backend although it itself is in global list of superblocks too. > As a result one syscall sync() could write one superblock several times > and send multiple disk barriers. > > This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. Why wouldn't you just remove the ->sync_fs method from overlay? I mean, if you don't need the filesystem to do anything special for one specific data integrity sync_fs call, you don't need it for any of them, yes? -Dave.
On Sun, Apr 12, 2020 at 1:29 AM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Apr 09, 2020 at 11:29:47AM +0300, Konstantin Khlebnikov wrote: > > Stacked filesystems like overlayfs has no own writeback, but they have to > > forward syncfs() requests to backend for keeping data integrity. > > > > During global sync() each overlayfs instance calls method ->sync_fs() > > for backend although it itself is in global list of superblocks too. > > As a result one syscall sync() could write one superblock several times > > and send multiple disk barriers. > > > > This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. > > Why wouldn't you just remove the ->sync_fs method from overlay? > > I mean, if you don't need the filesystem to do anything special for > one specific data integrity sync_fs call, you don't need it for any > of them, yes? > No, but I understand the confusion. Say you have 1000 overlay sb's all of them using upper directories from a single xfs sb (quite common for containers). syncfs(2) of each overlay, must call sync_fs of xfs (see ovl_sync_fs) sync(2) will call xfs sync_fs anyway, so there is no point in calling ovl_sync_fs => xfs sync_fs 1000 more times. Thanks, Amir.
On Thu, Apr 9, 2020 at 10:29 AM Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > > Stacked filesystems like overlayfs has no own writeback, but they have to > forward syncfs() requests to backend for keeping data integrity. > > During global sync() each overlayfs instance calls method ->sync_fs() > for backend although it itself is in global list of superblocks too. > As a result one syscall sync() could write one superblock several times > and send multiple disk barriers. > > This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. > > Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Thanks, applied. Miklos
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 732ad5495c92..59df13d16280 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -261,8 +261,8 @@ static int ovl_sync_fs(struct super_block *sb, int wait) return 0; /* - * If this is a sync(2) call or an emergency sync, all the super blocks - * will be iterated, including upper_sb, so no need to do anything. + * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC). + * All the super blocks will be iterated, including upper_sb. * * If this is a syncfs(2) call, then we do need to call * sync_filesystem() on upper_sb, but enough if we do it when being @@ -1818,6 +1818,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) sb->s_xattr = ovl_xattr_handlers; sb->s_fs_info = ofs; sb->s_flags |= SB_POSIXACL; + sb->s_iflags |= SB_I_SKIP_SYNC; err = -ENOMEM; root_dentry = ovl_get_root(sb, upperpath.dentry, oe); diff --git a/fs/sync.c b/fs/sync.c index 4d1ff010bc5a..16c2630ee4bf 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -76,7 +76,8 @@ static void sync_inodes_one_sb(struct super_block *sb, void *arg) static void sync_fs_one_sb(struct super_block *sb, void *arg) { - if (!sb_rdonly(sb) && sb->s_op->sync_fs) + if (!sb_rdonly(sb) && !(sb->s_iflags & SB_I_SKIP_SYNC) && + sb->s_op->sync_fs) sb->s_op->sync_fs(sb, *(int *)arg); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 4f6f59b4f22a..f186a966a36c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1409,6 +1409,8 @@ extern int send_sigurg(struct fown_struct *fown); #define SB_I_IMA_UNVERIFIABLE_SIGNATURE 0x00000020 #define SB_I_UNTRUSTED_MOUNTER 0x00000040 +#define SB_I_SKIP_SYNC 0x00000100 /* Skip superblock at global sync */ + /* Possible states of 'frozen' field */ enum { SB_UNFROZEN = 0, /* FS is unfrozen */
Stacked filesystems like overlayfs has no own writeback, but they have to forward syncfs() requests to backend for keeping data integrity. During global sync() each overlayfs instance calls method ->sync_fs() for backend although it itself is in global list of superblocks too. As a result one syscall sync() could write one superblock several times and send multiple disk barriers. This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> --- fs/overlayfs/super.c | 5 +++-- fs/sync.c | 3 ++- include/linux/fs.h | 2 ++ 3 files changed, 7 insertions(+), 3 deletions(-)