Message ID | 20230504105624.9789-1-idryomov@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: always respect QUEUE_FLAG_STABLE_WRITES on the block device | expand |
On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote: > Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue > and a sb flag") introduced a regression for the raw block device use > case. Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has > the effect of respecting it only when there is a filesystem mounted on > top of the block device. If a filesystem is not mounted, block devices > that do integrity checking return sporadic checksum errors. With "If a file system is not mounted" you want to say "when accessing a block device directly" here, right? The two are not exclusive.. > Additionally, this commit made the corresponding sysfs knob writeable > for debugging purposes. However, because QUEUE_FLAG_STABLE_WRITES flag > is captured when the filesystem is mounted and isn't consulted after > that anywhere outside of swap code, changing it doesn't take immediate > effect even though dumping the knob shows the new value. With no way > to dump SB_I_STABLE_WRITES flag, this is needlessly confusing. But very much intentional. s_bdev often is not the only device in a file system, and we should never reference if from core helpers. So I think we should go with something like this: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index db794399900734..aa36cc2a4530c1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable); */ void folio_wait_stable(struct folio *folio) { - if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES) + struct inode *inode = folio_inode(folio); + struct super_block *sb = inode->i_sb; + + if ((sb->s_iflags & SB_I_STABLE_WRITES) || + (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode)))) folio_wait_writeback(folio); } EXPORT_SYMBOL_GPL(folio_wait_stable);
On Thu, May 04, 2023 at 03:55:15PM +0200, Christoph Hellwig wrote: > On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote: > > Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue > > and a sb flag") introduced a regression for the raw block device use > > case. Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has > > the effect of respecting it only when there is a filesystem mounted on > > top of the block device. If a filesystem is not mounted, block devices > > that do integrity checking return sporadic checksum errors. > > With "If a file system is not mounted" you want to say "when accessing > a block device directly" here, right? The two are not exclusive.. > > > Additionally, this commit made the corresponding sysfs knob writeable > > for debugging purposes. However, because QUEUE_FLAG_STABLE_WRITES flag > > is captured when the filesystem is mounted and isn't consulted after > > that anywhere outside of swap code, changing it doesn't take immediate > > effect even though dumping the knob shows the new value. With no way > > to dump SB_I_STABLE_WRITES flag, this is needlessly confusing. > > But very much intentional. s_bdev often is not the only device > in a file system, and we should never reference if from core > helpers. > > So I think we should go with something like this: > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index db794399900734..aa36cc2a4530c1 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable); > */ > void folio_wait_stable(struct folio *folio) > { > - if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES) > + struct inode *inode = folio_inode(folio); > + struct super_block *sb = inode->i_sb; > + > + if ((sb->s_iflags & SB_I_STABLE_WRITES) || > + (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode)))) > folio_wait_writeback(folio); > } > EXPORT_SYMBOL_GPL(folio_wait_stable); I hate both of these patches ;-) What we should do is add AS_STABLE_WRITES, have the appropriate places call mapping_set_stable_writes() and then folio_wait_stable() becomes if (mapping_test_stable_writes(folio->mapping)) folio_wait_writeback(folio); and we remove all the dereferences (mapping->host->i_sb->s_iflags, plus whatever else is going on there)
On Thu, May 4, 2023 at 3:55 PM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote: > > Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue > > and a sb flag") introduced a regression for the raw block device use > > case. Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has > > the effect of respecting it only when there is a filesystem mounted on > > top of the block device. If a filesystem is not mounted, block devices > > that do integrity checking return sporadic checksum errors. > > With "If a file system is not mounted" you want to say "when accessing > a block device directly" here, right? The two are not exclusive.. Hi Christoph, Right, I meant to say "when accessing a block device directly". > > > Additionally, this commit made the corresponding sysfs knob writeable > > for debugging purposes. However, because QUEUE_FLAG_STABLE_WRITES flag > > is captured when the filesystem is mounted and isn't consulted after > > that anywhere outside of swap code, changing it doesn't take immediate > > effect even though dumping the knob shows the new value. With no way > > to dump SB_I_STABLE_WRITES flag, this is needlessly confusing. > > But very much intentional. s_bdev often is not the only device > in a file system, and we should never reference if from core > helpers. > > So I think we should go with something like this: > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index db794399900734..aa36cc2a4530c1 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable); > */ > void folio_wait_stable(struct folio *folio) > { > - if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES) > + struct inode *inode = folio_inode(folio); > + struct super_block *sb = inode->i_sb; > + > + if ((sb->s_iflags & SB_I_STABLE_WRITES) || > + (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode)))) > folio_wait_writeback(folio); Heh, this is almost exactly what I came up with initially (|| arms were swapped in that version), but decided to improve upon after noticing the writeable thing. Thanks, Ilya
On Thu, May 4, 2023 at 4:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, May 04, 2023 at 03:55:15PM +0200, Christoph Hellwig wrote: > > On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote: > > > Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue > > > and a sb flag") introduced a regression for the raw block device use > > > case. Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has > > > the effect of respecting it only when there is a filesystem mounted on > > > top of the block device. If a filesystem is not mounted, block devices > > > that do integrity checking return sporadic checksum errors. > > > > With "If a file system is not mounted" you want to say "when accessing > > a block device directly" here, right? The two are not exclusive.. > > > > > Additionally, this commit made the corresponding sysfs knob writeable > > > for debugging purposes. However, because QUEUE_FLAG_STABLE_WRITES flag > > > is captured when the filesystem is mounted and isn't consulted after > > > that anywhere outside of swap code, changing it doesn't take immediate > > > effect even though dumping the knob shows the new value. With no way > > > to dump SB_I_STABLE_WRITES flag, this is needlessly confusing. > > > > But very much intentional. s_bdev often is not the only device > > in a file system, and we should never reference if from core > > helpers. > > > > So I think we should go with something like this: > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > > index db794399900734..aa36cc2a4530c1 100644 > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable); > > */ > > void folio_wait_stable(struct folio *folio) > > { > > - if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES) > > + struct inode *inode = folio_inode(folio); > > + struct super_block *sb = inode->i_sb; > > + > > + if ((sb->s_iflags & SB_I_STABLE_WRITES) || > > + (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode)))) > > folio_wait_writeback(folio); > > } > > EXPORT_SYMBOL_GPL(folio_wait_stable); > > I hate both of these patches ;-) What we should do is add > AS_STABLE_WRITES, have the appropriate places call > mapping_set_stable_writes() and then folio_wait_stable() becomes > > if (mapping_test_stable_writes(folio->mapping)) > folio_wait_writeback(folio); > > and we remove all the dereferences (mapping->host->i_sb->s_iflags, plus > whatever else is going on there) Hi Matthew, We would still need something resembling Christoph's suggestion for 5.10 and 5.15 (at least). Since this fixes a regression, would you support merging the "ugly" version to facilitate backports or would you rather see the AS/mapping-based refactor first? Thanks, Ilya
On Thu, May 04, 2023 at 05:07:10PM +0200, Ilya Dryomov wrote: > On Thu, May 4, 2023 at 4:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > I hate both of these patches ;-) What we should do is add > > AS_STABLE_WRITES, have the appropriate places call > > mapping_set_stable_writes() and then folio_wait_stable() becomes > > > > if (mapping_test_stable_writes(folio->mapping)) > > folio_wait_writeback(folio); > > > > and we remove all the dereferences (mapping->host->i_sb->s_iflags, plus > > whatever else is going on there) > > Hi Matthew, > > We would still need something resembling Christoph's suggestion for > 5.10 and 5.15 (at least). Since this fixes a regression, would you > support merging the "ugly" version to facilitate backports or would > you rather see the AS/mapping-based refactor first? That's a terrible way of developing for Linux. First, do it the right way for mainline. Then, see how easy the patch is to backport to relevant kernel versions; if it's too ugly to cherry-pick, do something equivalent. But never start out with the premise "This must be backported, so do it as simply as possible first".
On Thu 04-05-23 15:16:39, Matthew Wilcox wrote: > On Thu, May 04, 2023 at 03:55:15PM +0200, Christoph Hellwig wrote: > > On Thu, May 04, 2023 at 12:56:24PM +0200, Ilya Dryomov wrote: > > > Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue > > > and a sb flag") introduced a regression for the raw block device use > > > case. Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has > > > the effect of respecting it only when there is a filesystem mounted on > > > top of the block device. If a filesystem is not mounted, block devices > > > that do integrity checking return sporadic checksum errors. > > > > With "If a file system is not mounted" you want to say "when accessing > > a block device directly" here, right? The two are not exclusive.. > > > > > Additionally, this commit made the corresponding sysfs knob writeable > > > for debugging purposes. However, because QUEUE_FLAG_STABLE_WRITES flag > > > is captured when the filesystem is mounted and isn't consulted after > > > that anywhere outside of swap code, changing it doesn't take immediate > > > effect even though dumping the knob shows the new value. With no way > > > to dump SB_I_STABLE_WRITES flag, this is needlessly confusing. > > > > But very much intentional. s_bdev often is not the only device > > in a file system, and we should never reference if from core > > helpers. > > > > So I think we should go with something like this: > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > > index db794399900734..aa36cc2a4530c1 100644 > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -3129,7 +3129,11 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable); > > */ > > void folio_wait_stable(struct folio *folio) > > { > > - if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES) > > + struct inode *inode = folio_inode(folio); > > + struct super_block *sb = inode->i_sb; > > + > > + if ((sb->s_iflags & SB_I_STABLE_WRITES) || > > + (sb_is_blkdev_sb(sb) && bdev_stable_writes(I_BDEV(inode)))) > > folio_wait_writeback(folio); > > } > > EXPORT_SYMBOL_GPL(folio_wait_stable); > > I hate both of these patches ;-) What we should do is add > AS_STABLE_WRITES, have the appropriate places call > mapping_set_stable_writes() and then folio_wait_stable() becomes > > if (mapping_test_stable_writes(folio->mapping)) > folio_wait_writeback(folio); > > and we remove all the dereferences (mapping->host->i_sb->s_iflags, plus > whatever else is going on there) For bdev address_space that's easy but what Ilya also mentioned is a problem when 'stable_write' flag gets toggled on the device and in that case having to propagate the flag update to all the address_space structures is a nightmare... Honza
On Thu, May 04, 2023 at 05:55:56PM +0200, Jan Kara wrote: > For bdev address_space that's easy but what Ilya also mentioned is a > problem when 'stable_write' flag gets toggled on the device and in that > case having to propagate the flag update to all the address_space > structures is a nightmare... We have a number of flags which don't take effect when modified on a block device with a mounted filesystem on it. For example, modifying the readahead settings do not change existing files, only new ones. Since this flag is only modifiable for debugging purposes, I think I'm OK with it not affecting already-mounted filesystems. It feels like a decision that reasonable people could disagree on, though.
On Thu, May 04, 2023 at 05:16:48PM +0100, Matthew Wilcox wrote: > On Thu, May 04, 2023 at 05:55:56PM +0200, Jan Kara wrote: > > For bdev address_space that's easy but what Ilya also mentioned is a > > problem when 'stable_write' flag gets toggled on the device and in that > > case having to propagate the flag update to all the address_space > > structures is a nightmare... > > We have a number of flags which don't take effect when modified on a > block device with a mounted filesystem on it. For example, modifying > the readahead settings do not change existing files, only new ones. > Since this flag is only modifiable for debugging purposes, I think I'm > OK with it not affecting already-mounted filesystems. It feels like a > decision that reasonable people could disagree on, though. I think an address space flag makes sense, because then we don't even have to care about the special bdev sb/inode thing - folio->mapping will already point at the bdev mapping and so do the right thing. That is, if the bdev changes stable_write state, it can toggle the AS_STABLE_WRITE flag on it's inode->i_mapping straight away and all the folios and files pointing to the bdev mapping will change behaviour immediately. Everything else retains the same behaviour we have now - the stable_write state is persistent on the superblock until the filesystem mount is cycled. Cheers, Dave.
On Fri 05-05-23 09:07:36, Dave Chinner wrote: > On Thu, May 04, 2023 at 05:16:48PM +0100, Matthew Wilcox wrote: > > On Thu, May 04, 2023 at 05:55:56PM +0200, Jan Kara wrote: > > > For bdev address_space that's easy but what Ilya also mentioned is a > > > problem when 'stable_write' flag gets toggled on the device and in that > > > case having to propagate the flag update to all the address_space > > > structures is a nightmare... > > > > We have a number of flags which don't take effect when modified on a > > block device with a mounted filesystem on it. For example, modifying > > the readahead settings do not change existing files, only new ones. > > Since this flag is only modifiable for debugging purposes, I think I'm > > OK with it not affecting already-mounted filesystems. It feels like a > > decision that reasonable people could disagree on, though. > > I think an address space flag makes sense, because then we don't > even have to care about the special bdev sb/inode thing - > folio->mapping will already point at the bdev mapping and so do the > right thing. > > That is, if the bdev changes stable_write state, it can toggle the > AS_STABLE_WRITE flag on it's inode->i_mapping straight away and all > the folios and files pointing to the bdev mapping will change > behaviour immediately. Everything else retains the same behaviour > we have now - the stable_write state is persistent on the superblock > until the filesystem mount is cycled. Yeah, I'm fine with this behavior. I just wasn't sure whether Ilya didn't need the sysfs change to be visible in the filesystem so that was why I pointed that out. But apparently he doesn't need it. Honza
diff --git a/fs/super.c b/fs/super.c index 04bc62ab7dfe..6705b3506ae8 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1213,8 +1213,6 @@ static int set_bdev_super(struct super_block *s, void *data) s->s_dev = s->s_bdev->bd_dev; s->s_bdi = bdi_get(s->s_bdev->bd_disk->bdi); - if (bdev_stable_writes(s->s_bdev)) - s->s_iflags |= SB_I_STABLE_WRITES; return 0; } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 516b1aa247e8..469bc57add8c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -3169,7 +3169,17 @@ EXPORT_SYMBOL_GPL(folio_wait_writeback_killable); */ void folio_wait_stable(struct folio *folio) { - if (folio_inode(folio)->i_sb->s_iflags & SB_I_STABLE_WRITES) + struct inode *inode = folio_inode(folio); + struct super_block *sb = inode->i_sb; + bool stable_writes; + + if (sb_is_blkdev_sb(sb)) + stable_writes = bdev_stable_writes(I_BDEV(inode)); + else + stable_writes = bdev_stable_writes(sb->s_bdev) || + (sb->s_iflags & SB_I_STABLE_WRITES); + + if (stable_writes) folio_wait_writeback(folio); } EXPORT_SYMBOL_GPL(folio_wait_stable);
Commit 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag") introduced a regression for the raw block device use case. Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has the effect of respecting it only when there is a filesystem mounted on top of the block device. If a filesystem is not mounted, block devices that do integrity checking return sporadic checksum errors. Additionally, this commit made the corresponding sysfs knob writeable for debugging purposes. However, because QUEUE_FLAG_STABLE_WRITES flag is captured when the filesystem is mounted and isn't consulted after that anywhere outside of swap code, changing it doesn't take immediate effect even though dumping the knob shows the new value. With no way to dump SB_I_STABLE_WRITES flag, this is needlessly confusing. Resurrect the original stable writes behavior by changing folio_wait_stable() to account for the case of a raw block device and also: - for the case of a filesystem, test QUEUE_FLAG_STABLE_WRITES flag each time instead of capturing it in the superblock so that changes are reflected immediately (thus aligning with the case of a raw block device) - retain SB_I_STABLE_WRITES flag for filesystems that need stable writes independent of the underlying block device (currently just NFS) Cc: stable@vger.kernel.org Fixes: 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> --- fs/super.c | 2 -- mm/page-writeback.c | 12 +++++++++++- 2 files changed, 11 insertions(+), 3 deletions(-)