diff mbox series

Revert "block: Do not discard buffers under a mounted filesystem"

Message ID 20210216133849.8244-1-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series Revert "block: Do not discard buffers under a mounted filesystem" | expand

Commit Message

Jan Kara Feb. 16, 2021, 1:38 p.m. UTC
Apparently there are several userspace programs that depend on being
able to call BLKDISCARD ioctl without the ability to grab bdev
exclusively - namely FUSE filesystems have the device open without
O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
discards ranges released from LV but that PV may be already open
exclusively by someone else (see bugzilla link below for more details).

This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=211167
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/ioctl.c          | 16 ++++++----------
 fs/block_dev.c         | 33 ++++-----------------------------
 include/linux/blkdev.h |  7 -------
 3 files changed, 10 insertions(+), 46 deletions(-)

Comments

Christoph Hellwig Feb. 16, 2021, 4:36 p.m. UTC | #1
On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote:
> Apparently there are several userspace programs that depend on being
> able to call BLKDISCARD ioctl without the ability to grab bdev
> exclusively - namely FUSE filesystems have the device open without
> O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
> fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
> discards ranges released from LV but that PV may be already open
> exclusively by someone else (see bugzilla link below for more details).
> 
> This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.

I think that is a bad idea. We fixed the problem for a reason.
I think the right fix is to just do nothing if the device hasn't been
opened with O_EXCL and can't be reopened with it, just don't do anything
but also don't return an error.  After all discard and thus
BLKDISCARD is purely advisory.
Jan Kara Feb. 16, 2021, 5:16 p.m. UTC | #2
On Tue 16-02-21 16:36:06, Christoph Hellwig wrote:
> On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote:
> > Apparently there are several userspace programs that depend on being
> > able to call BLKDISCARD ioctl without the ability to grab bdev
> > exclusively - namely FUSE filesystems have the device open without
> > O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
> > fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
> > discards ranges released from LV but that PV may be already open
> > exclusively by someone else (see bugzilla link below for more details).
> > 
> > This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.
> 
> I think that is a bad idea. We fixed the problem for a reason.
> I think the right fix is to just do nothing if the device hasn't been
> opened with O_EXCL and can't be reopened with it, just don't do anything
> but also don't return an error.  After all discard and thus
> BLKDISCARD is purely advisory.

Yeah, certainly we'd have to fix the original problem in some other way.
Just silently ignoring BLKDISCARD if we cannot claim the device exclusively
is certainly an option to stop complaints from userspace. But note that
fstrim with fuse-based filesystem would still stay silent NOP which is
suboptimal. It could be fixed on FUSE side as I talked to Miklos but it
is not trivial. Similarly for the LVM regression...

I was wondering whether we could do something like:
	use truncate_inode_pages() if we can claim bdev exclusively
	use invalidate_inode_pages2_range() if we cannot claim bdev
          exclusively, possibly do nothing if that returns EBUSY?

The downside is that cases where we cannot claim bdev exclusively would
unnecessarily write dirty buffer cache before discard.

								Honza
Keith Busch Feb. 16, 2021, 5:49 p.m. UTC | #3
On Tue, Feb 16, 2021 at 04:36:06PM +0000, Christoph Hellwig wrote:
> On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote:
> > Apparently there are several userspace programs that depend on being
> > able to call BLKDISCARD ioctl without the ability to grab bdev
> > exclusively - namely FUSE filesystems have the device open without
> > O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
> > fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
> > discards ranges released from LV but that PV may be already open
> > exclusively by someone else (see bugzilla link below for more details).
> > 
> > This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.
> 
> I think that is a bad idea. We fixed the problem for a reason.
> I think the right fix is to just do nothing if the device hasn't been
> opened with O_EXCL and can't be reopened with it, just don't do anything
> but also don't return an error.  After all discard and thus
> BLKDISCARD is purely advisory.

A discard is advisory, but BLKZEROOUT is not, so something different
should happen there. We were also planning to send a patch using this
same pattern for Zone Reset to fix stale page cache issues after the
reset, but we'll wait to see how this settles before sending that.
Damien Le Moal Feb. 16, 2021, 11:05 p.m. UTC | #4
On 2021/02/17 2:51, Keith Busch wrote:
> On Tue, Feb 16, 2021 at 04:36:06PM +0000, Christoph Hellwig wrote:
>> On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote:
>>> Apparently there are several userspace programs that depend on being
>>> able to call BLKDISCARD ioctl without the ability to grab bdev
>>> exclusively - namely FUSE filesystems have the device open without
>>> O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
>>> fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
>>> discards ranges released from LV but that PV may be already open
>>> exclusively by someone else (see bugzilla link below for more details).
>>>
>>> This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.
>>
>> I think that is a bad idea. We fixed the problem for a reason.
>> I think the right fix is to just do nothing if the device hasn't been
>> opened with O_EXCL and can't be reopened with it, just don't do anything
>> but also don't return an error.  After all discard and thus
>> BLKDISCARD is purely advisory.
> 
> A discard is advisory, but BLKZEROOUT is not, so something different
> should happen there. We were also planning to send a patch using this
> same pattern for Zone Reset to fix stale page cache issues after the
> reset, but we'll wait to see how this settles before sending that.

There is also another problem: the truncate_bdev & operation following it
(discard, zeroout or zone reset) are not atomic vs read/write operations to the
bdev. Without mutual exclusion, that page invalidation is best effort only since
reads can snick in between the truncate and discard (or zeroout or zone reset).
With our zone reset stale page problem case, it is reads from udevd that we see
snicking in between the truncate bdev and zone reset and so we still get stale
pages after the zone reset is finished. No solution to propose for solving that,
yet...
Jan Kara Feb. 18, 2021, 11:17 a.m. UTC | #5
On Tue 16-02-21 18:16:09, Jan Kara wrote:
> On Tue 16-02-21 16:36:06, Christoph Hellwig wrote:
> > On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote:
> > > Apparently there are several userspace programs that depend on being
> > > able to call BLKDISCARD ioctl without the ability to grab bdev
> > > exclusively - namely FUSE filesystems have the device open without
> > > O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
> > > fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
> > > discards ranges released from LV but that PV may be already open
> > > exclusively by someone else (see bugzilla link below for more details).
> > > 
> > > This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.
> > 
> > I think that is a bad idea. We fixed the problem for a reason.
> > I think the right fix is to just do nothing if the device hasn't been
> > opened with O_EXCL and can't be reopened with it, just don't do anything
> > but also don't return an error.  After all discard and thus
> > BLKDISCARD is purely advisory.
> 
> Yeah, certainly we'd have to fix the original problem in some other way.
> Just silently ignoring BLKDISCARD if we cannot claim the device exclusively
> is certainly an option to stop complaints from userspace. But note that
> fstrim with fuse-based filesystem would still stay silent NOP which is
> suboptimal. It could be fixed on FUSE side as I talked to Miklos but it
> is not trivial. Similarly for the LVM regression...
> 
> I was wondering whether we could do something like:
> 	use truncate_inode_pages() if we can claim bdev exclusively
> 	use invalidate_inode_pages2_range() if we cannot claim bdev
>           exclusively, possibly do nothing if that returns EBUSY?
> 
> The downside is that cases where we cannot claim bdev exclusively would
> unnecessarily write dirty buffer cache before discard.

OK, no more comments I guess so I'll post this in a form of a patch and
we'll see what people think.

								Honza
Jan Kara Feb. 18, 2021, 2:07 p.m. UTC | #6
On Tue 16-02-21 23:05:57, Damien Le Moal wrote:
> On 2021/02/17 2:51, Keith Busch wrote:
> > On Tue, Feb 16, 2021 at 04:36:06PM +0000, Christoph Hellwig wrote:
> >> On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote:
> >>> Apparently there are several userspace programs that depend on being
> >>> able to call BLKDISCARD ioctl without the ability to grab bdev
> >>> exclusively - namely FUSE filesystems have the device open without
> >>> O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
> >>> fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
> >>> discards ranges released from LV but that PV may be already open
> >>> exclusively by someone else (see bugzilla link below for more details).
> >>>
> >>> This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.
> >>
> >> I think that is a bad idea. We fixed the problem for a reason.
> >> I think the right fix is to just do nothing if the device hasn't been
> >> opened with O_EXCL and can't be reopened with it, just don't do anything
> >> but also don't return an error.  After all discard and thus
> >> BLKDISCARD is purely advisory.
> > 
> > A discard is advisory, but BLKZEROOUT is not, so something different
> > should happen there. We were also planning to send a patch using this
> > same pattern for Zone Reset to fix stale page cache issues after the
> > reset, but we'll wait to see how this settles before sending that.
> 
> There is also another problem: the truncate_bdev & operation following it
> (discard, zeroout or zone reset) are not atomic vs read/write operations to the
> bdev. Without mutual exclusion, that page invalidation is best effort only since
> reads can snick in between the truncate and discard (or zeroout or zone reset).
> With our zone reset stale page problem case, it is reads from udevd that we see
> snicking in between the truncate bdev and zone reset and so we still get stale
> pages after the zone reset is finished. No solution to propose for solving that,
> yet...

Well, at least blkdev_fallocate() does:

	truncate_bdev_range();
	blkdev_issue_zeroout();
	invalidate_inode_pages2_range();

so racing reads should not result in stale page cache contents AFAICT.

									Honza
Damien Le Moal Feb. 18, 2021, 10:35 p.m. UTC | #7
On 2021/02/18 23:07, Jan Kara wrote:
> On Tue 16-02-21 23:05:57, Damien Le Moal wrote:
>> On 2021/02/17 2:51, Keith Busch wrote:
>>> On Tue, Feb 16, 2021 at 04:36:06PM +0000, Christoph Hellwig wrote:
>>>> On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote:
>>>>> Apparently there are several userspace programs that depend on being
>>>>> able to call BLKDISCARD ioctl without the ability to grab bdev
>>>>> exclusively - namely FUSE filesystems have the device open without
>>>>> O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
>>>>> fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
>>>>> discards ranges released from LV but that PV may be already open
>>>>> exclusively by someone else (see bugzilla link below for more details).
>>>>>
>>>>> This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.
>>>>
>>>> I think that is a bad idea. We fixed the problem for a reason.
>>>> I think the right fix is to just do nothing if the device hasn't been
>>>> opened with O_EXCL and can't be reopened with it, just don't do anything
>>>> but also don't return an error.  After all discard and thus
>>>> BLKDISCARD is purely advisory.
>>>
>>> A discard is advisory, but BLKZEROOUT is not, so something different
>>> should happen there. We were also planning to send a patch using this
>>> same pattern for Zone Reset to fix stale page cache issues after the
>>> reset, but we'll wait to see how this settles before sending that.
>>
>> There is also another problem: the truncate_bdev & operation following it
>> (discard, zeroout or zone reset) are not atomic vs read/write operations to the
>> bdev. Without mutual exclusion, that page invalidation is best effort only since
>> reads can snick in between the truncate and discard (or zeroout or zone reset).
>> With our zone reset stale page problem case, it is reads from udevd that we see
>> snicking in between the truncate bdev and zone reset and so we still get stale
>> pages after the zone reset is finished. No solution to propose for solving that,
>> yet...
> 
> Well, at least blkdev_fallocate() does:
> 
> 	truncate_bdev_range();
> 	blkdev_issue_zeroout();
> 	invalidate_inode_pages2_range();
> 
> so racing reads should not result in stale page cache contents AFAICT.

Yes, but concurrent writes can then get in between the blkdev_issue_zeroout()
and invalidate_inode_pages2_range() and data discarded before hitting the
drive... Not very nice either. Granted, that would mean that userland has 2
concurrent writers that are not synchronized. So weird results are to be
expected. I guess it is probably safe to ignore that case ?

I guess the same pattern as above for zeroout would work for reset zone too.
Will try to see if that solves our test problem.
Jan Kara Feb. 19, 2021, 10:27 a.m. UTC | #8
On Thu 18-02-21 22:35:41, Damien Le Moal wrote:
> On 2021/02/18 23:07, Jan Kara wrote:
> > On Tue 16-02-21 23:05:57, Damien Le Moal wrote:
> >> On 2021/02/17 2:51, Keith Busch wrote:
> >>> On Tue, Feb 16, 2021 at 04:36:06PM +0000, Christoph Hellwig wrote:
> >>>> On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote:
> >>>>> Apparently there are several userspace programs that depend on being
> >>>>> able to call BLKDISCARD ioctl without the ability to grab bdev
> >>>>> exclusively - namely FUSE filesystems have the device open without
> >>>>> O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
> >>>>> fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
> >>>>> discards ranges released from LV but that PV may be already open
> >>>>> exclusively by someone else (see bugzilla link below for more details).
> >>>>>
> >>>>> This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.
> >>>>
> >>>> I think that is a bad idea. We fixed the problem for a reason.
> >>>> I think the right fix is to just do nothing if the device hasn't been
> >>>> opened with O_EXCL and can't be reopened with it, just don't do anything
> >>>> but also don't return an error.  After all discard and thus
> >>>> BLKDISCARD is purely advisory.
> >>>
> >>> A discard is advisory, but BLKZEROOUT is not, so something different
> >>> should happen there. We were also planning to send a patch using this
> >>> same pattern for Zone Reset to fix stale page cache issues after the
> >>> reset, but we'll wait to see how this settles before sending that.
> >>
> >> There is also another problem: the truncate_bdev & operation following it
> >> (discard, zeroout or zone reset) are not atomic vs read/write operations to the
> >> bdev. Without mutual exclusion, that page invalidation is best effort only since
> >> reads can snick in between the truncate and discard (or zeroout or zone reset).
> >> With our zone reset stale page problem case, it is reads from udevd that we see
> >> snicking in between the truncate bdev and zone reset and so we still get stale
> >> pages after the zone reset is finished. No solution to propose for solving that,
> >> yet...
> > 
> > Well, at least blkdev_fallocate() does:
> > 
> > 	truncate_bdev_range();
> > 	blkdev_issue_zeroout();
> > 	invalidate_inode_pages2_range();
> > 
> > so racing reads should not result in stale page cache contents AFAICT.
> 
> Yes, but concurrent writes can then get in between the blkdev_issue_zeroout()
> and invalidate_inode_pages2_range() and data discarded before hitting the
> drive... Not very nice either. Granted, that would mean that userland has 2
> concurrent writers that are not synchronized. So weird results are to be
> expected. I guess it is probably safe to ignore that case ?

Yes. IMHO any result that doesn't crash the kernel (or burn the HW) is fine
in that case.

> I guess the same pattern as above for zeroout would work for reset zone too.
> Will try to see if that solves our test problem.

								Honza
diff mbox series

Patch

diff --git a/block/ioctl.c b/block/ioctl.c
index d61d652078f4..d523f134f7db 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -103,7 +103,8 @@  static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 	uint64_t range[2];
 	uint64_t start, len;
 	struct request_queue *q = bdev_get_queue(bdev);
-	int err;
+	struct address_space *mapping = bdev->bd_inode->i_mapping;
+
 
 	if (!(mode & FMODE_WRITE))
 		return -EBADF;
@@ -124,11 +125,7 @@  static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 
 	if (start + len > i_size_read(bdev->bd_inode))
 		return -EINVAL;
-
-	err = truncate_bdev_range(bdev, mode, start, start + len - 1);
-	if (err)
-		return err;
-
+	truncate_inode_pages_range(mapping, start, start + len - 1);
 	return blkdev_issue_discard(bdev, start >> 9, len >> 9,
 				    GFP_KERNEL, flags);
 }
@@ -137,8 +134,8 @@  static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 		unsigned long arg)
 {
 	uint64_t range[2];
+	struct address_space *mapping;
 	uint64_t start, end, len;
-	int err;
 
 	if (!(mode & FMODE_WRITE))
 		return -EBADF;
@@ -160,9 +157,8 @@  static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 		return -EINVAL;
 
 	/* Invalidate the page cache, including dirty pages */
-	err = truncate_bdev_range(bdev, mode, start, end);
-	if (err)
-		return err;
+	mapping = bdev->bd_inode->i_mapping;
+	truncate_inode_pages_range(mapping, start, end);
 
 	return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
 			BLKDEV_ZERO_NOUNMAP);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3b8963e228a1..bfd97eb4b5f5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -103,31 +103,6 @@  void invalidate_bdev(struct block_device *bdev)
 }
 EXPORT_SYMBOL(invalidate_bdev);
 
-/*
- * Drop all buffers & page cache for given bdev range. This function bails
- * with error if bdev has other exclusive owner (such as filesystem).
- */
-int truncate_bdev_range(struct block_device *bdev, fmode_t mode,
-			loff_t lstart, loff_t lend)
-{
-	/*
-	 * If we don't hold exclusive handle for the device, upgrade to it
-	 * while we discard the buffer cache to avoid discarding buffers
-	 * under live filesystem.
-	 */
-	if (!(mode & FMODE_EXCL)) {
-		int err = bd_prepare_to_claim(bdev, truncate_bdev_range);
-		if (err)
-			return err;
-	}
-
-	truncate_inode_pages_range(bdev->bd_inode->i_mapping, lstart, lend);
-	if (!(mode & FMODE_EXCL))
-		bd_abort_claiming(bdev, truncate_bdev_range);
-	return 0;
-}
-EXPORT_SYMBOL(truncate_bdev_range);
-
 static void set_init_blocksize(struct block_device *bdev)
 {
 	bdev->bd_inode->i_blkbits = blksize_bits(bdev_logical_block_size(bdev));
@@ -1748,6 +1723,7 @@  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 			     loff_t len)
 {
 	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
+	struct address_space *mapping;
 	loff_t end = start + len - 1;
 	loff_t isize;
 	int error;
@@ -1775,9 +1751,8 @@  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		return -EINVAL;
 
 	/* Invalidate the page cache, including dirty pages. */
-	error = truncate_bdev_range(bdev, file->f_mode, start, end);
-	if (error)
-		return error;
+	mapping = bdev->bd_inode->i_mapping;
+	truncate_inode_pages_range(mapping, start, end);
 
 	switch (mode) {
 	case FALLOC_FL_ZERO_RANGE:
@@ -1804,7 +1779,7 @@  static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	 * the caller will be given -EBUSY.  The third argument is
 	 * inclusive, so the rounding here is safe.
 	 */
-	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
+	return invalidate_inode_pages2_range(mapping,
 					     start >> PAGE_SHIFT,
 					     end >> PAGE_SHIFT);
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94ee3089e01..dcce654e8201 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -2015,18 +2015,11 @@  void bdput(struct block_device *);
 
 #ifdef CONFIG_BLOCK
 void invalidate_bdev(struct block_device *bdev);
-int truncate_bdev_range(struct block_device *bdev, fmode_t mode, loff_t lstart,
-			loff_t lend);
 int sync_blockdev(struct block_device *bdev);
 #else
 static inline void invalidate_bdev(struct block_device *bdev)
 {
 }
-static inline int truncate_bdev_range(struct block_device *bdev, fmode_t mode,
-				      loff_t lstart, loff_t lend)
-{
-	return 0;
-}
 static inline int sync_blockdev(struct block_device *bdev)
 {
 	return 0;