diff mbox series

[1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly

Message ID 20240323-zielbereich-mittragen-6fdf14876c3e@brauner (mailing list archive)
State New, archived
Headers show
Series [1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly | expand

Commit Message

Christian Brauner March 23, 2024, 4:11 p.m. UTC
Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
default this option is set. When it is set the long-standing behavior
of being able to write to mounted block devices is enabled.

But in order to guard against unintended corruption by writing to the
block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
off. In that case it isn't possible to write to mounted block devices
anymore.

A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
which disallows concurrent BLK_OPEN_WRITE access. When we still had the
bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
the mode was passed around. Since we managed to get rid of the bdev
handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
on whether the file was opened writable and writes to that block device
are blocked. That logic doesn't work because we do allow
BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.

So fix the detection logic. Use O_EXCL as an indicator that
BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
for pidfds where O_EXCL means that this is a pidfd that refers to a
thread. For userspace open paths O_EXCL will never be retained but for
internal opens where we open files that are never installed into a file
descriptor table this is fine.

Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
directly be raised by userspace. It is implicitly raised during
mounting.

Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
unset.

Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
Reported-by: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 block/bdev.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

Comments

Yu Kuai March 25, 2024, 11:51 a.m. UTC | #1
Hi,

在 2024/03/24 0:11, Christian Brauner 写道:
> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> default this option is set. When it is set the long-standing behavior
> of being able to write to mounted block devices is enabled.
> 
> But in order to guard against unintended corruption by writing to the
> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> off. In that case it isn't possible to write to mounted block devices
> anymore.
> 
> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> the mode was passed around. Since we managed to get rid of the bdev
> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> on whether the file was opened writable and writes to that block device
> are blocked. That logic doesn't work because we do allow
> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.

I don't get it here, looks like there are no such use case. All users
passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.

Is the following root cause here?

1) t1 open with BLK_OPEN_WRITE
2) t2 open with BLK_OPEN_RESTRICT_WRITES, with bdev_block_writes(), yes
we don't wait for t1 to close;
3) t1 close, after the commit, bdev_unblock_writes() is called
unexpected.

Following openers will succeed although t2 doesn't close;
> 
> So fix the detection logic. Use O_EXCL as an indicator that
> BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> for pidfds where O_EXCL means that this is a pidfd that refers to a
> thread. For userspace open paths O_EXCL will never be retained but for
> internal opens where we open files that are never installed into a file
> descriptor table this is fine.

 From the path blkdev_open(), the file is from devtmpfs, and user can
pass in O_EXCL for that file, and that file will be used later in
blkdev_release() -> bdev_release() -> bdev_yield_write_access().

Thanks,
Kuai

> 
> Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> directly be raised by userspace. It is implicitly raised during
> mounting.
> 
> Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> unset.
> 
> Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> Reported-by: Matthew Wilcox <willy@infradead.org>
> Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>   block/bdev.c | 20 +++++++++++++-------
>   1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 7a5f611c3d2e..f819f3086905 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -821,13 +821,12 @@ static void bdev_yield_write_access(struct file *bdev_file)
>   		return;
>   
>   	bdev = file_bdev(bdev_file);
> -	/* Yield exclusive or shared write access. */
> -	if (bdev_file->f_mode & FMODE_WRITE) {
> -		if (bdev_writes_blocked(bdev))
> -			bdev_unblock_writes(bdev);
> -		else
> -			bdev->bd_writers--;
> -	}
> +
> +	/* O_EXCL is only set for internal BLK_OPEN_RESTRICT_WRITES. */
> +	if (bdev_file->f_flags & O_EXCL)
> +		bdev_unblock_writes(bdev);
> +	else if (bdev_file->f_mode & FMODE_WRITE)
> +		bdev->bd_writers--;
>   }
>   
>   /**
> @@ -946,6 +945,13 @@ static unsigned blk_to_file_flags(blk_mode_t mode)
>   	else
>   		WARN_ON_ONCE(true);
>   
> +	/*
> +	 * BLK_OPEN_RESTRICT_WRITES is never set from userspace and
> +	 * O_EXCL is stripped from userspace.
> +	 */
> +	if (mode & BLK_OPEN_RESTRICT_WRITES)
> +		flags |= O_EXCL;
> +
>   	if (mode & BLK_OPEN_NDELAY)
>   		flags |= O_NDELAY;
>   
>
Christian Brauner March 25, 2024, 12:04 p.m. UTC | #2
On Mon, Mar 25, 2024 at 07:51:27PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2024/03/24 0:11, Christian Brauner 写道:
> > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > default this option is set. When it is set the long-standing behavior
> > of being able to write to mounted block devices is enabled.
> > 
> > But in order to guard against unintended corruption by writing to the
> > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > off. In that case it isn't possible to write to mounted block devices
> > anymore.
> > 
> > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > the mode was passed around. Since we managed to get rid of the bdev
> > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > on whether the file was opened writable and writes to that block device
> > are blocked. That logic doesn't work because we do allow
> > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> 
> I don't get it here, looks like there are no such use case. All users
> passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.

sb_open_mode() does

#define sb_open_mode(flags) \
        (BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES | \
         (((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE))
Yu Kuai March 25, 2024, 1:52 p.m. UTC | #3
Hi,

在 2024/03/25 20:04, Christian Brauner 写道:
> On Mon, Mar 25, 2024 at 07:51:27PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/03/24 0:11, Christian Brauner 写道:
>>> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
>>> default this option is set. When it is set the long-standing behavior
>>> of being able to write to mounted block devices is enabled.
>>>
>>> But in order to guard against unintended corruption by writing to the
>>> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
>>> off. In that case it isn't possible to write to mounted block devices
>>> anymore.
>>>
>>> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
>>> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
>>> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
>>> the mode was passed around. Since we managed to get rid of the bdev
>>> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
>>> on whether the file was opened writable and writes to that block device
>>> are blocked. That logic doesn't work because we do allow
>>> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
>>
>> I don't get it here, looks like there are no such use case. All users
>> passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.
> 
> sb_open_mode() does
> 
> #define sb_open_mode(flags) \
>          (BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES | \
>           (((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE))

Yes, you're right, thanks for the notice. And the problem that I
described should also exist.

BTW, do you agree that using O_EXCL is not correct here?

Thanks,
Kuai
> .
>
Christian Brauner March 25, 2024, 1:54 p.m. UTC | #4
On Mon, Mar 25, 2024 at 07:51:27PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2024/03/24 0:11, Christian Brauner 写道:
> > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > default this option is set. When it is set the long-standing behavior
> > of being able to write to mounted block devices is enabled.
> > 
> > But in order to guard against unintended corruption by writing to the
> > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > off. In that case it isn't possible to write to mounted block devices
> > anymore.
> > 
> > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > the mode was passed around. Since we managed to get rid of the bdev
> > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > on whether the file was opened writable and writes to that block device
> > are blocked. That logic doesn't work because we do allow
> > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> 
> I don't get it here, looks like there are no such use case. All users
> passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.
> 
> Is the following root cause here?
> 
> 1) t1 open with BLK_OPEN_WRITE
> 2) t2 open with BLK_OPEN_RESTRICT_WRITES, with bdev_block_writes(), yes
> we don't wait for t1 to close;
> 3) t1 close, after the commit, bdev_unblock_writes() is called
> unexpected.
> 
> Following openers will succeed although t2 doesn't close;
> > 
> > So fix the detection logic. Use O_EXCL as an indicator that
> > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > thread. For userspace open paths O_EXCL will never be retained but for
> > internal opens where we open files that are never installed into a file
> > descriptor table this is fine.
> 
> From the path blkdev_open(), the file is from devtmpfs, and user can
> pass in O_EXCL for that file, and that file will be used later in
> blkdev_release() -> bdev_release() -> bdev_yield_write_access().

It can't because the VFS strips O_EXCL after the file has been opened.
Only internal opens can retain this flag. See do_dentry_open(). Or do
you mean something else?
Yu Kuai March 26, 2024, 1:32 a.m. UTC | #5
Hi,

在 2024/03/25 21:54, Christian Brauner 写道:
> On Mon, Mar 25, 2024 at 07:51:27PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/03/24 0:11, Christian Brauner 写道:
>>> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
>>> default this option is set. When it is set the long-standing behavior
>>> of being able to write to mounted block devices is enabled.
>>>
>>> But in order to guard against unintended corruption by writing to the
>>> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
>>> off. In that case it isn't possible to write to mounted block devices
>>> anymore.
>>>
>>> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
>>> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
>>> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
>>> the mode was passed around. Since we managed to get rid of the bdev
>>> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
>>> on whether the file was opened writable and writes to that block device
>>> are blocked. That logic doesn't work because we do allow
>>> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
>>
>> I don't get it here, looks like there are no such use case. All users
>> passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.
>>
>> Is the following root cause here?
>>
>> 1) t1 open with BLK_OPEN_WRITE
>> 2) t2 open with BLK_OPEN_RESTRICT_WRITES, with bdev_block_writes(), yes
>> we don't wait for t1 to close;
>> 3) t1 close, after the commit, bdev_unblock_writes() is called
>> unexpected.
>>
>> Following openers will succeed although t2 doesn't close;
>>>
>>> So fix the detection logic. Use O_EXCL as an indicator that
>>> BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
>>> for pidfds where O_EXCL means that this is a pidfd that refers to a
>>> thread. For userspace open paths O_EXCL will never be retained but for
>>> internal opens where we open files that are never installed into a file
>>> descriptor table this is fine.
>>
>>  From the path blkdev_open(), the file is from devtmpfs, and user can
>> pass in O_EXCL for that file, and that file will be used later in
>> blkdev_release() -> bdev_release() -> bdev_yield_write_access().
> 
> It can't because the VFS strips O_EXCL after the file has been opened.
> Only internal opens can retain this flag. See do_dentry_open(). Or do
> you mean something else?

Yes, I see that now, thanks for the explanation and forgive me that I'm
not that familiar with vfs code. :(

Now I think the patch can actually fix the problem, blkdev_open() and
blkdev_release() is not affected, and O_EXCL is not used from
bdev_file_open_by_dev() before. This is not straightforward, however I
can't find a better solution myself, so feel free to add:

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
> 
>
Jan Kara March 26, 2024, 12:57 p.m. UTC | #6
On Sat 23-03-24 17:11:19, Christian Brauner wrote:
> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> default this option is set. When it is set the long-standing behavior
> of being able to write to mounted block devices is enabled.
> 
> But in order to guard against unintended corruption by writing to the
> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> off. In that case it isn't possible to write to mounted block devices
> anymore.
> 
> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> the mode was passed around. Since we managed to get rid of the bdev
> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> on whether the file was opened writable and writes to that block device
> are blocked. That logic doesn't work because we do allow
> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> 
> So fix the detection logic. Use O_EXCL as an indicator that
> BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> for pidfds where O_EXCL means that this is a pidfd that refers to a
> thread. For userspace open paths O_EXCL will never be retained but for
> internal opens where we open files that are never installed into a file
> descriptor table this is fine.
> 
> Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> directly be raised by userspace. It is implicitly raised during
> mounting.
> 
> Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> unset.
> 
> Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> Reported-by: Matthew Wilcox <willy@infradead.org>
> Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>

The fix looks correct but admittedly it looks a bit hacky. I'd prefer
storing the needed information in some other flag, preferably one that does
not already have a special meaning with block devices. But FMODE_ space is
exhausted and don't see another easy solution. So I guess:

Reviewed-by: Jan Kara <jack@suse.cz>

Thanks for looking into this!

								Honza

> ---
>  block/bdev.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 7a5f611c3d2e..f819f3086905 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -821,13 +821,12 @@ static void bdev_yield_write_access(struct file *bdev_file)
>  		return;
>  
>  	bdev = file_bdev(bdev_file);
> -	/* Yield exclusive or shared write access. */
> -	if (bdev_file->f_mode & FMODE_WRITE) {
> -		if (bdev_writes_blocked(bdev))
> -			bdev_unblock_writes(bdev);
> -		else
> -			bdev->bd_writers--;
> -	}
> +
> +	/* O_EXCL is only set for internal BLK_OPEN_RESTRICT_WRITES. */
> +	if (bdev_file->f_flags & O_EXCL)
> +		bdev_unblock_writes(bdev);
> +	else if (bdev_file->f_mode & FMODE_WRITE)
> +		bdev->bd_writers--;
>  }
>  
>  /**
> @@ -946,6 +945,13 @@ static unsigned blk_to_file_flags(blk_mode_t mode)
>  	else
>  		WARN_ON_ONCE(true);
>  
> +	/*
> +	 * BLK_OPEN_RESTRICT_WRITES is never set from userspace and
> +	 * O_EXCL is stripped from userspace.
> +	 */
> +	if (mode & BLK_OPEN_RESTRICT_WRITES)
> +		flags |= O_EXCL;
> +
>  	if (mode & BLK_OPEN_NDELAY)
>  		flags |= O_NDELAY;
>  
> -- 
> 2.43.0
>
Christian Brauner March 26, 2024, 1:17 p.m. UTC | #7
On Tue, Mar 26, 2024 at 01:57:15PM +0100, Jan Kara wrote:
> On Sat 23-03-24 17:11:19, Christian Brauner wrote:
> > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > default this option is set. When it is set the long-standing behavior
> > of being able to write to mounted block devices is enabled.
> > 
> > But in order to guard against unintended corruption by writing to the
> > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > off. In that case it isn't possible to write to mounted block devices
> > anymore.
> > 
> > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > the mode was passed around. Since we managed to get rid of the bdev
> > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > on whether the file was opened writable and writes to that block device
> > are blocked. That logic doesn't work because we do allow
> > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > 
> > So fix the detection logic. Use O_EXCL as an indicator that
> > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > thread. For userspace open paths O_EXCL will never be retained but for
> > internal opens where we open files that are never installed into a file
> > descriptor table this is fine.
> > 
> > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > directly be raised by userspace. It is implicitly raised during
> > mounting.
> > 
> > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > unset.
> > 
> > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > Reported-by: Matthew Wilcox <willy@infradead.org>
> > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> The fix looks correct but admittedly it looks a bit hacky. I'd prefer
> storing the needed information in some other flag, preferably one that does
> not already have a special meaning with block devices. But FMODE_ space is
> exhausted and don't see another easy solution. So I guess:

Admittedly, it's not pretty but we're abusing O_EXCL already for block
devices. We do have FMODE_* available. We could add
FMODE_WRITE_RESTRICTED because we have two bits left.

One other solution apart from FMODE_* I had come up with was even
nastier which would've been using the struct fd approach of using the
two available bits in the pointer. But that doesn't work because we have
stuff like dm that passes in strings which are byte aligned. And it's
arguably uglier.
Jan Kara March 26, 2024, 1:31 p.m. UTC | #8
On Tue 26-03-24 14:17:20, Christian Brauner wrote:
> On Tue, Mar 26, 2024 at 01:57:15PM +0100, Jan Kara wrote:
> > On Sat 23-03-24 17:11:19, Christian Brauner wrote:
> > > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > > default this option is set. When it is set the long-standing behavior
> > > of being able to write to mounted block devices is enabled.
> > > 
> > > But in order to guard against unintended corruption by writing to the
> > > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > > off. In that case it isn't possible to write to mounted block devices
> > > anymore.
> > > 
> > > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > > the mode was passed around. Since we managed to get rid of the bdev
> > > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > > on whether the file was opened writable and writes to that block device
> > > are blocked. That logic doesn't work because we do allow
> > > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > > 
> > > So fix the detection logic. Use O_EXCL as an indicator that
> > > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > > thread. For userspace open paths O_EXCL will never be retained but for
> > > internal opens where we open files that are never installed into a file
> > > descriptor table this is fine.
> > > 
> > > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > > directly be raised by userspace. It is implicitly raised during
> > > mounting.
> > > 
> > > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > > unset.
> > > 
> > > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > > Reported-by: Matthew Wilcox <willy@infradead.org>
> > > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > 
> > The fix looks correct but admittedly it looks a bit hacky. I'd prefer
> > storing the needed information in some other flag, preferably one that does
> > not already have a special meaning with block devices. But FMODE_ space is
> > exhausted and don't see another easy solution. So I guess:
> 
> Admittedly, it's not pretty but we're abusing O_EXCL already for block
> devices. We do have FMODE_* available. We could add
> FMODE_WRITE_RESTRICTED because we have two bits left.

Yeah, I'm mostly afraid that a few years down the road someone comes and
adds code thinking that file->f_flags & O_EXCL means this is exclusive bdev
open. Which will be kind of natural assumption but subtly wrong... So to
make the code more robust, I'd prefer burning a fmode flag for this rather
than abusing O_EXCL.

								Honza
Christian Brauner March 26, 2024, 3:47 p.m. UTC | #9
On Tue, Mar 26, 2024 at 02:31:07PM +0100, Jan Kara wrote:
> On Tue 26-03-24 14:17:20, Christian Brauner wrote:
> > On Tue, Mar 26, 2024 at 01:57:15PM +0100, Jan Kara wrote:
> > > On Sat 23-03-24 17:11:19, Christian Brauner wrote:
> > > > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > > > default this option is set. When it is set the long-standing behavior
> > > > of being able to write to mounted block devices is enabled.
> > > > 
> > > > But in order to guard against unintended corruption by writing to the
> > > > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > > > off. In that case it isn't possible to write to mounted block devices
> > > > anymore.
> > > > 
> > > > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > > > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > > > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > > > the mode was passed around. Since we managed to get rid of the bdev
> > > > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > > > on whether the file was opened writable and writes to that block device
> > > > are blocked. That logic doesn't work because we do allow
> > > > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > > > 
> > > > So fix the detection logic. Use O_EXCL as an indicator that
> > > > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > > > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > > > thread. For userspace open paths O_EXCL will never be retained but for
> > > > internal opens where we open files that are never installed into a file
> > > > descriptor table this is fine.
> > > > 
> > > > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > > > directly be raised by userspace. It is implicitly raised during
> > > > mounting.
> > > > 
> > > > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > > > unset.
> > > > 
> > > > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > > > Reported-by: Matthew Wilcox <willy@infradead.org>
> > > > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > > 
> > > The fix looks correct but admittedly it looks a bit hacky. I'd prefer
> > > storing the needed information in some other flag, preferably one that does
> > > not already have a special meaning with block devices. But FMODE_ space is
> > > exhausted and don't see another easy solution. So I guess:
> > 
> > Admittedly, it's not pretty but we're abusing O_EXCL already for block
> > devices. We do have FMODE_* available. We could add
> > FMODE_WRITE_RESTRICTED because we have two bits left.
> 
> Yeah, I'm mostly afraid that a few years down the road someone comes and
> adds code thinking that file->f_flags & O_EXCL means this is exclusive bdev
> open. Which will be kind of natural assumption but subtly wrong... So to
> make the code more robust, I'd prefer burning a fmode flag for this rather
> than abusing O_EXCL.

Ok, done and resend. Thanks for the feedback! I was rather focussed on
not using a bit but I agree it's likely the right thing to do!
Christian Brauner March 27, 2024, 12:01 p.m. UTC | #10
On Sat, 23 Mar 2024 17:11:19 +0100, Christian Brauner wrote:
> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> default this option is set. When it is set the long-standing behavior
> of being able to write to mounted block devices is enabled.
> 
> But in order to guard against unintended corruption by writing to the
> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> off. In that case it isn't possible to write to mounted block devices
> anymore.
> 
> [...]

Applied to the vfs.fixes branch of the vfs/vfs.git tree.
Patches in the vfs.fixes branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.fixes

[1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
      https://git.kernel.org/vfs/vfs/c/ddd65e19c601
[2/2] block: count BLK_OPEN_RESTRICT_WRITES openers
      https://git.kernel.org/vfs/vfs/c/3ff56e285de5
Matthew Wilcox (Oracle) March 29, 2024, 4:56 a.m. UTC | #11
On Sat, Mar 23, 2024 at 05:11:19PM +0100, Christian Brauner wrote:
> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> default this option is set. When it is set the long-standing behavior
> of being able to write to mounted block devices is enabled.
> 
> But in order to guard against unintended corruption by writing to the
> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> off. In that case it isn't possible to write to mounted block devices
> anymore.
> 
> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> the mode was passed around. Since we managed to get rid of the bdev
> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> on whether the file was opened writable and writes to that block device
> are blocked. That logic doesn't work because we do allow
> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> 
> So fix the detection logic. Use O_EXCL as an indicator that
> BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> for pidfds where O_EXCL means that this is a pidfd that refers to a
> thread. For userspace open paths O_EXCL will never be retained but for
> internal opens where we open files that are never installed into a file
> descriptor table this is fine.
> 
> Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> directly be raised by userspace. It is implicitly raised during
> mounting.
> 
> Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> unset.
> 
> Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> Reported-by: Matthew Wilcox <willy@infradead.org>
> Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>

So v1 of this patch works fine.  I just got round to testing v2, and it
does not.  Indeed, applying 2/2 causes root to fail to mount:

/dev/root: Can't open blockdev
List of all bdev filesystems:
 ext3
 ext2
 ext4
 xfs

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)

Applying only 1/2 boots but fails to fix the bug.
Christian Brauner March 29, 2024, 12:10 p.m. UTC | #12
On Fri, Mar 29, 2024 at 04:56:07AM +0000, Matthew Wilcox wrote:
> On Sat, Mar 23, 2024 at 05:11:19PM +0100, Christian Brauner wrote:
> > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > default this option is set. When it is set the long-standing behavior
> > of being able to write to mounted block devices is enabled.
> > 
> > But in order to guard against unintended corruption by writing to the
> > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > off. In that case it isn't possible to write to mounted block devices
> > anymore.
> > 
> > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > the mode was passed around. Since we managed to get rid of the bdev
> > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > on whether the file was opened writable and writes to that block device
> > are blocked. That logic doesn't work because we do allow
> > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > 
> > So fix the detection logic. Use O_EXCL as an indicator that
> > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > thread. For userspace open paths O_EXCL will never be retained but for
> > internal opens where we open files that are never installed into a file
> > descriptor table this is fine.
> > 
> > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > directly be raised by userspace. It is implicitly raised during
> > mounting.
> > 
> > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > unset.
> > 
> > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > Reported-by: Matthew Wilcox <willy@infradead.org>
> > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> So v1 of this patch works fine.  I just got round to testing v2, and it
> does not.  Indeed, applying 2/2 causes root to fail to mount:
> 
> /dev/root: Can't open blockdev
> List of all bdev filesystems:
>  ext3
>  ext2
>  ext4
>  xfs
> 
> Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)
> 
> Applying only 1/2 boots but fails to fix the bug.

Thanks for testing this. This is odd because I tested with the setup you
provided.

I used the kernel config you sent to me in [2] with an xfs root device
with direct kernel boot and the following xfstests config in [3]. I'm
booting the vm with:

qemu-system-x86_64 -machine type=q35 -smp 1 -m 4G -accel kvm -cpu max -nographic -nodefaults \
        -chardev stdio,mux=on,id=console,signal=off -serial chardev:console -mon console \
        -kernel /home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.vmlinuz \
        -drive file=/home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.raw,format=raw,if=virtio \
        -append "console=ttyS0 root=/dev/vda2 module_blacklist=vmw_vmci systemd.tty.term.ttyS0=screen-256color systemd.tty.columns.ttyS0=96 systemd.tty.rows.ttyS0=46 debug loglevel=4 SYSTEMD_"

Note that the config you gave me in [2] didn't include
CONFIG_SCSI_VIRTIO=y which means I got the splat you did. I added this
missing config option and everything worked fine for me.

Can you please test what's in the vfs.fixes branch on
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git so we're
sure that we're testing the same thing?

The failures that I see are:

Failures: generic/042 generic/645 generic/682 generic/689 xfs/014
xfs/017 xfs/049 xfs/129 xfs/176 xfs/206 xfs/216 xfs/234 xfs/250 xfs/289
xfs/558 xfs/559
Failed 16 of 930 tests

* generic/645 fails because it requires an unrelated fix to fstests
  because we changed idmapped mounts to not not allow empty idmappings.
* generic/689 fails because the providec config doesn't compile tmpfs with POSIX ACL support
* xfs/558 and xfs/559 complain about missing logging
  about iomap validation and are unrelated
* All other failures are caused by loop devices which is expected unil
  a util-linux is released that contains Jan's fix in [1] so that
  mount(8) doesn't hold a writable fd to the loop device anymore and
  instead simply uses a read-only one.

[1]: https://github.com/util-linux/util-linux/commit/1cde32f323e0970f6c7f35940dcc0aea97b821e5
[2]: https://lore.kernel.org/r/Zf18I2UOGQxeN-Z1@casper.infradead.org
[3]:
#! /bin/bash

set -x

cd ~/src/git/xfstests-dev/
FIRST_DEV=/dev/vda3
SECOND_DEV=/dev/vda4
THIRD_DEV=/dev/vda5

echo "Testing xfs"
cat <<EOF >local.config
FSTYP=xfs
export TEST_DEV=${FIRST_DEV}
export SCRATCH_DEV=${SECOND_DEV}
export LOGWRITE_DEV=${THIRD_DEV}
export TEST_DIR=/mnt/test
export SCRATCH_MNT=/mnt/scratch
EOF

sudo mkfs.xfs -f ${FIRST_DEV}
sudo mkfs.xfs -f ${SECOND_DEV}
sudo mkfs.xfs -f ${THIRD_DEV}
sudo ./check -g quick
Christian Brauner March 29, 2024, 3:11 p.m. UTC | #13
On Fri, Mar 29, 2024 at 01:10:57PM +0100, Christian Brauner wrote:
> On Fri, Mar 29, 2024 at 04:56:07AM +0000, Matthew Wilcox wrote:
> > On Sat, Mar 23, 2024 at 05:11:19PM +0100, Christian Brauner wrote:
> > > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > > default this option is set. When it is set the long-standing behavior
> > > of being able to write to mounted block devices is enabled.
> > > 
> > > But in order to guard against unintended corruption by writing to the
> > > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > > off. In that case it isn't possible to write to mounted block devices
> > > anymore.
> > > 
> > > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > > the mode was passed around. Since we managed to get rid of the bdev
> > > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > > on whether the file was opened writable and writes to that block device
> > > are blocked. That logic doesn't work because we do allow
> > > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > > 
> > > So fix the detection logic. Use O_EXCL as an indicator that
> > > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > > thread. For userspace open paths O_EXCL will never be retained but for
> > > internal opens where we open files that are never installed into a file
> > > descriptor table this is fine.
> > > 
> > > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > > directly be raised by userspace. It is implicitly raised during
> > > mounting.
> > > 
> > > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > > unset.
> > > 
> > > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > > Reported-by: Matthew Wilcox <willy@infradead.org>
> > > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > 
> > So v1 of this patch works fine.  I just got round to testing v2, and it
> > does not.  Indeed, applying 2/2 causes root to fail to mount:
> > 
> > /dev/root: Can't open blockdev
> > List of all bdev filesystems:
> >  ext3
> >  ext2
> >  ext4
> >  xfs
> > 
> > Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)

One other observation. This line right here is suspicious.
What I expect to see here is something like this:

Kernel panic - not syncing: VFS: Unable to mount root fs on "/dev/vda2" or unknown-block(254,2) ]---

IOW, the name of the device that failed to mount should be printed.
Since that doesn't show up this indicates that the kernel booted here
didn't have commit 84d2b696236c ("init/mount: print pretty name of root
device when panics") which I've merged for v6.7.

So that output looks like it's from a pre v6.7 kernel. But Jan's write
restriction changes only made it into v6.8. So even if the patches
somehow were to apply to this kernel they'd be entirely ineffective
since CONFIG_BLK_DEV_WRITE_MOUNTED is not available in this kernel.

> > 
> > Applying only 1/2 boots but fails to fix the bug.
> 
> Thanks for testing this. This is odd because I tested with the setup you
> provided.
> 
> I used the kernel config you sent to me in [2] with an xfs root device
> with direct kernel boot and the following xfstests config in [3]. I'm
> booting the vm with:
> 
> qemu-system-x86_64 -machine type=q35 -smp 1 -m 4G -accel kvm -cpu max -nographic -nodefaults \
>         -chardev stdio,mux=on,id=console,signal=off -serial chardev:console -mon console \
>         -kernel /home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.vmlinuz \
>         -drive file=/home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.raw,format=raw,if=virtio \
>         -append "console=ttyS0 root=/dev/vda2 module_blacklist=vmw_vmci systemd.tty.term.ttyS0=screen-256color systemd.tty.columns.ttyS0=96 systemd.tty.rows.ttyS0=46 debug loglevel=4 SYSTEMD_"
> 
> Note that the config you gave me in [2] didn't include
> CONFIG_SCSI_VIRTIO=y which means I got the splat you did. I added this
> missing config option and everything worked fine for me.
> 
> Can you please test what's in the vfs.fixes branch on
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git so we're
> sure that we're testing the same thing?
> 
> The failures that I see are:
> 
> Failures: generic/042 generic/645 generic/682 generic/689 xfs/014
> xfs/017 xfs/049 xfs/129 xfs/176 xfs/206 xfs/216 xfs/234 xfs/250 xfs/289
> xfs/558 xfs/559
> Failed 16 of 930 tests
> 
> * generic/645 fails because it requires an unrelated fix to fstests
>   because we changed idmapped mounts to not not allow empty idmappings.
> * generic/689 fails because the providec config doesn't compile tmpfs with POSIX ACL support
> * xfs/558 and xfs/559 complain about missing logging
>   about iomap validation and are unrelated
> * All other failures are caused by loop devices which is expected unil
>   a util-linux is released that contains Jan's fix in [1] so that
>   mount(8) doesn't hold a writable fd to the loop device anymore and
>   instead simply uses a read-only one.
> 
> [1]: https://github.com/util-linux/util-linux/commit/1cde32f323e0970f6c7f35940dcc0aea97b821e5
> [2]: https://lore.kernel.org/r/Zf18I2UOGQxeN-Z1@casper.infradead.org
> [3]:
> #! /bin/bash
> 
> set -x
> 
> cd ~/src/git/xfstests-dev/
> FIRST_DEV=/dev/vda3
> SECOND_DEV=/dev/vda4
> THIRD_DEV=/dev/vda5
> 
> echo "Testing xfs"
> cat <<EOF >local.config
> FSTYP=xfs
> export TEST_DEV=${FIRST_DEV}
> export SCRATCH_DEV=${SECOND_DEV}
> export LOGWRITE_DEV=${THIRD_DEV}
> export TEST_DIR=/mnt/test
> export SCRATCH_MNT=/mnt/scratch
> EOF
> 
> sudo mkfs.xfs -f ${FIRST_DEV}
> sudo mkfs.xfs -f ${SECOND_DEV}
> sudo mkfs.xfs -f ${THIRD_DEV}
> sudo ./check -g quick
Christian Brauner March 29, 2024, 3:24 p.m. UTC | #14
> One other observation. This line right here is suspicious.
> What I expect to see here is something like this:
> 
> Kernel panic - not syncing: VFS: Unable to mount root fs on "/dev/vda2" or unknown-block(254,2) ]---

Sorry, I looked at the wrong codepath here.
Christian Brauner April 3, 2024, 6:04 a.m. UTC | #15
On Fri, Mar 29, 2024 at 01:10:57PM +0100, Christian Brauner wrote:
> On Fri, Mar 29, 2024 at 04:56:07AM +0000, Matthew Wilcox wrote:
> > On Sat, Mar 23, 2024 at 05:11:19PM +0100, Christian Brauner wrote:
> > > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > > default this option is set. When it is set the long-standing behavior
> > > of being able to write to mounted block devices is enabled.
> > > 
> > > But in order to guard against unintended corruption by writing to the
> > > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > > off. In that case it isn't possible to write to mounted block devices
> > > anymore.
> > > 
> > > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > > the mode was passed around. Since we managed to get rid of the bdev
> > > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > > on whether the file was opened writable and writes to that block device
> > > are blocked. That logic doesn't work because we do allow
> > > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > > 
> > > So fix the detection logic. Use O_EXCL as an indicator that
> > > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > > thread. For userspace open paths O_EXCL will never be retained but for
> > > internal opens where we open files that are never installed into a file
> > > descriptor table this is fine.
> > > 
> > > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > > directly be raised by userspace. It is implicitly raised during
> > > mounting.
> > > 
> > > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > > unset.
> > > 
> > > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > > Reported-by: Matthew Wilcox <willy@infradead.org>
> > > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > 
> > So v1 of this patch works fine.  I just got round to testing v2, and it
> > does not.  Indeed, applying 2/2 causes root to fail to mount:
> > 
> > /dev/root: Can't open blockdev
> > List of all bdev filesystems:
> >  ext3
> >  ext2
> >  ext4
> >  xfs
> > 
> > Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)
> > 
> > Applying only 1/2 boots but fails to fix the bug.
> 
> Thanks for testing this. This is odd because I tested with the setup you
> provided.
> 
> I used the kernel config you sent to me in [2] with an xfs root device
> with direct kernel boot and the following xfstests config in [3]. I'm
> booting the vm with:
> 
> qemu-system-x86_64 -machine type=q35 -smp 1 -m 4G -accel kvm -cpu max -nographic -nodefaults \
>         -chardev stdio,mux=on,id=console,signal=off -serial chardev:console -mon console \
>         -kernel /home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.vmlinuz \
>         -drive file=/home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.raw,format=raw,if=virtio \
>         -append "console=ttyS0 root=/dev/vda2 module_blacklist=vmw_vmci systemd.tty.term.ttyS0=screen-256color systemd.tty.columns.ttyS0=96 systemd.tty.rows.ttyS0=46 debug loglevel=4 SYSTEMD_"
> 
> Note that the config you gave me in [2] didn't include
> CONFIG_SCSI_VIRTIO=y which means I got the splat you did. I added this
> missing config option and everything worked fine for me.
> 
> Can you please test what's in the vfs.fixes branch on
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git so we're
> sure that we're testing the same thing?

Willy, can you still reproduce this? I've been delaying the pull request
to give you time to verify this but I would really like to send it
before Friday. So it'd be really great if you could get back to me on
this.

> 
> The failures that I see are:
> 
> Failures: generic/042 generic/645 generic/682 generic/689 xfs/014
> xfs/017 xfs/049 xfs/129 xfs/176 xfs/206 xfs/216 xfs/234 xfs/250 xfs/289
> xfs/558 xfs/559
> Failed 16 of 930 tests
> 
> * generic/645 fails because it requires an unrelated fix to fstests
>   because we changed idmapped mounts to not not allow empty idmappings.
> * generic/689 fails because the providec config doesn't compile tmpfs with POSIX ACL support
> * xfs/558 and xfs/559 complain about missing logging
>   about iomap validation and are unrelated
> * All other failures are caused by loop devices which is expected unil
>   a util-linux is released that contains Jan's fix in [1] so that
>   mount(8) doesn't hold a writable fd to the loop device anymore and
>   instead simply uses a read-only one.
> 
> [1]: https://github.com/util-linux/util-linux/commit/1cde32f323e0970f6c7f35940dcc0aea97b821e5
> [2]: https://lore.kernel.org/r/Zf18I2UOGQxeN-Z1@casper.infradead.org
> [3]:
> #! /bin/bash
> 
> set -x
> 
> cd ~/src/git/xfstests-dev/
> FIRST_DEV=/dev/vda3
> SECOND_DEV=/dev/vda4
> THIRD_DEV=/dev/vda5
> 
> echo "Testing xfs"
> cat <<EOF >local.config
> FSTYP=xfs
> export TEST_DEV=${FIRST_DEV}
> export SCRATCH_DEV=${SECOND_DEV}
> export LOGWRITE_DEV=${THIRD_DEV}
> export TEST_DIR=/mnt/test
> export SCRATCH_MNT=/mnt/scratch
> EOF
> 
> sudo mkfs.xfs -f ${FIRST_DEV}
> sudo mkfs.xfs -f ${SECOND_DEV}
> sudo mkfs.xfs -f ${THIRD_DEV}
> sudo ./check -g quick
Matthew Wilcox (Oracle) April 3, 2024, 7:22 p.m. UTC | #16
On Wed, Apr 03, 2024 at 08:04:16AM +0200, Christian Brauner wrote:
> Willy, can you still reproduce this? I've been delaying the pull request
> to give you time to verify this but I would really like to send it
> before Friday. So it'd be really great if you could get back to me on
> this.

Looks like the latest version fixed it.  Just built & booted next-20240403
with no patches and it works fine.  Thanks for fixing.  (I took a few
days off over Easter, so didn't look at this before now)
diff mbox series

Patch

diff --git a/block/bdev.c b/block/bdev.c
index 7a5f611c3d2e..f819f3086905 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -821,13 +821,12 @@  static void bdev_yield_write_access(struct file *bdev_file)
 		return;
 
 	bdev = file_bdev(bdev_file);
-	/* Yield exclusive or shared write access. */
-	if (bdev_file->f_mode & FMODE_WRITE) {
-		if (bdev_writes_blocked(bdev))
-			bdev_unblock_writes(bdev);
-		else
-			bdev->bd_writers--;
-	}
+
+	/* O_EXCL is only set for internal BLK_OPEN_RESTRICT_WRITES. */
+	if (bdev_file->f_flags & O_EXCL)
+		bdev_unblock_writes(bdev);
+	else if (bdev_file->f_mode & FMODE_WRITE)
+		bdev->bd_writers--;
 }
 
 /**
@@ -946,6 +945,13 @@  static unsigned blk_to_file_flags(blk_mode_t mode)
 	else
 		WARN_ON_ONCE(true);
 
+	/*
+	 * BLK_OPEN_RESTRICT_WRITES is never set from userspace and
+	 * O_EXCL is stripped from userspace.
+	 */
+	if (mode & BLK_OPEN_RESTRICT_WRITES)
+		flags |= O_EXCL;
+
 	if (mode & BLK_OPEN_NDELAY)
 		flags |= O_NDELAY;