diff mbox series

block: Add config option to not allow writing to mounted devices

Message ID 20230612161614.10302-1-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series block: Add config option to not allow writing to mounted devices | expand

Commit Message

Jan Kara June 12, 2023, 4:16 p.m. UTC
Writing to mounted devices is dangerous and can lead to filesystem
corruption as well as crashes. Furthermore syzbot comes with more and
more involved examples how to corrupt block device under a mounted
filesystem leading to kernel crashes and reports we can do nothing
about. Add config option to disallow writing to mounted (exclusively
open) block devices. Syzbot can use this option to avoid uninteresting
crashes. Also users whose userspace setup does not need writing to
mounted block devices can set this config option for hardening.

Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/Kconfig             | 12 ++++++++++++
 block/bdev.c              | 10 ++++++++++
 include/linux/blk_types.h |  3 +++
 3 files changed, 25 insertions(+)

FWIW I've tested this and my test VM with ext4 root fs boots fine and fstests
on ext4 seem to be also running fine with BLK_DEV_WRITE_HARDENING enabled.
OTOH my old VM setup which is not using initrd fails to boot with
BLK_DEV_WRITE_HARDENING enabled because fsck cannot open the root device
because the root is already mounted (read-only). Anyway this should be useful
for syzbot (Dmitry indicated interest in this option in the past) and maybe
other well controlled setups.

Comments

Jan Kara June 12, 2023, 4:25 p.m. UTC | #1
On Mon 12-06-23 18:16:14, Jan Kara wrote:
> Writing to mounted devices is dangerous and can lead to filesystem
> corruption as well as crashes. Furthermore syzbot comes with more and
> more involved examples how to corrupt block device under a mounted
> filesystem leading to kernel crashes and reports we can do nothing
> about. Add config option to disallow writing to mounted (exclusively
> open) block devices. Syzbot can use this option to avoid uninteresting
> crashes. Also users whose userspace setup does not need writing to
> mounted block devices can set this config option for hardening.
> 
> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
> Signed-off-by: Jan Kara <jack@suse.cz>

Please disregard this patch. I had uncommited fixups in my tree. I'll send
fixed version shortly. I'm sorry for the noise.

								Honza
Bart Van Assche June 12, 2023, 5:39 p.m. UTC | #2
On 6/12/23 09:25, Jan Kara wrote:
> On Mon 12-06-23 18:16:14, Jan Kara wrote:
>> Writing to mounted devices is dangerous and can lead to filesystem
>> corruption as well as crashes. Furthermore syzbot comes with more and
>> more involved examples how to corrupt block device under a mounted
>> filesystem leading to kernel crashes and reports we can do nothing
>> about. Add config option to disallow writing to mounted (exclusively
>> open) block devices. Syzbot can use this option to avoid uninteresting
>> crashes. Also users whose userspace setup does not need writing to
>> mounted block devices can set this config option for hardening.
>>
>> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
>> Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Please disregard this patch. I had uncommited fixups in my tree. I'll send
> fixed version shortly. I'm sorry for the noise.

Have alternatives been configured to making this functionality configurable
at build time only? How about a kernel command line parameter instead of a
config option?

Thanks,

Bart.
Theodore Ts'o June 12, 2023, 5:47 p.m. UTC | #3
On Mon, Jun 12, 2023 at 10:39:51AM -0700, Bart Van Assche wrote:
> > > Writing to mounted devices is dangerous and can lead to filesystem
> > > corruption as well as crashes. Furthermore syzbot comes with more and
> > > more involved examples how to corrupt block device under a mounted
> > > filesystem leading to kernel crashes and reports we can do nothing
> > > about. Add config option to disallow writing to mounted (exclusively
> > > open) block devices. Syzbot can use this option to avoid uninteresting
> > > crashes. Also users whose userspace setup does not need writing to
> > > mounted block devices can set this config option for hardening.
> 
> Have alternatives been configured to making this functionality
> configurable at build time only? How about a kernel command line
> parameter instead of a config option?

I could imagine wanting a config option which changes the default, as
well as a way of setting the parameter on the command line so that
users of distro kernel can change the parameter value.  That's
especially since it might be useful for more than just reining in
syzbot reports.

						- Ted
Colin Walters June 12, 2023, 6:52 p.m. UTC | #4
On Mon, Jun 12, 2023, at 1:39 PM, Bart Van Assche wrote:
> On 6/12/23 09:25, Jan Kara wrote:
>> On Mon 12-06-23 18:16:14, Jan Kara wrote:
>>> Writing to mounted devices is dangerous and can lead to filesystem
>>> corruption as well as crashes. Furthermore syzbot comes with more and
>>> more involved examples how to corrupt block device under a mounted
>>> filesystem leading to kernel crashes and reports we can do nothing
>>> about. Add config option to disallow writing to mounted (exclusively
>>> open) block devices. Syzbot can use this option to avoid uninteresting
>>> crashes. Also users whose userspace setup does not need writing to
>>> mounted block devices can set this config option for hardening.
>>>
>>> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
>>> Signed-off-by: Jan Kara <jack@suse.cz>
>> 
>> Please disregard this patch. I had uncommited fixups in my tree. I'll send
>> fixed version shortly. I'm sorry for the noise.
>
> Have alternatives been configured to making this functionality configurable
> at build time only? How about a kernel command line parameter instead of a
> config option?

It's not just syzbot here; at least once in my life I accidentally did `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk and not the target USB device.  I know I'm not alone =)

There's a lot of similar accidental-damage protection from this.  Another stronger argument here is that if one has a security policy that restricts access to filesystem level objects, if a process can somehow write to a mounted block device, it effectively subverts all of those controls. 

Right now it looks to me we're invoking devcgroup_check_permission pretty early on; maybe we could extend the device cgroup stuff to have a new check for write-mounted, like

```
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c994ff5b157c..f2af33c5acc1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6797,6 +6797,7 @@ enum {
 	BPF_DEVCG_ACC_MKNOD	= (1ULL << 0),
 	BPF_DEVCG_ACC_READ	= (1ULL << 1),
 	BPF_DEVCG_ACC_WRITE	= (1ULL << 2),
+	BPF_DEVCG_ACC_WRITE_MOUNTED	= (1ULL << 3),
 };
 
 enum {
```

?  But probably this would need to be some kind of opt-in flag to avoid breaking existing bpf progs?

If it was configurable via the device cgroup, then it's completely flexible from userspace; most specifically including supporting some specially privileged processes from doing it if necessary.

Also, I wonder if we should also support restricting *reads* from mounted block devices?
kernel test robot June 13, 2023, 4:56 a.m. UTC | #5
Hi Jan,

kernel test robot noticed the following build errors:

[auto build test ERROR on v6.4-rc6]
[also build test ERROR on linus/master next-20230609]
[cannot apply to axboe-block/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/block-Add-config-option-to-not-allow-writing-to-mounted-devices/20230613-001910
base:   v6.4-rc6
patch link:    https://lore.kernel.org/r/20230612161614.10302-1-jack%40suse.cz
patch subject: [PATCH] block: Add config option to not allow writing to mounted devices
config: arc-randconfig-r043-20230612 (https://download.01.org/0day-ci/archive/20230613/202306131212.dPssLmeY-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 12.3.0
reproduce (this is a W=1 build):
        mkdir -p ~/bin
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        git checkout v6.4-rc6
        b4 shazam https://lore.kernel.org/r/20230612161614.10302-1-jack@suse.cz
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=arc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306131212.dPssLmeY-lkp@intel.com/

All errors (new ones prefixed by >>):

   block/bdev.c: In function 'blkdev_get_whole':
>> block/bdev.c:606:59: error: 'struct block_device' has no member named 'bd_writers'
     606 |                 if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0)
         |                                                           ^~
   block/bdev.c:627:33: error: 'struct block_device' has no member named 'bd_writers'
     627 |                 atomic_inc(&bdev->bd_writers);
         |                                 ^~
   block/bdev.c: In function 'blkdev_put_whole':
   block/bdev.c:637:33: error: 'struct block_device' has no member named 'bd_writers'
     637 |                 atomic_dec(&bdev->bd_writers);
         |                                 ^~


vim +606 block/bdev.c

   599	
   600	static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
   601	{
   602		struct gendisk *disk = bdev->bd_disk;
   603		int ret;
   604	
   605		if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) {
 > 606			if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0)
   607				return -EBUSY;
   608			if (mode & FMODE_WRITE && bdev->bd_holders > 0)
   609				return -EBUSY;
   610		}
   611		if (disk->fops->open) {
   612			ret = disk->fops->open(bdev, mode);
   613			if (ret) {
   614				/* avoid ghost partitions on a removed medium */
   615				if (ret == -ENOMEDIUM &&
   616				     test_bit(GD_NEED_PART_SCAN, &disk->state))
   617					bdev_disk_changed(disk, true);
   618				return ret;
   619			}
   620		}
   621	
   622		if (!atomic_read(&bdev->bd_openers))
   623			set_init_blocksize(bdev);
   624		if (test_bit(GD_NEED_PART_SCAN, &disk->state))
   625			bdev_disk_changed(disk, false);
   626		if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE)
   627			atomic_inc(&bdev->bd_writers);
   628		atomic_inc(&bdev->bd_openers);
   629		return 0;
   630	}
   631
Christoph Hellwig June 13, 2023, 5:10 a.m. UTC | #6
> +config BLK_DEV_WRITE_HARDENING
> +	bool "Do not allow writing to mounted devices"
> +	help
> +	When a block device is mounted, writing to its buffer cache very likely
> +	going to cause filesystem corruption. It is also rather easy to crash
> +	the kernel in this way since the filesystem has no practical way of
> +	detecting these writes to buffer cache and verifying its metadata
> +	integrity. Select this option to disallow writing to mounted devices.
> +	This should be mostly fine but some filesystems (e.g. ext4) rely on
> +	the ability of filesystem tools to write to mounted filesystems to
> +	set e.g. UUID or run fsck on the root filesystem in some setups.

I'm not sure a config option is really the right thing.

I'd much prefer a BLK_OPEN_ flag to prohibit any other writer.
Except for etN and maybe fat all file systems can set that
unconditionally.  And for those file systems that have historically
allowed writes to mounted file systems they can find a local way
to decide on when and when not to set it.
Dmitry Vyukov June 13, 2023, 6:09 a.m. UTC | #7
On Tue, 13 Jun 2023 at 07:10, Christoph Hellwig <hch@infradead.org> wrote:
>
> > +config BLK_DEV_WRITE_HARDENING
> > +     bool "Do not allow writing to mounted devices"
> > +     help
> > +     When a block device is mounted, writing to its buffer cache very likely
> > +     going to cause filesystem corruption. It is also rather easy to crash
> > +     the kernel in this way since the filesystem has no practical way of
> > +     detecting these writes to buffer cache and verifying its metadata
> > +     integrity. Select this option to disallow writing to mounted devices.
> > +     This should be mostly fine but some filesystems (e.g. ext4) rely on
> > +     the ability of filesystem tools to write to mounted filesystems to
> > +     set e.g. UUID or run fsck on the root filesystem in some setups.
>
> I'm not sure a config option is really the right thing.
>
> I'd much prefer a BLK_OPEN_ flag to prohibit any other writer.
> Except for etN and maybe fat all file systems can set that
> unconditionally.  And for those file systems that have historically
> allowed writes to mounted file systems they can find a local way
> to decide on when and when not to set it.

I don't question there are use cases for the flag, but there are use
cases for the config as well.

Some distros may want a guarantee that this does not happen as it
compromises lockdown and kernel integrity (on par with unsigned module
loading).
For fuzzing systems it also may be hard to ensure fine-grained
argument constraints, it's much easier and more reliable to prohibit
it on config level.
Dmitry Vyukov June 13, 2023, 6:49 a.m. UTC | #8
On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote:
>
> Writing to mounted devices is dangerous and can lead to filesystem
> corruption as well as crashes. Furthermore syzbot comes with more and
> more involved examples how to corrupt block device under a mounted
> filesystem leading to kernel crashes and reports we can do nothing
> about. Add config option to disallow writing to mounted (exclusively
> open) block devices. Syzbot can use this option to avoid uninteresting
> crashes. Also users whose userspace setup does not need writing to
> mounted block devices can set this config option for hardening.

+syzkaller, Kees, Alexander, Eric

We can enable this on syzbot, however I have the same concerns as with
disabling of XFS_SUPPORT_V4:
https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278

It's useful to know the actual situation with respect to
bugs/vulnerabilities and one of the goals of syzbot is surfacing this
situation.
For some areas there is mismatch between upstream kernel and
downstream distros. Upstream says "this is buggy and deprecated",
which is fine in itself if not the other part: downstream distros
simply ignore that (maybe not even aware) and keep things enabled for
as long as possible. Stopping testing this is moving more in this
direction: silencing warnings and pretending that everything is fine,
when it's not.

I wonder if there is a way to at least somehow bridge this gap.

There is CONFIG_BROKEN, but not sure if it's the right thing here.
Maybe we add something like CONFIG_INSECURE. And such insecure config
settings (not setting BLK_DEV_WRITE_HARDENING, setting XFS_SUPPORT_V4)
will say:

depends on INSECURE

So that distros will need to at least acknowledge this to users by saying:

CONFIG_INSECURE=y

They are then motivated to work on actually removing dependencies on
these deprecated things.

CONFIG_INSECURE description can say something along the lines of "this
kernel includes subsystems with known bugs that may cause security and
data integrity issues". When a subsystem adds "depends on INSECURE",
the commit should list some of the known issues.

Then I see how trading disabling things on syzbot in exchange for
"depends on INSECURE" becomes reasonable and satisfies all parties to
some degree.

Btw, if we do this it can make sense to invert this config (enable
concurrent writes), default to 'y' and recommend 'n'.

Does it make any sense? Any other suggestions?

P.S. Alex, if this lands this may be a candidate for addition to:
https://github.com/a13xp0p0v/kconfig-hardened-check
(and XFS_SUPPORT_V4 as well).



> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  block/Kconfig             | 12 ++++++++++++
>  block/bdev.c              | 10 ++++++++++
>  include/linux/blk_types.h |  3 +++
>  3 files changed, 25 insertions(+)
>
> FWIW I've tested this and my test VM with ext4 root fs boots fine and fstests
> on ext4 seem to be also running fine with BLK_DEV_WRITE_HARDENING enabled.
> OTOH my old VM setup which is not using initrd fails to boot with
> BLK_DEV_WRITE_HARDENING enabled because fsck cannot open the root device
> because the root is already mounted (read-only). Anyway this should be useful
> for syzbot (Dmitry indicated interest in this option in the past) and maybe
> other well controlled setups.
>
> diff --git a/block/Kconfig b/block/Kconfig
> index 86122e459fe0..c44e2238e18d 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -77,6 +77,18 @@ config BLK_DEV_INTEGRITY_T10
>         select CRC_T10DIF
>         select CRC64_ROCKSOFT
>
> +config BLK_DEV_WRITE_HARDENING
> +       bool "Do not allow writing to mounted devices"
> +       help
> +       When a block device is mounted, writing to its buffer cache very likely
> +       going to cause filesystem corruption. It is also rather easy to crash
> +       the kernel in this way since the filesystem has no practical way of
> +       detecting these writes to buffer cache and verifying its metadata
> +       integrity. Select this option to disallow writing to mounted devices.
> +       This should be mostly fine but some filesystems (e.g. ext4) rely on
> +       the ability of filesystem tools to write to mounted filesystems to
> +       set e.g. UUID or run fsck on the root filesystem in some setups.
> +
>  config BLK_DEV_ZONED
>         bool "Zoned block device support"
>         select MQ_IOSCHED_DEADLINE
> diff --git a/block/bdev.c b/block/bdev.c
> index 21c63bfef323..ad01f0a6af0d 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -602,6 +602,12 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
>         struct gendisk *disk = bdev->bd_disk;
>         int ret;
>
> +       if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) {
> +               if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0)
> +                       return -EBUSY;
> +               if (mode & FMODE_WRITE && bdev->bd_holders > 0)
> +                       return -EBUSY;
> +       }
>         if (disk->fops->open) {
>                 ret = disk->fops->open(bdev, mode);
>                 if (ret) {
> @@ -617,6 +623,8 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
>                 set_init_blocksize(bdev);
>         if (test_bit(GD_NEED_PART_SCAN, &disk->state))
>                 bdev_disk_changed(disk, false);
> +       if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE)
> +               atomic_inc(&bdev->bd_writers);
>         atomic_inc(&bdev->bd_openers);
>         return 0;
>  }
> @@ -625,6 +633,8 @@ static void blkdev_put_whole(struct block_device *bdev, fmode_t mode)
>  {
>         if (atomic_dec_and_test(&bdev->bd_openers))
>                 blkdev_flush_mapping(bdev);
> +       if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE)
> +               atomic_dec(&bdev->bd_writers);
>         if (bdev->bd_disk->fops->release)
>                 bdev->bd_disk->fops->release(bdev->bd_disk, mode);
>  }
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 740afe80f297..25af3340f316 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -67,6 +67,9 @@ struct block_device {
>         struct partition_meta_info *bd_meta_info;
>  #ifdef CONFIG_FAIL_MAKE_REQUEST
>         bool                    bd_make_it_fail;
> +#endif
> +#ifdef CONFIG_BLK_DEV_WRITE_HARDENING
> +       atomic_t                bd_writers;
>  #endif
>         /*
>          * keep this out-of-line as it's both big and not needed in the fast
> --
> 2.35.3
>
Jan Kara June 13, 2023, 11:34 a.m. UTC | #9
On Mon 12-06-23 14:52:54, Colin Walters wrote:
> On Mon, Jun 12, 2023, at 1:39 PM, Bart Van Assche wrote:
> > On 6/12/23 09:25, Jan Kara wrote:
> >> On Mon 12-06-23 18:16:14, Jan Kara wrote:
> >>> Writing to mounted devices is dangerous and can lead to filesystem
> >>> corruption as well as crashes. Furthermore syzbot comes with more and
> >>> more involved examples how to corrupt block device under a mounted
> >>> filesystem leading to kernel crashes and reports we can do nothing
> >>> about. Add config option to disallow writing to mounted (exclusively
> >>> open) block devices. Syzbot can use this option to avoid uninteresting
> >>> crashes. Also users whose userspace setup does not need writing to
> >>> mounted block devices can set this config option for hardening.
> >>>
> >>> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
> >>> Signed-off-by: Jan Kara <jack@suse.cz>
> >> 
> >> Please disregard this patch. I had uncommited fixups in my tree. I'll send
> >> fixed version shortly. I'm sorry for the noise.
> >
> > Have alternatives been configured to making this functionality configurable
> > at build time only? How about a kernel command line parameter instead of a
> > config option?
> 
> It's not just syzbot here; at least once in my life I accidentally did
> `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk
> and not the target USB device.  I know I'm not alone =)

Yeah, so I'm not sure we are going to protect against this particular case.
I mean it is not *that* uncommon to alter partition table of /dev/sda while
/dev/sda1 is mounted. And for the kernel it is difficult to distinguish
this and your mishap.

> There's a lot of similar accidental-damage protection from this.  Another
> stronger argument here is that if one has a security policy that
> restricts access to filesystem level objects, if a process can somehow
> write to a mounted block device, it effectively subverts all of those
> controls. 

Well, there are multiple levels of protection that I can think of:

1) If user can write some image and make kernel mount it.
2) If user can modify device content while mounted (but not buffer cache
of the device).
3) If user can modify buffer cache of the device while mounted.

3) is the most problematic and effectively equivalent to full machine
control (executing arbitrary code in kernel mode) these days.  For 1) and
2) there are reasonable protection measures the filesystem driver can take
(and effectively you cannot escape these problems if you allow attaching
untrusted devices such as USB sticks) so they can cause DoS but we should
be able to prevent full machine takeover in the filesystem code.

So this patch is mainly aimed at forbiding 3).

> Right now it looks to me we're invoking devcgroup_check_permission pretty
> early on; maybe we could extend the device cgroup stuff to have a new
> check for write-mounted, like
> 
> ```
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index c994ff5b157c..f2af33c5acc1 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -6797,6 +6797,7 @@ enum {
>  	BPF_DEVCG_ACC_MKNOD	= (1ULL << 0),
>  	BPF_DEVCG_ACC_READ	= (1ULL << 1),
>  	BPF_DEVCG_ACC_WRITE	= (1ULL << 2),
> +	BPF_DEVCG_ACC_WRITE_MOUNTED	= (1ULL << 3),
>  };
>  
>  enum {
> ```
> 
> ?  But probably this would need to be some kind of opt-in flag to avoid
> breaking existing bpf progs?
> 
> If it was configurable via the device cgroup, then it's completely
> flexible from userspace; most specifically including supporting some
> specially privileged processes from doing it if necessary.

I kind of like the flexibility of device cgroups but it does not seem to
fit well with my "stop unactionable syzbot reports" usecase and doing the
protection properly would mean that we now need to create way to approve
access for all the tools that need this. So I'm not against this but I'd
consider this "future extension possibility" :).

> Also, I wonder if we should also support restricting *reads* from mounted
> block devices?

I don't see a strong usecase for this. Why would mounted vs unmounted
matter here?

								Honza
Theodore Ts'o June 13, 2023, 7:22 p.m. UTC | #10
On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote:
> On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote:
> >
> > Writing to mounted devices is dangerous and can lead to filesystem
> > corruption as well as crashes. Furthermore syzbot comes with more and
> > more involved examples how to corrupt block device under a mounted
> > filesystem leading to kernel crashes and reports we can do nothing
> > about. Add config option to disallow writing to mounted (exclusively
> > open) block devices. Syzbot can use this option to avoid uninteresting
> > crashes. Also users whose userspace setup does not need writing to
> > mounted block devices can set this config option for hardening.
> 
> +syzkaller, Kees, Alexander, Eric
> 
> We can enable this on syzbot, however I have the same concerns as with
> disabling of XFS_SUPPORT_V4:
> https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278
> 
> It's useful to know the actual situation with respect to
> bugs/vulnerabilities and one of the goals of syzbot is surfacing this
> situation.

So from my perspective, it's not a "vulernability".  It's another
example of "syzbot is porting that root can screw you".  The question
about whether the attacker has write access to the block device is a
threat model question.  If you have write access to the block device,
you can set the setuid bit on a copy of /bin/bash; but is that a
"vulernability"?  Not really....

Yes, from the lockdown perspective, what thight might mean is that
"root can run arbitrary code in ring0".  But for most distributions,
given that they allow users to provide custom modules (for exanple,
for Nvidia or other GPU support) that were not built by the
distribution, they can run arbitrary code in ring 0 because root can
provide an arbitrary kernel module.  If you are 100% locked down,
perhaps that's not the case.  But this is a very specialized use case,
and more to the point, asking upstream to worry about this is
effectively an unfunded mandate.


> For some areas there is mismatch between upstream kernel and
> downstream distros. Upstream says "this is buggy and deprecated",
> which is fine in itself if not the other part: downstream distros
> simply ignore that (maybe not even aware) and keep things enabled for
> as long as possible. Stopping testing this is moving more in this
> direction: silencing warnings and pretending that everything is fine,
> when it's not.
> 
> I wonder if there is a way to at least somehow bridge this gap.
> 
> There is CONFIG_BROKEN, but not sure if it's the right thing here.
> Maybe we add something like CONFIG_INSECURE.

"INSECURE" is not really accurate, because it presumes a certain treat
model, and it's not neccessarily one that everyone has signed off as
being one that they need to worry about.

So I'd put it differently.  We need to have a way of filtering out
those syzbot reports which are caused by allowing a privileged user to
be able to dynamically nodify the block device for a mounted file
system.  One way to do that is to simply surpress them.  For example,
we did that when we taught syzbot not to try to set the real-time
priority for a userspace task to MAX_RT_PRIO, which starves kernel
threads and causes the system to malfunction.  That's not a "kernel
bug", that's a userspace bug, and teaching syzbot not to do the stupid
thing made sense.

If you think there are some subset of people who are about syzbot
reports that are caused by dynamically modifying the underlying block
device while it is mounted, what if we can somehow attach a label to
the syzbot report, indicating that it was caused by modifying a
moutned block device?  That way, upstream can ignore it as a stupid
syzbot thing, and you can keep it as a "theoretical vulnerability".
And we don't have to debate the question of which threat model is the
more reasonable one.

						- Ted
Jan Kara June 13, 2023, 8:56 p.m. UTC | #11
On Mon 12-06-23 22:10:30, Christoph Hellwig wrote:
> > +config BLK_DEV_WRITE_HARDENING
> > +	bool "Do not allow writing to mounted devices"
> > +	help
> > +	When a block device is mounted, writing to its buffer cache very likely
> > +	going to cause filesystem corruption. It is also rather easy to crash
> > +	the kernel in this way since the filesystem has no practical way of
> > +	detecting these writes to buffer cache and verifying its metadata
> > +	integrity. Select this option to disallow writing to mounted devices.
> > +	This should be mostly fine but some filesystems (e.g. ext4) rely on
> > +	the ability of filesystem tools to write to mounted filesystems to
> > +	set e.g. UUID or run fsck on the root filesystem in some setups.
> 
> I'm not sure a config option is really the right thing.
> 
> I'd much prefer a BLK_OPEN_ flag to prohibit any other writer.
> Except for etN and maybe fat all file systems can set that
> unconditionally.  And for those file systems that have historically
> allowed writes to mounted file systems they can find a local way
> to decide on when and when not to set it.

Well, as I've mentioned in the changelog there are old setups (without
initrd) that run fsck on root filesystem mounted read-only and fsck
programs tend to open the device with O_RDWR. These would be broken by this
change (for the filesystems that would use BLK_OPEN_ flag). Similarly some
boot loaders can write to first sectors of the root partition while the
filesystem is mounted. So I don't think controlling the behavior by the
in-kernel user that is having the bdev exclusively open really works. It
seems to be more a property of the system setup than a property of the
in-kernel bdev user. Am I mistaken?

So I think kconfig option or sysfs tunable (maybe a per-device one?) will
be more appropriate choice? With default behavior configurable by kernel
parameter? And once set to write-protect on mount, do we allow flipping it
back? Both have advantages and disadvantages so the tunable might be
tri-state in the end (no protection, write-protect but can be turned off,
write-protect that cannot be turned off)? But maybe I'm overcomplicating
this so please share your thoughts :)

								Honza
Dave Chinner June 14, 2023, 12:26 a.m. UTC | #12
On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote:
> On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote:
> >
> > Writing to mounted devices is dangerous and can lead to filesystem
> > corruption as well as crashes. Furthermore syzbot comes with more and
> > more involved examples how to corrupt block device under a mounted
> > filesystem leading to kernel crashes and reports we can do nothing
> > about. Add config option to disallow writing to mounted (exclusively
> > open) block devices. Syzbot can use this option to avoid uninteresting
> > crashes. Also users whose userspace setup does not need writing to
> > mounted block devices can set this config option for hardening.
> 
> +syzkaller, Kees, Alexander, Eric
> 
> We can enable this on syzbot, however I have the same concerns as with
> disabling of XFS_SUPPORT_V4:
> https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278

Really?

This is exactly what I *detest most* about syzbot: the recalcitrant
maintainer who thinks their ideology is more important than any
other engineering consideration that might exist.

We want *better quality bug reports* on *current code* that we have
to *support for the foreseeable future*, not get buried under
repeated shitty reports containing yet more variants of problems we
fixed over a decade ago.

I'll repeat what Eric has already pointed out in the above GH issue
in the vain hope you'll listen this time rather than making even more
extreme ideological demands on us.

The XFS engineers put in place a planned, well documented
deprecation and removal process for V4 format support back in 2020.
We are well into that plan - we are not that far from turning it off
v4 support by default. The V4 format is legacy code and users are
already migrating away from it.

As such, it has much lower priority and relevance to us compared to
supporting v5 filesystems. The syzbot maintainers don't get to
decide how important XFS v4 format support is - that's the job of
the XFS engineers responsibile for developing XFS code and
supporting XFS users.

Because V4 has been deprecated and support is slowing down as people
migrate off it, we don't need as extensive test coverage as we once
did. i.e.  we are ramping down the validation in accordance with
it's lower priority and approaching disabling in 2025. We are
placing much more importance on validation of v5 format features and
functionality.

As such, we really don't need syzbot to be exercising v4 formats any
more - it's much more important to our users that we exercise v5
formats as extensively as possible. That is what we are asking the
syzkaller runners (and syzbot) do as a service for us.

If your ideology demands that "the only way to stop syzbot testing
XFS v4 filesytsems is to remove the code entirely" (paraphrasing
your comments from the above github issue), then the entire problem
here is your ideology.

That is, your ideology is getting in the way of practical, well
thought out, long running end-of-life engineering processes. It is
creating unnecessary load on limited resources. Further, your
demands that we place syzbot coverage (if syzbot doesn't test it, it
must depend on CONFIG_INSECURE!) above our direct responsibilities
to distro maintainers and other downstream users is, at best,
terribly misguided.

Syzbot is a *tool*. It's not even a critical tool we depend on - we
can live without it just fine. We'd really like syzbot to be a
better tool, but the ideology behind syzbot is the biggest
impediment to making it a better, more effective tool for the
community.

If syzbot maintainers won't listen to simple requests to improve
test coverage according to subsystem requirements, then it's clear
that syzbot is being driven by ideology rather than engineering
requirements. This needs to change.

-Dave.
Darrick J. Wong June 14, 2023, 1:55 a.m. UTC | #13
On Tue, Jun 13, 2023 at 01:34:48PM +0200, Jan Kara wrote:
> On Mon 12-06-23 14:52:54, Colin Walters wrote:
> > On Mon, Jun 12, 2023, at 1:39 PM, Bart Van Assche wrote:
> > > On 6/12/23 09:25, Jan Kara wrote:
> > >> On Mon 12-06-23 18:16:14, Jan Kara wrote:
> > >>> Writing to mounted devices is dangerous and can lead to filesystem
> > >>> corruption as well as crashes. Furthermore syzbot comes with more and
> > >>> more involved examples how to corrupt block device under a mounted
> > >>> filesystem leading to kernel crashes and reports we can do nothing
> > >>> about. Add config option to disallow writing to mounted (exclusively
> > >>> open) block devices. Syzbot can use this option to avoid uninteresting
> > >>> crashes. Also users whose userspace setup does not need writing to
> > >>> mounted block devices can set this config option for hardening.
> > >>>
> > >>> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
> > >>> Signed-off-by: Jan Kara <jack@suse.cz>
> > >> 
> > >> Please disregard this patch. I had uncommited fixups in my tree. I'll send
> > >> fixed version shortly. I'm sorry for the noise.
> > >
> > > Have alternatives been configured to making this functionality configurable
> > > at build time only? How about a kernel command line parameter instead of a
> > > config option?
> > 
> > It's not just syzbot here; at least once in my life I accidentally did
> > `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk
> > and not the target USB device.  I know I'm not alone =)
> 
> Yeah, so I'm not sure we are going to protect against this particular case.
> I mean it is not *that* uncommon to alter partition table of /dev/sda while
> /dev/sda1 is mounted. And for the kernel it is difficult to distinguish
> this and your mishap.

Honestly?

I'd love it if filesystems actually /could/ lock down the parts of block
devices they're using.  They could hand out write privileges to the open
bdev fds at the same time that a block layout lease is created, and
retract them when the lease terminates.  Areas before the fs (e.g. BIOS
boot sector) could actually be left writable by filesystems that don't
use that area; and anything beyond EOFS would still be writable (hello
lvm).  Then xfs actually /could/ prevent you from blowing away mounted
xfs filesystem.

ext4 could even still allow primary superblock writes to avoid breaking
tune2fs, or they could detect secureboot lockdown and prohibit that.

In the past, I was told to go write an LSM if I wanted XFS to protect
itself from getting nuked, but I've been too busy to learn how to do
that.  The other nastier question is blocking writes to sda when sda1 is
mounted; for that I have no response. :(

--D

> > There's a lot of similar accidental-damage protection from this.  Another
> > stronger argument here is that if one has a security policy that
> > restricts access to filesystem level objects, if a process can somehow
> > write to a mounted block device, it effectively subverts all of those
> > controls. 
> 
> Well, there are multiple levels of protection that I can think of:
> 
> 1) If user can write some image and make kernel mount it.
> 2) If user can modify device content while mounted (but not buffer cache
> of the device).
> 3) If user can modify buffer cache of the device while mounted.
> 
> 3) is the most problematic and effectively equivalent to full machine
> control (executing arbitrary code in kernel mode) these days.  For 1) and
> 2) there are reasonable protection measures the filesystem driver can take
> (and effectively you cannot escape these problems if you allow attaching
> untrusted devices such as USB sticks) so they can cause DoS but we should
> be able to prevent full machine takeover in the filesystem code.
> 
> So this patch is mainly aimed at forbiding 3).
> 
> > Right now it looks to me we're invoking devcgroup_check_permission pretty
> > early on; maybe we could extend the device cgroup stuff to have a new
> > check for write-mounted, like
> > 
> > ```
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index c994ff5b157c..f2af33c5acc1 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -6797,6 +6797,7 @@ enum {
> >  	BPF_DEVCG_ACC_MKNOD	= (1ULL << 0),
> >  	BPF_DEVCG_ACC_READ	= (1ULL << 1),
> >  	BPF_DEVCG_ACC_WRITE	= (1ULL << 2),
> > +	BPF_DEVCG_ACC_WRITE_MOUNTED	= (1ULL << 3),
> >  };
> >  
> >  enum {
> > ```
> > 
> > ?  But probably this would need to be some kind of opt-in flag to avoid
> > breaking existing bpf progs?
> > 
> > If it was configurable via the device cgroup, then it's completely
> > flexible from userspace; most specifically including supporting some
> > specially privileged processes from doing it if necessary.
> 
> I kind of like the flexibility of device cgroups but it does not seem to
> fit well with my "stop unactionable syzbot reports" usecase and doing the
> protection properly would mean that we now need to create way to approve
> access for all the tools that need this. So I'm not against this but I'd
> consider this "future extension possibility" :).
> 
> > Also, I wonder if we should also support restricting *reads* from mounted
> > block devices?
> 
> I don't see a strong usecase for this. Why would mounted vs unmounted
> matter here?
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Darrick J. Wong June 14, 2023, 2:04 a.m. UTC | #14
On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote:
> On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote:
> >
> > Writing to mounted devices is dangerous and can lead to filesystem
> > corruption as well as crashes. Furthermore syzbot comes with more and
> > more involved examples how to corrupt block device under a mounted
> > filesystem leading to kernel crashes and reports we can do nothing
> > about. Add config option to disallow writing to mounted (exclusively
> > open) block devices. Syzbot can use this option to avoid uninteresting
> > crashes. Also users whose userspace setup does not need writing to
> > mounted block devices can set this config option for hardening.
> 
> +syzkaller, Kees, Alexander, Eric
> 
> We can enable this on syzbot, however I have the same concerns as with
> disabling of XFS_SUPPORT_V4:
> https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278
> 
> It's useful to know the actual situation with respect to
> bugs/vulnerabilities and one of the goals of syzbot is surfacing this
> situation.
> For some areas there is mismatch between upstream kernel and
> downstream distros. Upstream says "this is buggy and deprecated",
> which is fine in itself if not the other part: downstream distros
> simply ignore that (maybe not even aware) and keep things enabled for
> as long as possible. Stopping testing this is moving more in this
> direction: silencing warnings and pretending that everything is fine,
> when it's not.
> 
> I wonder if there is a way to at least somehow bridge this gap.
> 
> There is CONFIG_BROKEN, but not sure if it's the right thing here.
> Maybe we add something like CONFIG_INSECURE. And such insecure config
> settings (not setting BLK_DEV_WRITE_HARDENING, setting XFS_SUPPORT_V4)
> will say:
> 
> depends on INSECURE
> 
> So that distros will need to at least acknowledge this to users by saying:
> 
> CONFIG_INSECURE=y
> 
> They are then motivated to work on actually removing dependencies on
> these deprecated things.
> 
> CONFIG_INSECURE description can say something along the lines of "this
> kernel includes subsystems with known bugs that may cause security and
> data integrity issues". When a subsystem adds "depends on INSECURE",
> the commit should list some of the known issues.
> 
> Then I see how trading disabling things on syzbot in exchange for
> "depends on INSECURE" becomes reasonable and satisfies all parties to
> some degree.

Well in that case, post a patchset adding "depends on INSECURE" for
every subsystem that syzbot files bugs against, if the maintainers do
not immediately drop what they're doing to resolve the bug.

Google extracts a bunch more unpaid labor from society to make its
owners richer, and everyone else on the planet suffers for it, just like
you all have done for the past 25 years.  That's the definition of
Googley!!

--D

> 
> Btw, if we do this it can make sense to invert this config (enable
> concurrent writes), default to 'y' and recommend 'n'.
> 
> Does it make any sense? Any other suggestions?
> 
> P.S. Alex, if this lands this may be a candidate for addition to:
> https://github.com/a13xp0p0v/kconfig-hardened-check
> (and XFS_SUPPORT_V4 as well).
> 
> 
> > Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  block/Kconfig             | 12 ++++++++++++
> >  block/bdev.c              | 10 ++++++++++
> >  include/linux/blk_types.h |  3 +++
> >  3 files changed, 25 insertions(+)
> >
> > FWIW I've tested this and my test VM with ext4 root fs boots fine and fstests
> > on ext4 seem to be also running fine with BLK_DEV_WRITE_HARDENING enabled.
> > OTOH my old VM setup which is not using initrd fails to boot with
> > BLK_DEV_WRITE_HARDENING enabled because fsck cannot open the root device
> > because the root is already mounted (read-only). Anyway this should be useful
> > for syzbot (Dmitry indicated interest in this option in the past) and maybe
> > other well controlled setups.
> >
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 86122e459fe0..c44e2238e18d 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -77,6 +77,18 @@ config BLK_DEV_INTEGRITY_T10
> >         select CRC_T10DIF
> >         select CRC64_ROCKSOFT
> >
> > +config BLK_DEV_WRITE_HARDENING
> > +       bool "Do not allow writing to mounted devices"
> > +       help
> > +       When a block device is mounted, writing to its buffer cache very likely
> > +       going to cause filesystem corruption. It is also rather easy to crash
> > +       the kernel in this way since the filesystem has no practical way of
> > +       detecting these writes to buffer cache and verifying its metadata
> > +       integrity. Select this option to disallow writing to mounted devices.
> > +       This should be mostly fine but some filesystems (e.g. ext4) rely on
> > +       the ability of filesystem tools to write to mounted filesystems to
> > +       set e.g. UUID or run fsck on the root filesystem in some setups.
> > +
> >  config BLK_DEV_ZONED
> >         bool "Zoned block device support"
> >         select MQ_IOSCHED_DEADLINE
> > diff --git a/block/bdev.c b/block/bdev.c
> > index 21c63bfef323..ad01f0a6af0d 100644
> > --- a/block/bdev.c
> > +++ b/block/bdev.c
> > @@ -602,6 +602,12 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
> >         struct gendisk *disk = bdev->bd_disk;
> >         int ret;
> >
> > +       if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) {
> > +               if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0)
> > +                       return -EBUSY;
> > +               if (mode & FMODE_WRITE && bdev->bd_holders > 0)
> > +                       return -EBUSY;
> > +       }
> >         if (disk->fops->open) {
> >                 ret = disk->fops->open(bdev, mode);
> >                 if (ret) {
> > @@ -617,6 +623,8 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
> >                 set_init_blocksize(bdev);
> >         if (test_bit(GD_NEED_PART_SCAN, &disk->state))
> >                 bdev_disk_changed(disk, false);
> > +       if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE)
> > +               atomic_inc(&bdev->bd_writers);
> >         atomic_inc(&bdev->bd_openers);
> >         return 0;
> >  }
> > @@ -625,6 +633,8 @@ static void blkdev_put_whole(struct block_device *bdev, fmode_t mode)
> >  {
> >         if (atomic_dec_and_test(&bdev->bd_openers))
> >                 blkdev_flush_mapping(bdev);
> > +       if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE)
> > +               atomic_dec(&bdev->bd_writers);
> >         if (bdev->bd_disk->fops->release)
> >                 bdev->bd_disk->fops->release(bdev->bd_disk, mode);
> >  }
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index 740afe80f297..25af3340f316 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -67,6 +67,9 @@ struct block_device {
> >         struct partition_meta_info *bd_meta_info;
> >  #ifdef CONFIG_FAIL_MAKE_REQUEST
> >         bool                    bd_make_it_fail;
> > +#endif
> > +#ifdef CONFIG_BLK_DEV_WRITE_HARDENING
> > +       atomic_t                bd_writers;
> >  #endif
> >         /*
> >          * keep this out-of-line as it's both big and not needed in the fast
> > --
> > 2.35.3
> >
Theodore Ts'o June 14, 2023, 2:57 a.m. UTC | #15
On Tue, Jun 13, 2023 at 07:04:12PM -0700, Darrick J. Wong wrote:
> 
> Well in that case, post a patchset adding "depends on INSECURE" for
> every subsystem that syzbot files bugs against, if the maintainers do
> not immediately drop what they're doing to resolve the bug.
> 
> Google extracts a bunch more unpaid labor from society to make its
> owners richer, and everyone else on the planet suffers for it, just like
> you all have done for the past 25 years.  That's the definition of
> Googley!!

To be fair, I don't think this is the official position of Google, but
rather Dmitry's personal security ideology (as Dave put it).

Dmitry, tell you what.  If you can find a vice president inside Google
who thinks this that preventing an attacker who has the ability to
modify a block device while it is mounted, while running code under
the control of the attacker, from being to potentially trigger the
ability to run ring 0 code --- and who believes it enough to actually
**fund** a headcount to actually work these syzbot reports --- I'll
gladly help to supervise that person and mentor their ability to work
these ext4 syzbot reports.

But I think you will find that the VP's will believe that this is not
a threat that has a genuine business case which is important enough
that they are willing to fund it.  And I'm saying as an upstream
developer, *other* syzbot reports are higher priority, because in my
judgement, they are much more willing to impact real users, and are
more likely to be issues that management chain would consider higher
priority.  (Never mind that *all* of my syzbot work has been done on
my own time.)

For those of us who are working with limited resources, and doing this
work out of the kindness of our hearts, it would be nice to filter out
those syzbot reports that in our best judgement, constitute **noise**.
If there is not a good way to filter out the noise, it is likely that
upstream developers will choose to use their time working with other
tools that are better suited to getting our job done as we understand
it.

So far, there is been a lot work done by folks on your team which has
made syzbot easier for us to use, and for that, I thank you.  But your
position on forcing your ideology of which security bugs I should fix
on my own time is.... annoying.  And if others feel the same way, your
attitude is going to be counter-productive towards the goals you have
towards making Linux more secure.

Sometimes, the "best" is the enemy is the "good enough".  And in this
era of Google's "sharpened focus" or Facebook's "year of efficiency",
very often, "good enough" is all the vice presidents are willing to
fund.

Best regards,

						- Ted
Christian Brauner June 14, 2023, 7:05 a.m. UTC | #16
On Tue, Jun 13, 2023 at 01:34:48PM +0200, Jan Kara wrote:
> On Mon 12-06-23 14:52:54, Colin Walters wrote:
> > On Mon, Jun 12, 2023, at 1:39 PM, Bart Van Assche wrote:
> > > On 6/12/23 09:25, Jan Kara wrote:
> > >> On Mon 12-06-23 18:16:14, Jan Kara wrote:
> > >>> Writing to mounted devices is dangerous and can lead to filesystem
> > >>> corruption as well as crashes. Furthermore syzbot comes with more and
> > >>> more involved examples how to corrupt block device under a mounted
> > >>> filesystem leading to kernel crashes and reports we can do nothing
> > >>> about. Add config option to disallow writing to mounted (exclusively
> > >>> open) block devices. Syzbot can use this option to avoid uninteresting
> > >>> crashes. Also users whose userspace setup does not need writing to
> > >>> mounted block devices can set this config option for hardening.
> > >>>
> > >>> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
> > >>> Signed-off-by: Jan Kara <jack@suse.cz>
> > >> 
> > >> Please disregard this patch. I had uncommited fixups in my tree. I'll send
> > >> fixed version shortly. I'm sorry for the noise.
> > >
> > > Have alternatives been configured to making this functionality configurable
> > > at build time only? How about a kernel command line parameter instead of a
> > > config option?
> > 
> > It's not just syzbot here; at least once in my life I accidentally did
> > `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk
> > and not the target USB device.  I know I'm not alone =)
> 
> Yeah, so I'm not sure we are going to protect against this particular case.
> I mean it is not *that* uncommon to alter partition table of /dev/sda while
> /dev/sda1 is mounted. And for the kernel it is difficult to distinguish
> this and your mishap.
> 
> > There's a lot of similar accidental-damage protection from this.  Another
> > stronger argument here is that if one has a security policy that
> > restricts access to filesystem level objects, if a process can somehow
> > write to a mounted block device, it effectively subverts all of those
> > controls. 
> 
> Well, there are multiple levels of protection that I can think of:
> 
> 1) If user can write some image and make kernel mount it.
> 2) If user can modify device content while mounted (but not buffer cache
> of the device).
> 3) If user can modify buffer cache of the device while mounted.
> 
> 3) is the most problematic and effectively equivalent to full machine
> control (executing arbitrary code in kernel mode) these days.  For 1) and
> 2) there are reasonable protection measures the filesystem driver can take
> (and effectively you cannot escape these problems if you allow attaching
> untrusted devices such as USB sticks) so they can cause DoS but we should
> be able to prevent full machine takeover in the filesystem code.
> 
> So this patch is mainly aimed at forbiding 3).
> 
> > Right now it looks to me we're invoking devcgroup_check_permission pretty
> > early on; maybe we could extend the device cgroup stuff to have a new
> > check for write-mounted, like
> > 
> > ```
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index c994ff5b157c..f2af33c5acc1 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -6797,6 +6797,7 @@ enum {
> >  	BPF_DEVCG_ACC_MKNOD	= (1ULL << 0),
> >  	BPF_DEVCG_ACC_READ	= (1ULL << 1),
> >  	BPF_DEVCG_ACC_WRITE	= (1ULL << 2),
> > +	BPF_DEVCG_ACC_WRITE_MOUNTED	= (1ULL << 3),
> >  };
> >  
> >  enum {
> > ```
> > 
> > ?  But probably this would need to be some kind of opt-in flag to avoid
> > breaking existing bpf progs?
> > 
> > If it was configurable via the device cgroup, then it's completely
> > flexible from userspace; most specifically including supporting some
> > specially privileged processes from doing it if necessary.
> 
> I kind of like the flexibility of device cgroups but it does not seem to

Let's not bring in device cgroups here just yet. They're an optional LSM
security measure while your change is more fundamental which is the
right thing to do imho.
Christoph Hellwig June 14, 2023, 7:07 a.m. UTC | #17
On Wed, Jun 14, 2023 at 09:05:41AM +0200, Christian Brauner wrote:
> > I kind of like the flexibility of device cgroups but it does not seem to
> 
> Let's not bring in device cgroups here just yet. They're an optional LSM
> security measure while your change is more fundamental which is the
> right thing to do imho.

Yes.  That last thing we need is hiding fundamentally security tradeoffs
in weird optional corners of the kernel.
Christoph Hellwig June 14, 2023, 7:10 a.m. UTC | #18
On Tue, Jun 13, 2023 at 01:34:48PM +0200, Jan Kara wrote:
> > It's not just syzbot here; at least once in my life I accidentally did
> > `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk
> > and not the target USB device.  I know I'm not alone =)
> 
> Yeah, so I'm not sure we are going to protect against this particular case.
> I mean it is not *that* uncommon to alter partition table of /dev/sda while
> /dev/sda1 is mounted. And for the kernel it is difficult to distinguish
> this and your mishap.

I think it is actually very easy to distinguish, because the partition
table is not mapped to any partition and certainly not an exclusively
opened one.

> 1) If user can write some image and make kernel mount it.
> 2) If user can modify device content while mounted (but not buffer cache
> of the device).
> 3) If user can modify buffer cache of the device while mounted.
> 
> 3) is the most problematic and effectively equivalent to full machine
> control (executing arbitrary code in kernel mode) these days.

If a corrupted image can trigger arbitrary code execution that also
means the file system code does not do proper input validation.

This isn't meant as an argument against protecting the write access
(which I think is good and important), but we shoudn't make this worse
than it is.
Christoph Hellwig June 14, 2023, 7:14 a.m. UTC | #19
On Tue, Jun 13, 2023 at 06:55:50PM -0700, Darrick J. Wong wrote:
> 
> I'd love it if filesystems actually /could/ lock down the parts of block
> devices they're using.  They could hand out write privileges to the open
> bdev fds at the same time that a block layout lease is created, and
> retract them when the lease terminates.  Areas before the fs (e.g. BIOS
> boot sector) could actually be left writable by filesystems that don't
> use that area; and anything beyond EOFS would still be writable (hello
> lvm).  Then xfs actually /could/ prevent you from blowing away mounted
> xfs filesystem.
> 
> ext4 could even still allow primary superblock writes to avoid breaking
> tune2fs, or they could detect secureboot lockdown and prohibit that.

Let's not overcomplicate things.  As said not allowing writes to
partitions through the whole block device is pretty trivial.  The
allowing to write into some areas of an otherwise fs owned device
(partition or whole) is just bogus.  We might have to support it for
extN (and maybe some other things) for legacy setups, but we really
need to add proper APIs for that and just disallow it for modern
setups instead of creating complex infrastructure to cater to this
fundamentally broken use case.
Christoph Hellwig June 14, 2023, 7:17 a.m. UTC | #20
On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote:
> I don't question there are use cases for the flag, but there are use
> cases for the config as well.
> 
> Some distros may want a guarantee that this does not happen as it
> compromises lockdown and kernel integrity (on par with unsigned module
> loading).
> For fuzzing systems it also may be hard to ensure fine-grained
> argument constraints, it's much easier and more reliable to prohibit
> it on config level.

I'm fine with a config option enforcing write blocking for any
BLK_OPEN_EXCL open.  Maybe the way to it is to:

 a) have an option to prevent any writes to exclusive openers, including
    a run-time version to enable it
 b) allow to also block writes without that option.  And maybe an
    opt-in to allow writes might be the better way than doing it
    the other way around.
Christoph Hellwig June 14, 2023, 7:20 a.m. UTC | #21
On Tue, Jun 13, 2023 at 10:56:14PM +0200, Jan Kara wrote:
> Well, as I've mentioned in the changelog there are old setups (without
> initrd) that run fsck on root filesystem mounted read-only and fsck
> programs tend to open the device with O_RDWR. These would be broken by this
> change (for the filesystems that would use BLK_OPEN_ flag).

But that's also a really broken setup that will corrupt data in many
cases.  So yes, maybe we need a way to allow it, but it probably would
have to be per-file system.

> Similarly some
> boot loaders can write to first sectors of the root partition while the
> filesystem is mounted. So I don't think controlling the behavior by the
> in-kernel user that is having the bdev exclusively open really works. It
> seems to be more a property of the system setup than a property of the
> in-kernel bdev user. Am I mistaken?

For many file systems this would already be completely broken (e.g.
XFS).  So allowing this is needed for legacy use cases, but again should
be limited to just file systems where this even makes sense.  And
strictly limited to legacy setups, we do need proper kernel APIs for
all of that in the future so that modern distros don't have to allow
sideband writes at all.
Christian Brauner June 14, 2023, 7:35 a.m. UTC | #22
On Tue, Jun 13, 2023 at 10:56:14PM +0200, Jan Kara wrote:
> On Mon 12-06-23 22:10:30, Christoph Hellwig wrote:
> > > +config BLK_DEV_WRITE_HARDENING
> > > +	bool "Do not allow writing to mounted devices"
> > > +	help
> > > +	When a block device is mounted, writing to its buffer cache very likely
> > > +	going to cause filesystem corruption. It is also rather easy to crash
> > > +	the kernel in this way since the filesystem has no practical way of
> > > +	detecting these writes to buffer cache and verifying its metadata
> > > +	integrity. Select this option to disallow writing to mounted devices.
> > > +	This should be mostly fine but some filesystems (e.g. ext4) rely on
> > > +	the ability of filesystem tools to write to mounted filesystems to
> > > +	set e.g. UUID or run fsck on the root filesystem in some setups.
> > 
> > I'm not sure a config option is really the right thing.
> > 
> > I'd much prefer a BLK_OPEN_ flag to prohibit any other writer.
> > Except for etN and maybe fat all file systems can set that
> > unconditionally.  And for those file systems that have historically
> > allowed writes to mounted file systems they can find a local way
> > to decide on when and when not to set it.
> 
> Well, as I've mentioned in the changelog there are old setups (without

Before going into the details: Let's please take syzbot out of the
picture as a justification or required reviewer for this patch. This is
a complete distraction imho. If the patch has the side effect that it
somehow makes for less noisy syzbot reports then so be it but it's
really not why this patch is a good idea.

For userspace this patch is immediately useful and a security mechanism
that everyone familiar with block device/filesystem attack surfaces will
want to make use of. And we should encourage this be the default
whenever possible imho. That's all the justification that we need.

> initrd) that run fsck on root filesystem mounted read-only and fsck
> programs tend to open the device with O_RDWR. These would be broken by this
> change (for the filesystems that would use BLK_OPEN_ flag). Similarly some
> boot loaders can write to first sectors of the root partition while the
> filesystem is mounted. So I don't think controlling the behavior by the
> in-kernel user that is having the bdev exclusively open really works. It
> seems to be more a property of the system setup than a property of the
> in-kernel bdev user. Am I mistaken?
> 
> So I think kconfig option or sysfs tunable (maybe a per-device one?) will
> be more appropriate choice? With default behavior configurable by kernel
> parameter? And once set to write-protect on mount, do we allow flipping it
> back? Both have advantages and disadvantages so the tunable might be
> tri-state in the end (no protection, write-protect but can be turned off,
> write-protect that cannot be turned off)? But maybe I'm overcomplicating
> this so please share your thoughts :)

A simple bool Kconfig overridable by kernel command line option is what
we want. Fundamental security relevant properties such as this should
never be runtime configurable. They should be boot and build time
configurable and then they should be off limits.
Christian Brauner June 14, 2023, 8:18 a.m. UTC | #23
On Wed, Jun 14, 2023 at 12:17:26AM -0700, Christoph Hellwig wrote:
> On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote:
> > I don't question there are use cases for the flag, but there are use
> > cases for the config as well.
> > 
> > Some distros may want a guarantee that this does not happen as it
> > compromises lockdown and kernel integrity (on par with unsigned module
> > loading).
> > For fuzzing systems it also may be hard to ensure fine-grained
> > argument constraints, it's much easier and more reliable to prohibit
> > it on config level.
> 
> I'm fine with a config option enforcing write blocking for any
> BLK_OPEN_EXCL open.  Maybe the way to it is to:
> 
>  a) have an option to prevent any writes to exclusive openers, including
>     a run-time version to enable it

I really would wish we don't make this runtime configurable. Build time
and boot time yes but toggling it at runtime makes this already a lot
less interesting.
Jan Kara June 14, 2023, 10:12 a.m. UTC | #24
On Wed 14-06-23 00:10:26, Christoph Hellwig wrote:
> On Tue, Jun 13, 2023 at 01:34:48PM +0200, Jan Kara wrote:
> > > It's not just syzbot here; at least once in my life I accidentally did
> > > `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk
> > > and not the target USB device.  I know I'm not alone =)
> > 
> > Yeah, so I'm not sure we are going to protect against this particular case.
> > I mean it is not *that* uncommon to alter partition table of /dev/sda while
> > /dev/sda1 is mounted. And for the kernel it is difficult to distinguish
> > this and your mishap.
> 
> I think it is actually very easy to distinguish, because the partition
> table is not mapped to any partition and certainly not an exclusively
> opened one.

Well, OK, I have not been precise :). Modifying a partition table (or LVM
description block) is impossible to distinguish from clobbering a
filesystem on open(2) time. Once we decide we implement arbitration of each
individual write(2), we can obviously stop writes to area covered by some
exclusively open partition. But then you are getting at the complexity
level of tracking used ranges of block devices which Darrick has suggested
and you didn't seem to like that (and neither do I). Furthermore the
protection is never going to be perfect as soon as loopback devices, device
mapper, and similar come into the mix (or it gets really really complex).
So I'd really prefer to stick with whatever arbitration we can perform on
open(2).

> > 1) If user can write some image and make kernel mount it.
> > 2) If user can modify device content while mounted (but not buffer cache
> > of the device).
> > 3) If user can modify buffer cache of the device while mounted.
> > 
> > 3) is the most problematic and effectively equivalent to full machine
> > control (executing arbitrary code in kernel mode) these days.
> 
> If a corrupted image can trigger arbitrary code execution that also
> means the file system code does not do proper input validation.

I agree. But case 3) is not about corrupted image - it is about userspace's
ability to corrupt data stored in the buffer cache *after* it has been
loaded from the image and verified. This is not a problem for XFS which has
its private block device cache incoherent with the buffer cache you access
when opening the bdev but basically every other filesystem suffers from
this problem.

								Honza
Jan Kara June 14, 2023, 10:36 a.m. UTC | #25
On Wed 14-06-23 10:18:16, Christian Brauner wrote:
> On Wed, Jun 14, 2023 at 12:17:26AM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote:
> > > I don't question there are use cases for the flag, but there are use
> > > cases for the config as well.
> > > 
> > > Some distros may want a guarantee that this does not happen as it
> > > compromises lockdown and kernel integrity (on par with unsigned module
> > > loading).
> > > For fuzzing systems it also may be hard to ensure fine-grained
> > > argument constraints, it's much easier and more reliable to prohibit
> > > it on config level.
> > 
> > I'm fine with a config option enforcing write blocking for any
> > BLK_OPEN_EXCL open.  Maybe the way to it is to:
> > 
> >  a) have an option to prevent any writes to exclusive openers, including
> >     a run-time version to enable it
> 
> I really would wish we don't make this runtime configurable. Build time
> and boot time yes but toggling it at runtime makes this already a lot
> less interesting.

I see your point from security POV. But if you are say a desktop (or even
server) user you may need to say resize your LVM or add partition to your
disk or install grub2 into boot sector of your partition. In all these
cases you need write access to a block device that is exclusively claimed
by someone else. Do you mandate reboot in permissive mode for all these
cases? Realistically that means such users just won't bother with the
feature and leave it disabled all the time. I'm OK with such outcome but I
wanted to point out this "no protection change after boot" policy noticably
restricts number of systems where this is applicable.

								Honza
Dmitry Vyukov June 14, 2023, 12:27 p.m. UTC | #26
On Wed, 14 Jun 2023 at 04:04, Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote:
> > On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote:
> > >
> > > Writing to mounted devices is dangerous and can lead to filesystem
> > > corruption as well as crashes. Furthermore syzbot comes with more and
> > > more involved examples how to corrupt block device under a mounted
> > > filesystem leading to kernel crashes and reports we can do nothing
> > > about. Add config option to disallow writing to mounted (exclusively
> > > open) block devices. Syzbot can use this option to avoid uninteresting
> > > crashes. Also users whose userspace setup does not need writing to
> > > mounted block devices can set this config option for hardening.
> >
> > +syzkaller, Kees, Alexander, Eric
> >
> > We can enable this on syzbot, however I have the same concerns as with
> > disabling of XFS_SUPPORT_V4:
> > https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278
> >
> > It's useful to know the actual situation with respect to
> > bugs/vulnerabilities and one of the goals of syzbot is surfacing this
> > situation.
> > For some areas there is mismatch between upstream kernel and
> > downstream distros. Upstream says "this is buggy and deprecated",
> > which is fine in itself if not the other part: downstream distros
> > simply ignore that (maybe not even aware) and keep things enabled for
> > as long as possible. Stopping testing this is moving more in this
> > direction: silencing warnings and pretending that everything is fine,
> > when it's not.
> >
> > I wonder if there is a way to at least somehow bridge this gap.
> >
> > There is CONFIG_BROKEN, but not sure if it's the right thing here.
> > Maybe we add something like CONFIG_INSECURE. And such insecure config
> > settings (not setting BLK_DEV_WRITE_HARDENING, setting XFS_SUPPORT_V4)
> > will say:
> >
> > depends on INSECURE
> >
> > So that distros will need to at least acknowledge this to users by saying:
> >
> > CONFIG_INSECURE=y
> >
> > They are then motivated to work on actually removing dependencies on
> > these deprecated things.
> >
> > CONFIG_INSECURE description can say something along the lines of "this
> > kernel includes subsystems with known bugs that may cause security and
> > data integrity issues". When a subsystem adds "depends on INSECURE",
> > the commit should list some of the known issues.
> >
> > Then I see how trading disabling things on syzbot in exchange for
> > "depends on INSECURE" becomes reasonable and satisfies all parties to
> > some degree.
>
> Well in that case, post a patchset adding "depends on INSECURE" for
> every subsystem that syzbot files bugs against, if the maintainers do
> not immediately drop what they're doing to resolve the bug.

Hi Darrick,

Open unfixed bugs are fine (for some definition of fine).
What's discussed here is different. It's not having any filed bugs at
all due to not testing a thing and then not having any visibility into
the state of things.

> Google extracts a bunch more unpaid labor from society to make its
> owners richer, and everyone else on the planet suffers for it, just like
> you all have done for the past 25 years.  That's the definition of
> Googley!!
>
> --D
>
> >
> > Btw, if we do this it can make sense to invert this config (enable
> > concurrent writes), default to 'y' and recommend 'n'.
> >
> > Does it make any sense? Any other suggestions?
> >
> > P.S. Alex, if this lands this may be a candidate for addition to:
> > https://github.com/a13xp0p0v/kconfig-hardened-check
> > (and XFS_SUPPORT_V4 as well).
> >
> >
> > > Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  block/Kconfig             | 12 ++++++++++++
> > >  block/bdev.c              | 10 ++++++++++
> > >  include/linux/blk_types.h |  3 +++
> > >  3 files changed, 25 insertions(+)
> > >
> > > FWIW I've tested this and my test VM with ext4 root fs boots fine and fstests
> > > on ext4 seem to be also running fine with BLK_DEV_WRITE_HARDENING enabled.
> > > OTOH my old VM setup which is not using initrd fails to boot with
> > > BLK_DEV_WRITE_HARDENING enabled because fsck cannot open the root device
> > > because the root is already mounted (read-only). Anyway this should be useful
> > > for syzbot (Dmitry indicated interest in this option in the past) and maybe
> > > other well controlled setups.
> > >
> > > diff --git a/block/Kconfig b/block/Kconfig
> > > index 86122e459fe0..c44e2238e18d 100644
> > > --- a/block/Kconfig
> > > +++ b/block/Kconfig
> > > @@ -77,6 +77,18 @@ config BLK_DEV_INTEGRITY_T10
> > >         select CRC_T10DIF
> > >         select CRC64_ROCKSOFT
> > >
> > > +config BLK_DEV_WRITE_HARDENING
> > > +       bool "Do not allow writing to mounted devices"
> > > +       help
> > > +       When a block device is mounted, writing to its buffer cache very likely
> > > +       going to cause filesystem corruption. It is also rather easy to crash
> > > +       the kernel in this way since the filesystem has no practical way of
> > > +       detecting these writes to buffer cache and verifying its metadata
> > > +       integrity. Select this option to disallow writing to mounted devices.
> > > +       This should be mostly fine but some filesystems (e.g. ext4) rely on
> > > +       the ability of filesystem tools to write to mounted filesystems to
> > > +       set e.g. UUID or run fsck on the root filesystem in some setups.
> > > +
> > >  config BLK_DEV_ZONED
> > >         bool "Zoned block device support"
> > >         select MQ_IOSCHED_DEADLINE
> > > diff --git a/block/bdev.c b/block/bdev.c
> > > index 21c63bfef323..ad01f0a6af0d 100644
> > > --- a/block/bdev.c
> > > +++ b/block/bdev.c
> > > @@ -602,6 +602,12 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
> > >         struct gendisk *disk = bdev->bd_disk;
> > >         int ret;
> > >
> > > +       if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) {
> > > +               if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0)
> > > +                       return -EBUSY;
> > > +               if (mode & FMODE_WRITE && bdev->bd_holders > 0)
> > > +                       return -EBUSY;
> > > +       }
> > >         if (disk->fops->open) {
> > >                 ret = disk->fops->open(bdev, mode);
> > >                 if (ret) {
> > > @@ -617,6 +623,8 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
> > >                 set_init_blocksize(bdev);
> > >         if (test_bit(GD_NEED_PART_SCAN, &disk->state))
> > >                 bdev_disk_changed(disk, false);
> > > +       if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE)
> > > +               atomic_inc(&bdev->bd_writers);
> > >         atomic_inc(&bdev->bd_openers);
> > >         return 0;
> > >  }
> > > @@ -625,6 +633,8 @@ static void blkdev_put_whole(struct block_device *bdev, fmode_t mode)
> > >  {
> > >         if (atomic_dec_and_test(&bdev->bd_openers))
> > >                 blkdev_flush_mapping(bdev);
> > > +       if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE)
> > > +               atomic_dec(&bdev->bd_writers);
> > >         if (bdev->bd_disk->fops->release)
> > >                 bdev->bd_disk->fops->release(bdev->bd_disk, mode);
> > >  }
> > > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > > index 740afe80f297..25af3340f316 100644
> > > --- a/include/linux/blk_types.h
> > > +++ b/include/linux/blk_types.h
> > > @@ -67,6 +67,9 @@ struct block_device {
> > >         struct partition_meta_info *bd_meta_info;
> > >  #ifdef CONFIG_FAIL_MAKE_REQUEST
> > >         bool                    bd_make_it_fail;
> > > +#endif
> > > +#ifdef CONFIG_BLK_DEV_WRITE_HARDENING
> > > +       atomic_t                bd_writers;
> > >  #endif
> > >         /*
> > >          * keep this out-of-line as it's both big and not needed in the fast
> > > --
> > > 2.35.3
> > >
Christian Brauner June 14, 2023, 12:48 p.m. UTC | #27
On Wed, Jun 14, 2023 at 12:36:54PM +0200, Jan Kara wrote:
> On Wed 14-06-23 10:18:16, Christian Brauner wrote:
> > On Wed, Jun 14, 2023 at 12:17:26AM -0700, Christoph Hellwig wrote:
> > > On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote:
> > > > I don't question there are use cases for the flag, but there are use
> > > > cases for the config as well.
> > > > 
> > > > Some distros may want a guarantee that this does not happen as it
> > > > compromises lockdown and kernel integrity (on par with unsigned module
> > > > loading).
> > > > For fuzzing systems it also may be hard to ensure fine-grained
> > > > argument constraints, it's much easier and more reliable to prohibit
> > > > it on config level.
> > > 
> > > I'm fine with a config option enforcing write blocking for any
> > > BLK_OPEN_EXCL open.  Maybe the way to it is to:
> > > 
> > >  a) have an option to prevent any writes to exclusive openers, including
> > >     a run-time version to enable it
> > 
> > I really would wish we don't make this runtime configurable. Build time
> > and boot time yes but toggling it at runtime makes this already a lot
> > less interesting.
> 
> I see your point from security POV. But if you are say a desktop (or even
> server) user you may need to say resize your LVM or add partition to your
> disk or install grub2 into boot sector of your partition. In all these
> cases you need write access to a block device that is exclusively claimed
> by someone else. Do you mandate reboot in permissive mode for all these
> cases? Realistically that means such users just won't bother with the
> feature and leave it disabled all the time. I'm OK with such outcome but I
> wanted to point out this "no protection change after boot" policy noticably
> restricts number of systems where this is applicable.

You're asking the hard/right questions.

Installing the boot loader into a boot sector seems like an archaic
scenario. With UEFI this isn't necessary and systems that do want this
they should turn the Kconfig off or boot with it turned off.

I'm trying to understand the partition and lvm resize issue. I've
chatted a bit about this and it seems that in this protected mode we
should ensure that we cannot write to the main block device's sectors
that are mapped to a partition block device. If you write to the main
block device of a partitioned device one should only be able to modify
the footer and header but nothing where you have a partition block
device on. That should mean you can resize an LVM partition afaict.

I've been told that the partition block devices and the main block
devices have different buffer caches. But that means you cannot mix
accesses to them because writes to one will not show up on the other
unless caches are flushed on both devices all the time.

So it'd be neat if the writes to the whole block device would simply be
not allowed at all to areas which are also exposed as partition block
devices.
Christoph Hellwig June 14, 2023, 2:30 p.m. UTC | #28
On Wed, Jun 14, 2023 at 12:12:56PM +0200, Jan Kara wrote:
> Well, OK, I have not been precise :). Modifying a partition table (or LVM
> description block) is impossible to distinguish from clobbering a
> filesystem on open(2) time. Once we decide we implement arbitration of each
> individual write(2), we can obviously stop writes to area covered by some
> exclusively open partition. But then you are getting at the complexity
> level of tracking used ranges of block devices which Darrick has suggested
> and you didn't seem to like that (and neither do I).

Well, we track these ranges in the block_devices hanging off the gendisk
anyway, so this is a totally different league.  But in the end parsing
partition tables is a little easier than parsing file system metadata
but not fundamentally different.  So if we really want to lock down
broken sideband manipulations we can't allow that either and need
in-kernel support for manipulating partition tables if that is required
at run time.
Christoph Hellwig June 14, 2023, 2:31 p.m. UTC | #29
On Wed, Jun 14, 2023 at 10:18:16AM +0200, Christian Brauner wrote:
> > I'm fine with a config option enforcing write blocking for any
> > BLK_OPEN_EXCL open.  Maybe the way to it is to:
> > 
> >  a) have an option to prevent any writes to exclusive openers, including
> >     a run-time version to enable it
> 
> I really would wish we don't make this runtime configurable. Build time
> and boot time yes but toggling it at runtime makes this already a lot
> less interesting.

With run time I really mean not compile time aka boot time.  Sorry for
not being precise.
Matthew Wilcox June 14, 2023, 2:46 p.m. UTC | #30
On Wed, Jun 14, 2023 at 12:12:56PM +0200, Jan Kara wrote:
> Well, OK, I have not been precise :). Modifying a partition table (or LVM
> description block) is impossible to distinguish from clobbering a
> filesystem on open(2) time. Once we decide we implement arbitration of each
> individual write(2), we can obviously stop writes to area covered by some
> exclusively open partition. But then you are getting at the complexity
> level of tracking used ranges of block devices which Darrick has suggested
> and you didn't seem to like that (and neither do I). Furthermore the
> protection is never going to be perfect as soon as loopback devices, device
> mapper, and similar come into the mix (or it gets really really complex).
> So I'd really prefer to stick with whatever arbitration we can perform on
> open(2).

What a shame we got rid of mandatory file locks in f7e33bdbd6d1 ;-)
Dave Chinner June 14, 2023, 11:38 p.m. UTC | #31
On Wed, Jun 14, 2023 at 02:27:46PM +0200, Dmitry Vyukov wrote:
> On Wed, 14 Jun 2023 at 04:04, Darrick J. Wong <djwong@kernel.org> wrote:
> > On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote:
> > > On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote:
> > > CONFIG_INSECURE description can say something along the lines of "this
> > > kernel includes subsystems with known bugs that may cause security and
> > > data integrity issues". When a subsystem adds "depends on INSECURE",
> > > the commit should list some of the known issues.
> > >
> > > Then I see how trading disabling things on syzbot in exchange for
> > > "depends on INSECURE" becomes reasonable and satisfies all parties to
> > > some degree.
> >
> > Well in that case, post a patchset adding "depends on INSECURE" for
> > every subsystem that syzbot files bugs against, if the maintainers do
> > not immediately drop what they're doing to resolve the bug.
> 
> Hi Darrick,
> 
> Open unfixed bugs are fine (for some definition of fine).
> What's discussed here is different. It's not having any filed bugs at
> all due to not testing a thing and then not having any visibility into
> the state of things.

Just because syzbot doesn't test something, it does not mean the
code is not tested, nor does it mean the developers who are
responsible for the code have no visibility into the state of their
code.

The reason they want to avoid this sort of corruption injection
testing in syzbot is that it *does not provide a net benefit* to
anyone. The number (and value) of real bugs it might find are vastly
outweighed by the cost of filtering out the many, many false
positives the testing methodology raises.

Keep in mind that syzbot does not provide useful unit and functional
test coverage. We have to run tests suites like fstests to provide
this sort of functionality and visibility into the *correct
operation of the code*.

However, alongside all the unit/functional tests in fstests, we also
have non-deterministic stress and fuzzer tests that are similar in
nature to syzbot. They often flush out weird integration level bugs
before we even get to merging the code. These non-deterministic
stress tests in fstests have found *hundreds* of bugs over the
*couple of decades* we have been running them, and they also have a
history of uncovering entire new classes of bugs we've had to
address.

At this point, syzbot is yet to do prove it is more than a one-trick
pony - it typically only finds a single class of filesystem bug.
That is, it only finds bugs that are related to undetected physical
structure corruption of the filesystem that result in macro level
failures (crash, warn, hang).

Syzbot does nothing to ensure correct behaviour is occuring, that
data integrity is maintained by the filesystem, that crash recovery
after failures works correctly, etc. These things are *by far* the
most important things we have to ensure during filesystem
development.

IOWs, the sorts of problems that syzbot finds in filesystems are way
down the list of important things we need to validate.  Yes,
structural validation testing is something we should be
running, and it's clear that is does get run (both from fstests and
syzbot).

Hence the claim that "because syzbot doesn't run we don't have
visibility of code bugs" is naive, conceited, incredibly
narcissistic and demonstratable false. It also indicates a very
poor understanding of where syzbot actually fits into the overall
engineering processes.

-Dave.
Dmitry Vyukov June 15, 2023, 9:14 a.m. UTC | #32
On Thu, 15 Jun 2023 at 01:38, Dave Chinner <david@fromorbit.com> wrote:
> > > > CONFIG_INSECURE description can say something along the lines of "this
> > > > kernel includes subsystems with known bugs that may cause security and
> > > > data integrity issues". When a subsystem adds "depends on INSECURE",
> > > > the commit should list some of the known issues.
> > > >
> > > > Then I see how trading disabling things on syzbot in exchange for
> > > > "depends on INSECURE" becomes reasonable and satisfies all parties to
> > > > some degree.
> > >
> > > Well in that case, post a patchset adding "depends on INSECURE" for
> > > every subsystem that syzbot files bugs against, if the maintainers do
> > > not immediately drop what they're doing to resolve the bug.
> >
> > Hi Darrick,
> >
> > Open unfixed bugs are fine (for some definition of fine).
> > What's discussed here is different. It's not having any filed bugs at
> > all due to not testing a thing and then not having any visibility into
> > the state of things.
>
> Just because syzbot doesn't test something, it does not mean the
> code is not tested, nor does it mean the developers who are
> responsible for the code have no visibility into the state of their
> code.
>
> The reason they want to avoid this sort of corruption injection
> testing in syzbot is that it *does not provide a net benefit* to
> anyone. The number (and value) of real bugs it might find are vastly
> outweighed by the cost of filtering out the many, many false
> positives the testing methodology raises.
>
> Keep in mind that syzbot does not provide useful unit and functional
> test coverage. We have to run tests suites like fstests to provide
> this sort of functionality and visibility into the *correct
> operation of the code*.
>
> However, alongside all the unit/functional tests in fstests, we also
> have non-deterministic stress and fuzzer tests that are similar in
> nature to syzbot. They often flush out weird integration level bugs
> before we even get to merging the code. These non-deterministic
> stress tests in fstests have found *hundreds* of bugs over the
> *couple of decades* we have been running them, and they also have a
> history of uncovering entire new classes of bugs we've had to
> address.
>
> At this point, syzbot is yet to do prove it is more than a one-trick
> pony - it typically only finds a single class of filesystem bug.
> That is, it only finds bugs that are related to undetected physical
> structure corruption of the filesystem that result in macro level
> failures (crash, warn, hang).
>
> Syzbot does nothing to ensure correct behaviour is occuring, that
> data integrity is maintained by the filesystem, that crash recovery
> after failures works correctly, etc. These things are *by far* the
> most important things we have to ensure during filesystem
> development.
>
> IOWs, the sorts of problems that syzbot finds in filesystems are way
> down the list of important things we need to validate.  Yes,
> structural validation testing is something we should be
> running, and it's clear that is does get run (both from fstests and
> syzbot).
>
> Hence the claim that "because syzbot doesn't run we don't have
> visibility of code bugs" is naive, conceited, incredibly
> narcissistic and demonstratable false. It also indicates a very
> poor understanding of where syzbot actually fits into the overall
> engineering processes.

Hi Dave, Ted,

We are currently looking into options of how to satisfy all parties.

I am not saying that all of these bugs need to be fixed, nor that they
are more important than bugs in supported parts. And we are very much
interested in testing supported parts as well as we can do.

By CONFIG_INSECURE I just meant something similar to kernel taint
bits. A user is free to continue after any bad thing has happened/they
did, but some warranties are void. And if a kernel developer receives
a bug report on a tainted kernel, they will take it with a grain of
salt. So it's important to note the fact and inform about it.
Something similar here: bugs in deprecated parts do not need to be
fixed, and distros are still free to enable them, but this fact is
acknowledged by distros and made visible to users.

But we are looking into other options that won't require even CONFIG_INSECURE.
Jan Kara June 15, 2023, 2:39 p.m. UTC | #33
On Wed 14-06-23 14:48:17, Christian Brauner wrote:
> On Wed, Jun 14, 2023 at 12:36:54PM +0200, Jan Kara wrote:
> > On Wed 14-06-23 10:18:16, Christian Brauner wrote:
> > > On Wed, Jun 14, 2023 at 12:17:26AM -0700, Christoph Hellwig wrote:
> > > > On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote:
> > > > > I don't question there are use cases for the flag, but there are use
> > > > > cases for the config as well.
> > > > > 
> > > > > Some distros may want a guarantee that this does not happen as it
> > > > > compromises lockdown and kernel integrity (on par with unsigned module
> > > > > loading).
> > > > > For fuzzing systems it also may be hard to ensure fine-grained
> > > > > argument constraints, it's much easier and more reliable to prohibit
> > > > > it on config level.
> > > > 
> > > > I'm fine with a config option enforcing write blocking for any
> > > > BLK_OPEN_EXCL open.  Maybe the way to it is to:
> > > > 
> > > >  a) have an option to prevent any writes to exclusive openers, including
> > > >     a run-time version to enable it
> > > 
> > > I really would wish we don't make this runtime configurable. Build time
> > > and boot time yes but toggling it at runtime makes this already a lot
> > > less interesting.
> > 
> > I see your point from security POV. But if you are say a desktop (or even
> > server) user you may need to say resize your LVM or add partition to your
> > disk or install grub2 into boot sector of your partition. In all these
> > cases you need write access to a block device that is exclusively claimed
> > by someone else. Do you mandate reboot in permissive mode for all these
> > cases? Realistically that means such users just won't bother with the
> > feature and leave it disabled all the time. I'm OK with such outcome but I
> > wanted to point out this "no protection change after boot" policy noticably
> > restricts number of systems where this is applicable.
> 
> You're asking the hard/right questions.
> 
> Installing the boot loader into a boot sector seems like an archaic
> scenario. With UEFI this isn't necessary and systems that do want this
> they should turn the Kconfig off or boot with it turned off.

OK.
 
> I'm trying to understand the partition and lvm resize issue. I've
> chatted a bit about this and it seems that in this protected mode we
> should ensure that we cannot write to the main block device's sectors
> that are mapped to a partition block device. If you write to the main
> block device of a partitioned device one should only be able to modify
> the footer and header but nothing where you have a partition block
> device on. That should mean you can resize an LVM partition afaict.
> 
> I've been told that the partition block devices and the main block
> devices have different buffer caches. But that means you cannot mix
> accesses to them because writes to one will not show up on the other
> unless caches are flushed on both devices all the time.
> 
> So it'd be neat if the writes to the whole block device would simply be
> not allowed at all to areas which are also exposed as partition block
> devices.

So you're touching very similar areas I've replied to elsewhere [1] in this
thread. But let me go into more details here:

1) Disallowing modifications to area covered by some partition through the
main block device is doable but it means arbitration needs to happen on
each write (or page fault ???) which has performance implications. 

2) With LVM, device mapper exclusively claims the underlying device, places
header / footer to it and then exposes the rest of the device as another
block device (possibly after further block remapping games). Loop device
can do the same. With LVM you can also have multiple block devices mapping
to the same underlying block device and possibly overlapping ranges
(userspace fully controls this), in fact partitioning on top of RAID or
multipath devices is done this way AFAIK.

So tracking all these ranges, how they are remapped and what is actually
used by whom seems like a really complex task that would need to propagate
the information through multiple layers.

So I would backtrack a bit and maybe ask what is the threat model and what
protection you'd like to achieve? Because as you mentioned above, each
block device has a separate (incoherent) buffer cache. That is also a good
thing because protecting "filesystem's" buffer cache from corruption means
we need to just protect that one block device when the filesystem is using
it to avoid the worst problems. Writes to the same physical location
through different block devices will corrupt the data "at rest" but the
filesystem should do enough verification to catch this when loading the
data into the buffer cache. So we could make blkdev_get_by_dev() calls when
mounting filesystems disallow any other FMODE_WRITE opens of the block
device and that should still allow modifying partition tables or LVM setup.
It will break legacy boot setups & tuning ext4 parameters while the
filesystem is mounted but I'm fine with just saying "don't use this feature
if you do this". What do you think?

								Honza

[1] https://lore.kernel.org/all/20230614101256.kgnui242k72lmp4e@quack3
Dave Chinner June 18, 2023, 11:35 p.m. UTC | #34
On Thu, Jun 15, 2023 at 11:14:53AM +0200, Dmitry Vyukov wrote:
> On Thu, 15 Jun 2023 at 01:38, Dave Chinner <david@fromorbit.com> wrote:
> > Hence the claim that "because syzbot doesn't run we don't have
> > visibility of code bugs" is naive, conceited, incredibly
> > narcissistic and demonstratable false. It also indicates a very
> > poor understanding of where syzbot actually fits into the overall
> > engineering processes.
> 
> Hi Dave, Ted,
> 
> We are currently looking into options of how to satisfy all parties.
> 
> I am not saying that all of these bugs need to be fixed, nor that they
> are more important than bugs in supported parts. And we are very much
> interested in testing supported parts as well as we can do.
> 
> By CONFIG_INSECURE I just meant something similar to kernel taint
> bits.

How is that any better?  Who gets to decide what sets
this taint? Subsystem maintainers?

> A user is free to continue after any bad thing has happened/they
> did, but some warranties are void. And if a kernel developer receives
> a bug report on a tainted kernel, they will take it with a grain of
> salt. So it's important to note the fact and inform about it.
> Something similar here: bugs in deprecated parts do not need to be
> fixed, and distros are still free to enable them, but this fact is
> acknowledged by distros and made visible to users.

"Deprecated" does not mean *unmaintained*. They are two completely
different things, and conflating the two does not help anyone.

You are talking about marking unmaintained code with a taint, not
deprecated code. reiserfs is unmaintained code. ntfs3 is
unmaintained code. hfs is unmaintained code. There are several other
unmaintained filesystems that don't have active maintainers.  Bug
reports never, ever get looked at, etc. Sure, there's an argument to
taint them.

But whilst XFS v4 is deprecated, it is still very much maintained.
We encourage people to move off V4, we focus less on it, but if bug
reports from users come in, we still fix them. So even if you
introduce some "unmaintained" taint for the kernel, you aren't going
to see it get set for the XFS V4 format.

> But we are looking into other options that won't require even CONFIG_INSECURE.

Who is this nebulous "we"?

-Dave.
Jan Kara June 20, 2023, 10:41 a.m. UTC | #35
Hello Christoph!

On Wed 14-06-23 00:20:12, Christoph Hellwig wrote:
> On Tue, Jun 13, 2023 at 10:56:14PM +0200, Jan Kara wrote:
> > Well, as I've mentioned in the changelog there are old setups (without
> > initrd) that run fsck on root filesystem mounted read-only and fsck
> > programs tend to open the device with O_RDWR. These would be broken by this
> > change (for the filesystems that would use BLK_OPEN_ flag).
> 
> But that's also a really broken setup that will corrupt data in many
> cases.  So yes, maybe we need a way to allow it, but it probably would
> have to be per-file system.

I was looking into implementing the write hardening support and I've come
across the following obstacle: Your patch series that is in linux-block.git
removes the 'mode' argument from blkdev_put() which makes it impossible to
track how many writers there are for the block device. This is needed so
that we can check whether the filesystem is safe when mounting the device.

I can see several solutions but since you've just reworked the code and I'm
not 100% certain about the motivation, I figured I'll ask you first before
spending significant time on something you won't like:

1) Just return the mode argument to blkdev_put().

2) Only pass to blkdev_put() whether we have write access or not as a
separate argument.

3) Don't track number of opens for writing, instead check whether writes
are blocked on each write access. I think this has a number of downsides
but I mention it for completeness. One problem is we have to add checks to
multiple places (buffered IO, direct IO) and existing mmap in particular
will be very hard to deal with (need to add page_mkwrite() handler). All
these checks add performance overhead. It is practically impossible
(without significant performance overhead or new percpu datastructures) to
properly synchronize open that wants to block writers against already
running writes.

So what would you prefer? Thanks in advance for your input.

								Honza
Christoph Hellwig June 20, 2023, 11:29 a.m. UTC | #36
On Tue, Jun 20, 2023 at 12:41:11PM +0200, Jan Kara wrote:
> I can see several solutions but since you've just reworked the code and I'm
> not 100% certain about the motivation, I figured I'll ask you first before
> spending significant time on something you won't like:
> 
> 1) Just return the mode argument to blkdev_put().
> 
> 2) Only pass to blkdev_put() whether we have write access or not as a
> separate argument.
> 
> 3) Don't track number of opens for writing, instead check whether writes
> are blocked on each write access. I think this has a number of downsides
> but I mention it for completeness. One problem is we have to add checks to
> multiple places (buffered IO, direct IO) and existing mmap in particular
> will be very hard to deal with (need to add page_mkwrite() handler). All
> these checks add performance overhead. It is practically impossible
> (without significant performance overhead or new percpu datastructures) to
> properly synchronize open that wants to block writers against already
> running writes.

I think 3 is out for the reasons you mention.  2 feels a bit ugly,
but so does 1 given that we only really need the write flag.  One thing
I through about earlier but decided was overkill at that time is to
return a cookie from blkdev_get_* that needs to be passed back to
blkdev_put.  That could either be opaque to the caller, or replace the
bdev ala:

struct bdev_handle {
	struct block_device *bdev;
	void *holder;
	bool for_write;
};

Given that fixups we needed to pass the right holder back in, it feels
like that would be the least error prone API, even if it is a fair
amount of churn.
diff mbox series

Patch

diff --git a/block/Kconfig b/block/Kconfig
index 86122e459fe0..c44e2238e18d 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -77,6 +77,18 @@  config BLK_DEV_INTEGRITY_T10
 	select CRC_T10DIF
 	select CRC64_ROCKSOFT
 
+config BLK_DEV_WRITE_HARDENING
+	bool "Do not allow writing to mounted devices"
+	help
+	When a block device is mounted, writing to its buffer cache very likely
+	going to cause filesystem corruption. It is also rather easy to crash
+	the kernel in this way since the filesystem has no practical way of
+	detecting these writes to buffer cache and verifying its metadata
+	integrity. Select this option to disallow writing to mounted devices.
+	This should be mostly fine but some filesystems (e.g. ext4) rely on
+	the ability of filesystem tools to write to mounted filesystems to
+	set e.g. UUID or run fsck on the root filesystem in some setups.
+
 config BLK_DEV_ZONED
 	bool "Zoned block device support"
 	select MQ_IOSCHED_DEADLINE
diff --git a/block/bdev.c b/block/bdev.c
index 21c63bfef323..ad01f0a6af0d 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -602,6 +602,12 @@  static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
 	struct gendisk *disk = bdev->bd_disk;
 	int ret;
 
+	if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) {
+		if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0)
+			return -EBUSY;
+		if (mode & FMODE_WRITE && bdev->bd_holders > 0)
+			return -EBUSY;
+	}
 	if (disk->fops->open) {
 		ret = disk->fops->open(bdev, mode);
 		if (ret) {
@@ -617,6 +623,8 @@  static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
 		set_init_blocksize(bdev);
 	if (test_bit(GD_NEED_PART_SCAN, &disk->state))
 		bdev_disk_changed(disk, false);
+	if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE)
+		atomic_inc(&bdev->bd_writers);
 	atomic_inc(&bdev->bd_openers);
 	return 0;
 }
@@ -625,6 +633,8 @@  static void blkdev_put_whole(struct block_device *bdev, fmode_t mode)
 {
 	if (atomic_dec_and_test(&bdev->bd_openers))
 		blkdev_flush_mapping(bdev);
+	if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE)
+		atomic_dec(&bdev->bd_writers);
 	if (bdev->bd_disk->fops->release)
 		bdev->bd_disk->fops->release(bdev->bd_disk, mode);
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 740afe80f297..25af3340f316 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -67,6 +67,9 @@  struct block_device {
 	struct partition_meta_info *bd_meta_info;
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 	bool			bd_make_it_fail;
+#endif
+#ifdef CONFIG_BLK_DEV_WRITE_HARDENING
+	atomic_t		bd_writers;
 #endif
 	/*
 	 * keep this out-of-line as it's both big and not needed in the fast