Message ID | 20230612161614.10302-1-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: Add config option to not allow writing to mounted devices | expand |
On Mon 12-06-23 18:16:14, Jan Kara wrote: > Writing to mounted devices is dangerous and can lead to filesystem > corruption as well as crashes. Furthermore syzbot comes with more and > more involved examples how to corrupt block device under a mounted > filesystem leading to kernel crashes and reports we can do nothing > about. Add config option to disallow writing to mounted (exclusively > open) block devices. Syzbot can use this option to avoid uninteresting > crashes. Also users whose userspace setup does not need writing to > mounted block devices can set this config option for hardening. > > Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org > Signed-off-by: Jan Kara <jack@suse.cz> Please disregard this patch. I had uncommited fixups in my tree. I'll send fixed version shortly. I'm sorry for the noise. Honza
On 6/12/23 09:25, Jan Kara wrote: > On Mon 12-06-23 18:16:14, Jan Kara wrote: >> Writing to mounted devices is dangerous and can lead to filesystem >> corruption as well as crashes. Furthermore syzbot comes with more and >> more involved examples how to corrupt block device under a mounted >> filesystem leading to kernel crashes and reports we can do nothing >> about. Add config option to disallow writing to mounted (exclusively >> open) block devices. Syzbot can use this option to avoid uninteresting >> crashes. Also users whose userspace setup does not need writing to >> mounted block devices can set this config option for hardening. >> >> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org >> Signed-off-by: Jan Kara <jack@suse.cz> > > Please disregard this patch. I had uncommited fixups in my tree. I'll send > fixed version shortly. I'm sorry for the noise. Have alternatives been configured to making this functionality configurable at build time only? How about a kernel command line parameter instead of a config option? Thanks, Bart.
On Mon, Jun 12, 2023 at 10:39:51AM -0700, Bart Van Assche wrote: > > > Writing to mounted devices is dangerous and can lead to filesystem > > > corruption as well as crashes. Furthermore syzbot comes with more and > > > more involved examples how to corrupt block device under a mounted > > > filesystem leading to kernel crashes and reports we can do nothing > > > about. Add config option to disallow writing to mounted (exclusively > > > open) block devices. Syzbot can use this option to avoid uninteresting > > > crashes. Also users whose userspace setup does not need writing to > > > mounted block devices can set this config option for hardening. > > Have alternatives been configured to making this functionality > configurable at build time only? How about a kernel command line > parameter instead of a config option? I could imagine wanting a config option which changes the default, as well as a way of setting the parameter on the command line so that users of distro kernel can change the parameter value. That's especially since it might be useful for more than just reining in syzbot reports. - Ted
On Mon, Jun 12, 2023, at 1:39 PM, Bart Van Assche wrote: > On 6/12/23 09:25, Jan Kara wrote: >> On Mon 12-06-23 18:16:14, Jan Kara wrote: >>> Writing to mounted devices is dangerous and can lead to filesystem >>> corruption as well as crashes. Furthermore syzbot comes with more and >>> more involved examples how to corrupt block device under a mounted >>> filesystem leading to kernel crashes and reports we can do nothing >>> about. Add config option to disallow writing to mounted (exclusively >>> open) block devices. Syzbot can use this option to avoid uninteresting >>> crashes. Also users whose userspace setup does not need writing to >>> mounted block devices can set this config option for hardening. >>> >>> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org >>> Signed-off-by: Jan Kara <jack@suse.cz> >> >> Please disregard this patch. I had uncommited fixups in my tree. I'll send >> fixed version shortly. I'm sorry for the noise. > > Have alternatives been configured to making this functionality configurable > at build time only? How about a kernel command line parameter instead of a > config option? It's not just syzbot here; at least once in my life I accidentally did `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk and not the target USB device. I know I'm not alone =) There's a lot of similar accidental-damage protection from this. Another stronger argument here is that if one has a security policy that restricts access to filesystem level objects, if a process can somehow write to a mounted block device, it effectively subverts all of those controls. Right now it looks to me we're invoking devcgroup_check_permission pretty early on; maybe we could extend the device cgroup stuff to have a new check for write-mounted, like ``` diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index c994ff5b157c..f2af33c5acc1 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -6797,6 +6797,7 @@ enum { BPF_DEVCG_ACC_MKNOD = (1ULL << 0), BPF_DEVCG_ACC_READ = (1ULL << 1), BPF_DEVCG_ACC_WRITE = (1ULL << 2), + BPF_DEVCG_ACC_WRITE_MOUNTED = (1ULL << 3), }; enum { ``` ? But probably this would need to be some kind of opt-in flag to avoid breaking existing bpf progs? If it was configurable via the device cgroup, then it's completely flexible from userspace; most specifically including supporting some specially privileged processes from doing it if necessary. Also, I wonder if we should also support restricting *reads* from mounted block devices?
Hi Jan, kernel test robot noticed the following build errors: [auto build test ERROR on v6.4-rc6] [also build test ERROR on linus/master next-20230609] [cannot apply to axboe-block/for-next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/block-Add-config-option-to-not-allow-writing-to-mounted-devices/20230613-001910 base: v6.4-rc6 patch link: https://lore.kernel.org/r/20230612161614.10302-1-jack%40suse.cz patch subject: [PATCH] block: Add config option to not allow writing to mounted devices config: arc-randconfig-r043-20230612 (https://download.01.org/0day-ci/archive/20230613/202306131212.dPssLmeY-lkp@intel.com/config) compiler: arc-elf-gcc (GCC) 12.3.0 reproduce (this is a W=1 build): mkdir -p ~/bin wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout v6.4-rc6 b4 shazam https://lore.kernel.org/r/20230612161614.10302-1-jack@suse.cz # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=arc olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202306131212.dPssLmeY-lkp@intel.com/ All errors (new ones prefixed by >>): block/bdev.c: In function 'blkdev_get_whole': >> block/bdev.c:606:59: error: 'struct block_device' has no member named 'bd_writers' 606 | if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0) | ^~ block/bdev.c:627:33: error: 'struct block_device' has no member named 'bd_writers' 627 | atomic_inc(&bdev->bd_writers); | ^~ block/bdev.c: In function 'blkdev_put_whole': block/bdev.c:637:33: error: 'struct block_device' has no member named 'bd_writers' 637 | atomic_dec(&bdev->bd_writers); | ^~ vim +606 block/bdev.c 599 600 static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) 601 { 602 struct gendisk *disk = bdev->bd_disk; 603 int ret; 604 605 if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) { > 606 if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0) 607 return -EBUSY; 608 if (mode & FMODE_WRITE && bdev->bd_holders > 0) 609 return -EBUSY; 610 } 611 if (disk->fops->open) { 612 ret = disk->fops->open(bdev, mode); 613 if (ret) { 614 /* avoid ghost partitions on a removed medium */ 615 if (ret == -ENOMEDIUM && 616 test_bit(GD_NEED_PART_SCAN, &disk->state)) 617 bdev_disk_changed(disk, true); 618 return ret; 619 } 620 } 621 622 if (!atomic_read(&bdev->bd_openers)) 623 set_init_blocksize(bdev); 624 if (test_bit(GD_NEED_PART_SCAN, &disk->state)) 625 bdev_disk_changed(disk, false); 626 if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE) 627 atomic_inc(&bdev->bd_writers); 628 atomic_inc(&bdev->bd_openers); 629 return 0; 630 } 631
> +config BLK_DEV_WRITE_HARDENING > + bool "Do not allow writing to mounted devices" > + help > + When a block device is mounted, writing to its buffer cache very likely > + going to cause filesystem corruption. It is also rather easy to crash > + the kernel in this way since the filesystem has no practical way of > + detecting these writes to buffer cache and verifying its metadata > + integrity. Select this option to disallow writing to mounted devices. > + This should be mostly fine but some filesystems (e.g. ext4) rely on > + the ability of filesystem tools to write to mounted filesystems to > + set e.g. UUID or run fsck on the root filesystem in some setups. I'm not sure a config option is really the right thing. I'd much prefer a BLK_OPEN_ flag to prohibit any other writer. Except for etN and maybe fat all file systems can set that unconditionally. And for those file systems that have historically allowed writes to mounted file systems they can find a local way to decide on when and when not to set it.
On Tue, 13 Jun 2023 at 07:10, Christoph Hellwig <hch@infradead.org> wrote: > > > +config BLK_DEV_WRITE_HARDENING > > + bool "Do not allow writing to mounted devices" > > + help > > + When a block device is mounted, writing to its buffer cache very likely > > + going to cause filesystem corruption. It is also rather easy to crash > > + the kernel in this way since the filesystem has no practical way of > > + detecting these writes to buffer cache and verifying its metadata > > + integrity. Select this option to disallow writing to mounted devices. > > + This should be mostly fine but some filesystems (e.g. ext4) rely on > > + the ability of filesystem tools to write to mounted filesystems to > > + set e.g. UUID or run fsck on the root filesystem in some setups. > > I'm not sure a config option is really the right thing. > > I'd much prefer a BLK_OPEN_ flag to prohibit any other writer. > Except for etN and maybe fat all file systems can set that > unconditionally. And for those file systems that have historically > allowed writes to mounted file systems they can find a local way > to decide on when and when not to set it. I don't question there are use cases for the flag, but there are use cases for the config as well. Some distros may want a guarantee that this does not happen as it compromises lockdown and kernel integrity (on par with unsigned module loading). For fuzzing systems it also may be hard to ensure fine-grained argument constraints, it's much easier and more reliable to prohibit it on config level.
On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote: > > Writing to mounted devices is dangerous and can lead to filesystem > corruption as well as crashes. Furthermore syzbot comes with more and > more involved examples how to corrupt block device under a mounted > filesystem leading to kernel crashes and reports we can do nothing > about. Add config option to disallow writing to mounted (exclusively > open) block devices. Syzbot can use this option to avoid uninteresting > crashes. Also users whose userspace setup does not need writing to > mounted block devices can set this config option for hardening. +syzkaller, Kees, Alexander, Eric We can enable this on syzbot, however I have the same concerns as with disabling of XFS_SUPPORT_V4: https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278 It's useful to know the actual situation with respect to bugs/vulnerabilities and one of the goals of syzbot is surfacing this situation. For some areas there is mismatch between upstream kernel and downstream distros. Upstream says "this is buggy and deprecated", which is fine in itself if not the other part: downstream distros simply ignore that (maybe not even aware) and keep things enabled for as long as possible. Stopping testing this is moving more in this direction: silencing warnings and pretending that everything is fine, when it's not. I wonder if there is a way to at least somehow bridge this gap. There is CONFIG_BROKEN, but not sure if it's the right thing here. Maybe we add something like CONFIG_INSECURE. And such insecure config settings (not setting BLK_DEV_WRITE_HARDENING, setting XFS_SUPPORT_V4) will say: depends on INSECURE So that distros will need to at least acknowledge this to users by saying: CONFIG_INSECURE=y They are then motivated to work on actually removing dependencies on these deprecated things. CONFIG_INSECURE description can say something along the lines of "this kernel includes subsystems with known bugs that may cause security and data integrity issues". When a subsystem adds "depends on INSECURE", the commit should list some of the known issues. Then I see how trading disabling things on syzbot in exchange for "depends on INSECURE" becomes reasonable and satisfies all parties to some degree. Btw, if we do this it can make sense to invert this config (enable concurrent writes), default to 'y' and recommend 'n'. Does it make any sense? Any other suggestions? P.S. Alex, if this lands this may be a candidate for addition to: https://github.com/a13xp0p0v/kconfig-hardened-check (and XFS_SUPPORT_V4 as well). > Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org > Signed-off-by: Jan Kara <jack@suse.cz> > --- > block/Kconfig | 12 ++++++++++++ > block/bdev.c | 10 ++++++++++ > include/linux/blk_types.h | 3 +++ > 3 files changed, 25 insertions(+) > > FWIW I've tested this and my test VM with ext4 root fs boots fine and fstests > on ext4 seem to be also running fine with BLK_DEV_WRITE_HARDENING enabled. > OTOH my old VM setup which is not using initrd fails to boot with > BLK_DEV_WRITE_HARDENING enabled because fsck cannot open the root device > because the root is already mounted (read-only). Anyway this should be useful > for syzbot (Dmitry indicated interest in this option in the past) and maybe > other well controlled setups. > > diff --git a/block/Kconfig b/block/Kconfig > index 86122e459fe0..c44e2238e18d 100644 > --- a/block/Kconfig > +++ b/block/Kconfig > @@ -77,6 +77,18 @@ config BLK_DEV_INTEGRITY_T10 > select CRC_T10DIF > select CRC64_ROCKSOFT > > +config BLK_DEV_WRITE_HARDENING > + bool "Do not allow writing to mounted devices" > + help > + When a block device is mounted, writing to its buffer cache very likely > + going to cause filesystem corruption. It is also rather easy to crash > + the kernel in this way since the filesystem has no practical way of > + detecting these writes to buffer cache and verifying its metadata > + integrity. Select this option to disallow writing to mounted devices. > + This should be mostly fine but some filesystems (e.g. ext4) rely on > + the ability of filesystem tools to write to mounted filesystems to > + set e.g. UUID or run fsck on the root filesystem in some setups. > + > config BLK_DEV_ZONED > bool "Zoned block device support" > select MQ_IOSCHED_DEADLINE > diff --git a/block/bdev.c b/block/bdev.c > index 21c63bfef323..ad01f0a6af0d 100644 > --- a/block/bdev.c > +++ b/block/bdev.c > @@ -602,6 +602,12 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) > struct gendisk *disk = bdev->bd_disk; > int ret; > > + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) { > + if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0) > + return -EBUSY; > + if (mode & FMODE_WRITE && bdev->bd_holders > 0) > + return -EBUSY; > + } > if (disk->fops->open) { > ret = disk->fops->open(bdev, mode); > if (ret) { > @@ -617,6 +623,8 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) > set_init_blocksize(bdev); > if (test_bit(GD_NEED_PART_SCAN, &disk->state)) > bdev_disk_changed(disk, false); > + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE) > + atomic_inc(&bdev->bd_writers); > atomic_inc(&bdev->bd_openers); > return 0; > } > @@ -625,6 +633,8 @@ static void blkdev_put_whole(struct block_device *bdev, fmode_t mode) > { > if (atomic_dec_and_test(&bdev->bd_openers)) > blkdev_flush_mapping(bdev); > + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE) > + atomic_dec(&bdev->bd_writers); > if (bdev->bd_disk->fops->release) > bdev->bd_disk->fops->release(bdev->bd_disk, mode); > } > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > index 740afe80f297..25af3340f316 100644 > --- a/include/linux/blk_types.h > +++ b/include/linux/blk_types.h > @@ -67,6 +67,9 @@ struct block_device { > struct partition_meta_info *bd_meta_info; > #ifdef CONFIG_FAIL_MAKE_REQUEST > bool bd_make_it_fail; > +#endif > +#ifdef CONFIG_BLK_DEV_WRITE_HARDENING > + atomic_t bd_writers; > #endif > /* > * keep this out-of-line as it's both big and not needed in the fast > -- > 2.35.3 >
On Mon 12-06-23 14:52:54, Colin Walters wrote: > On Mon, Jun 12, 2023, at 1:39 PM, Bart Van Assche wrote: > > On 6/12/23 09:25, Jan Kara wrote: > >> On Mon 12-06-23 18:16:14, Jan Kara wrote: > >>> Writing to mounted devices is dangerous and can lead to filesystem > >>> corruption as well as crashes. Furthermore syzbot comes with more and > >>> more involved examples how to corrupt block device under a mounted > >>> filesystem leading to kernel crashes and reports we can do nothing > >>> about. Add config option to disallow writing to mounted (exclusively > >>> open) block devices. Syzbot can use this option to avoid uninteresting > >>> crashes. Also users whose userspace setup does not need writing to > >>> mounted block devices can set this config option for hardening. > >>> > >>> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org > >>> Signed-off-by: Jan Kara <jack@suse.cz> > >> > >> Please disregard this patch. I had uncommited fixups in my tree. I'll send > >> fixed version shortly. I'm sorry for the noise. > > > > Have alternatives been configured to making this functionality configurable > > at build time only? How about a kernel command line parameter instead of a > > config option? > > It's not just syzbot here; at least once in my life I accidentally did > `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk > and not the target USB device. I know I'm not alone =) Yeah, so I'm not sure we are going to protect against this particular case. I mean it is not *that* uncommon to alter partition table of /dev/sda while /dev/sda1 is mounted. And for the kernel it is difficult to distinguish this and your mishap. > There's a lot of similar accidental-damage protection from this. Another > stronger argument here is that if one has a security policy that > restricts access to filesystem level objects, if a process can somehow > write to a mounted block device, it effectively subverts all of those > controls. Well, there are multiple levels of protection that I can think of: 1) If user can write some image and make kernel mount it. 2) If user can modify device content while mounted (but not buffer cache of the device). 3) If user can modify buffer cache of the device while mounted. 3) is the most problematic and effectively equivalent to full machine control (executing arbitrary code in kernel mode) these days. For 1) and 2) there are reasonable protection measures the filesystem driver can take (and effectively you cannot escape these problems if you allow attaching untrusted devices such as USB sticks) so they can cause DoS but we should be able to prevent full machine takeover in the filesystem code. So this patch is mainly aimed at forbiding 3). > Right now it looks to me we're invoking devcgroup_check_permission pretty > early on; maybe we could extend the device cgroup stuff to have a new > check for write-mounted, like > > ``` > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index c994ff5b157c..f2af33c5acc1 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -6797,6 +6797,7 @@ enum { > BPF_DEVCG_ACC_MKNOD = (1ULL << 0), > BPF_DEVCG_ACC_READ = (1ULL << 1), > BPF_DEVCG_ACC_WRITE = (1ULL << 2), > + BPF_DEVCG_ACC_WRITE_MOUNTED = (1ULL << 3), > }; > > enum { > ``` > > ? But probably this would need to be some kind of opt-in flag to avoid > breaking existing bpf progs? > > If it was configurable via the device cgroup, then it's completely > flexible from userspace; most specifically including supporting some > specially privileged processes from doing it if necessary. I kind of like the flexibility of device cgroups but it does not seem to fit well with my "stop unactionable syzbot reports" usecase and doing the protection properly would mean that we now need to create way to approve access for all the tools that need this. So I'm not against this but I'd consider this "future extension possibility" :). > Also, I wonder if we should also support restricting *reads* from mounted > block devices? I don't see a strong usecase for this. Why would mounted vs unmounted matter here? Honza
On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote: > On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote: > > > > Writing to mounted devices is dangerous and can lead to filesystem > > corruption as well as crashes. Furthermore syzbot comes with more and > > more involved examples how to corrupt block device under a mounted > > filesystem leading to kernel crashes and reports we can do nothing > > about. Add config option to disallow writing to mounted (exclusively > > open) block devices. Syzbot can use this option to avoid uninteresting > > crashes. Also users whose userspace setup does not need writing to > > mounted block devices can set this config option for hardening. > > +syzkaller, Kees, Alexander, Eric > > We can enable this on syzbot, however I have the same concerns as with > disabling of XFS_SUPPORT_V4: > https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278 > > It's useful to know the actual situation with respect to > bugs/vulnerabilities and one of the goals of syzbot is surfacing this > situation. So from my perspective, it's not a "vulernability". It's another example of "syzbot is porting that root can screw you". The question about whether the attacker has write access to the block device is a threat model question. If you have write access to the block device, you can set the setuid bit on a copy of /bin/bash; but is that a "vulernability"? Not really.... Yes, from the lockdown perspective, what thight might mean is that "root can run arbitrary code in ring0". But for most distributions, given that they allow users to provide custom modules (for exanple, for Nvidia or other GPU support) that were not built by the distribution, they can run arbitrary code in ring 0 because root can provide an arbitrary kernel module. If you are 100% locked down, perhaps that's not the case. But this is a very specialized use case, and more to the point, asking upstream to worry about this is effectively an unfunded mandate. > For some areas there is mismatch between upstream kernel and > downstream distros. Upstream says "this is buggy and deprecated", > which is fine in itself if not the other part: downstream distros > simply ignore that (maybe not even aware) and keep things enabled for > as long as possible. Stopping testing this is moving more in this > direction: silencing warnings and pretending that everything is fine, > when it's not. > > I wonder if there is a way to at least somehow bridge this gap. > > There is CONFIG_BROKEN, but not sure if it's the right thing here. > Maybe we add something like CONFIG_INSECURE. "INSECURE" is not really accurate, because it presumes a certain treat model, and it's not neccessarily one that everyone has signed off as being one that they need to worry about. So I'd put it differently. We need to have a way of filtering out those syzbot reports which are caused by allowing a privileged user to be able to dynamically nodify the block device for a mounted file system. One way to do that is to simply surpress them. For example, we did that when we taught syzbot not to try to set the real-time priority for a userspace task to MAX_RT_PRIO, which starves kernel threads and causes the system to malfunction. That's not a "kernel bug", that's a userspace bug, and teaching syzbot not to do the stupid thing made sense. If you think there are some subset of people who are about syzbot reports that are caused by dynamically modifying the underlying block device while it is mounted, what if we can somehow attach a label to the syzbot report, indicating that it was caused by modifying a moutned block device? That way, upstream can ignore it as a stupid syzbot thing, and you can keep it as a "theoretical vulnerability". And we don't have to debate the question of which threat model is the more reasonable one. - Ted
On Mon 12-06-23 22:10:30, Christoph Hellwig wrote: > > +config BLK_DEV_WRITE_HARDENING > > + bool "Do not allow writing to mounted devices" > > + help > > + When a block device is mounted, writing to its buffer cache very likely > > + going to cause filesystem corruption. It is also rather easy to crash > > + the kernel in this way since the filesystem has no practical way of > > + detecting these writes to buffer cache and verifying its metadata > > + integrity. Select this option to disallow writing to mounted devices. > > + This should be mostly fine but some filesystems (e.g. ext4) rely on > > + the ability of filesystem tools to write to mounted filesystems to > > + set e.g. UUID or run fsck on the root filesystem in some setups. > > I'm not sure a config option is really the right thing. > > I'd much prefer a BLK_OPEN_ flag to prohibit any other writer. > Except for etN and maybe fat all file systems can set that > unconditionally. And for those file systems that have historically > allowed writes to mounted file systems they can find a local way > to decide on when and when not to set it. Well, as I've mentioned in the changelog there are old setups (without initrd) that run fsck on root filesystem mounted read-only and fsck programs tend to open the device with O_RDWR. These would be broken by this change (for the filesystems that would use BLK_OPEN_ flag). Similarly some boot loaders can write to first sectors of the root partition while the filesystem is mounted. So I don't think controlling the behavior by the in-kernel user that is having the bdev exclusively open really works. It seems to be more a property of the system setup than a property of the in-kernel bdev user. Am I mistaken? So I think kconfig option or sysfs tunable (maybe a per-device one?) will be more appropriate choice? With default behavior configurable by kernel parameter? And once set to write-protect on mount, do we allow flipping it back? Both have advantages and disadvantages so the tunable might be tri-state in the end (no protection, write-protect but can be turned off, write-protect that cannot be turned off)? But maybe I'm overcomplicating this so please share your thoughts :) Honza
On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote: > On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote: > > > > Writing to mounted devices is dangerous and can lead to filesystem > > corruption as well as crashes. Furthermore syzbot comes with more and > > more involved examples how to corrupt block device under a mounted > > filesystem leading to kernel crashes and reports we can do nothing > > about. Add config option to disallow writing to mounted (exclusively > > open) block devices. Syzbot can use this option to avoid uninteresting > > crashes. Also users whose userspace setup does not need writing to > > mounted block devices can set this config option for hardening. > > +syzkaller, Kees, Alexander, Eric > > We can enable this on syzbot, however I have the same concerns as with > disabling of XFS_SUPPORT_V4: > https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278 Really? This is exactly what I *detest most* about syzbot: the recalcitrant maintainer who thinks their ideology is more important than any other engineering consideration that might exist. We want *better quality bug reports* on *current code* that we have to *support for the foreseeable future*, not get buried under repeated shitty reports containing yet more variants of problems we fixed over a decade ago. I'll repeat what Eric has already pointed out in the above GH issue in the vain hope you'll listen this time rather than making even more extreme ideological demands on us. The XFS engineers put in place a planned, well documented deprecation and removal process for V4 format support back in 2020. We are well into that plan - we are not that far from turning it off v4 support by default. The V4 format is legacy code and users are already migrating away from it. As such, it has much lower priority and relevance to us compared to supporting v5 filesystems. The syzbot maintainers don't get to decide how important XFS v4 format support is - that's the job of the XFS engineers responsibile for developing XFS code and supporting XFS users. Because V4 has been deprecated and support is slowing down as people migrate off it, we don't need as extensive test coverage as we once did. i.e. we are ramping down the validation in accordance with it's lower priority and approaching disabling in 2025. We are placing much more importance on validation of v5 format features and functionality. As such, we really don't need syzbot to be exercising v4 formats any more - it's much more important to our users that we exercise v5 formats as extensively as possible. That is what we are asking the syzkaller runners (and syzbot) do as a service for us. If your ideology demands that "the only way to stop syzbot testing XFS v4 filesytsems is to remove the code entirely" (paraphrasing your comments from the above github issue), then the entire problem here is your ideology. That is, your ideology is getting in the way of practical, well thought out, long running end-of-life engineering processes. It is creating unnecessary load on limited resources. Further, your demands that we place syzbot coverage (if syzbot doesn't test it, it must depend on CONFIG_INSECURE!) above our direct responsibilities to distro maintainers and other downstream users is, at best, terribly misguided. Syzbot is a *tool*. It's not even a critical tool we depend on - we can live without it just fine. We'd really like syzbot to be a better tool, but the ideology behind syzbot is the biggest impediment to making it a better, more effective tool for the community. If syzbot maintainers won't listen to simple requests to improve test coverage according to subsystem requirements, then it's clear that syzbot is being driven by ideology rather than engineering requirements. This needs to change. -Dave.
On Tue, Jun 13, 2023 at 01:34:48PM +0200, Jan Kara wrote: > On Mon 12-06-23 14:52:54, Colin Walters wrote: > > On Mon, Jun 12, 2023, at 1:39 PM, Bart Van Assche wrote: > > > On 6/12/23 09:25, Jan Kara wrote: > > >> On Mon 12-06-23 18:16:14, Jan Kara wrote: > > >>> Writing to mounted devices is dangerous and can lead to filesystem > > >>> corruption as well as crashes. Furthermore syzbot comes with more and > > >>> more involved examples how to corrupt block device under a mounted > > >>> filesystem leading to kernel crashes and reports we can do nothing > > >>> about. Add config option to disallow writing to mounted (exclusively > > >>> open) block devices. Syzbot can use this option to avoid uninteresting > > >>> crashes. Also users whose userspace setup does not need writing to > > >>> mounted block devices can set this config option for hardening. > > >>> > > >>> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org > > >>> Signed-off-by: Jan Kara <jack@suse.cz> > > >> > > >> Please disregard this patch. I had uncommited fixups in my tree. I'll send > > >> fixed version shortly. I'm sorry for the noise. > > > > > > Have alternatives been configured to making this functionality configurable > > > at build time only? How about a kernel command line parameter instead of a > > > config option? > > > > It's not just syzbot here; at least once in my life I accidentally did > > `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk > > and not the target USB device. I know I'm not alone =) > > Yeah, so I'm not sure we are going to protect against this particular case. > I mean it is not *that* uncommon to alter partition table of /dev/sda while > /dev/sda1 is mounted. And for the kernel it is difficult to distinguish > this and your mishap. Honestly? I'd love it if filesystems actually /could/ lock down the parts of block devices they're using. They could hand out write privileges to the open bdev fds at the same time that a block layout lease is created, and retract them when the lease terminates. Areas before the fs (e.g. BIOS boot sector) could actually be left writable by filesystems that don't use that area; and anything beyond EOFS would still be writable (hello lvm). Then xfs actually /could/ prevent you from blowing away mounted xfs filesystem. ext4 could even still allow primary superblock writes to avoid breaking tune2fs, or they could detect secureboot lockdown and prohibit that. In the past, I was told to go write an LSM if I wanted XFS to protect itself from getting nuked, but I've been too busy to learn how to do that. The other nastier question is blocking writes to sda when sda1 is mounted; for that I have no response. :( --D > > There's a lot of similar accidental-damage protection from this. Another > > stronger argument here is that if one has a security policy that > > restricts access to filesystem level objects, if a process can somehow > > write to a mounted block device, it effectively subverts all of those > > controls. > > Well, there are multiple levels of protection that I can think of: > > 1) If user can write some image and make kernel mount it. > 2) If user can modify device content while mounted (but not buffer cache > of the device). > 3) If user can modify buffer cache of the device while mounted. > > 3) is the most problematic and effectively equivalent to full machine > control (executing arbitrary code in kernel mode) these days. For 1) and > 2) there are reasonable protection measures the filesystem driver can take > (and effectively you cannot escape these problems if you allow attaching > untrusted devices such as USB sticks) so they can cause DoS but we should > be able to prevent full machine takeover in the filesystem code. > > So this patch is mainly aimed at forbiding 3). > > > Right now it looks to me we're invoking devcgroup_check_permission pretty > > early on; maybe we could extend the device cgroup stuff to have a new > > check for write-mounted, like > > > > ``` > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > > index c994ff5b157c..f2af33c5acc1 100644 > > --- a/tools/include/uapi/linux/bpf.h > > +++ b/tools/include/uapi/linux/bpf.h > > @@ -6797,6 +6797,7 @@ enum { > > BPF_DEVCG_ACC_MKNOD = (1ULL << 0), > > BPF_DEVCG_ACC_READ = (1ULL << 1), > > BPF_DEVCG_ACC_WRITE = (1ULL << 2), > > + BPF_DEVCG_ACC_WRITE_MOUNTED = (1ULL << 3), > > }; > > > > enum { > > ``` > > > > ? But probably this would need to be some kind of opt-in flag to avoid > > breaking existing bpf progs? > > > > If it was configurable via the device cgroup, then it's completely > > flexible from userspace; most specifically including supporting some > > specially privileged processes from doing it if necessary. > > I kind of like the flexibility of device cgroups but it does not seem to > fit well with my "stop unactionable syzbot reports" usecase and doing the > protection properly would mean that we now need to create way to approve > access for all the tools that need this. So I'm not against this but I'd > consider this "future extension possibility" :). > > > Also, I wonder if we should also support restricting *reads* from mounted > > block devices? > > I don't see a strong usecase for this. Why would mounted vs unmounted > matter here? > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote: > On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote: > > > > Writing to mounted devices is dangerous and can lead to filesystem > > corruption as well as crashes. Furthermore syzbot comes with more and > > more involved examples how to corrupt block device under a mounted > > filesystem leading to kernel crashes and reports we can do nothing > > about. Add config option to disallow writing to mounted (exclusively > > open) block devices. Syzbot can use this option to avoid uninteresting > > crashes. Also users whose userspace setup does not need writing to > > mounted block devices can set this config option for hardening. > > +syzkaller, Kees, Alexander, Eric > > We can enable this on syzbot, however I have the same concerns as with > disabling of XFS_SUPPORT_V4: > https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278 > > It's useful to know the actual situation with respect to > bugs/vulnerabilities and one of the goals of syzbot is surfacing this > situation. > For some areas there is mismatch between upstream kernel and > downstream distros. Upstream says "this is buggy and deprecated", > which is fine in itself if not the other part: downstream distros > simply ignore that (maybe not even aware) and keep things enabled for > as long as possible. Stopping testing this is moving more in this > direction: silencing warnings and pretending that everything is fine, > when it's not. > > I wonder if there is a way to at least somehow bridge this gap. > > There is CONFIG_BROKEN, but not sure if it's the right thing here. > Maybe we add something like CONFIG_INSECURE. And such insecure config > settings (not setting BLK_DEV_WRITE_HARDENING, setting XFS_SUPPORT_V4) > will say: > > depends on INSECURE > > So that distros will need to at least acknowledge this to users by saying: > > CONFIG_INSECURE=y > > They are then motivated to work on actually removing dependencies on > these deprecated things. > > CONFIG_INSECURE description can say something along the lines of "this > kernel includes subsystems with known bugs that may cause security and > data integrity issues". When a subsystem adds "depends on INSECURE", > the commit should list some of the known issues. > > Then I see how trading disabling things on syzbot in exchange for > "depends on INSECURE" becomes reasonable and satisfies all parties to > some degree. Well in that case, post a patchset adding "depends on INSECURE" for every subsystem that syzbot files bugs against, if the maintainers do not immediately drop what they're doing to resolve the bug. Google extracts a bunch more unpaid labor from society to make its owners richer, and everyone else on the planet suffers for it, just like you all have done for the past 25 years. That's the definition of Googley!! --D > > Btw, if we do this it can make sense to invert this config (enable > concurrent writes), default to 'y' and recommend 'n'. > > Does it make any sense? Any other suggestions? > > P.S. Alex, if this lands this may be a candidate for addition to: > https://github.com/a13xp0p0v/kconfig-hardened-check > (and XFS_SUPPORT_V4 as well). > > > > Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > block/Kconfig | 12 ++++++++++++ > > block/bdev.c | 10 ++++++++++ > > include/linux/blk_types.h | 3 +++ > > 3 files changed, 25 insertions(+) > > > > FWIW I've tested this and my test VM with ext4 root fs boots fine and fstests > > on ext4 seem to be also running fine with BLK_DEV_WRITE_HARDENING enabled. > > OTOH my old VM setup which is not using initrd fails to boot with > > BLK_DEV_WRITE_HARDENING enabled because fsck cannot open the root device > > because the root is already mounted (read-only). Anyway this should be useful > > for syzbot (Dmitry indicated interest in this option in the past) and maybe > > other well controlled setups. > > > > diff --git a/block/Kconfig b/block/Kconfig > > index 86122e459fe0..c44e2238e18d 100644 > > --- a/block/Kconfig > > +++ b/block/Kconfig > > @@ -77,6 +77,18 @@ config BLK_DEV_INTEGRITY_T10 > > select CRC_T10DIF > > select CRC64_ROCKSOFT > > > > +config BLK_DEV_WRITE_HARDENING > > + bool "Do not allow writing to mounted devices" > > + help > > + When a block device is mounted, writing to its buffer cache very likely > > + going to cause filesystem corruption. It is also rather easy to crash > > + the kernel in this way since the filesystem has no practical way of > > + detecting these writes to buffer cache and verifying its metadata > > + integrity. Select this option to disallow writing to mounted devices. > > + This should be mostly fine but some filesystems (e.g. ext4) rely on > > + the ability of filesystem tools to write to mounted filesystems to > > + set e.g. UUID or run fsck on the root filesystem in some setups. > > + > > config BLK_DEV_ZONED > > bool "Zoned block device support" > > select MQ_IOSCHED_DEADLINE > > diff --git a/block/bdev.c b/block/bdev.c > > index 21c63bfef323..ad01f0a6af0d 100644 > > --- a/block/bdev.c > > +++ b/block/bdev.c > > @@ -602,6 +602,12 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) > > struct gendisk *disk = bdev->bd_disk; > > int ret; > > > > + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) { > > + if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0) > > + return -EBUSY; > > + if (mode & FMODE_WRITE && bdev->bd_holders > 0) > > + return -EBUSY; > > + } > > if (disk->fops->open) { > > ret = disk->fops->open(bdev, mode); > > if (ret) { > > @@ -617,6 +623,8 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) > > set_init_blocksize(bdev); > > if (test_bit(GD_NEED_PART_SCAN, &disk->state)) > > bdev_disk_changed(disk, false); > > + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE) > > + atomic_inc(&bdev->bd_writers); > > atomic_inc(&bdev->bd_openers); > > return 0; > > } > > @@ -625,6 +633,8 @@ static void blkdev_put_whole(struct block_device *bdev, fmode_t mode) > > { > > if (atomic_dec_and_test(&bdev->bd_openers)) > > blkdev_flush_mapping(bdev); > > + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE) > > + atomic_dec(&bdev->bd_writers); > > if (bdev->bd_disk->fops->release) > > bdev->bd_disk->fops->release(bdev->bd_disk, mode); > > } > > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > > index 740afe80f297..25af3340f316 100644 > > --- a/include/linux/blk_types.h > > +++ b/include/linux/blk_types.h > > @@ -67,6 +67,9 @@ struct block_device { > > struct partition_meta_info *bd_meta_info; > > #ifdef CONFIG_FAIL_MAKE_REQUEST > > bool bd_make_it_fail; > > +#endif > > +#ifdef CONFIG_BLK_DEV_WRITE_HARDENING > > + atomic_t bd_writers; > > #endif > > /* > > * keep this out-of-line as it's both big and not needed in the fast > > -- > > 2.35.3 > >
On Tue, Jun 13, 2023 at 07:04:12PM -0700, Darrick J. Wong wrote: > > Well in that case, post a patchset adding "depends on INSECURE" for > every subsystem that syzbot files bugs against, if the maintainers do > not immediately drop what they're doing to resolve the bug. > > Google extracts a bunch more unpaid labor from society to make its > owners richer, and everyone else on the planet suffers for it, just like > you all have done for the past 25 years. That's the definition of > Googley!! To be fair, I don't think this is the official position of Google, but rather Dmitry's personal security ideology (as Dave put it). Dmitry, tell you what. If you can find a vice president inside Google who thinks this that preventing an attacker who has the ability to modify a block device while it is mounted, while running code under the control of the attacker, from being to potentially trigger the ability to run ring 0 code --- and who believes it enough to actually **fund** a headcount to actually work these syzbot reports --- I'll gladly help to supervise that person and mentor their ability to work these ext4 syzbot reports. But I think you will find that the VP's will believe that this is not a threat that has a genuine business case which is important enough that they are willing to fund it. And I'm saying as an upstream developer, *other* syzbot reports are higher priority, because in my judgement, they are much more willing to impact real users, and are more likely to be issues that management chain would consider higher priority. (Never mind that *all* of my syzbot work has been done on my own time.) For those of us who are working with limited resources, and doing this work out of the kindness of our hearts, it would be nice to filter out those syzbot reports that in our best judgement, constitute **noise**. If there is not a good way to filter out the noise, it is likely that upstream developers will choose to use their time working with other tools that are better suited to getting our job done as we understand it. So far, there is been a lot work done by folks on your team which has made syzbot easier for us to use, and for that, I thank you. But your position on forcing your ideology of which security bugs I should fix on my own time is.... annoying. And if others feel the same way, your attitude is going to be counter-productive towards the goals you have towards making Linux more secure. Sometimes, the "best" is the enemy is the "good enough". And in this era of Google's "sharpened focus" or Facebook's "year of efficiency", very often, "good enough" is all the vice presidents are willing to fund. Best regards, - Ted
On Tue, Jun 13, 2023 at 01:34:48PM +0200, Jan Kara wrote: > On Mon 12-06-23 14:52:54, Colin Walters wrote: > > On Mon, Jun 12, 2023, at 1:39 PM, Bart Van Assche wrote: > > > On 6/12/23 09:25, Jan Kara wrote: > > >> On Mon 12-06-23 18:16:14, Jan Kara wrote: > > >>> Writing to mounted devices is dangerous and can lead to filesystem > > >>> corruption as well as crashes. Furthermore syzbot comes with more and > > >>> more involved examples how to corrupt block device under a mounted > > >>> filesystem leading to kernel crashes and reports we can do nothing > > >>> about. Add config option to disallow writing to mounted (exclusively > > >>> open) block devices. Syzbot can use this option to avoid uninteresting > > >>> crashes. Also users whose userspace setup does not need writing to > > >>> mounted block devices can set this config option for hardening. > > >>> > > >>> Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org > > >>> Signed-off-by: Jan Kara <jack@suse.cz> > > >> > > >> Please disregard this patch. I had uncommited fixups in my tree. I'll send > > >> fixed version shortly. I'm sorry for the noise. > > > > > > Have alternatives been configured to making this functionality configurable > > > at build time only? How about a kernel command line parameter instead of a > > > config option? > > > > It's not just syzbot here; at least once in my life I accidentally did > > `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk > > and not the target USB device. I know I'm not alone =) > > Yeah, so I'm not sure we are going to protect against this particular case. > I mean it is not *that* uncommon to alter partition table of /dev/sda while > /dev/sda1 is mounted. And for the kernel it is difficult to distinguish > this and your mishap. > > > There's a lot of similar accidental-damage protection from this. Another > > stronger argument here is that if one has a security policy that > > restricts access to filesystem level objects, if a process can somehow > > write to a mounted block device, it effectively subverts all of those > > controls. > > Well, there are multiple levels of protection that I can think of: > > 1) If user can write some image and make kernel mount it. > 2) If user can modify device content while mounted (but not buffer cache > of the device). > 3) If user can modify buffer cache of the device while mounted. > > 3) is the most problematic and effectively equivalent to full machine > control (executing arbitrary code in kernel mode) these days. For 1) and > 2) there are reasonable protection measures the filesystem driver can take > (and effectively you cannot escape these problems if you allow attaching > untrusted devices such as USB sticks) so they can cause DoS but we should > be able to prevent full machine takeover in the filesystem code. > > So this patch is mainly aimed at forbiding 3). > > > Right now it looks to me we're invoking devcgroup_check_permission pretty > > early on; maybe we could extend the device cgroup stuff to have a new > > check for write-mounted, like > > > > ``` > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > > index c994ff5b157c..f2af33c5acc1 100644 > > --- a/tools/include/uapi/linux/bpf.h > > +++ b/tools/include/uapi/linux/bpf.h > > @@ -6797,6 +6797,7 @@ enum { > > BPF_DEVCG_ACC_MKNOD = (1ULL << 0), > > BPF_DEVCG_ACC_READ = (1ULL << 1), > > BPF_DEVCG_ACC_WRITE = (1ULL << 2), > > + BPF_DEVCG_ACC_WRITE_MOUNTED = (1ULL << 3), > > }; > > > > enum { > > ``` > > > > ? But probably this would need to be some kind of opt-in flag to avoid > > breaking existing bpf progs? > > > > If it was configurable via the device cgroup, then it's completely > > flexible from userspace; most specifically including supporting some > > specially privileged processes from doing it if necessary. > > I kind of like the flexibility of device cgroups but it does not seem to Let's not bring in device cgroups here just yet. They're an optional LSM security measure while your change is more fundamental which is the right thing to do imho.
On Wed, Jun 14, 2023 at 09:05:41AM +0200, Christian Brauner wrote: > > I kind of like the flexibility of device cgroups but it does not seem to > > Let's not bring in device cgroups here just yet. They're an optional LSM > security measure while your change is more fundamental which is the > right thing to do imho. Yes. That last thing we need is hiding fundamentally security tradeoffs in weird optional corners of the kernel.
On Tue, Jun 13, 2023 at 01:34:48PM +0200, Jan Kara wrote: > > It's not just syzbot here; at least once in my life I accidentally did > > `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk > > and not the target USB device. I know I'm not alone =) > > Yeah, so I'm not sure we are going to protect against this particular case. > I mean it is not *that* uncommon to alter partition table of /dev/sda while > /dev/sda1 is mounted. And for the kernel it is difficult to distinguish > this and your mishap. I think it is actually very easy to distinguish, because the partition table is not mapped to any partition and certainly not an exclusively opened one. > 1) If user can write some image and make kernel mount it. > 2) If user can modify device content while mounted (but not buffer cache > of the device). > 3) If user can modify buffer cache of the device while mounted. > > 3) is the most problematic and effectively equivalent to full machine > control (executing arbitrary code in kernel mode) these days. If a corrupted image can trigger arbitrary code execution that also means the file system code does not do proper input validation. This isn't meant as an argument against protecting the write access (which I think is good and important), but we shoudn't make this worse than it is.
On Tue, Jun 13, 2023 at 06:55:50PM -0700, Darrick J. Wong wrote: > > I'd love it if filesystems actually /could/ lock down the parts of block > devices they're using. They could hand out write privileges to the open > bdev fds at the same time that a block layout lease is created, and > retract them when the lease terminates. Areas before the fs (e.g. BIOS > boot sector) could actually be left writable by filesystems that don't > use that area; and anything beyond EOFS would still be writable (hello > lvm). Then xfs actually /could/ prevent you from blowing away mounted > xfs filesystem. > > ext4 could even still allow primary superblock writes to avoid breaking > tune2fs, or they could detect secureboot lockdown and prohibit that. Let's not overcomplicate things. As said not allowing writes to partitions through the whole block device is pretty trivial. The allowing to write into some areas of an otherwise fs owned device (partition or whole) is just bogus. We might have to support it for extN (and maybe some other things) for legacy setups, but we really need to add proper APIs for that and just disallow it for modern setups instead of creating complex infrastructure to cater to this fundamentally broken use case.
On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote: > I don't question there are use cases for the flag, but there are use > cases for the config as well. > > Some distros may want a guarantee that this does not happen as it > compromises lockdown and kernel integrity (on par with unsigned module > loading). > For fuzzing systems it also may be hard to ensure fine-grained > argument constraints, it's much easier and more reliable to prohibit > it on config level. I'm fine with a config option enforcing write blocking for any BLK_OPEN_EXCL open. Maybe the way to it is to: a) have an option to prevent any writes to exclusive openers, including a run-time version to enable it b) allow to also block writes without that option. And maybe an opt-in to allow writes might be the better way than doing it the other way around.
On Tue, Jun 13, 2023 at 10:56:14PM +0200, Jan Kara wrote: > Well, as I've mentioned in the changelog there are old setups (without > initrd) that run fsck on root filesystem mounted read-only and fsck > programs tend to open the device with O_RDWR. These would be broken by this > change (for the filesystems that would use BLK_OPEN_ flag). But that's also a really broken setup that will corrupt data in many cases. So yes, maybe we need a way to allow it, but it probably would have to be per-file system. > Similarly some > boot loaders can write to first sectors of the root partition while the > filesystem is mounted. So I don't think controlling the behavior by the > in-kernel user that is having the bdev exclusively open really works. It > seems to be more a property of the system setup than a property of the > in-kernel bdev user. Am I mistaken? For many file systems this would already be completely broken (e.g. XFS). So allowing this is needed for legacy use cases, but again should be limited to just file systems where this even makes sense. And strictly limited to legacy setups, we do need proper kernel APIs for all of that in the future so that modern distros don't have to allow sideband writes at all.
On Tue, Jun 13, 2023 at 10:56:14PM +0200, Jan Kara wrote: > On Mon 12-06-23 22:10:30, Christoph Hellwig wrote: > > > +config BLK_DEV_WRITE_HARDENING > > > + bool "Do not allow writing to mounted devices" > > > + help > > > + When a block device is mounted, writing to its buffer cache very likely > > > + going to cause filesystem corruption. It is also rather easy to crash > > > + the kernel in this way since the filesystem has no practical way of > > > + detecting these writes to buffer cache and verifying its metadata > > > + integrity. Select this option to disallow writing to mounted devices. > > > + This should be mostly fine but some filesystems (e.g. ext4) rely on > > > + the ability of filesystem tools to write to mounted filesystems to > > > + set e.g. UUID or run fsck on the root filesystem in some setups. > > > > I'm not sure a config option is really the right thing. > > > > I'd much prefer a BLK_OPEN_ flag to prohibit any other writer. > > Except for etN and maybe fat all file systems can set that > > unconditionally. And for those file systems that have historically > > allowed writes to mounted file systems they can find a local way > > to decide on when and when not to set it. > > Well, as I've mentioned in the changelog there are old setups (without Before going into the details: Let's please take syzbot out of the picture as a justification or required reviewer for this patch. This is a complete distraction imho. If the patch has the side effect that it somehow makes for less noisy syzbot reports then so be it but it's really not why this patch is a good idea. For userspace this patch is immediately useful and a security mechanism that everyone familiar with block device/filesystem attack surfaces will want to make use of. And we should encourage this be the default whenever possible imho. That's all the justification that we need. > initrd) that run fsck on root filesystem mounted read-only and fsck > programs tend to open the device with O_RDWR. These would be broken by this > change (for the filesystems that would use BLK_OPEN_ flag). Similarly some > boot loaders can write to first sectors of the root partition while the > filesystem is mounted. So I don't think controlling the behavior by the > in-kernel user that is having the bdev exclusively open really works. It > seems to be more a property of the system setup than a property of the > in-kernel bdev user. Am I mistaken? > > So I think kconfig option or sysfs tunable (maybe a per-device one?) will > be more appropriate choice? With default behavior configurable by kernel > parameter? And once set to write-protect on mount, do we allow flipping it > back? Both have advantages and disadvantages so the tunable might be > tri-state in the end (no protection, write-protect but can be turned off, > write-protect that cannot be turned off)? But maybe I'm overcomplicating > this so please share your thoughts :) A simple bool Kconfig overridable by kernel command line option is what we want. Fundamental security relevant properties such as this should never be runtime configurable. They should be boot and build time configurable and then they should be off limits.
On Wed, Jun 14, 2023 at 12:17:26AM -0700, Christoph Hellwig wrote: > On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote: > > I don't question there are use cases for the flag, but there are use > > cases for the config as well. > > > > Some distros may want a guarantee that this does not happen as it > > compromises lockdown and kernel integrity (on par with unsigned module > > loading). > > For fuzzing systems it also may be hard to ensure fine-grained > > argument constraints, it's much easier and more reliable to prohibit > > it on config level. > > I'm fine with a config option enforcing write blocking for any > BLK_OPEN_EXCL open. Maybe the way to it is to: > > a) have an option to prevent any writes to exclusive openers, including > a run-time version to enable it I really would wish we don't make this runtime configurable. Build time and boot time yes but toggling it at runtime makes this already a lot less interesting.
On Wed 14-06-23 00:10:26, Christoph Hellwig wrote: > On Tue, Jun 13, 2023 at 01:34:48PM +0200, Jan Kara wrote: > > > It's not just syzbot here; at least once in my life I accidentally did > > > `dd if=/path/to/foo.iso of=/dev/sda` when `/dev/sda` was my booted disk > > > and not the target USB device. I know I'm not alone =) > > > > Yeah, so I'm not sure we are going to protect against this particular case. > > I mean it is not *that* uncommon to alter partition table of /dev/sda while > > /dev/sda1 is mounted. And for the kernel it is difficult to distinguish > > this and your mishap. > > I think it is actually very easy to distinguish, because the partition > table is not mapped to any partition and certainly not an exclusively > opened one. Well, OK, I have not been precise :). Modifying a partition table (or LVM description block) is impossible to distinguish from clobbering a filesystem on open(2) time. Once we decide we implement arbitration of each individual write(2), we can obviously stop writes to area covered by some exclusively open partition. But then you are getting at the complexity level of tracking used ranges of block devices which Darrick has suggested and you didn't seem to like that (and neither do I). Furthermore the protection is never going to be perfect as soon as loopback devices, device mapper, and similar come into the mix (or it gets really really complex). So I'd really prefer to stick with whatever arbitration we can perform on open(2). > > 1) If user can write some image and make kernel mount it. > > 2) If user can modify device content while mounted (but not buffer cache > > of the device). > > 3) If user can modify buffer cache of the device while mounted. > > > > 3) is the most problematic and effectively equivalent to full machine > > control (executing arbitrary code in kernel mode) these days. > > If a corrupted image can trigger arbitrary code execution that also > means the file system code does not do proper input validation. I agree. But case 3) is not about corrupted image - it is about userspace's ability to corrupt data stored in the buffer cache *after* it has been loaded from the image and verified. This is not a problem for XFS which has its private block device cache incoherent with the buffer cache you access when opening the bdev but basically every other filesystem suffers from this problem. Honza
On Wed 14-06-23 10:18:16, Christian Brauner wrote: > On Wed, Jun 14, 2023 at 12:17:26AM -0700, Christoph Hellwig wrote: > > On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote: > > > I don't question there are use cases for the flag, but there are use > > > cases for the config as well. > > > > > > Some distros may want a guarantee that this does not happen as it > > > compromises lockdown and kernel integrity (on par with unsigned module > > > loading). > > > For fuzzing systems it also may be hard to ensure fine-grained > > > argument constraints, it's much easier and more reliable to prohibit > > > it on config level. > > > > I'm fine with a config option enforcing write blocking for any > > BLK_OPEN_EXCL open. Maybe the way to it is to: > > > > a) have an option to prevent any writes to exclusive openers, including > > a run-time version to enable it > > I really would wish we don't make this runtime configurable. Build time > and boot time yes but toggling it at runtime makes this already a lot > less interesting. I see your point from security POV. But if you are say a desktop (or even server) user you may need to say resize your LVM or add partition to your disk or install grub2 into boot sector of your partition. In all these cases you need write access to a block device that is exclusively claimed by someone else. Do you mandate reboot in permissive mode for all these cases? Realistically that means such users just won't bother with the feature and leave it disabled all the time. I'm OK with such outcome but I wanted to point out this "no protection change after boot" policy noticably restricts number of systems where this is applicable. Honza
On Wed, 14 Jun 2023 at 04:04, Darrick J. Wong <djwong@kernel.org> wrote: > > On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote: > > On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote: > > > > > > Writing to mounted devices is dangerous and can lead to filesystem > > > corruption as well as crashes. Furthermore syzbot comes with more and > > > more involved examples how to corrupt block device under a mounted > > > filesystem leading to kernel crashes and reports we can do nothing > > > about. Add config option to disallow writing to mounted (exclusively > > > open) block devices. Syzbot can use this option to avoid uninteresting > > > crashes. Also users whose userspace setup does not need writing to > > > mounted block devices can set this config option for hardening. > > > > +syzkaller, Kees, Alexander, Eric > > > > We can enable this on syzbot, however I have the same concerns as with > > disabling of XFS_SUPPORT_V4: > > https://github.com/google/syzkaller/issues/3918#issuecomment-1560624278 > > > > It's useful to know the actual situation with respect to > > bugs/vulnerabilities and one of the goals of syzbot is surfacing this > > situation. > > For some areas there is mismatch between upstream kernel and > > downstream distros. Upstream says "this is buggy and deprecated", > > which is fine in itself if not the other part: downstream distros > > simply ignore that (maybe not even aware) and keep things enabled for > > as long as possible. Stopping testing this is moving more in this > > direction: silencing warnings and pretending that everything is fine, > > when it's not. > > > > I wonder if there is a way to at least somehow bridge this gap. > > > > There is CONFIG_BROKEN, but not sure if it's the right thing here. > > Maybe we add something like CONFIG_INSECURE. And such insecure config > > settings (not setting BLK_DEV_WRITE_HARDENING, setting XFS_SUPPORT_V4) > > will say: > > > > depends on INSECURE > > > > So that distros will need to at least acknowledge this to users by saying: > > > > CONFIG_INSECURE=y > > > > They are then motivated to work on actually removing dependencies on > > these deprecated things. > > > > CONFIG_INSECURE description can say something along the lines of "this > > kernel includes subsystems with known bugs that may cause security and > > data integrity issues". When a subsystem adds "depends on INSECURE", > > the commit should list some of the known issues. > > > > Then I see how trading disabling things on syzbot in exchange for > > "depends on INSECURE" becomes reasonable and satisfies all parties to > > some degree. > > Well in that case, post a patchset adding "depends on INSECURE" for > every subsystem that syzbot files bugs against, if the maintainers do > not immediately drop what they're doing to resolve the bug. Hi Darrick, Open unfixed bugs are fine (for some definition of fine). What's discussed here is different. It's not having any filed bugs at all due to not testing a thing and then not having any visibility into the state of things. > Google extracts a bunch more unpaid labor from society to make its > owners richer, and everyone else on the planet suffers for it, just like > you all have done for the past 25 years. That's the definition of > Googley!! > > --D > > > > > Btw, if we do this it can make sense to invert this config (enable > > concurrent writes), default to 'y' and recommend 'n'. > > > > Does it make any sense? Any other suggestions? > > > > P.S. Alex, if this lands this may be a candidate for addition to: > > https://github.com/a13xp0p0v/kconfig-hardened-check > > (and XFS_SUPPORT_V4 as well). > > > > > > > Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org > > > Signed-off-by: Jan Kara <jack@suse.cz> > > > --- > > > block/Kconfig | 12 ++++++++++++ > > > block/bdev.c | 10 ++++++++++ > > > include/linux/blk_types.h | 3 +++ > > > 3 files changed, 25 insertions(+) > > > > > > FWIW I've tested this and my test VM with ext4 root fs boots fine and fstests > > > on ext4 seem to be also running fine with BLK_DEV_WRITE_HARDENING enabled. > > > OTOH my old VM setup which is not using initrd fails to boot with > > > BLK_DEV_WRITE_HARDENING enabled because fsck cannot open the root device > > > because the root is already mounted (read-only). Anyway this should be useful > > > for syzbot (Dmitry indicated interest in this option in the past) and maybe > > > other well controlled setups. > > > > > > diff --git a/block/Kconfig b/block/Kconfig > > > index 86122e459fe0..c44e2238e18d 100644 > > > --- a/block/Kconfig > > > +++ b/block/Kconfig > > > @@ -77,6 +77,18 @@ config BLK_DEV_INTEGRITY_T10 > > > select CRC_T10DIF > > > select CRC64_ROCKSOFT > > > > > > +config BLK_DEV_WRITE_HARDENING > > > + bool "Do not allow writing to mounted devices" > > > + help > > > + When a block device is mounted, writing to its buffer cache very likely > > > + going to cause filesystem corruption. It is also rather easy to crash > > > + the kernel in this way since the filesystem has no practical way of > > > + detecting these writes to buffer cache and verifying its metadata > > > + integrity. Select this option to disallow writing to mounted devices. > > > + This should be mostly fine but some filesystems (e.g. ext4) rely on > > > + the ability of filesystem tools to write to mounted filesystems to > > > + set e.g. UUID or run fsck on the root filesystem in some setups. > > > + > > > config BLK_DEV_ZONED > > > bool "Zoned block device support" > > > select MQ_IOSCHED_DEADLINE > > > diff --git a/block/bdev.c b/block/bdev.c > > > index 21c63bfef323..ad01f0a6af0d 100644 > > > --- a/block/bdev.c > > > +++ b/block/bdev.c > > > @@ -602,6 +602,12 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) > > > struct gendisk *disk = bdev->bd_disk; > > > int ret; > > > > > > + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) { > > > + if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0) > > > + return -EBUSY; > > > + if (mode & FMODE_WRITE && bdev->bd_holders > 0) > > > + return -EBUSY; > > > + } > > > if (disk->fops->open) { > > > ret = disk->fops->open(bdev, mode); > > > if (ret) { > > > @@ -617,6 +623,8 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) > > > set_init_blocksize(bdev); > > > if (test_bit(GD_NEED_PART_SCAN, &disk->state)) > > > bdev_disk_changed(disk, false); > > > + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE) > > > + atomic_inc(&bdev->bd_writers); > > > atomic_inc(&bdev->bd_openers); > > > return 0; > > > } > > > @@ -625,6 +633,8 @@ static void blkdev_put_whole(struct block_device *bdev, fmode_t mode) > > > { > > > if (atomic_dec_and_test(&bdev->bd_openers)) > > > blkdev_flush_mapping(bdev); > > > + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE) > > > + atomic_dec(&bdev->bd_writers); > > > if (bdev->bd_disk->fops->release) > > > bdev->bd_disk->fops->release(bdev->bd_disk, mode); > > > } > > > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > > > index 740afe80f297..25af3340f316 100644 > > > --- a/include/linux/blk_types.h > > > +++ b/include/linux/blk_types.h > > > @@ -67,6 +67,9 @@ struct block_device { > > > struct partition_meta_info *bd_meta_info; > > > #ifdef CONFIG_FAIL_MAKE_REQUEST > > > bool bd_make_it_fail; > > > +#endif > > > +#ifdef CONFIG_BLK_DEV_WRITE_HARDENING > > > + atomic_t bd_writers; > > > #endif > > > /* > > > * keep this out-of-line as it's both big and not needed in the fast > > > -- > > > 2.35.3 > > >
On Wed, Jun 14, 2023 at 12:36:54PM +0200, Jan Kara wrote: > On Wed 14-06-23 10:18:16, Christian Brauner wrote: > > On Wed, Jun 14, 2023 at 12:17:26AM -0700, Christoph Hellwig wrote: > > > On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote: > > > > I don't question there are use cases for the flag, but there are use > > > > cases for the config as well. > > > > > > > > Some distros may want a guarantee that this does not happen as it > > > > compromises lockdown and kernel integrity (on par with unsigned module > > > > loading). > > > > For fuzzing systems it also may be hard to ensure fine-grained > > > > argument constraints, it's much easier and more reliable to prohibit > > > > it on config level. > > > > > > I'm fine with a config option enforcing write blocking for any > > > BLK_OPEN_EXCL open. Maybe the way to it is to: > > > > > > a) have an option to prevent any writes to exclusive openers, including > > > a run-time version to enable it > > > > I really would wish we don't make this runtime configurable. Build time > > and boot time yes but toggling it at runtime makes this already a lot > > less interesting. > > I see your point from security POV. But if you are say a desktop (or even > server) user you may need to say resize your LVM or add partition to your > disk or install grub2 into boot sector of your partition. In all these > cases you need write access to a block device that is exclusively claimed > by someone else. Do you mandate reboot in permissive mode for all these > cases? Realistically that means such users just won't bother with the > feature and leave it disabled all the time. I'm OK with such outcome but I > wanted to point out this "no protection change after boot" policy noticably > restricts number of systems where this is applicable. You're asking the hard/right questions. Installing the boot loader into a boot sector seems like an archaic scenario. With UEFI this isn't necessary and systems that do want this they should turn the Kconfig off or boot with it turned off. I'm trying to understand the partition and lvm resize issue. I've chatted a bit about this and it seems that in this protected mode we should ensure that we cannot write to the main block device's sectors that are mapped to a partition block device. If you write to the main block device of a partitioned device one should only be able to modify the footer and header but nothing where you have a partition block device on. That should mean you can resize an LVM partition afaict. I've been told that the partition block devices and the main block devices have different buffer caches. But that means you cannot mix accesses to them because writes to one will not show up on the other unless caches are flushed on both devices all the time. So it'd be neat if the writes to the whole block device would simply be not allowed at all to areas which are also exposed as partition block devices.
On Wed, Jun 14, 2023 at 12:12:56PM +0200, Jan Kara wrote: > Well, OK, I have not been precise :). Modifying a partition table (or LVM > description block) is impossible to distinguish from clobbering a > filesystem on open(2) time. Once we decide we implement arbitration of each > individual write(2), we can obviously stop writes to area covered by some > exclusively open partition. But then you are getting at the complexity > level of tracking used ranges of block devices which Darrick has suggested > and you didn't seem to like that (and neither do I). Well, we track these ranges in the block_devices hanging off the gendisk anyway, so this is a totally different league. But in the end parsing partition tables is a little easier than parsing file system metadata but not fundamentally different. So if we really want to lock down broken sideband manipulations we can't allow that either and need in-kernel support for manipulating partition tables if that is required at run time.
On Wed, Jun 14, 2023 at 10:18:16AM +0200, Christian Brauner wrote: > > I'm fine with a config option enforcing write blocking for any > > BLK_OPEN_EXCL open. Maybe the way to it is to: > > > > a) have an option to prevent any writes to exclusive openers, including > > a run-time version to enable it > > I really would wish we don't make this runtime configurable. Build time > and boot time yes but toggling it at runtime makes this already a lot > less interesting. With run time I really mean not compile time aka boot time. Sorry for not being precise.
On Wed, Jun 14, 2023 at 12:12:56PM +0200, Jan Kara wrote: > Well, OK, I have not been precise :). Modifying a partition table (or LVM > description block) is impossible to distinguish from clobbering a > filesystem on open(2) time. Once we decide we implement arbitration of each > individual write(2), we can obviously stop writes to area covered by some > exclusively open partition. But then you are getting at the complexity > level of tracking used ranges of block devices which Darrick has suggested > and you didn't seem to like that (and neither do I). Furthermore the > protection is never going to be perfect as soon as loopback devices, device > mapper, and similar come into the mix (or it gets really really complex). > So I'd really prefer to stick with whatever arbitration we can perform on > open(2). What a shame we got rid of mandatory file locks in f7e33bdbd6d1 ;-)
On Wed, Jun 14, 2023 at 02:27:46PM +0200, Dmitry Vyukov wrote: > On Wed, 14 Jun 2023 at 04:04, Darrick J. Wong <djwong@kernel.org> wrote: > > On Tue, Jun 13, 2023 at 08:49:38AM +0200, Dmitry Vyukov wrote: > > > On Mon, 12 Jun 2023 at 18:16, Jan Kara <jack@suse.cz> wrote: > > > CONFIG_INSECURE description can say something along the lines of "this > > > kernel includes subsystems with known bugs that may cause security and > > > data integrity issues". When a subsystem adds "depends on INSECURE", > > > the commit should list some of the known issues. > > > > > > Then I see how trading disabling things on syzbot in exchange for > > > "depends on INSECURE" becomes reasonable and satisfies all parties to > > > some degree. > > > > Well in that case, post a patchset adding "depends on INSECURE" for > > every subsystem that syzbot files bugs against, if the maintainers do > > not immediately drop what they're doing to resolve the bug. > > Hi Darrick, > > Open unfixed bugs are fine (for some definition of fine). > What's discussed here is different. It's not having any filed bugs at > all due to not testing a thing and then not having any visibility into > the state of things. Just because syzbot doesn't test something, it does not mean the code is not tested, nor does it mean the developers who are responsible for the code have no visibility into the state of their code. The reason they want to avoid this sort of corruption injection testing in syzbot is that it *does not provide a net benefit* to anyone. The number (and value) of real bugs it might find are vastly outweighed by the cost of filtering out the many, many false positives the testing methodology raises. Keep in mind that syzbot does not provide useful unit and functional test coverage. We have to run tests suites like fstests to provide this sort of functionality and visibility into the *correct operation of the code*. However, alongside all the unit/functional tests in fstests, we also have non-deterministic stress and fuzzer tests that are similar in nature to syzbot. They often flush out weird integration level bugs before we even get to merging the code. These non-deterministic stress tests in fstests have found *hundreds* of bugs over the *couple of decades* we have been running them, and they also have a history of uncovering entire new classes of bugs we've had to address. At this point, syzbot is yet to do prove it is more than a one-trick pony - it typically only finds a single class of filesystem bug. That is, it only finds bugs that are related to undetected physical structure corruption of the filesystem that result in macro level failures (crash, warn, hang). Syzbot does nothing to ensure correct behaviour is occuring, that data integrity is maintained by the filesystem, that crash recovery after failures works correctly, etc. These things are *by far* the most important things we have to ensure during filesystem development. IOWs, the sorts of problems that syzbot finds in filesystems are way down the list of important things we need to validate. Yes, structural validation testing is something we should be running, and it's clear that is does get run (both from fstests and syzbot). Hence the claim that "because syzbot doesn't run we don't have visibility of code bugs" is naive, conceited, incredibly narcissistic and demonstratable false. It also indicates a very poor understanding of where syzbot actually fits into the overall engineering processes. -Dave.
On Thu, 15 Jun 2023 at 01:38, Dave Chinner <david@fromorbit.com> wrote: > > > > CONFIG_INSECURE description can say something along the lines of "this > > > > kernel includes subsystems with known bugs that may cause security and > > > > data integrity issues". When a subsystem adds "depends on INSECURE", > > > > the commit should list some of the known issues. > > > > > > > > Then I see how trading disabling things on syzbot in exchange for > > > > "depends on INSECURE" becomes reasonable and satisfies all parties to > > > > some degree. > > > > > > Well in that case, post a patchset adding "depends on INSECURE" for > > > every subsystem that syzbot files bugs against, if the maintainers do > > > not immediately drop what they're doing to resolve the bug. > > > > Hi Darrick, > > > > Open unfixed bugs are fine (for some definition of fine). > > What's discussed here is different. It's not having any filed bugs at > > all due to not testing a thing and then not having any visibility into > > the state of things. > > Just because syzbot doesn't test something, it does not mean the > code is not tested, nor does it mean the developers who are > responsible for the code have no visibility into the state of their > code. > > The reason they want to avoid this sort of corruption injection > testing in syzbot is that it *does not provide a net benefit* to > anyone. The number (and value) of real bugs it might find are vastly > outweighed by the cost of filtering out the many, many false > positives the testing methodology raises. > > Keep in mind that syzbot does not provide useful unit and functional > test coverage. We have to run tests suites like fstests to provide > this sort of functionality and visibility into the *correct > operation of the code*. > > However, alongside all the unit/functional tests in fstests, we also > have non-deterministic stress and fuzzer tests that are similar in > nature to syzbot. They often flush out weird integration level bugs > before we even get to merging the code. These non-deterministic > stress tests in fstests have found *hundreds* of bugs over the > *couple of decades* we have been running them, and they also have a > history of uncovering entire new classes of bugs we've had to > address. > > At this point, syzbot is yet to do prove it is more than a one-trick > pony - it typically only finds a single class of filesystem bug. > That is, it only finds bugs that are related to undetected physical > structure corruption of the filesystem that result in macro level > failures (crash, warn, hang). > > Syzbot does nothing to ensure correct behaviour is occuring, that > data integrity is maintained by the filesystem, that crash recovery > after failures works correctly, etc. These things are *by far* the > most important things we have to ensure during filesystem > development. > > IOWs, the sorts of problems that syzbot finds in filesystems are way > down the list of important things we need to validate. Yes, > structural validation testing is something we should be > running, and it's clear that is does get run (both from fstests and > syzbot). > > Hence the claim that "because syzbot doesn't run we don't have > visibility of code bugs" is naive, conceited, incredibly > narcissistic and demonstratable false. It also indicates a very > poor understanding of where syzbot actually fits into the overall > engineering processes. Hi Dave, Ted, We are currently looking into options of how to satisfy all parties. I am not saying that all of these bugs need to be fixed, nor that they are more important than bugs in supported parts. And we are very much interested in testing supported parts as well as we can do. By CONFIG_INSECURE I just meant something similar to kernel taint bits. A user is free to continue after any bad thing has happened/they did, but some warranties are void. And if a kernel developer receives a bug report on a tainted kernel, they will take it with a grain of salt. So it's important to note the fact and inform about it. Something similar here: bugs in deprecated parts do not need to be fixed, and distros are still free to enable them, but this fact is acknowledged by distros and made visible to users. But we are looking into other options that won't require even CONFIG_INSECURE.
On Wed 14-06-23 14:48:17, Christian Brauner wrote: > On Wed, Jun 14, 2023 at 12:36:54PM +0200, Jan Kara wrote: > > On Wed 14-06-23 10:18:16, Christian Brauner wrote: > > > On Wed, Jun 14, 2023 at 12:17:26AM -0700, Christoph Hellwig wrote: > > > > On Tue, Jun 13, 2023 at 08:09:14AM +0200, Dmitry Vyukov wrote: > > > > > I don't question there are use cases for the flag, but there are use > > > > > cases for the config as well. > > > > > > > > > > Some distros may want a guarantee that this does not happen as it > > > > > compromises lockdown and kernel integrity (on par with unsigned module > > > > > loading). > > > > > For fuzzing systems it also may be hard to ensure fine-grained > > > > > argument constraints, it's much easier and more reliable to prohibit > > > > > it on config level. > > > > > > > > I'm fine with a config option enforcing write blocking for any > > > > BLK_OPEN_EXCL open. Maybe the way to it is to: > > > > > > > > a) have an option to prevent any writes to exclusive openers, including > > > > a run-time version to enable it > > > > > > I really would wish we don't make this runtime configurable. Build time > > > and boot time yes but toggling it at runtime makes this already a lot > > > less interesting. > > > > I see your point from security POV. But if you are say a desktop (or even > > server) user you may need to say resize your LVM or add partition to your > > disk or install grub2 into boot sector of your partition. In all these > > cases you need write access to a block device that is exclusively claimed > > by someone else. Do you mandate reboot in permissive mode for all these > > cases? Realistically that means such users just won't bother with the > > feature and leave it disabled all the time. I'm OK with such outcome but I > > wanted to point out this "no protection change after boot" policy noticably > > restricts number of systems where this is applicable. > > You're asking the hard/right questions. > > Installing the boot loader into a boot sector seems like an archaic > scenario. With UEFI this isn't necessary and systems that do want this > they should turn the Kconfig off or boot with it turned off. OK. > I'm trying to understand the partition and lvm resize issue. I've > chatted a bit about this and it seems that in this protected mode we > should ensure that we cannot write to the main block device's sectors > that are mapped to a partition block device. If you write to the main > block device of a partitioned device one should only be able to modify > the footer and header but nothing where you have a partition block > device on. That should mean you can resize an LVM partition afaict. > > I've been told that the partition block devices and the main block > devices have different buffer caches. But that means you cannot mix > accesses to them because writes to one will not show up on the other > unless caches are flushed on both devices all the time. > > So it'd be neat if the writes to the whole block device would simply be > not allowed at all to areas which are also exposed as partition block > devices. So you're touching very similar areas I've replied to elsewhere [1] in this thread. But let me go into more details here: 1) Disallowing modifications to area covered by some partition through the main block device is doable but it means arbitration needs to happen on each write (or page fault ???) which has performance implications. 2) With LVM, device mapper exclusively claims the underlying device, places header / footer to it and then exposes the rest of the device as another block device (possibly after further block remapping games). Loop device can do the same. With LVM you can also have multiple block devices mapping to the same underlying block device and possibly overlapping ranges (userspace fully controls this), in fact partitioning on top of RAID or multipath devices is done this way AFAIK. So tracking all these ranges, how they are remapped and what is actually used by whom seems like a really complex task that would need to propagate the information through multiple layers. So I would backtrack a bit and maybe ask what is the threat model and what protection you'd like to achieve? Because as you mentioned above, each block device has a separate (incoherent) buffer cache. That is also a good thing because protecting "filesystem's" buffer cache from corruption means we need to just protect that one block device when the filesystem is using it to avoid the worst problems. Writes to the same physical location through different block devices will corrupt the data "at rest" but the filesystem should do enough verification to catch this when loading the data into the buffer cache. So we could make blkdev_get_by_dev() calls when mounting filesystems disallow any other FMODE_WRITE opens of the block device and that should still allow modifying partition tables or LVM setup. It will break legacy boot setups & tuning ext4 parameters while the filesystem is mounted but I'm fine with just saying "don't use this feature if you do this". What do you think? Honza [1] https://lore.kernel.org/all/20230614101256.kgnui242k72lmp4e@quack3
On Thu, Jun 15, 2023 at 11:14:53AM +0200, Dmitry Vyukov wrote: > On Thu, 15 Jun 2023 at 01:38, Dave Chinner <david@fromorbit.com> wrote: > > Hence the claim that "because syzbot doesn't run we don't have > > visibility of code bugs" is naive, conceited, incredibly > > narcissistic and demonstratable false. It also indicates a very > > poor understanding of where syzbot actually fits into the overall > > engineering processes. > > Hi Dave, Ted, > > We are currently looking into options of how to satisfy all parties. > > I am not saying that all of these bugs need to be fixed, nor that they > are more important than bugs in supported parts. And we are very much > interested in testing supported parts as well as we can do. > > By CONFIG_INSECURE I just meant something similar to kernel taint > bits. How is that any better? Who gets to decide what sets this taint? Subsystem maintainers? > A user is free to continue after any bad thing has happened/they > did, but some warranties are void. And if a kernel developer receives > a bug report on a tainted kernel, they will take it with a grain of > salt. So it's important to note the fact and inform about it. > Something similar here: bugs in deprecated parts do not need to be > fixed, and distros are still free to enable them, but this fact is > acknowledged by distros and made visible to users. "Deprecated" does not mean *unmaintained*. They are two completely different things, and conflating the two does not help anyone. You are talking about marking unmaintained code with a taint, not deprecated code. reiserfs is unmaintained code. ntfs3 is unmaintained code. hfs is unmaintained code. There are several other unmaintained filesystems that don't have active maintainers. Bug reports never, ever get looked at, etc. Sure, there's an argument to taint them. But whilst XFS v4 is deprecated, it is still very much maintained. We encourage people to move off V4, we focus less on it, but if bug reports from users come in, we still fix them. So even if you introduce some "unmaintained" taint for the kernel, you aren't going to see it get set for the XFS V4 format. > But we are looking into other options that won't require even CONFIG_INSECURE. Who is this nebulous "we"? -Dave.
Hello Christoph! On Wed 14-06-23 00:20:12, Christoph Hellwig wrote: > On Tue, Jun 13, 2023 at 10:56:14PM +0200, Jan Kara wrote: > > Well, as I've mentioned in the changelog there are old setups (without > > initrd) that run fsck on root filesystem mounted read-only and fsck > > programs tend to open the device with O_RDWR. These would be broken by this > > change (for the filesystems that would use BLK_OPEN_ flag). > > But that's also a really broken setup that will corrupt data in many > cases. So yes, maybe we need a way to allow it, but it probably would > have to be per-file system. I was looking into implementing the write hardening support and I've come across the following obstacle: Your patch series that is in linux-block.git removes the 'mode' argument from blkdev_put() which makes it impossible to track how many writers there are for the block device. This is needed so that we can check whether the filesystem is safe when mounting the device. I can see several solutions but since you've just reworked the code and I'm not 100% certain about the motivation, I figured I'll ask you first before spending significant time on something you won't like: 1) Just return the mode argument to blkdev_put(). 2) Only pass to blkdev_put() whether we have write access or not as a separate argument. 3) Don't track number of opens for writing, instead check whether writes are blocked on each write access. I think this has a number of downsides but I mention it for completeness. One problem is we have to add checks to multiple places (buffered IO, direct IO) and existing mmap in particular will be very hard to deal with (need to add page_mkwrite() handler). All these checks add performance overhead. It is practically impossible (without significant performance overhead or new percpu datastructures) to properly synchronize open that wants to block writers against already running writes. So what would you prefer? Thanks in advance for your input. Honza
On Tue, Jun 20, 2023 at 12:41:11PM +0200, Jan Kara wrote: > I can see several solutions but since you've just reworked the code and I'm > not 100% certain about the motivation, I figured I'll ask you first before > spending significant time on something you won't like: > > 1) Just return the mode argument to blkdev_put(). > > 2) Only pass to blkdev_put() whether we have write access or not as a > separate argument. > > 3) Don't track number of opens for writing, instead check whether writes > are blocked on each write access. I think this has a number of downsides > but I mention it for completeness. One problem is we have to add checks to > multiple places (buffered IO, direct IO) and existing mmap in particular > will be very hard to deal with (need to add page_mkwrite() handler). All > these checks add performance overhead. It is practically impossible > (without significant performance overhead or new percpu datastructures) to > properly synchronize open that wants to block writers against already > running writes. I think 3 is out for the reasons you mention. 2 feels a bit ugly, but so does 1 given that we only really need the write flag. One thing I through about earlier but decided was overkill at that time is to return a cookie from blkdev_get_* that needs to be passed back to blkdev_put. That could either be opaque to the caller, or replace the bdev ala: struct bdev_handle { struct block_device *bdev; void *holder; bool for_write; }; Given that fixups we needed to pass the right holder back in, it feels like that would be the least error prone API, even if it is a fair amount of churn.
diff --git a/block/Kconfig b/block/Kconfig index 86122e459fe0..c44e2238e18d 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -77,6 +77,18 @@ config BLK_DEV_INTEGRITY_T10 select CRC_T10DIF select CRC64_ROCKSOFT +config BLK_DEV_WRITE_HARDENING + bool "Do not allow writing to mounted devices" + help + When a block device is mounted, writing to its buffer cache very likely + going to cause filesystem corruption. It is also rather easy to crash + the kernel in this way since the filesystem has no practical way of + detecting these writes to buffer cache and verifying its metadata + integrity. Select this option to disallow writing to mounted devices. + This should be mostly fine but some filesystems (e.g. ext4) rely on + the ability of filesystem tools to write to mounted filesystems to + set e.g. UUID or run fsck on the root filesystem in some setups. + config BLK_DEV_ZONED bool "Zoned block device support" select MQ_IOSCHED_DEADLINE diff --git a/block/bdev.c b/block/bdev.c index 21c63bfef323..ad01f0a6af0d 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -602,6 +602,12 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) struct gendisk *disk = bdev->bd_disk; int ret; + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING)) { + if (mode & FMODE_EXCL && atomic_read(&bdev->bd_writers) > 0) + return -EBUSY; + if (mode & FMODE_WRITE && bdev->bd_holders > 0) + return -EBUSY; + } if (disk->fops->open) { ret = disk->fops->open(bdev, mode); if (ret) { @@ -617,6 +623,8 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) set_init_blocksize(bdev); if (test_bit(GD_NEED_PART_SCAN, &disk->state)) bdev_disk_changed(disk, false); + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE) + atomic_inc(&bdev->bd_writers); atomic_inc(&bdev->bd_openers); return 0; } @@ -625,6 +633,8 @@ static void blkdev_put_whole(struct block_device *bdev, fmode_t mode) { if (atomic_dec_and_test(&bdev->bd_openers)) blkdev_flush_mapping(bdev); + if (IS_ENABLED(BLK_DEV_WRITE_HARDENING) && mode & FMODE_WRITE) + atomic_dec(&bdev->bd_writers); if (bdev->bd_disk->fops->release) bdev->bd_disk->fops->release(bdev->bd_disk, mode); } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 740afe80f297..25af3340f316 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -67,6 +67,9 @@ struct block_device { struct partition_meta_info *bd_meta_info; #ifdef CONFIG_FAIL_MAKE_REQUEST bool bd_make_it_fail; +#endif +#ifdef CONFIG_BLK_DEV_WRITE_HARDENING + atomic_t bd_writers; #endif /* * keep this out-of-line as it's both big and not needed in the fast
Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add config option to disallow writing to mounted (exclusively open) block devices. Syzbot can use this option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this config option for hardening. Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org Signed-off-by: Jan Kara <jack@suse.cz> --- block/Kconfig | 12 ++++++++++++ block/bdev.c | 10 ++++++++++ include/linux/blk_types.h | 3 +++ 3 files changed, 25 insertions(+) FWIW I've tested this and my test VM with ext4 root fs boots fine and fstests on ext4 seem to be also running fine with BLK_DEV_WRITE_HARDENING enabled. OTOH my old VM setup which is not using initrd fails to boot with BLK_DEV_WRITE_HARDENING enabled because fsck cannot open the root device because the root is already mounted (read-only). Anyway this should be useful for syzbot (Dmitry indicated interest in this option in the past) and maybe other well controlled setups.