Message ID | 20230925095133.311224-1-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iomap: add a workaround for racy i_size updates on block devices | expand |
On Mon, Sep 25, 2023 at 11:51:33AM +0200, Christoph Hellwig wrote: > A szybot reproducer that does write I/O while truncating the size of a > block device can end up in clean_bdev_aliases, which tries to clean the > bdev aliases that it uses. This is because iomap_to_bh automatically > sets the BH_New flag when outside of i_size. For block devices updates > to i_size are racy and we can hit this case in a tiny race window, > leading to the eventual clean_bdev_aliases call. Fix this by erroring > out of > i_size I/O on block devices. > > Reported-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com > Signed-off-by: Christoph Hellwig <hch@lst.de> > Tested-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com > --- > fs/buffer.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index a6785cd07081cb..12e9a71c693d74 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2058,8 +2058,17 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh, > fallthrough; > case IOMAP_MAPPED: > if ((iomap->flags & IOMAP_F_NEW) || > - offset >= i_size_read(inode)) > + offset >= i_size_read(inode)) { > + /* > + * This can happen if truncating the block device races > + * with the check in the caller as i_size updates on > + * block devices aren't synchronized by i_rwsem for > + * block devices. Why /are/ bdevs special like this (not holding i_rwsem during a truncate) anyway? Is it because we require the sysadmin to coordinate device shrink vs. running programs? --D > + */ > + if (S_ISBLK(inode->i_mode)) > + return -EIO; > set_buffer_new(bh); > + } > bh->b_blocknr = (iomap->addr + offset - iomap->offset) >> > inode->i_blkbits; > set_buffer_mapped(bh); > -- > 2.39.2 >
On Mon, Sep 25, 2023 at 08:09:02AM -0700, Darrick J. Wong wrote: > > + /* > > + * This can happen if truncating the block device races > > + * with the check in the caller as i_size updates on > > + * block devices aren't synchronized by i_rwsem for > > + * block devices. > > Why /are/ bdevs special like this (not holding i_rwsem during a > truncate) anyway? Is it because we require the sysadmin to coordinate > device shrink vs. running programs? It's not just truncate, they also don't hold a lock on write. I think the reason is that there is no such things as the block allocator and block truncation that happens for block devices, they historically had a fixed size, and at some point we allowed to change that size by various crude means that are only slowly becoming more standardized and formal. Real block device size changes are about 100% growing of the device, as that is an actually useful feature. Shrinks OTOH are usuall a "cute" hack: block drivers set the size to 0 stop I/O when they are shut down. I've been wanting to replace that with an actual check in the bdev fd I/O path for a while, but that would also mean the shrinking case would still be around, just exercised a lot less.
On Mon, Sep 25, 2023 at 05:18:16PM +0200, Christoph Hellwig wrote: > On Mon, Sep 25, 2023 at 08:09:02AM -0700, Darrick J. Wong wrote: > > > + /* > > > + * This can happen if truncating the block device races > > > + * with the check in the caller as i_size updates on > > > + * block devices aren't synchronized by i_rwsem for > > > + * block devices. > > > > Why /are/ bdevs special like this (not holding i_rwsem during a > > truncate) anyway? Is it because we require the sysadmin to coordinate > > device shrink vs. running programs? > > It's not just truncate, they also don't hold a lock on write. Oh! So they don't. Heh. > I think the reason is that there is no such things as the block allocator > and block truncation that happens for block devices, they historically > had a fixed size, and at some point we allowed to change that size > by various crude means that are only slowly becoming more standardized > and formal. Real block device size changes are about 100% growing of > the device, as that is an actually useful feature. Shrinks OTOH are > usuall a "cute" hack: block drivers set the size to 0 stop I/O when they > are shut down. I've been wanting to replace that with an actual check > in the bdev fd I/O path for a while, but that would also mean the > shrinking case would still be around, just exercised a lot less. You call bdev shrink a cute hack, cloud tenants call it a cost-reducing activity, and cloud vendors call it a revenue opportunity because shrinking filesystems is un***** expensive in terms of CPU time and IO usage. ;) Anyway, I'm not going to argue with longstanding blockdev precedent. Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D
diff --git a/fs/buffer.c b/fs/buffer.c index a6785cd07081cb..12e9a71c693d74 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2058,8 +2058,17 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh, fallthrough; case IOMAP_MAPPED: if ((iomap->flags & IOMAP_F_NEW) || - offset >= i_size_read(inode)) + offset >= i_size_read(inode)) { + /* + * This can happen if truncating the block device races + * with the check in the caller as i_size updates on + * block devices aren't synchronized by i_rwsem for + * block devices. + */ + if (S_ISBLK(inode->i_mode)) + return -EIO; set_buffer_new(bh); + } bh->b_blocknr = (iomap->addr + offset - iomap->offset) >> inode->i_blkbits; set_buffer_mapped(bh);