diff mbox series

[4/9] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap

Message ID 20190731141245.7230-5-cmaiolino@redhat.com (mailing list archive)
State Superseded, archived
Headers show
Series New ->fiemap infrastructure and ->bmap removal | expand

Commit Message

Carlos Maiolino July 31, 2019, 2:12 p.m. UTC
Now we have the possibility of proper error return in bmap, use bmap()
function in ioctl_fibmap() instead of calling ->bmap method directly.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---

Changelog:

V4:
	- Ensure ioctl_fibmap() returns 0 in case of error returned from
	  bmap(). Otherwise we'll be changing the user interface (which
	  returns 0 in case of error)
V3:
	- Rename usr_blk to ur_block

V2:
	- Use a local sector_t variable to asign the block number
	  instead of using direct casting.

 fs/ioctl.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

Comments

Darrick J. Wong July 31, 2019, 11:12 p.m. UTC | #1
On Wed, Jul 31, 2019 at 04:12:40PM +0200, Carlos Maiolino wrote:
> Now we have the possibility of proper error return in bmap, use bmap()
> function in ioctl_fibmap() instead of calling ->bmap method directly.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> 
> Changelog:
> 
> V4:
> 	- Ensure ioctl_fibmap() returns 0 in case of error returned from
> 	  bmap(). Otherwise we'll be changing the user interface (which
> 	  returns 0 in case of error)
> V3:
> 	- Rename usr_blk to ur_block
> 
> V2:
> 	- Use a local sector_t variable to asign the block number
> 	  instead of using direct casting.
> 
>  fs/ioctl.c | 29 +++++++++++++++++++----------
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index fef3a6bf7c78..6b589c873bc2 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -53,19 +53,28 @@ EXPORT_SYMBOL(vfs_ioctl);
>  
>  static int ioctl_fibmap(struct file *filp, int __user *p)
>  {
> -	struct address_space *mapping = filp->f_mapping;
> -	int res, block;
> +	struct inode *inode = file_inode(filp);
> +	int error, ur_block;
> +	sector_t block;
>  
> -	/* do we support this mess? */
> -	if (!mapping->a_ops->bmap)
> -		return -EINVAL;
>  	if (!capable(CAP_SYS_RAWIO))
>  		return -EPERM;
> -	res = get_user(block, p);
> -	if (res)
> -		return res;
> -	res = mapping->a_ops->bmap(mapping, block);
> -	return put_user(res, p);
> +
> +	error = get_user(ur_block, p);
> +	if (error)
> +		return error;
> +
> +	block = ur_block;
> +	error = bmap(inode, &block);
> +
> +	if (error)
> +		ur_block = 0;
> +	else
> +		ur_block = block;

What happens if ur_block > INT_MAX?  Shouldn't we return zero (i.e.
error) instead of truncating the value?  Maybe the code does this
somewhere else?  Here seemed like the obvious place for an overflow
check as we go from sector_t to int.

--D

> +
> +	error = put_user(ur_block, p);
> +
> +	return error;
>  }
>  
>  /**
> -- 
> 2.20.1
>
Carlos Maiolino Aug. 2, 2019, 9:19 a.m. UTC | #2
Hi Darrick.

> > +		return error;
> > +
> > +	block = ur_block;
> > +	error = bmap(inode, &block);
> > +
> > +	if (error)
> > +		ur_block = 0;
> > +	else
> > +		ur_block = block;
> 
> What happens if ur_block > INT_MAX?  Shouldn't we return zero (i.e.
> error) instead of truncating the value?  Maybe the code does this
> somewhere else?  Here seemed like the obvious place for an overflow
> check as we go from sector_t to int.
> 

The behavior should still be the same. It will get truncated, unfortunately. I
don't think we can actually change this behavior and return zero instead of
truncating it.

> --D
> 
> > +
> > +	error = put_user(ur_block, p);
> > +
> > +	return error;
> >  }
> >  
> >  /**
> > -- 
> > 2.20.1
> >
Darrick J. Wong Aug. 2, 2019, 3:14 p.m. UTC | #3
On Fri, Aug 02, 2019 at 11:19:39AM +0200, Carlos Maiolino wrote:
> Hi Darrick.
> 
> > > +		return error;
> > > +
> > > +	block = ur_block;
> > > +	error = bmap(inode, &block);
> > > +
> > > +	if (error)
> > > +		ur_block = 0;
> > > +	else
> > > +		ur_block = block;
> > 
> > What happens if ur_block > INT_MAX?  Shouldn't we return zero (i.e.
> > error) instead of truncating the value?  Maybe the code does this
> > somewhere else?  Here seemed like the obvious place for an overflow
> > check as we go from sector_t to int.
> > 
> 
> The behavior should still be the same. It will get truncated, unfortunately. I
> don't think we can actually change this behavior and return zero instead of
> truncating it.

But that's even worse, because the programs that rely on FIBMAP will now
receive *incorrect* results that may point at a different file and
definitely do not point at the correct file block.

Note also that the iomap (and therefore xfs) implementation WARNs on
integer overflow and returns 0 (error) to prevent an incorrect access.

--D

> > --D
> > 
> > > +
> > > +	error = put_user(ur_block, p);
> > > +
> > > +	return error;
> > >  }
> > >  
> > >  /**
> > > -- 
> > > 2.20.1
> > > 
> 
> -- 
> Carlos
Carlos Maiolino Aug. 5, 2019, 10:27 a.m. UTC | #4
On Fri, Aug 02, 2019 at 08:14:00AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 02, 2019 at 11:19:39AM +0200, Carlos Maiolino wrote:
> > Hi Darrick.
> > 
> > > > +		return error;
> > > > +
> > > > +	block = ur_block;
> > > > +	error = bmap(inode, &block);
> > > > +
> > > > +	if (error)
> > > > +		ur_block = 0;
> > > > +	else
> > > > +		ur_block = block;
> > > 
> > > What happens if ur_block > INT_MAX?  Shouldn't we return zero (i.e.
> > > error) instead of truncating the value?  Maybe the code does this
> > > somewhere else?  Here seemed like the obvious place for an overflow
> > > check as we go from sector_t to int.
> > > 
> > 
> > The behavior should still be the same. It will get truncated, unfortunately. I
> > don't think we can actually change this behavior and return zero instead of
> > truncating it.
> 
> But that's even worse, because the programs that rely on FIBMAP will now
> receive *incorrect* results that may point at a different file and
> definitely do not point at the correct file block.

How is this worse? This is exactly what happens today, on the original FIBMAP
implementation.

Maybe I am not seeing something or having a different thinking you have, but
this is the behavior we have now, without my patches. And we can't really change
it; the user view of this implementation.
That's why I didn't try to change the result, so the truncation still happens.
> 
> Note also that the iomap (and therefore xfs) implementation WARNs on
> integer overflow and returns 0 (error) to prevent an incorrect access.

It does not really prevent anything. It just issue a warning saying the result
will be truncated, in an attempt to notify the FIBMAP interface user that he/she
can't trust the result, but it does not prevent a truncated result to be
returned. And IIRC, iomap is the only interface now that cares about issuing a
warning.

I think the *best* we could do here, is to make the new bmap() to issue the same
kind of WARN() iomap does, but we can't really change the end result.


> 
> --D
> 
> > > --D
> > > 
> > > > +
> > > > +	error = put_user(ur_block, p);
> > > > +
> > > > +	return error;
> > > >  }
> > > >  
> > > >  /**
> > > > -- 
> > > > 2.20.1
> > > > 
> > 
> > -- 
> > Carlos
Darrick J. Wong Aug. 5, 2019, 3:12 p.m. UTC | #5
On Mon, Aug 05, 2019 at 12:27:30PM +0200, Carlos Maiolino wrote:
> On Fri, Aug 02, 2019 at 08:14:00AM -0700, Darrick J. Wong wrote:
> > On Fri, Aug 02, 2019 at 11:19:39AM +0200, Carlos Maiolino wrote:
> > > Hi Darrick.
> > > 
> > > > > +		return error;
> > > > > +
> > > > > +	block = ur_block;
> > > > > +	error = bmap(inode, &block);
> > > > > +
> > > > > +	if (error)
> > > > > +		ur_block = 0;
> > > > > +	else
> > > > > +		ur_block = block;
> > > > 
> > > > What happens if ur_block > INT_MAX?  Shouldn't we return zero (i.e.
> > > > error) instead of truncating the value?  Maybe the code does this
> > > > somewhere else?  Here seemed like the obvious place for an overflow
> > > > check as we go from sector_t to int.
> > > > 
> > > 
> > > The behavior should still be the same. It will get truncated, unfortunately. I
> > > don't think we can actually change this behavior and return zero instead of
> > > truncating it.
> > 
> > But that's even worse, because the programs that rely on FIBMAP will now
> > receive *incorrect* results that may point at a different file and
> > definitely do not point at the correct file block.
> 
> How is this worse? This is exactly what happens today, on the original FIBMAP
> implementation.

Ok, I wasn't being 110% careful with my words.  Delete "will now" from
the sentence above.

> Maybe I am not seeing something or having a different thinking you have, but
> this is the behavior we have now, without my patches. And we can't really change
> it; the user view of this implementation.
> That's why I didn't try to change the result, so the truncation still happens.

I understand that we're not generally supposed to change existing
userspace interfaces, but the fact remains that allowing truncated
responses causes *filesystem corruption*.

We know that the most well known FIBMAP callers are bootloaders, and we
know what they do with the information they get -- they use it to record
the block map of boot files.  So if the IPL/grub/whatever installer
queries the boot file and the boot file is at block 12345678901 (a
34-bit number), this interface truncates that to 3755744309 (a 32-bit
number) and that's where the bootloader will think its boot files are.
The installation succeeds, the user reboots and *kaboom* the system no
longer boots because the contents of block 3755744309 is not a bootloader.

Worse yet, grub1 used FIBMAP data to record the location of the grub
environment file and installed itself between the MBR and the start of
partition 1.  If the environment file is at offset 1234578901, grub will
write status data to its environment file (which it thinks is at
3755744309) and *KABOOM* we've just destroyed whatever was in that
block.

Far better for the bootloader installation script to hit an error and
force the admin to deal with the situation than for the system to become
unbootable.  That's *why* the (newer) iomap bmap implementation does not
return truncated mappings, even though the classic implementation does.

The classic code returning truncated results is a broken behavior.

> > Note also that the iomap (and therefore xfs) implementation WARNs on
> > integer overflow and returns 0 (error) to prevent an incorrect access.
> 
> It does not really prevent anything. It just issue a warning saying the result
> will be truncated, in an attempt to notify the FIBMAP interface user that he/she
> can't trust the result, but it does not prevent a truncated result to be

I disagree; the iomap bmap implementation /does/ prevent truncated responses:

: static loff_t
: iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
: void *data, struct iomap *iomap)
: {
: 	sector_t *bno = data, addr;
: 
: 	if (iomap->type == IOMAP_MAPPED) {
: 		addr = (pos - iomap->offset + iomap->addr) >> inode->i_blkbits;
: 		if (addr > INT_MAX)
: 			WARN(1, "would truncate bmap result\n");

Notice how we don't set *bno here?

: 		else
: 			*bno = addr;

And only set it in the case that there isn't an integer overflow?

: 	}
: 	return 0;
: }
:
: /* legacy ->bmap interface.  0 is the error return (!) */
: sector_t
: iomap_bmap(struct address_space *mapping, sector_t bno,
: 		const struct iomap_ops *ops)
: {
: 	struct inode *inode = mapping->host;
: 	loff_t pos = bno << inode->i_blkbits;
: 	unsigned blocksize = i_blocksize(inode);
: 
: 	if (filemap_write_and_wait(mapping))
: 		return 0;
: 
: 	bno = 0;

We initialize bno to zero here...

: 	iomap_apply(inode, pos, blocksize, 0, ops, &bno, iomap_bmap_actor);

...then pass bno's address to the apply function to pass to
iomap_bmap_actor, so either the _actor function set bno or in the case
of overflow it left it set to zero.

: 	return bno;
: }

> returned. And IIRC, iomap is the only interface now that cares about issuing a
> warning.
>
> I think the *best* we could do here, is to make the new bmap() to issue the same
> kind of WARN() iomap does, but we can't really change the end result.

I'd rather we break legacy code than corrupt filesystems.

--D

> 
> > 
> > --D
> > 
> > > > --D
> > > > 
> > > > > +
> > > > > +	error = put_user(ur_block, p);
> > > > > +
> > > > > +	return error;
> > > > >  }
> > > > >  
> > > > >  /**
> > > > > -- 
> > > > > 2.20.1
> > > > > 
> > > 
> > > -- 
> > > Carlos
> 
> -- 
> Carlos
Christoph Hellwig Aug. 6, 2019, 5:38 a.m. UTC | #6
On Mon, Aug 05, 2019 at 08:12:58AM -0700, Darrick J. Wong wrote:
> > returned. And IIRC, iomap is the only interface now that cares about issuing a
> > warning.
> >
> > I think the *best* we could do here, is to make the new bmap() to issue the same
> > kind of WARN() iomap does, but we can't really change the end result.
> 
> I'd rather we break legacy code than corrupt filesystems.

This particular patch should keep existing behavior as is, as the intent
is to not change functionality.  Throwing in another patch to have saner
error behavior now that we have a saner in-kernel interface that cleary
documents what it is breaking and why on the other hand sounds like a
very good idea.
Carlos Maiolino Aug. 6, 2019, 12:02 p.m. UTC | #7
I'll keep my reply to your email short here, because hch's follow-up email
summarizes what I wanted to say, so I'll reply to his email directly.

> : 	return bno;
> : }
> 
> > returned. And IIRC, iomap is the only interface now that cares about issuing a
> > warning.
> >
> > I think the *best* we could do here, is to make the new bmap() to issue the same
> > kind of WARN() iomap does, but we can't really change the end result.
> 
> I'd rather we break legacy code than corrupt filesystems.

By now, I wish to apologize Darrick, it was my fault to not pay enough attention
to iomap's bmap implementation as you mentioned. Will keep the discussion topic
on next e-mail.
Carlos Maiolino Aug. 6, 2019, 12:07 p.m. UTC | #8
On Tue, Aug 06, 2019 at 07:38:40AM +0200, Christoph Hellwig wrote:
> On Mon, Aug 05, 2019 at 08:12:58AM -0700, Darrick J. Wong wrote:
> > > returned. And IIRC, iomap is the only interface now that cares about issuing a
> > > warning.
> > >
> > > I think the *best* we could do here, is to make the new bmap() to issue the same
> > > kind of WARN() iomap does, but we can't really change the end result.
> > 
> > I'd rather we break legacy code than corrupt filesystems.
> 

Yes, I have the same feeling, but this patchset does not have the goal to fix
the broken api.

> This particular patch should keep existing behavior as is, as the intent
> is to not change functionality.  Throwing in another patch to have saner
> error behavior now that we have a saner in-kernel interface that cleary
> documents what it is breaking and why on the other hand sounds like a
> very good idea.

I totally agree here, and to be honest, I think such change should be in a
different patchset rather than a new patch in this series. I can do it for sure,
but this discussion IMHO should be done not only here in linux-fsdevel, but also
in linux-api, which well, I don't think cc'ing this whole patchset there will do
any good other than keep the change discussion more complicated than it should
be. I'd rather finish the design and implementation of this patchset, and I'll
follow-up it, once it's all set, with a new patch to change the truncation
behavior, it will make the discussion way easier than mixing up subjects. What
you guys think?

Cheers
Darrick J. Wong Aug. 6, 2019, 2:48 p.m. UTC | #9
On Tue, Aug 06, 2019 at 02:07:24PM +0200, Carlos Maiolino wrote:
> On Tue, Aug 06, 2019 at 07:38:40AM +0200, Christoph Hellwig wrote:
> > On Mon, Aug 05, 2019 at 08:12:58AM -0700, Darrick J. Wong wrote:
> > > > returned. And IIRC, iomap is the only interface now that cares about issuing a
> > > > warning.
> > > >
> > > > I think the *best* we could do here, is to make the new bmap() to issue the same
> > > > kind of WARN() iomap does, but we can't really change the end result.
> > > 
> > > I'd rather we break legacy code than corrupt filesystems.
> > 
> 
> Yes, I have the same feeling, but this patchset does not have the goal to fix
> the broken api.
> 
> > This particular patch should keep existing behavior as is, as the intent
> > is to not change functionality.  Throwing in another patch to have saner
> > error behavior now that we have a saner in-kernel interface that cleary
> > documents what it is breaking and why on the other hand sounds like a
> > very good idea.
> 
> I totally agree here, and to be honest, I think such change should be in a
> different patchset rather than a new patch in this series. I can do it for sure,
> but this discussion IMHO should be done not only here in linux-fsdevel, but also
> in linux-api, which well, I don't think cc'ing this whole patchset there will do
> any good other than keep the change discussion more complicated than it should
> be. I'd rather finish the design and implementation of this patchset, and I'll
> follow-up it, once it's all set, with a new patch to change the truncation
> behavior, it will make the discussion way easier than mixing up subjects. What
> you guys think?

I probably would've fixed the truncation behavior in the old code and
based the fiemap-fibmap conversion on that so that anyone who wants to
backport the behavior change to an old kernel has an easier time of it.

But afterwards probably works just as well since I don't feel like tying
ourselves in more knots over an old interface. ;)

--D

> Cheers
> 
> 
> -- 
> Carlos
Luis Chamberlain Aug. 6, 2019, 10:41 p.m. UTC | #10
On Mon, Aug 05, 2019 at 08:12:58AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 05, 2019 at 12:27:30PM +0200, Carlos Maiolino wrote:
> > On Fri, Aug 02, 2019 at 08:14:00AM -0700, Darrick J. Wong wrote:
> > > On Fri, Aug 02, 2019 at 11:19:39AM +0200, Carlos Maiolino wrote:
> > > > Hi Darrick.
> > > > 
> > > > > > +		return error;
> > > > > > +
> > > > > > +	block = ur_block;
> > > > > > +	error = bmap(inode, &block);
> > > > > > +
> > > > > > +	if (error)
> > > > > > +		ur_block = 0;
> > > > > > +	else
> > > > > > +		ur_block = block;
> > > > > 
> > > > > What happens if ur_block > INT_MAX?  Shouldn't we return zero (i.e.
> > > > > error) instead of truncating the value?  Maybe the code does this
> > > > > somewhere else?  Here seemed like the obvious place for an overflow
> > > > > check as we go from sector_t to int.
> > > > > 
> > > > 
> > > > The behavior should still be the same. It will get truncated, unfortunately. I
> > > > don't think we can actually change this behavior and return zero instead of
> > > > truncating it.
> > > 
> > > But that's even worse, because the programs that rely on FIBMAP will now
> > > receive *incorrect* results that may point at a different file and
> > > definitely do not point at the correct file block.
> > 
> > How is this worse? This is exactly what happens today, on the original FIBMAP
> > implementation.
> 
> Ok, I wasn't being 110% careful with my words.  Delete "will now" from
> the sentence above.
> 
> > Maybe I am not seeing something or having a different thinking you have, but
> > this is the behavior we have now, without my patches. And we can't really change
> > it; the user view of this implementation.
> > That's why I didn't try to change the result, so the truncation still happens.
> 
> I understand that we're not generally supposed to change existing
> userspace interfaces, but the fact remains that allowing truncated
> responses causes *filesystem corruption*.
> 
> We know that the most well known FIBMAP callers are bootloaders, and we
> know what they do with the information they get -- they use it to record
> the block map of boot files.  So if the IPL/grub/whatever installer
> queries the boot file and the boot file is at block 12345678901 (a
> 34-bit number), this interface truncates that to 3755744309 (a 32-bit
> number) and that's where the bootloader will think its boot files are.
> The installation succeeds, the user reboots and *kaboom* the system no
> longer boots because the contents of block 3755744309 is not a bootloader.
> 
> Worse yet, grub1 used FIBMAP data to record the location of the grub
> environment file and installed itself between the MBR and the start of
> partition 1.  If the environment file is at offset 1234578901, grub will
> write status data to its environment file (which it thinks is at
> 3755744309) and *KABOOM* we've just destroyed whatever was in that
> block.
> 
> Far better for the bootloader installation script to hit an error and
> force the admin to deal with the situation than for the system to become
> unbootable.  That's *why* the (newer) iomap bmap implementation does not
> return truncated mappings, even though the classic implementation does.
> 
> The classic code returning truncated results is a broken behavior.

How long as it been broken for? And if we do fix it, I'd just like for
a nice commit lot describing potential risks of not applying it. *If*
the issue exists as-is today, the above contains a lot of information
for addressing potential issues, even if theoretical.

  Luis
Darrick J. Wong Aug. 7, 2019, 2:42 p.m. UTC | #11
On Tue, Aug 06, 2019 at 10:41:38PM +0000, Luis Chamberlain wrote:
> On Mon, Aug 05, 2019 at 08:12:58AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 05, 2019 at 12:27:30PM +0200, Carlos Maiolino wrote:
> > > On Fri, Aug 02, 2019 at 08:14:00AM -0700, Darrick J. Wong wrote:
> > > > On Fri, Aug 02, 2019 at 11:19:39AM +0200, Carlos Maiolino wrote:
> > > > > Hi Darrick.
> > > > > 
> > > > > > > +		return error;
> > > > > > > +
> > > > > > > +	block = ur_block;
> > > > > > > +	error = bmap(inode, &block);
> > > > > > > +
> > > > > > > +	if (error)
> > > > > > > +		ur_block = 0;
> > > > > > > +	else
> > > > > > > +		ur_block = block;
> > > > > > 
> > > > > > What happens if ur_block > INT_MAX?  Shouldn't we return zero (i.e.
> > > > > > error) instead of truncating the value?  Maybe the code does this
> > > > > > somewhere else?  Here seemed like the obvious place for an overflow
> > > > > > check as we go from sector_t to int.
> > > > > > 
> > > > > 
> > > > > The behavior should still be the same. It will get truncated, unfortunately. I
> > > > > don't think we can actually change this behavior and return zero instead of
> > > > > truncating it.
> > > > 
> > > > But that's even worse, because the programs that rely on FIBMAP will now
> > > > receive *incorrect* results that may point at a different file and
> > > > definitely do not point at the correct file block.
> > > 
> > > How is this worse? This is exactly what happens today, on the original FIBMAP
> > > implementation.
> > 
> > Ok, I wasn't being 110% careful with my words.  Delete "will now" from
> > the sentence above.
> > 
> > > Maybe I am not seeing something or having a different thinking you have, but
> > > this is the behavior we have now, without my patches. And we can't really change
> > > it; the user view of this implementation.
> > > That's why I didn't try to change the result, so the truncation still happens.
> > 
> > I understand that we're not generally supposed to change existing
> > userspace interfaces, but the fact remains that allowing truncated
> > responses causes *filesystem corruption*.
> > 
> > We know that the most well known FIBMAP callers are bootloaders, and we
> > know what they do with the information they get -- they use it to record
> > the block map of boot files.  So if the IPL/grub/whatever installer
> > queries the boot file and the boot file is at block 12345678901 (a
> > 34-bit number), this interface truncates that to 3755744309 (a 32-bit
> > number) and that's where the bootloader will think its boot files are.
> > The installation succeeds, the user reboots and *kaboom* the system no
> > longer boots because the contents of block 3755744309 is not a bootloader.
> > 
> > Worse yet, grub1 used FIBMAP data to record the location of the grub
> > environment file and installed itself between the MBR and the start of
> > partition 1.  If the environment file is at offset 1234578901, grub will
> > write status data to its environment file (which it thinks is at
> > 3755744309) and *KABOOM* we've just destroyed whatever was in that
> > block.
> > 
> > Far better for the bootloader installation script to hit an error and
> > force the admin to deal with the situation than for the system to become
> > unbootable.  That's *why* the (newer) iomap bmap implementation does not
> > return truncated mappings, even though the classic implementation does.
> > 
> > The classic code returning truncated results is a broken behavior.
> 
> How long as it been broken for?

Probably since the beginning (ext2).

> And if we do fix it, I'd just like for
> a nice commit lot describing potential risks of not applying it. *If*
> the issue exists as-is today, the above contains a lot of information
> for addressing potential issues, even if theoretical.

I think a lot of the filesystems avoid the problem either by not
supporting > INT_MAX blocks in the first place or by detecting the
truncation in the fs-specific ->bmap method, so that might be why we
haven't been deluged by corruption reports.

--D

>   Luis
Carlos Maiolino Aug. 8, 2019, 7:12 a.m. UTC | #12
> > 
> > > Maybe I am not seeing something or having a different thinking you have, but
> > > this is the behavior we have now, without my patches. And we can't really change
> > > it; the user view of this implementation.
> > > That's why I didn't try to change the result, so the truncation still happens.
> > 
> > I understand that we're not generally supposed to change existing
> > userspace interfaces, but the fact remains that allowing truncated
> > responses causes *filesystem corruption*.
> > 
> > We know that the most well known FIBMAP callers are bootloaders, and we
> > know what they do with the information they get -- they use it to record
> > the block map of boot files.  So if the IPL/grub/whatever installer
> > queries the boot file and the boot file is at block 12345678901 (a
> > 34-bit number), this interface truncates that to 3755744309 (a 32-bit
> > number) and that's where the bootloader will think its boot files are.
> > The installation succeeds, the user reboots and *kaboom* the system no
> > longer boots because the contents of block 3755744309 is not a bootloader.
> > 
> > Worse yet, grub1 used FIBMAP data to record the location of the grub
> > environment file and installed itself between the MBR and the start of
> > partition 1.  If the environment file is at offset 1234578901, grub will
> > write status data to its environment file (which it thinks is at
> > 3755744309) and *KABOOM* we've just destroyed whatever was in that
> > block.
> > 
> > Far better for the bootloader installation script to hit an error and
> > force the admin to deal with the situation than for the system to become
> > unbootable.  That's *why* the (newer) iomap bmap implementation does not
> > return truncated mappings, even though the classic implementation does.
> > 
> > The classic code returning truncated results is a broken behavior.
> 
> How long as it been broken for? And if we do fix it, I'd just like for
> a nice commit lot describing potential risks of not applying it. *If*
> the issue exists as-is today, the above contains a lot of information
> for addressing potential issues, even if theoretical.
> 

It's broken since forever. This has always been the FIBMAP behavior.


>   Luis
Carlos Maiolino Aug. 8, 2019, 7:17 a.m. UTC | #13
On Tue, Aug 06, 2019 at 07:48:00AM -0700, Darrick J. Wong wrote:
> On Tue, Aug 06, 2019 at 02:07:24PM +0200, Carlos Maiolino wrote:
> > On Tue, Aug 06, 2019 at 07:38:40AM +0200, Christoph Hellwig wrote:
> > > On Mon, Aug 05, 2019 at 08:12:58AM -0700, Darrick J. Wong wrote:
> > > > > returned. And IIRC, iomap is the only interface now that cares about issuing a
> > > > > warning.
> > > > >
> > > > > I think the *best* we could do here, is to make the new bmap() to issue the same
> > > > > kind of WARN() iomap does, but we can't really change the end result.
> > > > 
> > > > I'd rather we break legacy code than corrupt filesystems.
> > > 
> > 
> > Yes, I have the same feeling, but this patchset does not have the goal to fix
> > the broken api.
> > 
> > > This particular patch should keep existing behavior as is, as the intent
> > > is to not change functionality.  Throwing in another patch to have saner
> > > error behavior now that we have a saner in-kernel interface that cleary
> > > documents what it is breaking and why on the other hand sounds like a
> > > very good idea.
> > 
> > I totally agree here, and to be honest, I think such change should be in a
> > different patchset rather than a new patch in this series. I can do it for sure,
> > but this discussion IMHO should be done not only here in linux-fsdevel, but also
> > in linux-api, which well, I don't think cc'ing this whole patchset there will do
> > any good other than keep the change discussion more complicated than it should
> > be. I'd rather finish the design and implementation of this patchset, and I'll
> > follow-up it, once it's all set, with a new patch to change the truncation
> > behavior, it will make the discussion way easier than mixing up subjects. What
> > you guys think?
> 
> I probably would've fixed the truncation behavior in the old code and
> based the fiemap-fibmap conversion on that so that anyone who wants to
> backport the behavior change to an old kernel has an easier time of it.
> 

Well, another problem in fixing it in the old code, is that bmap() can't
properly return errors :P
After this patchset, bmap() will be able to return errors, so we can easily fix
it, once we won't need to 'guess' what a zero return mean from bmap()


> But afterwards probably works just as well since I don't feel like tying
> ourselves in more knots over an old interface. ;)
> 
> --D
> 
> > Cheers
> > 
> > 
> > -- 
> > Carlos
Andreas Dilger Aug. 8, 2019, 6:53 p.m. UTC | #14
On Aug 8, 2019, at 1:12 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
> 
>>> 
>>>> Maybe I am not seeing something or having a different thinking you have, but
>>>> this is the behavior we have now, without my patches. And we can't really change
>>>> it; the user view of this implementation.
>>>> That's why I didn't try to change the result, so the truncation still happens.
>>> 
>>> I understand that we're not generally supposed to change existing
>>> userspace interfaces, but the fact remains that allowing truncated
>>> responses causes *filesystem corruption*.
>>> 
>>> We know that the most well known FIBMAP callers are bootloaders, and we
>>> know what they do with the information they get -- they use it to record
>>> the block map of boot files.  So if the IPL/grub/whatever installer
>>> queries the boot file and the boot file is at block 12345678901 (a
>>> 34-bit number), this interface truncates that to 3755744309 (a 32-bit
>>> number) and that's where the bootloader will think its boot files are.
>>> The installation succeeds, the user reboots and *kaboom* the system no
>>> longer boots because the contents of block 3755744309 is not a bootloader.
>>> 
>>> Worse yet, grub1 used FIBMAP data to record the location of the grub
>>> environment file and installed itself between the MBR and the start of
>>> partition 1.  If the environment file is at offset 1234578901, grub will
>>> write status data to its environment file (which it thinks is at
>>> 3755744309) and *KABOOM* we've just destroyed whatever was in that
>>> block.
>>> 
>>> Far better for the bootloader installation script to hit an error and
>>> force the admin to deal with the situation than for the system to become
>>> unbootable.  That's *why* the (newer) iomap bmap implementation does not
>>> return truncated mappings, even though the classic implementation does.
>>> 
>>> The classic code returning truncated results is a broken behavior.
>> 
>> How long as it been broken for? And if we do fix it, I'd just like for
>> a nice commit lot describing potential risks of not applying it. *If*
>> the issue exists as-is today, the above contains a lot of information
>> for addressing potential issues, even if theoretical.
>> 
> 
> It's broken since forever. This has always been the FIBMAP behavior.

It's been broken since forever, but only for filesystems larger than 4TB or
16TB (2^32 blocks), which are only becoming commonplace for root disks recently.
Also, doesn't LILO have a limit on the location of the kernel image, in the
first 1GB or similar?

So maybe this is not an issue that FIBMAP users ever hit in practise anyway,
but I agree that it doesn't make sense to return bad data (32-bit wrapped block
numbers) and 0 should be returned in such cases.


Cheers, Andreas
Carlos Maiolino Aug. 19, 2019, 10:10 a.m. UTC | #15
Meh... Sorry andreas, your reply became disconnected from the thread, and I
think I didn't reply.

On Thu, Aug 08, 2019 at 12:53:25PM -0600, Andreas Dilger wrote:
> On Aug 8, 2019, at 1:12 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
> > 
> >>> 
> >>>> Maybe I am not seeing something or having a different thinking you have, but
> >>>> this is the behavior we have now, without my patches. And we can't really change
> >>>> it; the user view of this implementation.
> >>>> That's why I didn't try to change the result, so the truncation still happens.
> >>> 
> >>> I understand that we're not generally supposed to change existing
> >>> userspace interfaces, but the fact remains that allowing truncated
> >>> responses causes *filesystem corruption*.
> >>> 
> >>> We know that the most well known FIBMAP callers are bootloaders, and we
> >>> know what they do with the information they get -- they use it to record
> >>> the block map of boot files.  So if the IPL/grub/whatever installer
> >>> queries the boot file and the boot file is at block 12345678901 (a
> >>> 34-bit number), this interface truncates that to 3755744309 (a 32-bit
> >>> number) and that's where the bootloader will think its boot files are.
> >>> The installation succeeds, the user reboots and *kaboom* the system no
> >>> longer boots because the contents of block 3755744309 is not a bootloader.
> >>> 
> >>> Worse yet, grub1 used FIBMAP data to record the location of the grub
> >>> environment file and installed itself between the MBR and the start of
> >>> partition 1.  If the environment file is at offset 1234578901, grub will
> >>> write status data to its environment file (which it thinks is at
> >>> 3755744309) and *KABOOM* we've just destroyed whatever was in that
> >>> block.
> >>> 
> >>> Far better for the bootloader installation script to hit an error and
> >>> force the admin to deal with the situation than for the system to become
> >>> unbootable.  That's *why* the (newer) iomap bmap implementation does not
> >>> return truncated mappings, even though the classic implementation does.
> >>> 
> >>> The classic code returning truncated results is a broken behavior.
> >> 
> >> How long as it been broken for? And if we do fix it, I'd just like for
> >> a nice commit lot describing potential risks of not applying it. *If*
> >> the issue exists as-is today, the above contains a lot of information
> >> for addressing potential issues, even if theoretical.
> >> 
> > 
> > It's broken since forever. This has always been the FIBMAP behavior.
> 
> It's been broken since forever, but only for filesystems larger than 4TB or
> 16TB (2^32 blocks), which are only becoming commonplace for root disks recently.
> Also, doesn't LILO have a limit on the location of the kernel image, in the
> first 1GB or similar?
> 
> So maybe this is not an issue that FIBMAP users ever hit in practise anyway,
> but I agree that it doesn't make sense to return bad data (32-bit wrapped block
> numbers) and 0 should be returned in such cases.
> 

Thanks for the input, but TBH I don't use LILO for a long time, and I don't
remember exactly how it works.

Anyway, I have 2 bugs to fix in this code, after I can get this series in, one
is the overflow we'll probably need kernel-api approval, and another one is the
acceptance of negative values into FIBMAP, which we have no protection at all.
I'll fix both once I can get the main series in.

Cheers


> 
> Cheers, Andreas
> 
> 
> 
> 
>
diff mbox series

Patch

diff --git a/fs/ioctl.c b/fs/ioctl.c
index fef3a6bf7c78..6b589c873bc2 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -53,19 +53,28 @@  EXPORT_SYMBOL(vfs_ioctl);
 
 static int ioctl_fibmap(struct file *filp, int __user *p)
 {
-	struct address_space *mapping = filp->f_mapping;
-	int res, block;
+	struct inode *inode = file_inode(filp);
+	int error, ur_block;
+	sector_t block;
 
-	/* do we support this mess? */
-	if (!mapping->a_ops->bmap)
-		return -EINVAL;
 	if (!capable(CAP_SYS_RAWIO))
 		return -EPERM;
-	res = get_user(block, p);
-	if (res)
-		return res;
-	res = mapping->a_ops->bmap(mapping, block);
-	return put_user(res, p);
+
+	error = get_user(ur_block, p);
+	if (error)
+		return error;
+
+	block = ur_block;
+	error = bmap(inode, &block);
+
+	if (error)
+		ur_block = 0;
+	else
+		ur_block = block;
+
+	error = put_user(ur_block, p);
+
+	return error;
 }
 
 /**