[3/5] fs: remove ki_nbytes

Message ID	1422381313-24034-4-git-send-email-hch@lst.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Christoph Hellwig <hch@lst.de> To: Al Viro <viro@zeniv.linux.org.uk> Cc: Miklos Szeredi <mszeredi@suse.cz>, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 3/5] fs: remove ki_nbytes Date: Tue, 27 Jan 2015 18:55:11 +0100 Message-Id: <1422381313-24034-4-git-send-email-hch@lst.de> In-Reply-To: <1422381313-24034-1-git-send-email-hch@lst.de> References: <1422381313-24034-1-git-send-email-hch@lst.de> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Christoph Hellwig Jan. 27, 2015, 5:55 p.m. UTC

There is no need to pass the total request length in the kiocb.  We
already have it in the iov_iter, and for those few methods not converted
to iov_iter we can easily calculate it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/staging/android/logger.c   |  2 +-
 drivers/usb/gadget/function/f_fs.c |  4 ++--
 drivers/usb/gadget/legacy/inode.c  | 10 +++++-----
 fs/aio.c                           | 34 ++++++++++++++++++----------------
 fs/ceph/file.c                     |  2 +-
 fs/nfs/direct.c                    |  2 +-
 fs/ocfs2/file.c                    |  8 +++-----
 fs/read_write.c                    |  8 --------
 fs/udf/file.c                      |  2 +-
 include/linux/aio.h                |  1 -
 kernel/printk/printk.c             |  2 +-
 mm/page_io.c                       |  1 -
 net/socket.c                       |  3 +--
 13 files changed, 34 insertions(+), 45 deletions(-)

Al Viro Jan. 31, 2015, 6:08 a.m. UTC | #1

> @@ -717,14 +718,14 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  		unsigned long nr_segs, loff_t o)
>  {
>  	struct ep_data		*epdata = iocb->ki_filp->private_data;
> +	size_t			len = iov_length(iov, nr_segs);
>  	char			*buf;
> -	size_t			len = 0;
>  	int			i = 0;
>  
>  	if (unlikely(!usb_endpoint_dir_in(&epdata->desc)))
>  		return -EINVAL;
>  
> -	buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL);
> +	buf = kmalloc(len, GFP_KERNEL);
>  	if (unlikely(!buf))
>  		return -ENOMEM;
>  
> @@ -734,7 +735,6 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  			kfree(buf);
>  			return -EFAULT;
>  		}
> -		len += iov[i].iov_len;
>  	}
>  	return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0);

WTF bother?  Just switch it to ->write_iter():

static ssize_t
ep_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
	struct ep_data *epdata = iocb->ki_filp->private_data;
	size_t len = iov_iter_count(from);
	void *buf;

	if (unlikely(!usb_endpoint_dir_in(&epdata->desc)))
		return -EINVAL;

	buf = kmalloc(iov_iter_count(from), GFP_KERNEL);
	if (unlikely(!buf))
		return -ENOMEM;

	if (copy_from_iter(buf, from, len) != len) {
		kfree(buf);
		return -EFAULT;
	}

	return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0);
}

and be done with that.  I'll put drivers/usb/gadget patches into a stable
branch and ask use folks to pull from it - that's the simplest of this
series, actually...

> @@ -903,10 +903,9 @@ static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov,
>  	if (pos != 0)
>  		return -ESPIPE;
>  
> -	if (iocb->ki_nbytes == 0)	/* Match SYS5 behaviour */
> +	if (!iov_length(iov, nr_segs))	/* Match SYS5 behaviour */
>  		return 0;

FWIW, it's switched to ->read_iter/->write_iter in vfs.git#for-davem.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Feb. 2, 2015, 8:07 a.m. UTC | #2

On Sat, Jan 31, 2015 at 06:08:41AM +0000, Al Viro wrote:
> and be done with that.  I'll put drivers/usb/gadget patches into a stable
> branch and ask use folks to pull from it - that's the simplest of this
> series, actually...

use folks?  Btw, does this mean you patches to switch it over, or do
you want me to finish my conversion.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 2, 2015, 8:11 a.m. UTC | #3

On Mon, Feb 02, 2015 at 09:07:29AM +0100, Christoph Hellwig wrote:
> On Sat, Jan 31, 2015 at 06:08:41AM +0000, Al Viro wrote:
> > and be done with that.  I'll put drivers/usb/gadget patches into a stable
> > branch and ask use folks to pull from it - that's the simplest of this
> > series, actually...
> 
> use folks?  Btw, does this mean you patches to switch it over, or do

usb.

> you want me to finish my conversion.

I have f_fs.c conversion and partial legacy/inode.c one.  Will post tomorrow,
about to fall asleep right now...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 2, 2015, 8:14 a.m. UTC | #4

On Mon, Feb 02, 2015 at 08:11:12AM +0000, Al Viro wrote:
> On Mon, Feb 02, 2015 at 09:07:29AM +0100, Christoph Hellwig wrote:
> > On Sat, Jan 31, 2015 at 06:08:41AM +0000, Al Viro wrote:
> > > and be done with that.  I'll put drivers/usb/gadget patches into a stable
> > > branch and ask use folks to pull from it - that's the simplest of this
> > > series, actually...
> > 
> > use folks?  Btw, does this mean you patches to switch it over, or do
> 
> usb.
> 
> > you want me to finish my conversion.
> 
> I have f_fs.c conversion and partial legacy/inode.c one.  Will post tomorrow,
> about to fall asleep right now...

FWIW, here's the current situation with read/write methods:

1) a bunch of file_operations instances have only ->read() and/or ->write();
no ->aio_... or ->..._iter versions, no ->splice_write().  Those are basically
device drivers:
	readv is equivalent to loop over segments, with read on each.
	writev is equivalent to loop over segments, with write on each.
	AIO read and write are all synchronous
	splice_write is does kmap on each buffer in turn and write that
Note that splitting a buffer into several adjacent pieces and submitting
a vectored IO is *not* guaranteed to be equivalent to plain IO on the entire
buffer - it has datagram-like semantics and boundaries do matter.

2) regular ->read_iter/->write_iter users:
	->read is new_sync_read() (or NULL, if ->read_iter is NULL)
	->write is new_sync_write() (or NULL, if ->write_iter is NULL)
	->aio_read and ->aio_write are NULL
	->splice_write is iter_splice_write (or NULL, if ->write_iter is NULL)
Those are stream-style ones; what matter is concatenation of the buffers, not
the boundaries.  Right now the following is true: ->read_iter/->write_iter on
a sync kiocb never returns EIOCBQUEUED and iterator argument is advanced
exactly by the amount of data transferred.

3) ones like (2), but with NULL ->splice_write() in spite of present
->write_iter().  AFAICS they can be bulk-converted to (2).  This is what the
bulk of irregularities boil down to.

4) ones like (2) or (3), except ->read() and ->write() are left NULL instead
of doing new_sync_{read,write}().  Mostly equivalent to (2); some codepaths
call ->read() or ->write() directly, instead of going through vfs_... wrappers,
and for those it can be a bit of a headache.  In any case, there are very
few such instances (only 3) and they are trivial to convert to (2).

5) Two instances in fs/9p have ->write_iter() *and* ->write(), the latter not
being new_sync_write().  Ditto for ->read_iter() and ->read().  No other
instance has such combinations.  FWIW, they are *almost* regular - their
->read() and ->write() are new_sync_... unless it's an O_DIRECT file.
It might make sense to try and convert those to ->direct_IO(); as it is, their
O_DIRECT is ignored for readv()/writev(), which is arguably a bug.

6) drivers/char/mem.c stuff; they are equivalent to (2), but somewhat
optimised.  Some of that might make sense to convert to (2); there _are_ hot
paths involved, so we'd better be careful there.

7) bad_file_ops.  It is equivalent to (2); it's also a very special
case.  FWIW, I'm not at all sure that we *need* most of the methods in
there.  We never replace ->f_op of an opened file with that, so if we
end up with that sucker, it happens on fresh open of something that had
its ->i_fop replaced with bad_file_ops.  And that will instantly fail in
bad_file_open().  Why do we need the rest of the methods in there?
AFAICS, we should drop all but ->open().

8) hypfs - AFAICS, converts to (2) easily; I'll do that.

9) FUSE - ->aio_read/->aio_write user, with unusual ->splice_write as
well.  I _think_ it can be converted to (2), but that'll take a bit of
work.

10) sockets; conversion to ->read_iter/->write_iter had been posted to
netdev, ->splice_write() (the only user of ->sendpage(), BTW) is a work
for the next cycle.

11) NTFS with rw support enabled.  ->aio_write() user, needs to be converted
to ->write_iter().  AFAICS, it wasn't particularly high priority for Anton
(along with all rw stuff in fs/ntfs); it doesn't look terribly hard, but then
it wasn't a high priority for me either.

12) virtio_console.  Interesting ->splice_write() there; hadn't looked deep
enough.

13) two odd drivers/usb/gadget instances.  I have conversion for f_fs.c,
but legacy/inode.c (ep_read() et.al.) is trickier.  The problem in there
is that writev() on a single-element vector is *not* equivalent to plain
write().  The former treats the wrong-direction endpoint as EINVAL; the
latter does
                if (usb_endpoint_xfer_isoc(&data->desc)) {
                        mutex_unlock(&data->lock);
                        return -EINVAL;
                }
                DBG (data->dev, "%s halt\n", data->name);
                spin_lock_irq (&data->dev->lock);
                if (likely (data->ep != NULL))
                        usb_ep_set_halt (data->ep);
                spin_unlock_irq (&data->dev->lock);
                mutex_unlock(&data->lock);
                return -EBADMSG;
instead.  IOW, for isochronous endpoints behaviour is the same, but the
rest behaves differently.  If not for that, that sucker would convert
to (3) easily; ->splice_write() semantics is potentially interesting -
the question is where do we want transfer boundaries.  As it is, writev()
collects the entire iovec and shoves it into single transfer; splice() to
that thing ends up with each pipe_buf going in a separate transfer, mergable
or not.  I would really appreciate comments on semantics of that thing from
USB folks...

14) ipathfs and qibfs: seriously different semantics for write and writev/AIO
write.  As in "different set of commands recognized"; AIO write plays like
writev, whether it's vectored or not (and it's always synchronous).
I've no idea who had come up with that... highly innovative API or why
hadn't they simply added two files (it's all on their virtual filesystem,
so they had full control of layout) rather that multiplexing two different
command sets in such a fashion.

15) /dev/snd/pcmC*D*[cp].  Again, different semantics for write and writev,
with the latter wanting nr_seqs equal to the number of channels.  AIO
non-vectored write fails unless there's only one channel.  Not sure how
ALSA userland uses that thing; AIO side is always synchronous, so it might
be simply never used.  FWIW, I'm not sure that write() on a single-channel
one is equivalent to 1-element writev() - silencing-related logics seem to
differ.

That's it.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Feb. 2, 2015, 2:20 p.m. UTC | #5

On Mon, Feb 02, 2015 at 08:11:12AM +0000, Al Viro wrote:
> > you want me to finish my conversion.
> 
> I have f_fs.c conversion and partial legacy/inode.c one.  Will post tomorrow,
> about to fall asleep right now...

Already.  I've already fixed up the remaining commets on the series,
so if you post it soon I'll have the series one for this merge window.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Feb. 2, 2015, 2:26 p.m. UTC | #6

On Mon, Feb 02, 2015 at 08:14:31AM +0000, Al Viro wrote:
> 13) two odd drivers/usb/gadget instances.  I have conversion for f_fs.c,
> but legacy/inode.c (ep_read() et.al.) is trickier.  The problem in there
> is that writev() on a single-element vector is *not* equivalent to plain
> write().  The former treats the wrong-direction endpoint as EINVAL; the
> latter does
>                 if (usb_endpoint_xfer_isoc(&data->desc)) {
>                         mutex_unlock(&data->lock);
>                         return -EINVAL;
>                 }
>                 DBG (data->dev, "%s halt\n", data->name);
>                 spin_lock_irq (&data->dev->lock);
>                 if (likely (data->ep != NULL))
>                         usb_ep_set_halt (data->ep);
>                 spin_unlock_irq (&data->dev->lock);
>                 mutex_unlock(&data->lock);
>                 return -EBADMSG;
> instead.  IOW, for isochronous endpoints behaviour is the same, but the
> rest behaves differently.  If not for that, that sucker would convert
> to (3) easily;

I would bet the behavior difference is a bug, might be worth to Cc the
usb folks on this issue.  I bet we'd want the more complex behavior
for both variants.

> 14) ipathfs and qibfs: seriously different semantics for write and writev/AIO
> write.  As in "different set of commands recognized"; AIO write plays like
> writev, whether it's vectored or not (and it's always synchronous).
> I've no idea who had come up with that... highly innovative API or why
> hadn't they simply added two files (it's all on their virtual filesystem,
> so they had full control of layout) rather that multiplexing two different
> command sets in such a fashion.
> 
> 15) /dev/snd/pcmC*D*[cp].  Again, different semantics for write and writev,
> with the latter wanting nr_seqs equal to the number of channels.  AIO
> non-vectored write fails unless there's only one channel.  Not sure how
> ALSA userland uses that thing; AIO side is always synchronous, so it might
> be simply never used.  FWIW, I'm not sure that write() on a single-channel
> one is equivalent to 1-element writev() - silencing-related logics seem to
> differ.

For these weirdos we can pass down a flag in the kiocb about the source
of the I/O.  We'll need that flags field for non-blocking buffered reads
and per-I/O O_SYNC anyway, and it will be very useful for fixing the
races around changing the O_DIRECT flag at run time.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 4, 2015, 8:34 a.m. UTC | #7

[USB folks Cc'd]

On Mon, Feb 02, 2015 at 03:26:17PM +0100, Christoph Hellwig wrote:

> I would bet the behavior difference is a bug, might be worth to Cc the
> usb folks on this issue.  I bet we'd want the more complex behavior
> for both variants.

[Context for USB people: The difference in question is what ep_read() does
when it is called on write endpoint that isn't isochronous; it halts the
sucker and fails with EBADMSG, while ep_aio_read() handles all write endpoints
as isochronous ones - fails with EINVAL; FWIW, I agree that it's probably
a bug]

Sadly, that's not the only problem in there ;-/  _This_ one really has
the "what if single-segment AIO read tries to dereference iovec after
the caller is gone" bug you suspected in fs/direct-io.c; we have
static void ep_user_copy_worker(struct work_struct *work)
{
        struct kiocb_priv *priv = container_of(work, struct kiocb_priv, work);
        struct mm_struct *mm = priv->mm;
        struct kiocb *iocb = priv->iocb;
        size_t ret;

        use_mm(mm); 
        ret = ep_copy_to_user(priv);
        unuse_mm(mm);

        /* completing the iocb can drop the ctx and mm, don't touch mm after */
        aio_complete(iocb, ret, ret);

        kfree(priv->buf);
        kfree(priv);
}
called via schedule_work() from ->complete() of usb_request allocated and
queued by ->aio_read().  It very definitely _can_ be executed after return
from ->aio_read() and aio_run_iocb().  And ep_copy_to_user() dereferences
the iovec given to ->aio_read(); _not_ its copy as f_fs.c does.

Do io_submit(2) with several IOCB_CMD_PREAD requests, and you'll almost
certainly get the data from the first one copied to the destination of
the second one instead.  It shouldn't be hard to reproduce.  And that,
of course, is not the worst possible outcome...

I'm going to add copying of iovec in async read case.  And AFAICS, that one
is -stable fodder.  See vfs.git#gadget for f_fs.c conversion; I haven't
pushed legacy/inode.c stuff yet - I need to pull the fix of the bug above
into the beginning of that pile first.

FWIW, I don't believe that it makes sense to do iovec copying in
aio_run_iocb(); note that most of the instances will be done with
iovec before they return there.  These two were the sole exceptions;
function/f_fs.c did copying, legacy/inode.c didn't.  Most of the
->aio_read/->read_iter instances (including ones that *do* return
EIOCBQUEUED) only access iovec synchronously; usually that's done
by grabbing the pages to copy into before we get aronud to starting
IO.  legacy/inode.c is the only instance to step into that kind of bug.
function/f_fs.c also had a fun bug, BTW - failure in AIO ended up leaking
io_data (plus iovec copy in case of aio_read()).  Looks like another
-stable fodder, if less serious one...  See b17d2ded6 (gadget/function/f_fs.c:
close leaks) in vfs.git#gadget for that one.

I plan to pull the fix for use-after-free in the beginning of that queue
(in an easy to backport form) and then have ep_aio_read/ep_aio_write
start doing the halt-related bits as in ep_read/ep_write.  With that it's
trivial to convert that sucker along the same lines as function/f_fs.c.

All of that, assuming that anybody gives a damn about the driver in question.
The things like
                        spin_lock_irq (&dev->lock);
			....
// FIXME don't call this with the spinlock held ...
                                if (copy_to_user (buf, dev->req->buf, len))
seem to indicate that nobody does, seeing that this bug had been there
since 2003, complete with FIXME ;-/

If nobody cares about that sucker, git rm would be a better solution, IMO...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alan Stern Feb. 4, 2015, 6:17 p.m. UTC | #8

On Wed, 4 Feb 2015, Al Viro wrote:

> [USB folks Cc'd]

Incidentally, Al, have you seen this email?

	http://marc.info/?l=linux-usb&m=142295011402339&w=2

I encouraged the writer to send in a patch but so far there has been no 
reply.

> On Mon, Feb 02, 2015 at 03:26:17PM +0100, Christoph Hellwig wrote:
> 
> > I would bet the behavior difference is a bug, might be worth to Cc the
> > usb folks on this issue.  I bet we'd want the more complex behavior
> > for both variants.
> 
> [Context for USB people: The difference in question is what ep_read() does
> when it is called on write endpoint that isn't isochronous;

You're talking about drivers/usb/gadget/legacy/inode.c, right?

>  it halts the
> sucker and fails with EBADMSG, while ep_aio_read() handles all write endpoints
> as isochronous ones - fails with EINVAL; FWIW, I agree that it's probably
> a bug]

It's not a bug; it's by design.  That's how you halt an endpoint in 
gadgetfs -- by doing a synchronous I/O call in the "wrong" direction.

> Sadly, that's not the only problem in there ;-/  _This_ one really has
> the "what if single-segment AIO read tries to dereference iovec after
> the caller is gone" bug you suspected in fs/direct-io.c; we have
> static void ep_user_copy_worker(struct work_struct *work)
> {
>         struct kiocb_priv *priv = container_of(work, struct kiocb_priv, work);
>         struct mm_struct *mm = priv->mm;
>         struct kiocb *iocb = priv->iocb;
>         size_t ret;
> 
>         use_mm(mm); 
>         ret = ep_copy_to_user(priv);
>         unuse_mm(mm);
> 
>         /* completing the iocb can drop the ctx and mm, don't touch mm after */
>         aio_complete(iocb, ret, ret);
> 
>         kfree(priv->buf);
>         kfree(priv);
> }
> called via schedule_work() from ->complete() of usb_request allocated and
> queued by ->aio_read().  It very definitely _can_ be executed after return
> from ->aio_read() and aio_run_iocb().  And ep_copy_to_user() dereferences
> the iovec given to ->aio_read(); _not_ its copy as f_fs.c does.
> 
> Do io_submit(2) with several IOCB_CMD_PREAD requests, and you'll almost
> certainly get the data from the first one copied to the destination of
> the second one instead.  It shouldn't be hard to reproduce.  And that,
> of course, is not the worst possible outcome...
> 
> I'm going to add copying of iovec in async read case.  And AFAICS, that one
> is -stable fodder.  See vfs.git#gadget for f_fs.c conversion; I haven't
> pushed legacy/inode.c stuff yet - I need to pull the fix of the bug above
> into the beginning of that pile first.
> 
> FWIW, I don't believe that it makes sense to do iovec copying in
> aio_run_iocb(); note that most of the instances will be done with
> iovec before they return there.

That's true even for gadgetfs in the write case.

>  These two were the sole exceptions;
> function/f_fs.c did copying, legacy/inode.c didn't.  Most of the
> ->aio_read/->read_iter instances (including ones that *do* return
> EIOCBQUEUED) only access iovec synchronously; usually that's done
> by grabbing the pages to copy into before we get aronud to starting
> IO.  legacy/inode.c is the only instance to step into that kind of bug.
> function/f_fs.c also had a fun bug, BTW - failure in AIO ended up leaking
> io_data (plus iovec copy in case of aio_read()).  Looks like another
> -stable fodder, if less serious one...  See b17d2ded6 (gadget/function/f_fs.c:
> close leaks) in vfs.git#gadget for that one.
> 
> I plan to pull the fix for use-after-free in the beginning of that queue
> (in an easy to backport form) and then have ep_aio_read/ep_aio_write
> start doing the halt-related bits as in ep_read/ep_write.  With that it's
> trivial to convert that sucker along the same lines as function/f_fs.c.

I don't think there's any need to make the async routines do the
halt-related stuff.  After all, it's silly for users to call an async
I/O routine to perform a synchronous action like halting an endpoint.

On the other hand, it would be reasonable to replace the -EBADMSG with
some massaged version of the return code from usb_ep_set_halt(), which
is supposed to return -EAGAIN under some circumstances.  But that would
be an API change, so we probably shouldn't do it...

> All of that, assuming that anybody gives a damn about the driver in question.
> The things like
>                         spin_lock_irq (&dev->lock);
> 			....
> // FIXME don't call this with the spinlock held ...
>                                 if (copy_to_user (buf, dev->req->buf, len))
> seem to indicate that nobody does, seeing that this bug had been there
> since 2003, complete with FIXME ;-/
> 
> If nobody cares about that sucker, git rm would be a better solution, IMO...

It is a legacy driver after all, but some people still use it.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 4, 2015, 7:06 p.m. UTC | #9

On Wed, Feb 04, 2015 at 01:17:30PM -0500, Alan Stern wrote:
> On Wed, 4 Feb 2015, Al Viro wrote:
> 
> > [USB folks Cc'd]
> 
> Incidentally, Al, have you seen this email?
> 
> 	http://marc.info/?l=linux-usb&m=142295011402339&w=2
> 
> I encouraged the writer to send in a patch but so far there has been no 
> reply.

Yecchhh...  Anything that changes ->f_op *after* return from ->open() is
doing a nasty, nasty thing.  What's to guarantee that any checks for
NULL fields will stay valid, etc.?

FWIW, in all the tree there are only 4 places where that would be happening;
	* i810_map_buffer() screwing around with having vm_mmap() done,
only it wants its own thing called as ->mmap() (and a bit of extra data
stashed for it).  Racy as hell (if another thread calls mmap() on the
same file, you'll get a nasty surprise).  Driver's too old and brittle to
touch, according to drm folks...
	* TTY hangup logics.  Nasty (and might be broken around ->fasync()),
but it's a very special case.
	* snd_card_disconnect().  Analogue of TTY hangup, actually; both are
trying to do a form of revoke().
	* this one.  Note that you are not guaranteed that ep_config() won't
be called more than once - two threads might race in write(2), with the loser
getting through mutex_lock_interruptible(&data->lock); in ep_config() only
after the winner has already gotten through write(), switched ->f_op, returned
to userland and started doing read()/write()/etc.  If nothing else,
the contents of data->desc and data->hs_desc can be buggered by arbitrary
data, no matter how bogus, right as the first thread is doing IO.

> > [Context for USB people: The difference in question is what ep_read() does
> > when it is called on write endpoint that isn't isochronous;
> 
> You're talking about drivers/usb/gadget/legacy/inode.c, right?

Yes.

> >  it halts the
> > sucker and fails with EBADMSG, while ep_aio_read() handles all write endpoints
> > as isochronous ones - fails with EINVAL; FWIW, I agree that it's probably
> > a bug]
> 
> It's not a bug; it's by design.  That's how you halt an endpoint in 
> gadgetfs -- by doing a synchronous I/O call in the "wrong" direction.

Yes, but you have readv() on single-element vector behave different from
read(), which is surprising, to put it mildly.

> > I plan to pull the fix for use-after-free in the beginning of that queue
> > (in an easy to backport form) and then have ep_aio_read/ep_aio_write
> > start doing the halt-related bits as in ep_read/ep_write.  With that it's
> > trivial to convert that sucker along the same lines as function/f_fs.c.
> 
> I don't think there's any need to make the async routines do the
> halt-related stuff.  After all, it's silly for users to call an async
> I/O routine to perform a synchronous action like halting an endpoint.

Um...  readv() is also going through ->aio_read().  I can tie that to
sync vs. async, though - is_sync_kiocb() will do just that, if you are
OK with having readv() act the same as read() in that respect.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alan Stern Feb. 4, 2015, 8:30 p.m. UTC | #10

On Wed, 4 Feb 2015, Al Viro wrote:

> On Wed, Feb 04, 2015 at 01:17:30PM -0500, Alan Stern wrote:
> > On Wed, 4 Feb 2015, Al Viro wrote:
> > 
> > > [USB folks Cc'd]
> > 
> > Incidentally, Al, have you seen this email?
> > 
> > 	http://marc.info/?l=linux-usb&m=142295011402339&w=2
> > 
> > I encouraged the writer to send in a patch but so far there has been no 
> > reply.
> 
> Yecchhh...  Anything that changes ->f_op *after* return from ->open() is
> doing a nasty, nasty thing.  What's to guarantee that any checks for
> NULL fields will stay valid, etc.?
> 
> FWIW, in all the tree there are only 4 places where that would be happening;
> 	* i810_map_buffer() screwing around with having vm_mmap() done,
> only it wants its own thing called as ->mmap() (and a bit of extra data
> stashed for it).  Racy as hell (if another thread calls mmap() on the
> same file, you'll get a nasty surprise).  Driver's too old and brittle to
> touch, according to drm folks...
> 	* TTY hangup logics.  Nasty (and might be broken around ->fasync()),
> but it's a very special case.
> 	* snd_card_disconnect().  Analogue of TTY hangup, actually; both are
> trying to do a form of revoke().
> 	* this one.  Note that you are not guaranteed that ep_config() won't
> be called more than once - two threads might race in write(2), with the loser
> getting through mutex_lock_interruptible(&data->lock); in ep_config() only
> after the winner has already gotten through write(), switched ->f_op, returned
> to userland and started doing read()/write()/etc.  If nothing else,
> the contents of data->desc and data->hs_desc can be buggered by arbitrary
> data, no matter how bogus, right as the first thread is doing IO.

Well, this one certainly can be fixed to avoid altering ->f_op, at the 
cost of adding an extra check at the start of each I/O operation.

> > >  it halts the
> > > sucker and fails with EBADMSG, while ep_aio_read() handles all write endpoints
> > > as isochronous ones - fails with EINVAL; FWIW, I agree that it's probably
> > > a bug]
> > 
> > It's not a bug; it's by design.  That's how you halt an endpoint in 
> > gadgetfs -- by doing a synchronous I/O call in the "wrong" direction.
> 
> Yes, but you have readv() on single-element vector behave different from
> read(), which is surprising, to put it mildly.
> 
> > > I plan to pull the fix for use-after-free in the beginning of that queue
> > > (in an easy to backport form) and then have ep_aio_read/ep_aio_write
> > > start doing the halt-related bits as in ep_read/ep_write.  With that it's
> > > trivial to convert that sucker along the same lines as function/f_fs.c.
> > 
> > I don't think there's any need to make the async routines do the
> > halt-related stuff.  After all, it's silly for users to call an async
> > I/O routine to perform a synchronous action like halting an endpoint.
> 
> Um...  readv() is also going through ->aio_read().

Why does readv() do this but not read()?  Wouldn't it make more sense 
to have all the read* calls use the same internal interface?

>  I can tie that to
> sync vs. async, though - is_sync_kiocb() will do just that, if you are
> OK with having readv() act the same as read() in that respect.

I don't really care one way or the other.  In fact, it doesn't matter
if the same behavior applies to all the async calls as well as the sync
calls -- I just doubt that anybody will ever use them.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 4, 2015, 11:07 p.m. UTC | #11

On Wed, Feb 04, 2015 at 03:30:32PM -0500, Alan Stern wrote:
> > 	* this one.  Note that you are not guaranteed that ep_config() won't
> > be called more than once - two threads might race in write(2), with the loser
> > getting through mutex_lock_interruptible(&data->lock); in ep_config() only
> > after the winner has already gotten through write(), switched ->f_op, returned
> > to userland and started doing read()/write()/etc.  If nothing else,
> > the contents of data->desc and data->hs_desc can be buggered by arbitrary
> > data, no matter how bogus, right as the first thread is doing IO.
> 
> Well, this one certainly can be fixed to avoid altering ->f_op, at the 
> cost of adding an extra check at the start of each I/O operation.
 
> > Um...  readv() is also going through ->aio_read().
> 
> Why does readv() do this but not read()?  Wouldn't it make more sense 
> to have all the read* calls use the same internal interface?

Because there are two partially overlapping classes wrt vector IO semantics:
	1) datagram-style.  Vectored read/write is equivalent to simple
read/write done on each vector component.  And IO boundaries matter - if
your driver treats any write() as datagram that starts e.g. with
fixed-sized table in the beginning + arbitrary amount of data following
it, you will get very different results from write(fd, buf, 200) and
writev(fd, (struct iovec[2]){{buf, 100}, {buf+100, 100}}, 2).  A _lot_ of
drivers are like that - they supply ->read() and ->write() for single-range
IO and VFS construct the rest of operations out of those.

	2) stream-style.  Vectored read is guaranteed to behave the same
way as simple read on a range with size being the sum of vector element
sizes, except that the data ends in ranges covered by vector elements instead
of a single array.  Vectored write is guaranteed to behave the same way
as simple write from a buffer containing the concatenation of ranges covered
by vector elements.  Boundaries between the elements do not matter at all.
Regular files on storage filesystems are like that.  So are FIFOs and pipes
and so are sockets.  Even for datagram protocols, boundaries between the
vector elements are ignored; boundaries between syscalls provide the datagram
boundaries, but you can e.g. do writev(udp_socket_fd, (struct iovec[3]){
{const_header, sizeof(const_header)}, {&n, 4}, {s, strlen(s)}}, 3) and have
only one UDP packet sent.  IOW, it's general-purpose scatter-gather for read
and write.

	The last example shows that (2) isn't a subset of (1) - it's not
always possible to call ->write() in loop and get the right behaviour.
For regular files (and pure stream sockets, etc.) it would work, but for
stuff like UDP sockets it would break.  Moreover, even for regular files on
storage filesystems it would be quite inefficient - we'd need to acquire and
release a bunch of locks, poke through metadata, etc., for each segment.

	As the result, there was a couple of new methods added, inventively
called ->readv() and ->writev().  do_sync_read() was supposed to be used
as ->read() instance - it's "feed a single-element vector to ->readv()" and
similar for s/read/write/.

	Note that both (1) and (2) satisfy the following sanity requirement -
single-element readv() is always equivalent to single-element() read().  You
could violate that, by providing completely unrelated ->read() and ->readv(),
but very few drivers went for that - too insane.

	Then, when AIO had been added, those had grown an argument pointing
to iocb (instead of file and ppos - for those we use iocb->ki_filp and
&iocb->ki_pos resp.) and they got renamed into ->aio_read() and ->aio_write().
Note that non-vectored AIO uses the same methods - ->read() and ->write() had
too many instances to convert and most of those would end up just using those
two iocb fields instead of the old arguments - tons of churn for no good
reason.  ->readv() and ->writev() had fewer instances (very common was the
use of generic_file_aio_{read,write}()) and conversion was less painful.
So there was no ->aio_read() and ->aio_write().  That, in principle, was a
bit of constraint - you couldn't make single-element AIO_PREADV behave
different from AIO_PREAD, but nobody had been insane enough to ask for that.

	Moreover, keeping ->readv() and ->writev() was pointless. There is
cheap way to tell whether ->aio_{read,write}() call is due to io_submit(2)
or to readv()/writev() - is_sync_kiocb(iocb) tells which one it is, so if
driver really wanted different semantics for sync vs. async, it could check
that.

	So we ended up with ->read/->write for sync non-vectored and
->aio_read()/->aio_write() for sync vectored *and* async anything.  Usually
you provide one or the other - NULL ->aio_... means loop calling ->read/write
on each element, NULL ->read/write (or do_sync_... for them - it's the same
thing) means feeding sync iocb and single-element vector to ->aio_....
You *can* provide both, but that's rare and almost always pointless.

	These days ->aio_{read,write} is almost gone - replaced with
->{read,write}_iter().  That sucker is equivalent, except that you
get a pointer struct iov_iter instead iovec, nr_seg, size triple.  And
you use linux/uio.h primitives for accessing the data (it might be backed
by something other than userland ranges).  The primitives in there are
not harder to use than copy_..._user(), and you avoid any need to loop over
iovec array, etc.  Those primitives generally don't give a damn about the
range boundaries; the only exception is iov_iter_single_seg_count(), which
tells how much data is left to be consumed in the current segment; very
few things are using it.  is_sync_kiocb() is still available to tell io_submit
from synchronous syscalls.

	I don't believe that it's worth switching the (1) \setminus (2) to
those; it's still too much churn, plus we'd need the loop over segments
somewhere to keep the semantics.  It *can* be expressed by ->read_iter()
and ->write_iter(), but it'll be tons of boilerplate code in a lot of
drivers.  So ->read() and ->write() are there to stay.  However, I think
that we can get to the situation when it's really either one or another -
if you have ->read_iter()/->write_iter(), don't mess with custom ->read()
and ->write().

	Both function/f_fs.c and legacy/inode.c are in class (2) - they
gather data from iovec into temp buffer in ->aio_write() and pass the buffer to 
the code doing actual IO, and on ->aio_read() side they give a buffer to
read into, then arrange for it to be scattered into our iovec upon IO
completion.  And they are doing non-trivial async stuff, so I'd prefer to
switch them completely to ->read_iter/->write_iter.  The only obstacle is
read vs. single-element readv and write vs. single-element writev behaviour
difference.

	For some files it really can't be helped - {ipath,qib}fs has completely
unrelated sets of commands accepted by write and writev, including the
single-element one.  And /dev/snd/pcm*, while somewhat milder, has non-trivial
behaviour differences.  I think that trick suggested by hch (put several
flags in iocb, encoding the reason for the call; that would allow to tell
one from another) is the best way to deal with those.  But I would really
prefer to avoid that; IMO those examples are simply bad userland APIs.
legacy/inode.c is the third (and last) beast like those two.  And there the
difference is small enough to simply change readv/writev to be like read/write
in that respect, i.e. also halt the endpoint when called on the isochronous
one with wrong direction.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert Baldyga Feb. 5, 2015, 8:24 a.m. UTC | #12

On 02/05/2015 12:07 AM, Al Viro wrote:
> On Wed, Feb 04, 2015 at 03:30:32PM -0500, Alan Stern wrote:
>>> 	* this one.  Note that you are not guaranteed that ep_config() won't
>>> be called more than once - two threads might race in write(2), with the loser
>>> getting through mutex_lock_interruptible(&data->lock); in ep_config() only
>>> after the winner has already gotten through write(), switched ->f_op, returned
>>> to userland and started doing read()/write()/etc.  If nothing else,
>>> the contents of data->desc and data->hs_desc can be buggered by arbitrary
>>> data, no matter how bogus, right as the first thread is doing IO.
>>
>> Well, this one certainly can be fixed to avoid altering ->f_op, at the 
>> cost of adding an extra check at the start of each I/O operation.
>  
>>> Um...  readv() is also going through ->aio_read().
>>
>> Why does readv() do this but not read()?  Wouldn't it make more sense 
>> to have all the read* calls use the same internal interface?
> 
> Because there are two partially overlapping classes wrt vector IO semantics:
> 	1) datagram-style.  Vectored read/write is equivalent to simple
> read/write done on each vector component.  And IO boundaries matter - if
> your driver treats any write() as datagram that starts e.g. with
> fixed-sized table in the beginning + arbitrary amount of data following
> it, you will get very different results from write(fd, buf, 200) and
> writev(fd, (struct iovec[2]){{buf, 100}, {buf+100, 100}}, 2).  A _lot_ of
> drivers are like that - they supply ->read() and ->write() for single-range
> IO and VFS construct the rest of operations out of those.
> 
> 	2) stream-style.  Vectored read is guaranteed to behave the same
> way as simple read on a range with size being the sum of vector element
> sizes, except that the data ends in ranges covered by vector elements instead
> of a single array.  Vectored write is guaranteed to behave the same way
> as simple write from a buffer containing the concatenation of ranges covered
> by vector elements.  Boundaries between the elements do not matter at all.
> Regular files on storage filesystems are like that.  So are FIFOs and pipes
> and so are sockets.  Even for datagram protocols, boundaries between the
> vector elements are ignored; boundaries between syscalls provide the datagram
> boundaries, but you can e.g. do writev(udp_socket_fd, (struct iovec[3]){
> {const_header, sizeof(const_header)}, {&n, 4}, {s, strlen(s)}}, 3) and have
> only one UDP packet sent.  IOW, it's general-purpose scatter-gather for read
> and write.
> 
> 	The last example shows that (2) isn't a subset of (1) - it's not
> always possible to call ->write() in loop and get the right behaviour.
> For regular files (and pure stream sockets, etc.) it would work, but for
> stuff like UDP sockets it would break.  Moreover, even for regular files on
> storage filesystems it would be quite inefficient - we'd need to acquire and
> release a bunch of locks, poke through metadata, etc., for each segment.
> 
> 	As the result, there was a couple of new methods added, inventively
> called ->readv() and ->writev().  do_sync_read() was supposed to be used
> as ->read() instance - it's "feed a single-element vector to ->readv()" and
> similar for s/read/write/.
> 
> 	Note that both (1) and (2) satisfy the following sanity requirement -
> single-element readv() is always equivalent to single-element() read().  You
> could violate that, by providing completely unrelated ->read() and ->readv(),
> but very few drivers went for that - too insane.
> 
> 	Then, when AIO had been added, those had grown an argument pointing
> to iocb (instead of file and ppos - for those we use iocb->ki_filp and
> &iocb->ki_pos resp.) and they got renamed into ->aio_read() and ->aio_write().
> Note that non-vectored AIO uses the same methods - ->read() and ->write() had
> too many instances to convert and most of those would end up just using those
> two iocb fields instead of the old arguments - tons of churn for no good
> reason.  ->readv() and ->writev() had fewer instances (very common was the
> use of generic_file_aio_{read,write}()) and conversion was less painful.
> So there was no ->aio_read() and ->aio_write().  That, in principle, was a
> bit of constraint - you couldn't make single-element AIO_PREADV behave
> different from AIO_PREAD, but nobody had been insane enough to ask for that.
> 
> 	Moreover, keeping ->readv() and ->writev() was pointless. There is
> cheap way to tell whether ->aio_{read,write}() call is due to io_submit(2)
> or to readv()/writev() - is_sync_kiocb(iocb) tells which one it is, so if
> driver really wanted different semantics for sync vs. async, it could check
> that.
> 
> 	So we ended up with ->read/->write for sync non-vectored and
> ->aio_read()/->aio_write() for sync vectored *and* async anything.  Usually
> you provide one or the other - NULL ->aio_... means loop calling ->read/write
> on each element, NULL ->read/write (or do_sync_... for them - it's the same
> thing) means feeding sync iocb and single-element vector to ->aio_....
> You *can* provide both, but that's rare and almost always pointless.
> 
> 	These days ->aio_{read,write} is almost gone - replaced with
> ->{read,write}_iter().  That sucker is equivalent, except that you
> get a pointer struct iov_iter instead iovec, nr_seg, size triple.  And
> you use linux/uio.h primitives for accessing the data (it might be backed
> by something other than userland ranges).  The primitives in there are
> not harder to use than copy_..._user(), and you avoid any need to loop over
> iovec array, etc.  Those primitives generally don't give a damn about the
> range boundaries; the only exception is iov_iter_single_seg_count(), which
> tells how much data is left to be consumed in the current segment; very
> few things are using it.  is_sync_kiocb() is still available to tell io_submit
> from synchronous syscalls.
> 
> 	I don't believe that it's worth switching the (1) \setminus (2) to
> those; it's still too much churn, plus we'd need the loop over segments
> somewhere to keep the semantics.  It *can* be expressed by ->read_iter()
> and ->write_iter(), but it'll be tons of boilerplate code in a lot of
> drivers.  So ->read() and ->write() are there to stay.  However, I think
> that we can get to the situation when it's really either one or another -
> if you have ->read_iter()/->write_iter(), don't mess with custom ->read()
> and ->write().
> 
> 	Both function/f_fs.c and legacy/inode.c are in class (2) - they
> gather data from iovec into temp buffer in ->aio_write() and pass the buffer to 
> the code doing actual IO, and on ->aio_read() side they give a buffer to
> read into, then arrange for it to be scattered into our iovec upon IO
> completion.  And they are doing non-trivial async stuff, so I'd prefer to
> switch them completely to ->read_iter/->write_iter.  The only obstacle is
> read vs. single-element readv and write vs. single-element writev behaviour
> difference.

No, function/f_fs.c and legacy/inode.c are in class (1). They have
datagram semantics - each vector element is submitted in separate USB
request. Each single request is sent in separate USB data packet (for
bulk endpoints it can be more than one packet). In fact sync
read()/write() also will give different results while called once with
some block of data or in loop with the same block of data splitted into
a few parts.

> 
> 	For some files it really can't be helped - {ipath,qib}fs has completely
> unrelated sets of commands accepted by write and writev, including the
> single-element one.  And /dev/snd/pcm*, while somewhat milder, has non-trivial
> behaviour differences.  I think that trick suggested by hch (put several
> flags in iocb, encoding the reason for the call; that would allow to tell
> one from another) is the best way to deal with those.  But I would really
> prefer to avoid that; IMO those examples are simply bad userland APIs.
> legacy/inode.c is the third (and last) beast like those two.  And there the
> difference is small enough to simply change readv/writev to be like read/write
> in that respect, i.e. also halt the endpoint when called on the isochronous
> one with wrong direction.

It shouldn't be a big problem to halt endpoints in
ep_aio_write()/ep_aio_read() to have similar behaviour for
single-element readv()/writev() and read()/write() operations.

Robert Baldyga
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 5, 2015, 8:47 a.m. UTC | #13

On Thu, Feb 05, 2015 at 09:24:32AM +0100, Robert Baldyga wrote:

> No, function/f_fs.c and legacy/inode.c are in class (1). They have
> datagram semantics - each vector element is submitted in separate USB
> request. Each single request is sent in separate USB data packet (for
> bulk endpoints it can be more than one packet). In fact sync
> read()/write() also will give different results while called once with
> some block of data or in loop with the same block of data splitted into
> a few parts.

No, they don't.  This is from ffs_epfile_io():

                data = kmalloc(data_len, GFP_KERNEL);
                if (unlikely(!data))
                        return -ENOMEM;
                if (io_data->aio && !io_data->read) {
                        int i;
                        size_t pos = 0;
                        for (i = 0; i < io_data->nr_segs; i++) {
                                if (unlikely(copy_from_user(&data[pos],
                                             io_data->iovec[i].iov_base,
                                             io_data->iovec[i].iov_len))) {
                                        ret = -EFAULT;
                                        goto error;
                                }
                                pos += io_data->iovec[i].iov_len;
                        }

and that's the last point where it looks at iovec.  After that all work
is done to the copy in data, where no information about the boundaries
survives.  And ep_aio_write() (in legacy/inode.c) is the same way.

	You are confusing datagram-per-syscall (which they are) with
datagram-per-iovec (which they are definitely not).  IOW, they behave
as UDP sockets - writev() is purely scatter-gather variant of write(),
with datagram per syscall and all vector elements silently concatenated.
That's class 2, and _not_ in its intersection with class 1.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 5, 2015, 9:03 a.m. UTC | #14

On Thu, Feb 05, 2015 at 08:47:29AM +0000, Al Viro wrote:
> 	You are confusing datagram-per-syscall (which they are) with
> datagram-per-iovec (which they are definitely not).  IOW, they behave
> as UDP sockets - writev() is purely scatter-gather variant of write(),
> with datagram per syscall and all vector elements silently concatenated.
> That's class 2, and _not_ in its intersection with class 1.

PS: you want class 1, look at something like /proc/sys/kernel/domainname
(or any other sysctl of that sort).  write "foobar" there and
cat /proc/sys/kernel/domainname will print foorbat.  writev an array consisting
of "foo" and "bar", and you'll see bar afterwards, same as you would
after writing first "foo", then "bar".  There the iovec boundaries affect
the result - ->no aio_write() for that sucker, so we get two calls of
->write(), with expected results.  And there are character devices like that
as well.  _That_ is class 1 outside of intersection with class 2.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert Baldyga Feb. 5, 2015, 9:15 a.m. UTC | #15

On 02/05/2015 10:03 AM, Al Viro wrote:
> On Thu, Feb 05, 2015 at 08:47:29AM +0000, Al Viro wrote:
>> 	You are confusing datagram-per-syscall (which they are) with
>> datagram-per-iovec (which they are definitely not).  IOW, they behave
>> as UDP sockets - writev() is purely scatter-gather variant of write(),
>> with datagram per syscall and all vector elements silently concatenated.
>> That's class 2, and _not_ in its intersection with class 1.
> 
> PS: you want class 1, look at something like /proc/sys/kernel/domainname
> (or any other sysctl of that sort).  write "foobar" there and
> cat /proc/sys/kernel/domainname will print foorbat.  writev an array consisting
> of "foo" and "bar", and you'll see bar afterwards, same as you would
> after writing first "foo", then "bar".  There the iovec boundaries affect
> the result - ->no aio_write() for that sucker, so we get two calls of
> ->write(), with expected results.  And there are character devices like that
> as well.  _That_ is class 1 outside of intersection with class 2.
> 

Oh, I see. Thanks.

So we only need to add endpoint halting to aio_read()/aio_write(), to
make their behaviour similar to sync ones, right?

Robert Baldyga
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alan Stern Feb. 5, 2015, 3:29 p.m. UTC | #16

On Wed, 4 Feb 2015, Al Viro wrote:

> > > Um...  readv() is also going through ->aio_read().
> > 
> > Why does readv() do this but not read()?  Wouldn't it make more sense 
> > to have all the read* calls use the same internal interface?
> 
> Because there are two partially overlapping classes wrt vector IO semantics:
...

Thanks for the detailed explanation.  It appears to boil down to a 
series of historical accidents.

In any case, feel free to copy the non-isochronous behavior of the
synchronous routines in the async routines.  It certainly won't hurt 
anything.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 6, 2015, 7:03 a.m. UTC | #17

On Thu, Feb 05, 2015 at 10:29:42AM -0500, Alan Stern wrote:
> On Wed, 4 Feb 2015, Al Viro wrote:
> 
> > > > Um...  readv() is also going through ->aio_read().
> > > 
> > > Why does readv() do this but not read()?  Wouldn't it make more sense 
> > > to have all the read* calls use the same internal interface?
> > 
> > Because there are two partially overlapping classes wrt vector IO semantics:
> ...
> 
> Thanks for the detailed explanation.  It appears to boil down to a 
> series of historical accidents.
> 
> In any case, feel free to copy the non-isochronous behavior of the
> synchronous routines in the async routines.  It certainly won't hurt 
> anything.

Hmm...  What happens if f_fs.c successfully queues struct usb_request, returns
-EIOCBQUEUED and then gets hit by io_cancel(2)?  AFAICS, you get
ffs_aio_cancel() called, which dequeues usb_request and buggers off.
The thing is, freeing io_data and stuff hanging off it would be done by
ffs_user_copy_worker(), which would be scheduled via schedule_work() by
ffs_epfile_async_io_complete(), i.e. usb_request ->complete() callback.
And usb_ep_dequeue() (aka. ep->ops->dequeue()) has tons of instances, but
AFAICS some of them might not trigger usb_gadget_giveback_request(), which
would normally call ->complete()...

Example:
net2272_dequeue(struct usb_ep *_ep, struct usb_request *_req)
{
        struct net2272_ep *ep;
        struct net2272_request *req;
        unsigned long flags;
        int stopped;

        ep = container_of(_ep, struct net2272_ep, ep);
        if (!_ep || (!ep->desc && ep->num != 0) || !_req)
                return -EINVAL;

        spin_lock_irqsave(&ep->dev->lock, flags);
        stopped = ep->stopped;
        ep->stopped = 1;

        /* make sure it's still queued on this endpoint */
        list_for_each_entry(req, &ep->queue, queue) {
                if (&req->req == _req)
                        break;
        }
        if (&req->req != _req) {
                spin_unlock_irqrestore(&ep->dev->lock, flags);
                return -EINVAL;
        }

        /* queue head may be partially complete */
        if (ep->queue.next == &req->queue) {
                dev_dbg(ep->dev->dev, "unlink (%s) pio\n", _ep->name);
                net2272_done(ep, req, -ECONNRESET);
        }
        req = NULL;
        ep->stopped = stopped;

        spin_unlock_irqrestore(&ep->dev->lock, flags);
        return 0;
}

Note that net2272_done(), which would call usb_gadget_giveback_request(),
is only called if the victim happens to be queue head.  Is that just a
net2272.c bug, or am I missing something subtle here?  Looks like
at least on that hardware io_cancel() could leak io_data and everything
that hangs off it...

FWIW, net2272.c was the first one I looked at (happened to be on the last
line of screen during git grep for \.dequeue in drivers/usb/gadget ;-)
and after checking several more it seems that it's a Sod's Law in action -
I'd checked about 5 of them and they seem to call usb_gadget_giveback_request()
as long as they find the sucker in queue, head or no head.  OTOH, there's
a lot more of those guys, so that observation is not worth much...

IOW, is that a net2272.c bug, or a f_fs.c one? 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert Baldyga Feb. 6, 2015, 8:44 a.m. UTC | #18

On 02/06/2015 08:03 AM, Al Viro wrote:
> On Thu, Feb 05, 2015 at 10:29:42AM -0500, Alan Stern wrote:
>> On Wed, 4 Feb 2015, Al Viro wrote:
>>
>>>>> Um...  readv() is also going through ->aio_read().
>>>>
>>>> Why does readv() do this but not read()?  Wouldn't it make more sense 
>>>> to have all the read* calls use the same internal interface?
>>>
>>> Because there are two partially overlapping classes wrt vector IO semantics:
>> ...
>>
>> Thanks for the detailed explanation.  It appears to boil down to a 
>> series of historical accidents.
>>
>> In any case, feel free to copy the non-isochronous behavior of the
>> synchronous routines in the async routines.  It certainly won't hurt 
>> anything.
> 
> Hmm...  What happens if f_fs.c successfully queues struct usb_request, returns
> -EIOCBQUEUED and then gets hit by io_cancel(2)?  AFAICS, you get
> ffs_aio_cancel() called, which dequeues usb_request and buggers off.
> The thing is, freeing io_data and stuff hanging off it would be done by
> ffs_user_copy_worker(), which would be scheduled via schedule_work() by
> ffs_epfile_async_io_complete(), i.e. usb_request ->complete() callback.
> And usb_ep_dequeue() (aka. ep->ops->dequeue()) has tons of instances, but
> AFAICS some of them might not trigger usb_gadget_giveback_request(), which
> would normally call ->complete()...
> 
> Example:
> net2272_dequeue(struct usb_ep *_ep, struct usb_request *_req)
> {
>         struct net2272_ep *ep;
>         struct net2272_request *req;
>         unsigned long flags;
>         int stopped;
> 
>         ep = container_of(_ep, struct net2272_ep, ep);
>         if (!_ep || (!ep->desc && ep->num != 0) || !_req)
>                 return -EINVAL;
> 
>         spin_lock_irqsave(&ep->dev->lock, flags);
>         stopped = ep->stopped;
>         ep->stopped = 1;
> 
>         /* make sure it's still queued on this endpoint */
>         list_for_each_entry(req, &ep->queue, queue) {
>                 if (&req->req == _req)
>                         break;
>         }
>         if (&req->req != _req) {
>                 spin_unlock_irqrestore(&ep->dev->lock, flags);
>                 return -EINVAL;
>         }
> 
>         /* queue head may be partially complete */
>         if (ep->queue.next == &req->queue) {
>                 dev_dbg(ep->dev->dev, "unlink (%s) pio\n", _ep->name);
>                 net2272_done(ep, req, -ECONNRESET);
>         }
>         req = NULL;
>         ep->stopped = stopped;
> 
>         spin_unlock_irqrestore(&ep->dev->lock, flags);
>         return 0;
> }
> 
> Note that net2272_done(), which would call usb_gadget_giveback_request(),
> is only called if the victim happens to be queue head.  Is that just a
> net2272.c bug, or am I missing something subtle here?  Looks like
> at least on that hardware io_cancel() could leak io_data and everything
> that hangs off it...
> 
> FWIW, net2272.c was the first one I looked at (happened to be on the last
> line of screen during git grep for \.dequeue in drivers/usb/gadget ;-)
> and after checking several more it seems that it's a Sod's Law in action -
> I'd checked about 5 of them and they seem to call usb_gadget_giveback_request()
> as long as they find the sucker in queue, head or no head.  OTOH, there's
> a lot more of those guys, so that observation is not worth much...
> 
> IOW, is that a net2272.c bug, or a f_fs.c one? 

AFAIK usb request should be completed in all cases, and many gadget
drivers make assumptions that complete() of each request will be called,
so it's definitely bug in net2272 driver.

Robert Baldyga
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 7, 2015, 5:44 a.m. UTC | #19

On Thu, Feb 05, 2015 at 10:29:42AM -0500, Alan Stern wrote:
> On Wed, 4 Feb 2015, Al Viro wrote:
> 
> > > > Um...  readv() is also going through ->aio_read().
> > > 
> > > Why does readv() do this but not read()?  Wouldn't it make more sense 
> > > to have all the read* calls use the same internal interface?
> > 
> > Because there are two partially overlapping classes wrt vector IO semantics:
> ...
> 
> Thanks for the detailed explanation.  It appears to boil down to a 
> series of historical accidents.
> 
> In any case, feel free to copy the non-isochronous behavior of the
> synchronous routines in the async routines.  It certainly won't hurt 
> anything.

OK, I've limited it to sync ones, actually.  Preliminary series in followups.
Those who prefer to read in git, see
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #gadget

WARNING: completely untested

Al Viro (6):
      new helper: dup_iter()
      gadget/function/f_fs.c: close leaks
      gadget/function/f_fs.c: use put iov_iter into io_data
      gadget/function/f_fs.c: switch to ->{read,write}_iter()
      gadgetfs: use-after-free in ->aio_read()
      gadget: switch ep_io_operations to ->read_iter/->write_iter

 drivers/usb/gadget/function/f_fs.c | 204 +++++++++-------------
 drivers/usb/gadget/legacy/inode.c  | 346 +++++++++++++++----------------------
 include/linux/uio.h                |   2 +
 mm/iov_iter.c                      |  15 ++
 4 files changed, 237 insertions(+), 330 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[3/5] fs: remove ki_nbytes

Commit Message

Comments

Patch