Message ID | 1422381313-24034-4-git-send-email-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> @@ -717,14 +718,14 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov, > unsigned long nr_segs, loff_t o) > { > struct ep_data *epdata = iocb->ki_filp->private_data; > + size_t len = iov_length(iov, nr_segs); > char *buf; > - size_t len = 0; > int i = 0; > > if (unlikely(!usb_endpoint_dir_in(&epdata->desc))) > return -EINVAL; > > - buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL); > + buf = kmalloc(len, GFP_KERNEL); > if (unlikely(!buf)) > return -ENOMEM; > > @@ -734,7 +735,6 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov, > kfree(buf); > return -EFAULT; > } > - len += iov[i].iov_len; > } > return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0); WTF bother? Just switch it to ->write_iter(): static ssize_t ep_write_iter(struct kiocb *iocb, struct iov_iter *from) { struct ep_data *epdata = iocb->ki_filp->private_data; size_t len = iov_iter_count(from); void *buf; if (unlikely(!usb_endpoint_dir_in(&epdata->desc))) return -EINVAL; buf = kmalloc(iov_iter_count(from), GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; if (copy_from_iter(buf, from, len) != len) { kfree(buf); return -EFAULT; } return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0); } and be done with that. I'll put drivers/usb/gadget patches into a stable branch and ask use folks to pull from it - that's the simplest of this series, actually... > @@ -903,10 +903,9 @@ static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, > if (pos != 0) > return -ESPIPE; > > - if (iocb->ki_nbytes == 0) /* Match SYS5 behaviour */ > + if (!iov_length(iov, nr_segs)) /* Match SYS5 behaviour */ > return 0; FWIW, it's switched to ->read_iter/->write_iter in vfs.git#for-davem. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jan 31, 2015 at 06:08:41AM +0000, Al Viro wrote: > and be done with that. I'll put drivers/usb/gadget patches into a stable > branch and ask use folks to pull from it - that's the simplest of this > series, actually... use folks? Btw, does this mean you patches to switch it over, or do you want me to finish my conversion. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 02, 2015 at 09:07:29AM +0100, Christoph Hellwig wrote: > On Sat, Jan 31, 2015 at 06:08:41AM +0000, Al Viro wrote: > > and be done with that. I'll put drivers/usb/gadget patches into a stable > > branch and ask use folks to pull from it - that's the simplest of this > > series, actually... > > use folks? Btw, does this mean you patches to switch it over, or do usb. > you want me to finish my conversion. I have f_fs.c conversion and partial legacy/inode.c one. Will post tomorrow, about to fall asleep right now... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 02, 2015 at 08:11:12AM +0000, Al Viro wrote: > On Mon, Feb 02, 2015 at 09:07:29AM +0100, Christoph Hellwig wrote: > > On Sat, Jan 31, 2015 at 06:08:41AM +0000, Al Viro wrote: > > > and be done with that. I'll put drivers/usb/gadget patches into a stable > > > branch and ask use folks to pull from it - that's the simplest of this > > > series, actually... > > > > use folks? Btw, does this mean you patches to switch it over, or do > > usb. > > > you want me to finish my conversion. > > I have f_fs.c conversion and partial legacy/inode.c one. Will post tomorrow, > about to fall asleep right now... FWIW, here's the current situation with read/write methods: 1) a bunch of file_operations instances have only ->read() and/or ->write(); no ->aio_... or ->..._iter versions, no ->splice_write(). Those are basically device drivers: readv is equivalent to loop over segments, with read on each. writev is equivalent to loop over segments, with write on each. AIO read and write are all synchronous splice_write is does kmap on each buffer in turn and write that Note that splitting a buffer into several adjacent pieces and submitting a vectored IO is *not* guaranteed to be equivalent to plain IO on the entire buffer - it has datagram-like semantics and boundaries do matter. 2) regular ->read_iter/->write_iter users: ->read is new_sync_read() (or NULL, if ->read_iter is NULL) ->write is new_sync_write() (or NULL, if ->write_iter is NULL) ->aio_read and ->aio_write are NULL ->splice_write is iter_splice_write (or NULL, if ->write_iter is NULL) Those are stream-style ones; what matter is concatenation of the buffers, not the boundaries. Right now the following is true: ->read_iter/->write_iter on a sync kiocb never returns EIOCBQUEUED and iterator argument is advanced exactly by the amount of data transferred. 3) ones like (2), but with NULL ->splice_write() in spite of present ->write_iter(). AFAICS they can be bulk-converted to (2). This is what the bulk of irregularities boil down to. 4) ones like (2) or (3), except ->read() and ->write() are left NULL instead of doing new_sync_{read,write}(). Mostly equivalent to (2); some codepaths call ->read() or ->write() directly, instead of going through vfs_... wrappers, and for those it can be a bit of a headache. In any case, there are very few such instances (only 3) and they are trivial to convert to (2). 5) Two instances in fs/9p have ->write_iter() *and* ->write(), the latter not being new_sync_write(). Ditto for ->read_iter() and ->read(). No other instance has such combinations. FWIW, they are *almost* regular - their ->read() and ->write() are new_sync_... unless it's an O_DIRECT file. It might make sense to try and convert those to ->direct_IO(); as it is, their O_DIRECT is ignored for readv()/writev(), which is arguably a bug. 6) drivers/char/mem.c stuff; they are equivalent to (2), but somewhat optimised. Some of that might make sense to convert to (2); there _are_ hot paths involved, so we'd better be careful there. 7) bad_file_ops. It is equivalent to (2); it's also a very special case. FWIW, I'm not at all sure that we *need* most of the methods in there. We never replace ->f_op of an opened file with that, so if we end up with that sucker, it happens on fresh open of something that had its ->i_fop replaced with bad_file_ops. And that will instantly fail in bad_file_open(). Why do we need the rest of the methods in there? AFAICS, we should drop all but ->open(). 8) hypfs - AFAICS, converts to (2) easily; I'll do that. 9) FUSE - ->aio_read/->aio_write user, with unusual ->splice_write as well. I _think_ it can be converted to (2), but that'll take a bit of work. 10) sockets; conversion to ->read_iter/->write_iter had been posted to netdev, ->splice_write() (the only user of ->sendpage(), BTW) is a work for the next cycle. 11) NTFS with rw support enabled. ->aio_write() user, needs to be converted to ->write_iter(). AFAICS, it wasn't particularly high priority for Anton (along with all rw stuff in fs/ntfs); it doesn't look terribly hard, but then it wasn't a high priority for me either. 12) virtio_console. Interesting ->splice_write() there; hadn't looked deep enough. 13) two odd drivers/usb/gadget instances. I have conversion for f_fs.c, but legacy/inode.c (ep_read() et.al.) is trickier. The problem in there is that writev() on a single-element vector is *not* equivalent to plain write(). The former treats the wrong-direction endpoint as EINVAL; the latter does if (usb_endpoint_xfer_isoc(&data->desc)) { mutex_unlock(&data->lock); return -EINVAL; } DBG (data->dev, "%s halt\n", data->name); spin_lock_irq (&data->dev->lock); if (likely (data->ep != NULL)) usb_ep_set_halt (data->ep); spin_unlock_irq (&data->dev->lock); mutex_unlock(&data->lock); return -EBADMSG; instead. IOW, for isochronous endpoints behaviour is the same, but the rest behaves differently. If not for that, that sucker would convert to (3) easily; ->splice_write() semantics is potentially interesting - the question is where do we want transfer boundaries. As it is, writev() collects the entire iovec and shoves it into single transfer; splice() to that thing ends up with each pipe_buf going in a separate transfer, mergable or not. I would really appreciate comments on semantics of that thing from USB folks... 14) ipathfs and qibfs: seriously different semantics for write and writev/AIO write. As in "different set of commands recognized"; AIO write plays like writev, whether it's vectored or not (and it's always synchronous). I've no idea who had come up with that... highly innovative API or why hadn't they simply added two files (it's all on their virtual filesystem, so they had full control of layout) rather that multiplexing two different command sets in such a fashion. 15) /dev/snd/pcmC*D*[cp]. Again, different semantics for write and writev, with the latter wanting nr_seqs equal to the number of channels. AIO non-vectored write fails unless there's only one channel. Not sure how ALSA userland uses that thing; AIO side is always synchronous, so it might be simply never used. FWIW, I'm not sure that write() on a single-channel one is equivalent to 1-element writev() - silencing-related logics seem to differ. That's it. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 02, 2015 at 08:11:12AM +0000, Al Viro wrote: > > you want me to finish my conversion. > > I have f_fs.c conversion and partial legacy/inode.c one. Will post tomorrow, > about to fall asleep right now... Already. I've already fixed up the remaining commets on the series, so if you post it soon I'll have the series one for this merge window. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 02, 2015 at 08:14:31AM +0000, Al Viro wrote: > 13) two odd drivers/usb/gadget instances. I have conversion for f_fs.c, > but legacy/inode.c (ep_read() et.al.) is trickier. The problem in there > is that writev() on a single-element vector is *not* equivalent to plain > write(). The former treats the wrong-direction endpoint as EINVAL; the > latter does > if (usb_endpoint_xfer_isoc(&data->desc)) { > mutex_unlock(&data->lock); > return -EINVAL; > } > DBG (data->dev, "%s halt\n", data->name); > spin_lock_irq (&data->dev->lock); > if (likely (data->ep != NULL)) > usb_ep_set_halt (data->ep); > spin_unlock_irq (&data->dev->lock); > mutex_unlock(&data->lock); > return -EBADMSG; > instead. IOW, for isochronous endpoints behaviour is the same, but the > rest behaves differently. If not for that, that sucker would convert > to (3) easily; I would bet the behavior difference is a bug, might be worth to Cc the usb folks on this issue. I bet we'd want the more complex behavior for both variants. > 14) ipathfs and qibfs: seriously different semantics for write and writev/AIO > write. As in "different set of commands recognized"; AIO write plays like > writev, whether it's vectored or not (and it's always synchronous). > I've no idea who had come up with that... highly innovative API or why > hadn't they simply added two files (it's all on their virtual filesystem, > so they had full control of layout) rather that multiplexing two different > command sets in such a fashion. > > 15) /dev/snd/pcmC*D*[cp]. Again, different semantics for write and writev, > with the latter wanting nr_seqs equal to the number of channels. AIO > non-vectored write fails unless there's only one channel. Not sure how > ALSA userland uses that thing; AIO side is always synchronous, so it might > be simply never used. FWIW, I'm not sure that write() on a single-channel > one is equivalent to 1-element writev() - silencing-related logics seem to > differ. For these weirdos we can pass down a flag in the kiocb about the source of the I/O. We'll need that flags field for non-blocking buffered reads and per-I/O O_SYNC anyway, and it will be very useful for fixing the races around changing the O_DIRECT flag at run time. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[USB folks Cc'd] On Mon, Feb 02, 2015 at 03:26:17PM +0100, Christoph Hellwig wrote: > I would bet the behavior difference is a bug, might be worth to Cc the > usb folks on this issue. I bet we'd want the more complex behavior > for both variants. [Context for USB people: The difference in question is what ep_read() does when it is called on write endpoint that isn't isochronous; it halts the sucker and fails with EBADMSG, while ep_aio_read() handles all write endpoints as isochronous ones - fails with EINVAL; FWIW, I agree that it's probably a bug] Sadly, that's not the only problem in there ;-/ _This_ one really has the "what if single-segment AIO read tries to dereference iovec after the caller is gone" bug you suspected in fs/direct-io.c; we have static void ep_user_copy_worker(struct work_struct *work) { struct kiocb_priv *priv = container_of(work, struct kiocb_priv, work); struct mm_struct *mm = priv->mm; struct kiocb *iocb = priv->iocb; size_t ret; use_mm(mm); ret = ep_copy_to_user(priv); unuse_mm(mm); /* completing the iocb can drop the ctx and mm, don't touch mm after */ aio_complete(iocb, ret, ret); kfree(priv->buf); kfree(priv); } called via schedule_work() from ->complete() of usb_request allocated and queued by ->aio_read(). It very definitely _can_ be executed after return from ->aio_read() and aio_run_iocb(). And ep_copy_to_user() dereferences the iovec given to ->aio_read(); _not_ its copy as f_fs.c does. Do io_submit(2) with several IOCB_CMD_PREAD requests, and you'll almost certainly get the data from the first one copied to the destination of the second one instead. It shouldn't be hard to reproduce. And that, of course, is not the worst possible outcome... I'm going to add copying of iovec in async read case. And AFAICS, that one is -stable fodder. See vfs.git#gadget for f_fs.c conversion; I haven't pushed legacy/inode.c stuff yet - I need to pull the fix of the bug above into the beginning of that pile first. FWIW, I don't believe that it makes sense to do iovec copying in aio_run_iocb(); note that most of the instances will be done with iovec before they return there. These two were the sole exceptions; function/f_fs.c did copying, legacy/inode.c didn't. Most of the ->aio_read/->read_iter instances (including ones that *do* return EIOCBQUEUED) only access iovec synchronously; usually that's done by grabbing the pages to copy into before we get aronud to starting IO. legacy/inode.c is the only instance to step into that kind of bug. function/f_fs.c also had a fun bug, BTW - failure in AIO ended up leaking io_data (plus iovec copy in case of aio_read()). Looks like another -stable fodder, if less serious one... See b17d2ded6 (gadget/function/f_fs.c: close leaks) in vfs.git#gadget for that one. I plan to pull the fix for use-after-free in the beginning of that queue (in an easy to backport form) and then have ep_aio_read/ep_aio_write start doing the halt-related bits as in ep_read/ep_write. With that it's trivial to convert that sucker along the same lines as function/f_fs.c. All of that, assuming that anybody gives a damn about the driver in question. The things like spin_lock_irq (&dev->lock); .... // FIXME don't call this with the spinlock held ... if (copy_to_user (buf, dev->req->buf, len)) seem to indicate that nobody does, seeing that this bug had been there since 2003, complete with FIXME ;-/ If nobody cares about that sucker, git rm would be a better solution, IMO... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 4 Feb 2015, Al Viro wrote: > [USB folks Cc'd] Incidentally, Al, have you seen this email? http://marc.info/?l=linux-usb&m=142295011402339&w=2 I encouraged the writer to send in a patch but so far there has been no reply. > On Mon, Feb 02, 2015 at 03:26:17PM +0100, Christoph Hellwig wrote: > > > I would bet the behavior difference is a bug, might be worth to Cc the > > usb folks on this issue. I bet we'd want the more complex behavior > > for both variants. > > [Context for USB people: The difference in question is what ep_read() does > when it is called on write endpoint that isn't isochronous; You're talking about drivers/usb/gadget/legacy/inode.c, right? > it halts the > sucker and fails with EBADMSG, while ep_aio_read() handles all write endpoints > as isochronous ones - fails with EINVAL; FWIW, I agree that it's probably > a bug] It's not a bug; it's by design. That's how you halt an endpoint in gadgetfs -- by doing a synchronous I/O call in the "wrong" direction. > Sadly, that's not the only problem in there ;-/ _This_ one really has > the "what if single-segment AIO read tries to dereference iovec after > the caller is gone" bug you suspected in fs/direct-io.c; we have > static void ep_user_copy_worker(struct work_struct *work) > { > struct kiocb_priv *priv = container_of(work, struct kiocb_priv, work); > struct mm_struct *mm = priv->mm; > struct kiocb *iocb = priv->iocb; > size_t ret; > > use_mm(mm); > ret = ep_copy_to_user(priv); > unuse_mm(mm); > > /* completing the iocb can drop the ctx and mm, don't touch mm after */ > aio_complete(iocb, ret, ret); > > kfree(priv->buf); > kfree(priv); > } > called via schedule_work() from ->complete() of usb_request allocated and > queued by ->aio_read(). It very definitely _can_ be executed after return > from ->aio_read() and aio_run_iocb(). And ep_copy_to_user() dereferences > the iovec given to ->aio_read(); _not_ its copy as f_fs.c does. > > Do io_submit(2) with several IOCB_CMD_PREAD requests, and you'll almost > certainly get the data from the first one copied to the destination of > the second one instead. It shouldn't be hard to reproduce. And that, > of course, is not the worst possible outcome... > > I'm going to add copying of iovec in async read case. And AFAICS, that one > is -stable fodder. See vfs.git#gadget for f_fs.c conversion; I haven't > pushed legacy/inode.c stuff yet - I need to pull the fix of the bug above > into the beginning of that pile first. > > FWIW, I don't believe that it makes sense to do iovec copying in > aio_run_iocb(); note that most of the instances will be done with > iovec before they return there. That's true even for gadgetfs in the write case. > These two were the sole exceptions; > function/f_fs.c did copying, legacy/inode.c didn't. Most of the > ->aio_read/->read_iter instances (including ones that *do* return > EIOCBQUEUED) only access iovec synchronously; usually that's done > by grabbing the pages to copy into before we get aronud to starting > IO. legacy/inode.c is the only instance to step into that kind of bug. > function/f_fs.c also had a fun bug, BTW - failure in AIO ended up leaking > io_data (plus iovec copy in case of aio_read()). Looks like another > -stable fodder, if less serious one... See b17d2ded6 (gadget/function/f_fs.c: > close leaks) in vfs.git#gadget for that one. > > I plan to pull the fix for use-after-free in the beginning of that queue > (in an easy to backport form) and then have ep_aio_read/ep_aio_write > start doing the halt-related bits as in ep_read/ep_write. With that it's > trivial to convert that sucker along the same lines as function/f_fs.c. I don't think there's any need to make the async routines do the halt-related stuff. After all, it's silly for users to call an async I/O routine to perform a synchronous action like halting an endpoint. On the other hand, it would be reasonable to replace the -EBADMSG with some massaged version of the return code from usb_ep_set_halt(), which is supposed to return -EAGAIN under some circumstances. But that would be an API change, so we probably shouldn't do it... > All of that, assuming that anybody gives a damn about the driver in question. > The things like > spin_lock_irq (&dev->lock); > .... > // FIXME don't call this with the spinlock held ... > if (copy_to_user (buf, dev->req->buf, len)) > seem to indicate that nobody does, seeing that this bug had been there > since 2003, complete with FIXME ;-/ > > If nobody cares about that sucker, git rm would be a better solution, IMO... It is a legacy driver after all, but some people still use it. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 04, 2015 at 01:17:30PM -0500, Alan Stern wrote: > On Wed, 4 Feb 2015, Al Viro wrote: > > > [USB folks Cc'd] > > Incidentally, Al, have you seen this email? > > http://marc.info/?l=linux-usb&m=142295011402339&w=2 > > I encouraged the writer to send in a patch but so far there has been no > reply. Yecchhh... Anything that changes ->f_op *after* return from ->open() is doing a nasty, nasty thing. What's to guarantee that any checks for NULL fields will stay valid, etc.? FWIW, in all the tree there are only 4 places where that would be happening; * i810_map_buffer() screwing around with having vm_mmap() done, only it wants its own thing called as ->mmap() (and a bit of extra data stashed for it). Racy as hell (if another thread calls mmap() on the same file, you'll get a nasty surprise). Driver's too old and brittle to touch, according to drm folks... * TTY hangup logics. Nasty (and might be broken around ->fasync()), but it's a very special case. * snd_card_disconnect(). Analogue of TTY hangup, actually; both are trying to do a form of revoke(). * this one. Note that you are not guaranteed that ep_config() won't be called more than once - two threads might race in write(2), with the loser getting through mutex_lock_interruptible(&data->lock); in ep_config() only after the winner has already gotten through write(), switched ->f_op, returned to userland and started doing read()/write()/etc. If nothing else, the contents of data->desc and data->hs_desc can be buggered by arbitrary data, no matter how bogus, right as the first thread is doing IO. > > [Context for USB people: The difference in question is what ep_read() does > > when it is called on write endpoint that isn't isochronous; > > You're talking about drivers/usb/gadget/legacy/inode.c, right? Yes. > > it halts the > > sucker and fails with EBADMSG, while ep_aio_read() handles all write endpoints > > as isochronous ones - fails with EINVAL; FWIW, I agree that it's probably > > a bug] > > It's not a bug; it's by design. That's how you halt an endpoint in > gadgetfs -- by doing a synchronous I/O call in the "wrong" direction. Yes, but you have readv() on single-element vector behave different from read(), which is surprising, to put it mildly. > > I plan to pull the fix for use-after-free in the beginning of that queue > > (in an easy to backport form) and then have ep_aio_read/ep_aio_write > > start doing the halt-related bits as in ep_read/ep_write. With that it's > > trivial to convert that sucker along the same lines as function/f_fs.c. > > I don't think there's any need to make the async routines do the > halt-related stuff. After all, it's silly for users to call an async > I/O routine to perform a synchronous action like halting an endpoint. Um... readv() is also going through ->aio_read(). I can tie that to sync vs. async, though - is_sync_kiocb() will do just that, if you are OK with having readv() act the same as read() in that respect. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 4 Feb 2015, Al Viro wrote: > On Wed, Feb 04, 2015 at 01:17:30PM -0500, Alan Stern wrote: > > On Wed, 4 Feb 2015, Al Viro wrote: > > > > > [USB folks Cc'd] > > > > Incidentally, Al, have you seen this email? > > > > http://marc.info/?l=linux-usb&m=142295011402339&w=2 > > > > I encouraged the writer to send in a patch but so far there has been no > > reply. > > Yecchhh... Anything that changes ->f_op *after* return from ->open() is > doing a nasty, nasty thing. What's to guarantee that any checks for > NULL fields will stay valid, etc.? > > FWIW, in all the tree there are only 4 places where that would be happening; > * i810_map_buffer() screwing around with having vm_mmap() done, > only it wants its own thing called as ->mmap() (and a bit of extra data > stashed for it). Racy as hell (if another thread calls mmap() on the > same file, you'll get a nasty surprise). Driver's too old and brittle to > touch, according to drm folks... > * TTY hangup logics. Nasty (and might be broken around ->fasync()), > but it's a very special case. > * snd_card_disconnect(). Analogue of TTY hangup, actually; both are > trying to do a form of revoke(). > * this one. Note that you are not guaranteed that ep_config() won't > be called more than once - two threads might race in write(2), with the loser > getting through mutex_lock_interruptible(&data->lock); in ep_config() only > after the winner has already gotten through write(), switched ->f_op, returned > to userland and started doing read()/write()/etc. If nothing else, > the contents of data->desc and data->hs_desc can be buggered by arbitrary > data, no matter how bogus, right as the first thread is doing IO. Well, this one certainly can be fixed to avoid altering ->f_op, at the cost of adding an extra check at the start of each I/O operation. > > > it halts the > > > sucker and fails with EBADMSG, while ep_aio_read() handles all write endpoints > > > as isochronous ones - fails with EINVAL; FWIW, I agree that it's probably > > > a bug] > > > > It's not a bug; it's by design. That's how you halt an endpoint in > > gadgetfs -- by doing a synchronous I/O call in the "wrong" direction. > > Yes, but you have readv() on single-element vector behave different from > read(), which is surprising, to put it mildly. > > > > I plan to pull the fix for use-after-free in the beginning of that queue > > > (in an easy to backport form) and then have ep_aio_read/ep_aio_write > > > start doing the halt-related bits as in ep_read/ep_write. With that it's > > > trivial to convert that sucker along the same lines as function/f_fs.c. > > > > I don't think there's any need to make the async routines do the > > halt-related stuff. After all, it's silly for users to call an async > > I/O routine to perform a synchronous action like halting an endpoint. > > Um... readv() is also going through ->aio_read(). Why does readv() do this but not read()? Wouldn't it make more sense to have all the read* calls use the same internal interface? > I can tie that to > sync vs. async, though - is_sync_kiocb() will do just that, if you are > OK with having readv() act the same as read() in that respect. I don't really care one way or the other. In fact, it doesn't matter if the same behavior applies to all the async calls as well as the sync calls -- I just doubt that anybody will ever use them. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 04, 2015 at 03:30:32PM -0500, Alan Stern wrote: > > * this one. Note that you are not guaranteed that ep_config() won't > > be called more than once - two threads might race in write(2), with the loser > > getting through mutex_lock_interruptible(&data->lock); in ep_config() only > > after the winner has already gotten through write(), switched ->f_op, returned > > to userland and started doing read()/write()/etc. If nothing else, > > the contents of data->desc and data->hs_desc can be buggered by arbitrary > > data, no matter how bogus, right as the first thread is doing IO. > > Well, this one certainly can be fixed to avoid altering ->f_op, at the > cost of adding an extra check at the start of each I/O operation. > > Um... readv() is also going through ->aio_read(). > > Why does readv() do this but not read()? Wouldn't it make more sense > to have all the read* calls use the same internal interface? Because there are two partially overlapping classes wrt vector IO semantics: 1) datagram-style. Vectored read/write is equivalent to simple read/write done on each vector component. And IO boundaries matter - if your driver treats any write() as datagram that starts e.g. with fixed-sized table in the beginning + arbitrary amount of data following it, you will get very different results from write(fd, buf, 200) and writev(fd, (struct iovec[2]){{buf, 100}, {buf+100, 100}}, 2). A _lot_ of drivers are like that - they supply ->read() and ->write() for single-range IO and VFS construct the rest of operations out of those. 2) stream-style. Vectored read is guaranteed to behave the same way as simple read on a range with size being the sum of vector element sizes, except that the data ends in ranges covered by vector elements instead of a single array. Vectored write is guaranteed to behave the same way as simple write from a buffer containing the concatenation of ranges covered by vector elements. Boundaries between the elements do not matter at all. Regular files on storage filesystems are like that. So are FIFOs and pipes and so are sockets. Even for datagram protocols, boundaries between the vector elements are ignored; boundaries between syscalls provide the datagram boundaries, but you can e.g. do writev(udp_socket_fd, (struct iovec[3]){ {const_header, sizeof(const_header)}, {&n, 4}, {s, strlen(s)}}, 3) and have only one UDP packet sent. IOW, it's general-purpose scatter-gather for read and write. The last example shows that (2) isn't a subset of (1) - it's not always possible to call ->write() in loop and get the right behaviour. For regular files (and pure stream sockets, etc.) it would work, but for stuff like UDP sockets it would break. Moreover, even for regular files on storage filesystems it would be quite inefficient - we'd need to acquire and release a bunch of locks, poke through metadata, etc., for each segment. As the result, there was a couple of new methods added, inventively called ->readv() and ->writev(). do_sync_read() was supposed to be used as ->read() instance - it's "feed a single-element vector to ->readv()" and similar for s/read/write/. Note that both (1) and (2) satisfy the following sanity requirement - single-element readv() is always equivalent to single-element() read(). You could violate that, by providing completely unrelated ->read() and ->readv(), but very few drivers went for that - too insane. Then, when AIO had been added, those had grown an argument pointing to iocb (instead of file and ppos - for those we use iocb->ki_filp and &iocb->ki_pos resp.) and they got renamed into ->aio_read() and ->aio_write(). Note that non-vectored AIO uses the same methods - ->read() and ->write() had too many instances to convert and most of those would end up just using those two iocb fields instead of the old arguments - tons of churn for no good reason. ->readv() and ->writev() had fewer instances (very common was the use of generic_file_aio_{read,write}()) and conversion was less painful. So there was no ->aio_read() and ->aio_write(). That, in principle, was a bit of constraint - you couldn't make single-element AIO_PREADV behave different from AIO_PREAD, but nobody had been insane enough to ask for that. Moreover, keeping ->readv() and ->writev() was pointless. There is cheap way to tell whether ->aio_{read,write}() call is due to io_submit(2) or to readv()/writev() - is_sync_kiocb(iocb) tells which one it is, so if driver really wanted different semantics for sync vs. async, it could check that. So we ended up with ->read/->write for sync non-vectored and ->aio_read()/->aio_write() for sync vectored *and* async anything. Usually you provide one or the other - NULL ->aio_... means loop calling ->read/write on each element, NULL ->read/write (or do_sync_... for them - it's the same thing) means feeding sync iocb and single-element vector to ->aio_.... You *can* provide both, but that's rare and almost always pointless. These days ->aio_{read,write} is almost gone - replaced with ->{read,write}_iter(). That sucker is equivalent, except that you get a pointer struct iov_iter instead iovec, nr_seg, size triple. And you use linux/uio.h primitives for accessing the data (it might be backed by something other than userland ranges). The primitives in there are not harder to use than copy_..._user(), and you avoid any need to loop over iovec array, etc. Those primitives generally don't give a damn about the range boundaries; the only exception is iov_iter_single_seg_count(), which tells how much data is left to be consumed in the current segment; very few things are using it. is_sync_kiocb() is still available to tell io_submit from synchronous syscalls. I don't believe that it's worth switching the (1) \setminus (2) to those; it's still too much churn, plus we'd need the loop over segments somewhere to keep the semantics. It *can* be expressed by ->read_iter() and ->write_iter(), but it'll be tons of boilerplate code in a lot of drivers. So ->read() and ->write() are there to stay. However, I think that we can get to the situation when it's really either one or another - if you have ->read_iter()/->write_iter(), don't mess with custom ->read() and ->write(). Both function/f_fs.c and legacy/inode.c are in class (2) - they gather data from iovec into temp buffer in ->aio_write() and pass the buffer to the code doing actual IO, and on ->aio_read() side they give a buffer to read into, then arrange for it to be scattered into our iovec upon IO completion. And they are doing non-trivial async stuff, so I'd prefer to switch them completely to ->read_iter/->write_iter. The only obstacle is read vs. single-element readv and write vs. single-element writev behaviour difference. For some files it really can't be helped - {ipath,qib}fs has completely unrelated sets of commands accepted by write and writev, including the single-element one. And /dev/snd/pcm*, while somewhat milder, has non-trivial behaviour differences. I think that trick suggested by hch (put several flags in iocb, encoding the reason for the call; that would allow to tell one from another) is the best way to deal with those. But I would really prefer to avoid that; IMO those examples are simply bad userland APIs. legacy/inode.c is the third (and last) beast like those two. And there the difference is small enough to simply change readv/writev to be like read/write in that respect, i.e. also halt the endpoint when called on the isochronous one with wrong direction. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/05/2015 12:07 AM, Al Viro wrote: > On Wed, Feb 04, 2015 at 03:30:32PM -0500, Alan Stern wrote: >>> * this one. Note that you are not guaranteed that ep_config() won't >>> be called more than once - two threads might race in write(2), with the loser >>> getting through mutex_lock_interruptible(&data->lock); in ep_config() only >>> after the winner has already gotten through write(), switched ->f_op, returned >>> to userland and started doing read()/write()/etc. If nothing else, >>> the contents of data->desc and data->hs_desc can be buggered by arbitrary >>> data, no matter how bogus, right as the first thread is doing IO. >> >> Well, this one certainly can be fixed to avoid altering ->f_op, at the >> cost of adding an extra check at the start of each I/O operation. > >>> Um... readv() is also going through ->aio_read(). >> >> Why does readv() do this but not read()? Wouldn't it make more sense >> to have all the read* calls use the same internal interface? > > Because there are two partially overlapping classes wrt vector IO semantics: > 1) datagram-style. Vectored read/write is equivalent to simple > read/write done on each vector component. And IO boundaries matter - if > your driver treats any write() as datagram that starts e.g. with > fixed-sized table in the beginning + arbitrary amount of data following > it, you will get very different results from write(fd, buf, 200) and > writev(fd, (struct iovec[2]){{buf, 100}, {buf+100, 100}}, 2). A _lot_ of > drivers are like that - they supply ->read() and ->write() for single-range > IO and VFS construct the rest of operations out of those. > > 2) stream-style. Vectored read is guaranteed to behave the same > way as simple read on a range with size being the sum of vector element > sizes, except that the data ends in ranges covered by vector elements instead > of a single array. Vectored write is guaranteed to behave the same way > as simple write from a buffer containing the concatenation of ranges covered > by vector elements. Boundaries between the elements do not matter at all. > Regular files on storage filesystems are like that. So are FIFOs and pipes > and so are sockets. Even for datagram protocols, boundaries between the > vector elements are ignored; boundaries between syscalls provide the datagram > boundaries, but you can e.g. do writev(udp_socket_fd, (struct iovec[3]){ > {const_header, sizeof(const_header)}, {&n, 4}, {s, strlen(s)}}, 3) and have > only one UDP packet sent. IOW, it's general-purpose scatter-gather for read > and write. > > The last example shows that (2) isn't a subset of (1) - it's not > always possible to call ->write() in loop and get the right behaviour. > For regular files (and pure stream sockets, etc.) it would work, but for > stuff like UDP sockets it would break. Moreover, even for regular files on > storage filesystems it would be quite inefficient - we'd need to acquire and > release a bunch of locks, poke through metadata, etc., for each segment. > > As the result, there was a couple of new methods added, inventively > called ->readv() and ->writev(). do_sync_read() was supposed to be used > as ->read() instance - it's "feed a single-element vector to ->readv()" and > similar for s/read/write/. > > Note that both (1) and (2) satisfy the following sanity requirement - > single-element readv() is always equivalent to single-element() read(). You > could violate that, by providing completely unrelated ->read() and ->readv(), > but very few drivers went for that - too insane. > > Then, when AIO had been added, those had grown an argument pointing > to iocb (instead of file and ppos - for those we use iocb->ki_filp and > &iocb->ki_pos resp.) and they got renamed into ->aio_read() and ->aio_write(). > Note that non-vectored AIO uses the same methods - ->read() and ->write() had > too many instances to convert and most of those would end up just using those > two iocb fields instead of the old arguments - tons of churn for no good > reason. ->readv() and ->writev() had fewer instances (very common was the > use of generic_file_aio_{read,write}()) and conversion was less painful. > So there was no ->aio_read() and ->aio_write(). That, in principle, was a > bit of constraint - you couldn't make single-element AIO_PREADV behave > different from AIO_PREAD, but nobody had been insane enough to ask for that. > > Moreover, keeping ->readv() and ->writev() was pointless. There is > cheap way to tell whether ->aio_{read,write}() call is due to io_submit(2) > or to readv()/writev() - is_sync_kiocb(iocb) tells which one it is, so if > driver really wanted different semantics for sync vs. async, it could check > that. > > So we ended up with ->read/->write for sync non-vectored and > ->aio_read()/->aio_write() for sync vectored *and* async anything. Usually > you provide one or the other - NULL ->aio_... means loop calling ->read/write > on each element, NULL ->read/write (or do_sync_... for them - it's the same > thing) means feeding sync iocb and single-element vector to ->aio_.... > You *can* provide both, but that's rare and almost always pointless. > > These days ->aio_{read,write} is almost gone - replaced with > ->{read,write}_iter(). That sucker is equivalent, except that you > get a pointer struct iov_iter instead iovec, nr_seg, size triple. And > you use linux/uio.h primitives for accessing the data (it might be backed > by something other than userland ranges). The primitives in there are > not harder to use than copy_..._user(), and you avoid any need to loop over > iovec array, etc. Those primitives generally don't give a damn about the > range boundaries; the only exception is iov_iter_single_seg_count(), which > tells how much data is left to be consumed in the current segment; very > few things are using it. is_sync_kiocb() is still available to tell io_submit > from synchronous syscalls. > > I don't believe that it's worth switching the (1) \setminus (2) to > those; it's still too much churn, plus we'd need the loop over segments > somewhere to keep the semantics. It *can* be expressed by ->read_iter() > and ->write_iter(), but it'll be tons of boilerplate code in a lot of > drivers. So ->read() and ->write() are there to stay. However, I think > that we can get to the situation when it's really either one or another - > if you have ->read_iter()/->write_iter(), don't mess with custom ->read() > and ->write(). > > Both function/f_fs.c and legacy/inode.c are in class (2) - they > gather data from iovec into temp buffer in ->aio_write() and pass the buffer to > the code doing actual IO, and on ->aio_read() side they give a buffer to > read into, then arrange for it to be scattered into our iovec upon IO > completion. And they are doing non-trivial async stuff, so I'd prefer to > switch them completely to ->read_iter/->write_iter. The only obstacle is > read vs. single-element readv and write vs. single-element writev behaviour > difference. No, function/f_fs.c and legacy/inode.c are in class (1). They have datagram semantics - each vector element is submitted in separate USB request. Each single request is sent in separate USB data packet (for bulk endpoints it can be more than one packet). In fact sync read()/write() also will give different results while called once with some block of data or in loop with the same block of data splitted into a few parts. > > For some files it really can't be helped - {ipath,qib}fs has completely > unrelated sets of commands accepted by write and writev, including the > single-element one. And /dev/snd/pcm*, while somewhat milder, has non-trivial > behaviour differences. I think that trick suggested by hch (put several > flags in iocb, encoding the reason for the call; that would allow to tell > one from another) is the best way to deal with those. But I would really > prefer to avoid that; IMO those examples are simply bad userland APIs. > legacy/inode.c is the third (and last) beast like those two. And there the > difference is small enough to simply change readv/writev to be like read/write > in that respect, i.e. also halt the endpoint when called on the isochronous > one with wrong direction. It shouldn't be a big problem to halt endpoints in ep_aio_write()/ep_aio_read() to have similar behaviour for single-element readv()/writev() and read()/write() operations. Robert Baldyga -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2015 at 09:24:32AM +0100, Robert Baldyga wrote: > No, function/f_fs.c and legacy/inode.c are in class (1). They have > datagram semantics - each vector element is submitted in separate USB > request. Each single request is sent in separate USB data packet (for > bulk endpoints it can be more than one packet). In fact sync > read()/write() also will give different results while called once with > some block of data or in loop with the same block of data splitted into > a few parts. No, they don't. This is from ffs_epfile_io(): data = kmalloc(data_len, GFP_KERNEL); if (unlikely(!data)) return -ENOMEM; if (io_data->aio && !io_data->read) { int i; size_t pos = 0; for (i = 0; i < io_data->nr_segs; i++) { if (unlikely(copy_from_user(&data[pos], io_data->iovec[i].iov_base, io_data->iovec[i].iov_len))) { ret = -EFAULT; goto error; } pos += io_data->iovec[i].iov_len; } and that's the last point where it looks at iovec. After that all work is done to the copy in data, where no information about the boundaries survives. And ep_aio_write() (in legacy/inode.c) is the same way. You are confusing datagram-per-syscall (which they are) with datagram-per-iovec (which they are definitely not). IOW, they behave as UDP sockets - writev() is purely scatter-gather variant of write(), with datagram per syscall and all vector elements silently concatenated. That's class 2, and _not_ in its intersection with class 1. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2015 at 08:47:29AM +0000, Al Viro wrote: > You are confusing datagram-per-syscall (which they are) with > datagram-per-iovec (which they are definitely not). IOW, they behave > as UDP sockets - writev() is purely scatter-gather variant of write(), > with datagram per syscall and all vector elements silently concatenated. > That's class 2, and _not_ in its intersection with class 1. PS: you want class 1, look at something like /proc/sys/kernel/domainname (or any other sysctl of that sort). write "foobar" there and cat /proc/sys/kernel/domainname will print foorbat. writev an array consisting of "foo" and "bar", and you'll see bar afterwards, same as you would after writing first "foo", then "bar". There the iovec boundaries affect the result - ->no aio_write() for that sucker, so we get two calls of ->write(), with expected results. And there are character devices like that as well. _That_ is class 1 outside of intersection with class 2. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/05/2015 10:03 AM, Al Viro wrote: > On Thu, Feb 05, 2015 at 08:47:29AM +0000, Al Viro wrote: >> You are confusing datagram-per-syscall (which they are) with >> datagram-per-iovec (which they are definitely not). IOW, they behave >> as UDP sockets - writev() is purely scatter-gather variant of write(), >> with datagram per syscall and all vector elements silently concatenated. >> That's class 2, and _not_ in its intersection with class 1. > > PS: you want class 1, look at something like /proc/sys/kernel/domainname > (or any other sysctl of that sort). write "foobar" there and > cat /proc/sys/kernel/domainname will print foorbat. writev an array consisting > of "foo" and "bar", and you'll see bar afterwards, same as you would > after writing first "foo", then "bar". There the iovec boundaries affect > the result - ->no aio_write() for that sucker, so we get two calls of > ->write(), with expected results. And there are character devices like that > as well. _That_ is class 1 outside of intersection with class 2. > Oh, I see. Thanks. So we only need to add endpoint halting to aio_read()/aio_write(), to make their behaviour similar to sync ones, right? Robert Baldyga -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 4 Feb 2015, Al Viro wrote: > > > Um... readv() is also going through ->aio_read(). > > > > Why does readv() do this but not read()? Wouldn't it make more sense > > to have all the read* calls use the same internal interface? > > Because there are two partially overlapping classes wrt vector IO semantics: ... Thanks for the detailed explanation. It appears to boil down to a series of historical accidents. In any case, feel free to copy the non-isochronous behavior of the synchronous routines in the async routines. It certainly won't hurt anything. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2015 at 10:29:42AM -0500, Alan Stern wrote: > On Wed, 4 Feb 2015, Al Viro wrote: > > > > > Um... readv() is also going through ->aio_read(). > > > > > > Why does readv() do this but not read()? Wouldn't it make more sense > > > to have all the read* calls use the same internal interface? > > > > Because there are two partially overlapping classes wrt vector IO semantics: > ... > > Thanks for the detailed explanation. It appears to boil down to a > series of historical accidents. > > In any case, feel free to copy the non-isochronous behavior of the > synchronous routines in the async routines. It certainly won't hurt > anything. Hmm... What happens if f_fs.c successfully queues struct usb_request, returns -EIOCBQUEUED and then gets hit by io_cancel(2)? AFAICS, you get ffs_aio_cancel() called, which dequeues usb_request and buggers off. The thing is, freeing io_data and stuff hanging off it would be done by ffs_user_copy_worker(), which would be scheduled via schedule_work() by ffs_epfile_async_io_complete(), i.e. usb_request ->complete() callback. And usb_ep_dequeue() (aka. ep->ops->dequeue()) has tons of instances, but AFAICS some of them might not trigger usb_gadget_giveback_request(), which would normally call ->complete()... Example: net2272_dequeue(struct usb_ep *_ep, struct usb_request *_req) { struct net2272_ep *ep; struct net2272_request *req; unsigned long flags; int stopped; ep = container_of(_ep, struct net2272_ep, ep); if (!_ep || (!ep->desc && ep->num != 0) || !_req) return -EINVAL; spin_lock_irqsave(&ep->dev->lock, flags); stopped = ep->stopped; ep->stopped = 1; /* make sure it's still queued on this endpoint */ list_for_each_entry(req, &ep->queue, queue) { if (&req->req == _req) break; } if (&req->req != _req) { spin_unlock_irqrestore(&ep->dev->lock, flags); return -EINVAL; } /* queue head may be partially complete */ if (ep->queue.next == &req->queue) { dev_dbg(ep->dev->dev, "unlink (%s) pio\n", _ep->name); net2272_done(ep, req, -ECONNRESET); } req = NULL; ep->stopped = stopped; spin_unlock_irqrestore(&ep->dev->lock, flags); return 0; } Note that net2272_done(), which would call usb_gadget_giveback_request(), is only called if the victim happens to be queue head. Is that just a net2272.c bug, or am I missing something subtle here? Looks like at least on that hardware io_cancel() could leak io_data and everything that hangs off it... FWIW, net2272.c was the first one I looked at (happened to be on the last line of screen during git grep for \.dequeue in drivers/usb/gadget ;-) and after checking several more it seems that it's a Sod's Law in action - I'd checked about 5 of them and they seem to call usb_gadget_giveback_request() as long as they find the sucker in queue, head or no head. OTOH, there's a lot more of those guys, so that observation is not worth much... IOW, is that a net2272.c bug, or a f_fs.c one? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/06/2015 08:03 AM, Al Viro wrote: > On Thu, Feb 05, 2015 at 10:29:42AM -0500, Alan Stern wrote: >> On Wed, 4 Feb 2015, Al Viro wrote: >> >>>>> Um... readv() is also going through ->aio_read(). >>>> >>>> Why does readv() do this but not read()? Wouldn't it make more sense >>>> to have all the read* calls use the same internal interface? >>> >>> Because there are two partially overlapping classes wrt vector IO semantics: >> ... >> >> Thanks for the detailed explanation. It appears to boil down to a >> series of historical accidents. >> >> In any case, feel free to copy the non-isochronous behavior of the >> synchronous routines in the async routines. It certainly won't hurt >> anything. > > Hmm... What happens if f_fs.c successfully queues struct usb_request, returns > -EIOCBQUEUED and then gets hit by io_cancel(2)? AFAICS, you get > ffs_aio_cancel() called, which dequeues usb_request and buggers off. > The thing is, freeing io_data and stuff hanging off it would be done by > ffs_user_copy_worker(), which would be scheduled via schedule_work() by > ffs_epfile_async_io_complete(), i.e. usb_request ->complete() callback. > And usb_ep_dequeue() (aka. ep->ops->dequeue()) has tons of instances, but > AFAICS some of them might not trigger usb_gadget_giveback_request(), which > would normally call ->complete()... > > Example: > net2272_dequeue(struct usb_ep *_ep, struct usb_request *_req) > { > struct net2272_ep *ep; > struct net2272_request *req; > unsigned long flags; > int stopped; > > ep = container_of(_ep, struct net2272_ep, ep); > if (!_ep || (!ep->desc && ep->num != 0) || !_req) > return -EINVAL; > > spin_lock_irqsave(&ep->dev->lock, flags); > stopped = ep->stopped; > ep->stopped = 1; > > /* make sure it's still queued on this endpoint */ > list_for_each_entry(req, &ep->queue, queue) { > if (&req->req == _req) > break; > } > if (&req->req != _req) { > spin_unlock_irqrestore(&ep->dev->lock, flags); > return -EINVAL; > } > > /* queue head may be partially complete */ > if (ep->queue.next == &req->queue) { > dev_dbg(ep->dev->dev, "unlink (%s) pio\n", _ep->name); > net2272_done(ep, req, -ECONNRESET); > } > req = NULL; > ep->stopped = stopped; > > spin_unlock_irqrestore(&ep->dev->lock, flags); > return 0; > } > > Note that net2272_done(), which would call usb_gadget_giveback_request(), > is only called if the victim happens to be queue head. Is that just a > net2272.c bug, or am I missing something subtle here? Looks like > at least on that hardware io_cancel() could leak io_data and everything > that hangs off it... > > FWIW, net2272.c was the first one I looked at (happened to be on the last > line of screen during git grep for \.dequeue in drivers/usb/gadget ;-) > and after checking several more it seems that it's a Sod's Law in action - > I'd checked about 5 of them and they seem to call usb_gadget_giveback_request() > as long as they find the sucker in queue, head or no head. OTOH, there's > a lot more of those guys, so that observation is not worth much... > > IOW, is that a net2272.c bug, or a f_fs.c one? AFAIK usb request should be completed in all cases, and many gadget drivers make assumptions that complete() of each request will be called, so it's definitely bug in net2272 driver. Robert Baldyga -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2015 at 10:29:42AM -0500, Alan Stern wrote: > On Wed, 4 Feb 2015, Al Viro wrote: > > > > > Um... readv() is also going through ->aio_read(). > > > > > > Why does readv() do this but not read()? Wouldn't it make more sense > > > to have all the read* calls use the same internal interface? > > > > Because there are two partially overlapping classes wrt vector IO semantics: > ... > > Thanks for the detailed explanation. It appears to boil down to a > series of historical accidents. > > In any case, feel free to copy the non-isochronous behavior of the > synchronous routines in the async routines. It certainly won't hurt > anything. OK, I've limited it to sync ones, actually. Preliminary series in followups. Those who prefer to read in git, see git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #gadget WARNING: completely untested Al Viro (6): new helper: dup_iter() gadget/function/f_fs.c: close leaks gadget/function/f_fs.c: use put iov_iter into io_data gadget/function/f_fs.c: switch to ->{read,write}_iter() gadgetfs: use-after-free in ->aio_read() gadget: switch ep_io_operations to ->read_iter/->write_iter drivers/usb/gadget/function/f_fs.c | 204 +++++++++------------- drivers/usb/gadget/legacy/inode.c | 346 +++++++++++++++---------------------- include/linux/uio.h | 2 + mm/iov_iter.c | 15 ++ 4 files changed, 237 insertions(+), 330 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/staging/android/logger.c b/drivers/staging/android/logger.c index a673ffa..b0dfda1 100644 --- a/drivers/staging/android/logger.c +++ b/drivers/staging/android/logger.c @@ -422,7 +422,7 @@ static ssize_t logger_write_iter(struct kiocb *iocb, struct iov_iter *from) struct timespec now; size_t len, count, w_off; - count = min_t(size_t, iocb->ki_nbytes, LOGGER_ENTRY_MAX_PAYLOAD); + count = min_t(size_t, iov_iter_count(from), LOGGER_ENTRY_MAX_PAYLOAD); now = current_kernel_time(); diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c index c30d125..7fc43c7 100644 --- a/drivers/usb/gadget/function/f_fs.c +++ b/drivers/usb/gadget/function/f_fs.c @@ -974,7 +974,7 @@ static ssize_t ffs_epfile_aio_write(struct kiocb *kiocb, io_data->kiocb = kiocb; io_data->iovec = iovec; io_data->nr_segs = nr_segs; - io_data->len = kiocb->ki_nbytes; + io_data->len = iov_length(iovec, nr_segs); io_data->mm = current->mm; kiocb->private = io_data; @@ -1010,7 +1010,7 @@ static ssize_t ffs_epfile_aio_read(struct kiocb *kiocb, io_data->kiocb = kiocb; io_data->iovec = iovec_copy; io_data->nr_segs = nr_segs; - io_data->len = kiocb->ki_nbytes; + io_data->len = iov_length(iovec, nr_segs); io_data->mm = current->mm; kiocb->private = io_data; diff --git a/drivers/usb/gadget/legacy/inode.c b/drivers/usb/gadget/legacy/inode.c index 987a461..25a4e6f 100644 --- a/drivers/usb/gadget/legacy/inode.c +++ b/drivers/usb/gadget/legacy/inode.c @@ -700,16 +700,17 @@ ep_aio_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; + size_t len = iov_length(iov, nr_segs); char *buf; if (unlikely(usb_endpoint_dir_in(&epdata->desc))) return -EINVAL; - buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL); + buf = kmalloc(len, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; - return ep_aio_rwtail(iocb, buf, iocb->ki_nbytes, epdata, iov, nr_segs); + return ep_aio_rwtail(iocb, buf, len, epdata, iov, nr_segs); } static ssize_t @@ -717,14 +718,14 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t o) { struct ep_data *epdata = iocb->ki_filp->private_data; + size_t len = iov_length(iov, nr_segs); char *buf; - size_t len = 0; int i = 0; if (unlikely(!usb_endpoint_dir_in(&epdata->desc))) return -EINVAL; - buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL); + buf = kmalloc(len, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; @@ -734,7 +735,6 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov, kfree(buf); return -EFAULT; } - len += iov[i].iov_len; } return ep_aio_rwtail(iocb, buf, len, epdata, NULL, 0); } diff --git a/fs/aio.c b/fs/aio.c index 52f36ee..fa079bc 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1327,12 +1327,13 @@ typedef ssize_t (rw_iter_op)(struct kiocb *, struct iov_iter *); static ssize_t aio_setup_vectored_rw(struct kiocb *kiocb, int rw, char __user *buf, unsigned long *nr_segs, + size_t *len, struct iovec **iovec, bool compat) { ssize_t ret; - *nr_segs = kiocb->ki_nbytes; + *nr_segs = *len; #ifdef CONFIG_COMPAT if (compat) @@ -1347,21 +1348,22 @@ static ssize_t aio_setup_vectored_rw(struct kiocb *kiocb, if (ret < 0) return ret; - /* ki_nbytes now reflect bytes instead of segs */ - kiocb->ki_nbytes = ret; + /* len now reflect bytes instead of segs */ + *len = ret; return 0; } static ssize_t aio_setup_single_vector(struct kiocb *kiocb, int rw, char __user *buf, unsigned long *nr_segs, + size_t len, struct iovec *iovec) { - if (unlikely(!access_ok(!rw, buf, kiocb->ki_nbytes))) + if (unlikely(!access_ok(!rw, buf, len))) return -EFAULT; iovec->iov_base = buf; - iovec->iov_len = kiocb->ki_nbytes; + iovec->iov_len = len; *nr_segs = 1; return 0; } @@ -1371,7 +1373,7 @@ static ssize_t aio_setup_single_vector(struct kiocb *kiocb, * Performs the initial checks and io submission. */ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode, - char __user *buf, bool compat) + char __user *buf, size_t len, bool compat) { struct file *file = req->ki_filp; ssize_t ret; @@ -1406,21 +1408,21 @@ rw_common: if (!rw_op && !iter_op) return -EINVAL; - ret = (opcode == IOCB_CMD_PREADV || - opcode == IOCB_CMD_PWRITEV) - ? aio_setup_vectored_rw(req, rw, buf, &nr_segs, - &iovec, compat) - : aio_setup_single_vector(req, rw, buf, &nr_segs, - iovec); + if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV) + ret = aio_setup_vectored_rw(req, rw, buf, &nr_segs, + &len, &iovec, compat); + else + ret = aio_setup_single_vector(req, rw, buf, &nr_segs, + len, iovec); if (!ret) - ret = rw_verify_area(rw, file, &req->ki_pos, req->ki_nbytes); + ret = rw_verify_area(rw, file, &req->ki_pos, len); if (ret < 0) { if (iovec != inline_vecs) kfree(iovec); return ret; } - req->ki_nbytes = ret; + len = ret; /* XXX: move/kill - rw_verify_area()? */ /* This matches the pread()/pwrite() logic */ @@ -1433,7 +1435,7 @@ rw_common: file_start_write(file); if (iter_op) { - iov_iter_init(&iter, rw, iovec, nr_segs, req->ki_nbytes); + iov_iter_init(&iter, rw, iovec, nr_segs, len); ret = iter_op(req, &iter); } else { ret = rw_op(req, iovec, nr_segs, req->ki_pos); @@ -1536,10 +1538,10 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, req->ki_obj.user = user_iocb; req->ki_user_data = iocb->aio_data; req->ki_pos = iocb->aio_offset; - req->ki_nbytes = iocb->aio_nbytes; ret = aio_run_iocb(req, iocb->aio_lio_opcode, (char __user *)(unsigned long)iocb->aio_buf, + iocb->aio_nbytes, compat); if (ret) goto out_put_req; diff --git a/fs/ceph/file.c b/fs/ceph/file.c index ce74b39..0c93f62 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -807,7 +807,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct file *filp = iocb->ki_filp; struct ceph_file_info *fi = filp->private_data; - size_t len = iocb->ki_nbytes; + size_t len = iov_iter_count(to); struct inode *inode = file_inode(filp); struct ceph_inode_info *ci = ceph_inode(inode); struct page *pinned_page = NULL; diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index 9441c4c..c416e84 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -218,7 +218,7 @@ ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter, loff_t return -EINVAL; #else - VM_BUG_ON(iocb->ki_nbytes != PAGE_SIZE); + VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE); if (rw == READ) return nfs_file_direct_read(iocb, iter, pos); diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 3950693..5f3fbecf 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2254,7 +2254,7 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb, file->f_path.dentry->d_name.name, (unsigned int)from->nr_segs); /* GRRRRR */ - if (iocb->ki_nbytes == 0) + if (count == 0) return 0; appending = file->f_flags & O_APPEND ? 1 : 0; @@ -2304,8 +2304,7 @@ relock: } can_do_direct = direct_io; - ret = ocfs2_prepare_inode_for_write(file, ppos, - iocb->ki_nbytes, appending, + ret = ocfs2_prepare_inode_for_write(file, ppos, count, appending, &can_do_direct, &has_refcount); if (ret < 0) { mlog_errno(ret); @@ -2313,8 +2312,7 @@ relock: } if (direct_io && !is_sync_kiocb(iocb)) - unaligned_dio = ocfs2_is_io_unaligned(inode, iocb->ki_nbytes, - *ppos); + unaligned_dio = ocfs2_is_io_unaligned(inode, count, *ppos); /* * We can't complete the direct I/O as requested, fall back to diff --git a/fs/read_write.c b/fs/read_write.c index adab608..955cad3 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -343,7 +343,6 @@ ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos) init_sync_kiocb(&kiocb, file); kiocb.ki_pos = *ppos; - kiocb.ki_nbytes = iov_iter_count(iter); iter->type |= READ; ret = file->f_op->read_iter(&kiocb, iter); @@ -364,7 +363,6 @@ ssize_t vfs_iter_write(struct file *file, struct iov_iter *iter, loff_t *ppos) init_sync_kiocb(&kiocb, file); kiocb.ki_pos = *ppos; - kiocb.ki_nbytes = iov_iter_count(iter); iter->type |= WRITE; ret = file->f_op->write_iter(&kiocb, iter); @@ -422,7 +420,6 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *pp init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - kiocb.ki_nbytes = len; ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); BUG_ON(ret == -EIOCBQUEUED); @@ -441,7 +438,6 @@ ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *p init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - kiocb.ki_nbytes = len; iov_iter_init(&iter, READ, &iov, 1, len); ret = filp->f_op->read_iter(&kiocb, &iter); @@ -504,7 +500,6 @@ ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, lof init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - kiocb.ki_nbytes = len; ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); BUG_ON(ret == -EIOCBQUEUED); @@ -523,7 +518,6 @@ ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, lo init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - kiocb.ki_nbytes = len; iov_iter_init(&iter, WRITE, &iov, 1, len); ret = filp->f_op->write_iter(&kiocb, &iter); @@ -711,7 +705,6 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - kiocb.ki_nbytes = len; iov_iter_init(&iter, rw, iov, nr_segs, len); ret = fn(&kiocb, &iter); @@ -728,7 +721,6 @@ static ssize_t do_sync_readv_writev(struct file *filp, const struct iovec *iov, init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; - kiocb.ki_nbytes = len; ret = fn(&kiocb, iov, nr_segs, kiocb.ki_pos); BUG_ON(ret == -EIOCBQUEUED); diff --git a/fs/udf/file.c b/fs/udf/file.c index bb15771..c3d7ce0 100644 --- a/fs/udf/file.c +++ b/fs/udf/file.c @@ -122,7 +122,7 @@ static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from) struct file *file = iocb->ki_filp; struct inode *inode = file_inode(file); int err, pos; - size_t count = iocb->ki_nbytes; + size_t count = iov_iter_count(from); struct udf_inode_info *iinfo = UDF_I(inode); mutex_lock(&inode->i_mutex); diff --git a/include/linux/aio.h b/include/linux/aio.h index 5657bd2..46ead10 100644 --- a/include/linux/aio.h +++ b/include/linux/aio.h @@ -41,7 +41,6 @@ struct kiocb { __u64 ki_user_data; /* user's data for completion */ loff_t ki_pos; - size_t ki_nbytes; /* copy of iocb->aio_nbytes */ struct list_head ki_list; /* the aio core uses this * for cancellation */ diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 02d6b6d..a87301c 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -521,7 +521,7 @@ static ssize_t devkmsg_write(struct kiocb *iocb, struct iov_iter *from) int i; int level = default_message_loglevel; int facility = 1; /* LOG_USER */ - size_t len = iocb->ki_nbytes; + size_t len = iov_iter_count(from); ssize_t ret = len; if (len > LOG_LINE_MAX) diff --git a/mm/page_io.c b/mm/page_io.c index e604580..7ef2157 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -274,7 +274,6 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, iov_iter_bvec(&from, ITER_BVEC | WRITE, &bv, 1, PAGE_SIZE); init_sync_kiocb(&kiocb, swap_file); kiocb.ki_pos = page_file_offset(page); - kiocb.ki_nbytes = PAGE_SIZE; set_page_writeback(page); unlock_page(page); diff --git a/net/socket.c b/net/socket.c index 4032931..121b976 100644 --- a/net/socket.c +++ b/net/socket.c @@ -903,10 +903,9 @@ static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, if (pos != 0) return -ESPIPE; - if (iocb->ki_nbytes == 0) /* Match SYS5 behaviour */ + if (!iov_length(iov, nr_segs)) /* Match SYS5 behaviour */ return 0; - x = alloc_sock_iocb(iocb, &siocb); if (!x) return -ENOMEM;
There is no need to pass the total request length in the kiocb. We already have it in the iov_iter, and for those few methods not converted to iov_iter we can easily calculate it. Signed-off-by: Christoph Hellwig <hch@lst.de> --- drivers/staging/android/logger.c | 2 +- drivers/usb/gadget/function/f_fs.c | 4 ++-- drivers/usb/gadget/legacy/inode.c | 10 +++++----- fs/aio.c | 34 ++++++++++++++++++---------------- fs/ceph/file.c | 2 +- fs/nfs/direct.c | 2 +- fs/ocfs2/file.c | 8 +++----- fs/read_write.c | 8 -------- fs/udf/file.c | 2 +- include/linux/aio.h | 1 - kernel/printk/printk.c | 2 +- mm/page_io.c | 1 - net/socket.c | 3 +-- 13 files changed, 34 insertions(+), 45 deletions(-)