Orangefs ABI documentation

Hi Al...

I moved a tiny bit of your work around so it would compile, but I booted a
kernel from it, and ran some tests, it seems to work OK... doing this
work from home makes me remember writing cobol programs on a
silent 700... my co-workers are helping me look at
wait_for_cancellation_downcall... we recently made some
improvements there based on some problems we were
having in production with our out-of-tree Frankenstein
module... I'm glad you are also looking there.

# git diff

-Mike

On Fri, Jan 22, 2016 at 6:09 AM, Mike Marshall <hubcap@omnibond.com> wrote:
> Hi Al...
>
> Thanks for the review...
>
> I'll be trying to make some progress on your concerns about
> wait_for_cancellation_downcall today, but in case it looks
> like I dropped off the face of the earth for a few days - maybe
> I did - a several days long ice storm is dropping on us right
> now, everyone is working from home, power will likely go
> out for some...
>
> I'm cloning your repo from kernel.org across my slow
> dsl link right now...
>
> -Mike "I wrote all the debugfs code <g> "
>
> On Fri, Jan 22, 2016 at 2:11 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Jan 15, 2016 at 04:46:09PM -0500, Mike Marshall wrote:
>>> Al...
>>>
>>> We have been working on Orangefs ABI (protocol between
>>> userspace and kernel module) documentation, and improving
>>> the code along the way.
>>>
>>> We wish you would look at the documentation progress, and
>>> look at the way that .write_iter is implemented now... plus
>>> the other improvements. We know that every problem you
>>> pointed out hasn't been worked on yet, but your feedback
>>> would let us know if we're headed in the right direction.
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
>>> for-next
>>>
>>> The improved (but not yet officially released) userspace
>>> function that does the writev to the Orangefs pseudo device is in:
>>>
>>> http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/
>>> src/io/dev/pint-dev.c
>>>
>>> You may decide we don't deserve an ACK yet, but since we are in
>>> the merge window I hope you can look at it...
>>
>> write_iter got much better, but there are serious problems with locking.
>> For example, just what will happen if orangefs_devreq_read() picks an op
>> for copying to daemon, moves it to hash, unlocks everything and starts
>> copy_to_user() at the time when submitter of that op (sleeping
>> in wait_for_matching_downcall()) gets a signal and buggers off?  It calls
>> orangefs_clean_up_interrupted_operation(), removes the sucker from hash
>> and proceed to free the damn thing.  Right under ongoing copy_to_user().
>> Better yet, should that copy_to_user() fail, we proceed to move an already
>> freed op back to request list.  And no, get_op()/put_op() pair won't be
>> enough - submitters just go ahead an call op_release(), freeing the op,
>> refcount be damned.  They'd need to switch to put_op() for that to work.
>> Note that the same applies for existing use of get_op/put_op mechanism -
>> if that signal comes just as the orangefs_devreq_write_iter() has picked
>> the op from hash, we'll end up freeing the sucker right under copy_from_iter()
>> despite get_op().
>>
>> What's more, a failure of that copy_to_user() leaves us in a nasty situation -
>> how do we tell whether to put the request back to the list or just quietly
>> drop it?  Check the refcount?  But what if the submitter (believing, and
>> for a good reason, that it had removed the sucker from hash) hadn't gotten
>> around to dropping its reference?
>>
>> Another piece of fun assumes malicious daemon; sure, it's already a FUBAR
>> situation, but userland shouldn't be able to corrupt the kernel memory.
>> Look: daemon spawns a couple of threads.  One of those issues read() on
>> your character device, with just 16 bytes of destination mmapped and filled
>> with zeroes.  Another spins until it sees ->tag becoming non-zero and
>> immediately does writev() now that it has the tag value.  What we get is
>> submitter seeing op serviced and proceeding to free it while ->read() hits
>> the copy_to_user() failure and reinserts the already freed op into request
>> list.
>>
>> I'm not sure we have any business inserting the op to hash until we are
>> done with copy_to_user() and know we won't fail.  Would simplify the
>> failure recovery as well...
>>
>> I think we ought to flag the op when submitter decides to give up on it
>> and have the return-to-request-list logics check that (under op->lock),
>> returning it to the list only in case it's not flagged *and* decrementing
>> refcount before dropping op->lock in that case - we are guaranteed that
>> this is not the last reference.  In case it's flagged, we drop ->op_lock,
>> do put_op() and _not_ refile the sucker to request list.
>>
>> I'm not sure if this
>>                 /* NOTE: for I/O operations we handle releasing the op
>>                  * object except in the case of timeout.  the reason we
>>                  * can't free the op in timeout cases is that the op
>>                  * service logic in the vfs retries operations using
>>                  * the same op ptr, thus it can't be freed.
>>                  */
>>                 if (!timed_out)
>>                         op_release(op);
>> is safe, while we are at it...  I suspect we'd be better off if we did
>> put_op() there.  Unconditionally.  After the entire
>>         if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO)
>> thing, so that the total change of refcount is zero in all cases.  At least
>> that makes for regular lifetime rules (with s/op_release/put_op/ in submitters)
>>
>>         The thing I really don't understand is the situation with
>> wait_for_cancellation_downcall().  The comments in there flat-out
>> contradict the code - if called with signal already pending (as the
>> comment would seem to indicate), we might easily leave and free
>> the damn op without it ever being seen by daemon.  What's intended there?
>> And what happens if we reach schedule_timeout() in there (i.e. if we get
>> there without a signal pending) and operation completes successfully?
>> AFAICS, you'll get -ETIMEDOUT, not that the caller even looked at that
>> return value...  What's going on there?
>>
>>         Another unpleasantness is that your locking hierarchy leads to
>> to races in orangefs_clean_up_interrupted_operation(); you take op->lock,
>> look at the state, decide what needs to be locked to remove the sucker from,
>> drop op->lock, lock the relevant data structure and proceed to search op in
>> it, removing it in case of success.  Fun, but what happens if it gets refiled
>> as soon as you drop op->lock?
>>
>>         I suspect that we ought to make refiling conditional on the same
>> "submitter has given up on that one" flag *and* move hash lock outside of
>> op->lock.  IOW, once that flag had been set, only submitter can do anything
>> with op->list.  Then orangefs_clean_up_interrupted_operation() could simply
>> make sure that flag had been set, drop op->lock, grab the relevant spinlock
>> and do list_del().  And to hell with "search and remove if it's there"
>> complications.
>>
>>         One more problem is with your open_access_count; it should be
>> tristate, not boolean.  As it is, another open() of your character
>> device is possible as soon as ->release() has decremented open_access_count.
>> IOW, purge_inprogress_ops() and purge_waiting_ops() can happen when we
>> already have reopened the sucker.  It should have three states - "not opened",
>> "opened" and "shutting down", the last one both preventing open() and
>> being treated as "not in service".
>>
>>         Input validation got a lot saner (and documentation is quite welcome).
>> One place where I see a problem is listxattr -
>>                 if (total + new_op->downcall.resp.listxattr.lengths[i] > size)
>>                         goto done;
>> is *not* enough; you need to check that it's not going backwards (i.e. isn't
>> less than total).  total needs to be size_t, BTW - ssize_t is wrong here.
>>
>>         Another issue is that you are doing register_chrdev() too early -
>> it should be just before register_filesystem(); definitely without any
>> failure exits in between.
>>
>>         Debugfs stuff is FUBAR - it's sufficiently isolated, so I hadn't
>> spent much time on it, but for example debugfs_create_file() in
>> orangefs_debugfs_init() fails, we'll call orangefs_debugfs_cleanup()
>> right there.  On module exit it will be called again, with
>> debugfs_remove_recursive(debug_dir) called *again*.  At the very least it
>> should clear debug_dir in orangefs_debugfs_cleanup().  Worse,
>> orangefs_kernel_debug_init() will try to create an object in freed
>> directory...
>>
>>         I've dumped some of the massage into vfs.git#orangefs-untested, but
>> the locking and lifetime changes are not there yet; I'll add more of that
>> tomorrow.  I would really appreciate details on wait_for_cancellation_downcall;
>> that's the worst gap I have in understanding that code.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Orangefs ABI documentation

Commit Message

Comments

Patch