write() semantics (Re: Orangefs ABI documentation)

Message ID	CAOg9mSTDxoLfHceXOJAwNZ6mhwZs4o=d9sQUjWcAQsQbUPiE4w@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> MIME-Version: 1.0 In-Reply-To: <CA+55aFw_3RpvBuTjrMank5CVRYp+RqOZn53Q3OrV+MNur_vZ6A@mail.gmail.com> References: <CAOg9mSTCkr5_or_VSOHnNSsVqd5S=ENuGYBsJkgitqgworN9Eg@mail.gmail.com> <CAOg9mSRoEFvFkdJmEmFOO_ErT7X3Q51=X=HP5Li6g-dVFANOOg@mail.gmail.com> <20160122200442.GF17997@ZenIV.linux.org.uk> <CAOg9mSTPbx0231RoERbbdeyR+nDCD4wGHHFKAzxmowai8GtNgg@mail.gmail.com> <20160123001202.GJ17997@ZenIV.linux.org.uk> <20160123012808.GK17997@ZenIV.linux.org.uk> <CAOg9mST8rpzSs4FGB=7yiGJ2PZBqw2s=H1ozQBiWjDF232ZnhQ@mail.gmail.com> <20160123191055.GN17997@ZenIV.linux.org.uk> <CAOg9mSR667SdVvR323MOaqEs0DVGKcxr8e0tyx=Bu_=ENops0g@mail.gmail.com> <20160123214006.GO17997@ZenIV.linux.org.uk> <20160123224632.GQ17997@ZenIV.linux.org.uk> <CA+55aFw_3RpvBuTjrMank5CVRYp+RqOZn53Q3OrV+MNur_vZ6A@mail.gmail.com> Date: Thu, 3 Mar 2016 17:25:08 -0500 Message-ID: <CAOg9mSTDxoLfHceXOJAwNZ6mhwZs4o=d9sQUjWcAQsQbUPiE4w@mail.gmail.com> Subject: Re: write() semantics (Re: Orangefs ABI documentation) From: Mike Marshall <hubcap@omnibond.com> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Mike Marshall <hubcap@omnibond.com> Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Message ID

CAOg9mSTDxoLfHceXOJAwNZ6mhwZs4o=d9sQUjWcAQsQbUPiE4w@mail.gmail.com (mailing list archive)

State

New, archived

Headers

MIME-Version: 1.0
In-Reply-To: <CA+55aFw_3RpvBuTjrMank5CVRYp+RqOZn53Q3OrV+MNur_vZ6A@mail.gmail.com>
References: <CAOg9mSTCkr5_or_VSOHnNSsVqd5S=ENuGYBsJkgitqgworN9Eg@mail.gmail.com>
	<CAOg9mSRoEFvFkdJmEmFOO_ErT7X3Q51=X=HP5Li6g-dVFANOOg@mail.gmail.com>
	<20160122200442.GF17997@ZenIV.linux.org.uk>
	<CAOg9mSTPbx0231RoERbbdeyR+nDCD4wGHHFKAzxmowai8GtNgg@mail.gmail.com>
	<20160123001202.GJ17997@ZenIV.linux.org.uk>
	<20160123012808.GK17997@ZenIV.linux.org.uk>
	<CAOg9mST8rpzSs4FGB=7yiGJ2PZBqw2s=H1ozQBiWjDF232ZnhQ@mail.gmail.com>
	<20160123191055.GN17997@ZenIV.linux.org.uk>
	<CAOg9mSR667SdVvR323MOaqEs0DVGKcxr8e0tyx=Bu_=ENops0g@mail.gmail.com>
	<20160123214006.GO17997@ZenIV.linux.org.uk>
	<20160123224632.GQ17997@ZenIV.linux.org.uk>
	<CA+55aFw_3RpvBuTjrMank5CVRYp+RqOZn53Q3OrV+MNur_vZ6A@mail.gmail.com>
Date: Thu, 3 Mar 2016 17:25:08 -0500
Message-ID: <CAOg9mSTDxoLfHceXOJAwNZ6mhwZs4o=d9sQUjWcAQsQbUPiE4w@mail.gmail.com>
Subject: Re: write() semantics (Re: Orangefs ABI documentation)
From: Mike Marshall <hubcap@omnibond.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Mike Marshall <hubcap@omnibond.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Mike Marshall March 3, 2016, 10:25 p.m. UTC

Here is what I have come up with to try and make our return to
interrupted writes more acceptable... in my tests it seems to
work. My test involved running the client-core with a tiny IO
buffer size (4k) and a C program with a large write buffer(32M).
That way there was plenty of time for me to fire off the C program
and hit Ctrl-C while the write was chugging along.

The return value from do_readv_writev always matches the size
of the file written by the aborted C program in my tests.

I changed the C program around so that sometimes I ran it with
(O_CREAT | O_RDWR) on open and sometimes with (O_CREAT | O_RDWR | O_APPEND)
and it seemed to do the right thing. I didn't try to set up a
signal handler so that the signal wasn't fatal to the process,
I guess that would be a test to actually see and verify the correct
short return code to write...

Do you all think this looks like it should work in principle?

BTW: in the distant past someone else attempted to solve this problem
the "nfs intr" way - we have an intr mount option, and that's why there's
all that sweating over whether or not stuff is "interruptible" in
waitqueue.c... I'm not sure if our intr mount option is relevant anymore
given the way the op handling code has evolved...

On Sat, Jan 23, 2016 at 6:35 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Jan 23, 2016 at 2:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> What should we get?  -EINTR, despite having written some data?
>
> No, that's not acceptable.
>
> Either all or nothing (which is POSIX) or the NFS 'intr' mount
> behavior (partial write return, -EINTR only when nothing was written
> at all). And, like NFS, a mount option might be a good thing.
>
> And of course, for the usual reasons, fatal signals are special in
> that for them we generally say "screw posix, nobody sees the return
> value anyway", but even there the filesystem might as well still
> return the partial return value (just to not introduce yet another
> special case).
>
> In fact, I think that with our "fatal signals interrupt" behavior,
> nobody should likely use the "intr" mount option on NFS. Even if the
> semantics may be "better", there are likely simply just too many
> programs that don't check the return value of "write()" at all, much
> less handle partial writes correctly.
>
> (And yes, our "screw posix" behavior wrt fatal signals is strictly
> wrong even _despite_ the fact that nobody sees the return value -
> other processes can still obviously see that the whole write wasn't
> done. But blocking on a fatal signal is _so_ annoying that it's one of
> those things where we just say "posix was wrong on this one, and if we
> squint a bit we look _almost_ like we're compliant").
>
>               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Mike Marshall March 4, 2016, 8:55 p.m. UTC | #1

I added a signal handler to my test program today, and now I see the
giant difference between wait_for_completion_interruptible_timeout,
which is what you get when you use the -intr mount option, and
wait_for_completion_killable_timeout, which is what you get when
you don't, so I retract what I said about the -intr mount option
not being relevant <g>...

It also seems like it would be easy (and correct) to modify the
patch a little so that -EINTR would still be returned on the off-chance
that the interrupt was caught before any data was written...

-Mike

On Thu, Mar 3, 2016 at 5:25 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> Here is what I have come up with to try and make our return to
> interrupted writes more acceptable... in my tests it seems to
> work. My test involved running the client-core with a tiny IO
> buffer size (4k) and a C program with a large write buffer(32M).
> That way there was plenty of time for me to fire off the C program
> and hit Ctrl-C while the write was chugging along.
>
> The return value from do_readv_writev always matches the size
> of the file written by the aborted C program in my tests.
>
> I changed the C program around so that sometimes I ran it with
> (O_CREAT | O_RDWR) on open and sometimes with (O_CREAT | O_RDWR | O_APPEND)
> and it seemed to do the right thing. I didn't try to set up a
> signal handler so that the signal wasn't fatal to the process,
> I guess that would be a test to actually see and verify the correct
> short return code to write...
>
> Do you all think this looks like it should work in principle?
>
> BTW: in the distant past someone else attempted to solve this problem
> the "nfs intr" way - we have an intr mount option, and that's why there's
> all that sweating over whether or not stuff is "interruptible" in
> waitqueue.c... I'm not sure if our intr mount option is relevant anymore
> given the way the op handling code has evolved...
>
> diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
> index 6f2e0f7..4349c9d 100644
> --- a/fs/orangefs/file.c
> +++ b/fs/orangefs/file.c
> @@ -180,21 +180,54 @@ populate_shared_memory:
>   }
>
>   if (ret < 0) {
> - /*
> - * don't write an error to syslog on signaled operation
> - * termination unless we've got debugging turned on, as
> - * this can happen regularly (i.e. ctrl-c)
> - */
> - if (ret == -EINTR)
> + if (ret == -EINTR) {
> + /*
> + * We can't return EINTR if any data was written,
> + * it's not POSIX. It is minimally acceptable
> + * to give a partial write, the way NFS does.
> + *
> + * It would be optimal to return all or nothing,
> + * but if a userspace write is bigger than
> + * an IO buffer, and the interrupt occurs
> + * between buffer writes, that would not be
> + * possible.
> + */
> + switch (new_op->op_state - OP_VFS_STATE_GIVEN_UP) {
> + /*
> + * If the op was waiting when the interrupt
> + * occurred, then the client-core did not
> + * trigger the write.
> + */
> + case OP_VFS_STATE_WAITING:
> + ret = 0;
> + break;
> + /*
> + * If the op was in progress when the interrupt
> + * occurred, then the client-core was able to
> + * trigger the write.
> + */
> + case OP_VFS_STATE_INPROGR:
> + ret = total_size;
> + break;
> + default:
> + gossip_err("%s: unexpected op state :%d:.\n",
> +   __func__,
> +   new_op->op_state);
> + ret = 0;
> + break;
> + }
>   gossip_debug(GOSSIP_FILE_DEBUG,
> -     "%s: returning error %ld\n", __func__,
> -     (long)ret);
> - else
> +     "%s: got EINTR, state:%d: %p\n",
> +     __func__,
> +     new_op->op_state,
> +     new_op);
> + } else {
>   gossip_err("%s: error in %s handle %pU, returning %zd\n",
>   __func__,
>   type == ORANGEFS_IO_READ ?
>   "read from" : "write to",
>   handle, ret);
> + }
>   if (orangefs_cancel_op_in_progress(new_op))
>   return ret;
>
>
> On Sat, Jan 23, 2016 at 6:35 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Sat, Jan 23, 2016 at 2:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>
>>> What should we get?  -EINTR, despite having written some data?
>>
>> No, that's not acceptable.
>>
>> Either all or nothing (which is POSIX) or the NFS 'intr' mount
>> behavior (partial write return, -EINTR only when nothing was written
>> at all). And, like NFS, a mount option might be a good thing.
>>
>> And of course, for the usual reasons, fatal signals are special in
>> that for them we generally say "screw posix, nobody sees the return
>> value anyway", but even there the filesystem might as well still
>> return the partial return value (just to not introduce yet another
>> special case).
>>
>> In fact, I think that with our "fatal signals interrupt" behavior,
>> nobody should likely use the "intr" mount option on NFS. Even if the
>> semantics may be "better", there are likely simply just too many
>> programs that don't check the return value of "write()" at all, much
>> less handle partial writes correctly.
>>
>> (And yes, our "screw posix" behavior wrt fatal signals is strictly
>> wrong even _despite_ the fact that nobody sees the return value -
>> other processes can still obviously see that the whole write wasn't
>> done. But blocking on a fatal signal is _so_ annoying that it's one of
>> those things where we just say "posix was wrong on this one, and if we
>> squint a bit we look _almost_ like we're compliant").
>>
>>               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 6f2e0f7..4349c9d 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -180,21 +180,54 @@  populate_shared_memory:
  }

  if (ret < 0) {
- /*
- * don't write an error to syslog on signaled operation
- * termination unless we've got debugging turned on, as
- * this can happen regularly (i.e. ctrl-c)
- */
- if (ret == -EINTR)
+ if (ret == -EINTR) {
+ /*
+ * We can't return EINTR if any data was written,
+ * it's not POSIX. It is minimally acceptable
+ * to give a partial write, the way NFS does.
+ *
+ * It would be optimal to return all or nothing,
+ * but if a userspace write is bigger than
+ * an IO buffer, and the interrupt occurs
+ * between buffer writes, that would not be
+ * possible.
+ */
+ switch (new_op->op_state - OP_VFS_STATE_GIVEN_UP) {
+ /*
+ * If the op was waiting when the interrupt
+ * occurred, then the client-core did not
+ * trigger the write.
+ */
+ case OP_VFS_STATE_WAITING:
+ ret = 0;
+ break;
+ /*
+ * If the op was in progress when the interrupt
+ * occurred, then the client-core was able to
+ * trigger the write.
+ */
+ case OP_VFS_STATE_INPROGR:
+ ret = total_size;
+ break;
+ default:
+ gossip_err("%s: unexpected op state :%d:.\n",
+   __func__,
+   new_op->op_state);
+ ret = 0;
+ break;
+ }
  gossip_debug(GOSSIP_FILE_DEBUG,
-     "%s: returning error %ld\n", __func__,
-     (long)ret);
- else
+     "%s: got EINTR, state:%d: %p\n",
+     __func__,
+     new_op->op_state,
+     new_op);
+ } else {
  gossip_err("%s: error in %s handle %pU, returning %zd\n",
  __func__,
  type == ORANGEFS_IO_READ ?
  "read from" : "write to",
  handle, ret);
+ }
  if (orangefs_cancel_op_in_progress(new_op))
  return ret;

write() semantics (Re: Orangefs ABI documentation)

Commit Message

Comments

Patch