[RFC] Fix deadlock on regular nonseekable files

Message ID	CANZZXhCCUh12_1gt0ppc4cMMR-BUsq4rySqNihsEJziOGYbh3A@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> MIME-Version: 1.0 Date: Fri, 20 Mar 2015 15:17:54 +0200 Message-ID: <CANZZXhCCUh12_1gt0ppc4cMMR-BUsq4rySqNihsEJziOGYbh3A@mail.gmail.com> Subject: [PATCH] [RFC] Fix deadlock on regular nonseekable files From: Vitaly Chernooky <vitalii.chernookyi@globallogic.com> To: Alexander Viro <viro@zeniv.linux.org.uk>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Linus Torvalds <torvalds@linux-foundation.org> Cc: David Vrabel <david.vrabel@citrix.com>, =?UTF-8?Q?Marek_Marczykowski=2DG=C3=B3recki?= <marmarek@invisiblethingslab.com>, Iurii Konovalenko <iurii.konovalenko@globallogic.com>, Ian Campbell <ian.campbell@citrix.com>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Andrii Anisov <andrii.anisov@globallogic.com>, Artem Mygaiev <artem.mygaiev@globallogic.com> Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Vitaly Chernooky March 20, 2015, 1:17 p.m. UTC

From 8ef72cde695d1b1a3e9f6165477c9e7415fca2b1 Mon Sep 17 00:00:00 2001
From: Vitaly Chernooky <vitaly.chernooky@globallogic.com>
Date: Fri, 20 Mar 2015 12:26:37 +0200
Subject: [PATCH] Fix deadlock on regular nonseekable files

'Commit 9c225f2655e36a470c4f58dbbc99244c5fc7f2d4 ("vfs: atomic f_pos
accesses as per POSIX")' introduce following regression. If some program
does multithreaded IO on file in pseudo-filesystem, like procfs, with
nonseekable files marked as regular, we get deadlock on f_pos_lock
mutex, if there are simultaneous reading and writing by different
threads.

Signed-off-by: Vitaly Chernooky <vitaly.chernooky@globallogic.com>
---
 fs/open.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Al Viro March 20, 2015, 1:42 p.m. UTC | #1

On Fri, Mar 20, 2015 at 03:17:54PM +0200, Vitaly Chernooky wrote:
> >From 8ef72cde695d1b1a3e9f6165477c9e7415fca2b1 Mon Sep 17 00:00:00 2001
> From: Vitaly Chernooky <vitaly.chernooky@globallogic.com>
> Date: Fri, 20 Mar 2015 12:26:37 +0200
> Subject: [PATCH] Fix deadlock on regular nonseekable files
> 
> 'Commit 9c225f2655e36a470c4f58dbbc99244c5fc7f2d4 ("vfs: atomic f_pos
> accesses as per POSIX")' introduce following regression. If some program
> does multithreaded IO on file in pseudo-filesystem, like procfs, with
> nonseekable files marked as regular, we get deadlock on f_pos_lock
> mutex, if there are simultaneous reading and writing by different
> threads.
 
Details of deadlock, please.  How do we manage that when it's always the
outermost lock to be taken?  Describe the minimal deadlocked set of threads -
thread 1 holds <this> and is blocked trying to get <that>, etc.

AFAICS, any threads blocked on f_pos_lock are not holding anything else and
cannot impede the rest.  What am I missing here?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vitaly Chernooky March 20, 2015, 2:22 p.m. UTC | #2

> AFAICS, any threads blocked on f_pos_lock are not holding anything else and
> cannot impede the rest.  What am I missing here?

As far as I understand it is true only for files on regular filesystem
like ext4. Let's to see how XEN guys run into trouble with that
f_pos_lock:

---------- Forwarded message ----------
From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Date: Thu, Mar 19, 2015 at 3:19 AM
Subject: [Xen-devel] Deadlock in /proc/xen/xenbus watch+read on 3.17+
(maybe earlier)
To: xen-devel <xen-devel@lists.xen.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>, David Vrabel
<david.vrabel@citrix.com>


Hi,

I've hit some deadlock in kernel xenstore client exposed via
/proc/xen/xenbus. Steps to reproduce are simple:
int main() {
        struct xs_handle *xs;
        xs = xs_open(0);
        xs_watch(xs, "domid", "token");
        xs_read(xs, 0, "name", NULL);
        return 0;
}

xs_watch internally creates new thread, which uses read to wait for the
watch. And in the same time, the program tries to read some value,
but actually it hangs at sending the command (before even sending a path to be
read). Strace gives this (simplified for readability):
[pid  2494] write(3, "\4\0\0\0\0\0\0\0\0\0\0\0\f\0\0\0", 160 = 16
[pid  2494] write(3, "domid\0", 6)      = 6
[pid  2494] write(3, "token\0", 6)      = 6
[pid  2495] read(3,  <unfinished ...>
[pid  2494] futex(0x71c0d4, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid  2495] <... read resumed>
"\17\0\0\0\377\377\377\377\220~\255\27\f\0\0\0", 16) = 16
[pid  2495] read(3, "domid\0token\0", 12) = 12
[pid  2495] read(3, "\4\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0", 16) = 16
[pid  2495] read(3, "OK\0", 3)          = 3
[pid  2495] futex(0x71c0d4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x71c0d0,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
[pid  2494] <... futex resumed> )       = 0
[pid  2495] <... futex resumed> )       = 1
[pid  2494] futex(0x71c0a8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  2495] futex(0x71c0a8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  2494] <... futex resumed> )       = -1 EAGAIN (Resource
temporarily unavailable)
[pid  2495] <... futex resumed> )       = 0
[pid  2494] futex(0x71c0a8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  2495] read(3,  <unfinished ...>
[pid  2494] <... futex resumed> )       = 0
[pid  2494] rt_sigaction(SIGPIPE, {SIG_DFL, [], SA_RESTORER,
0x7fc78c1488f0}, NULL, 8) = 0
[pid  2494] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER,
0x7fc78c1488f0}, {SIG_DFL, [], SA_RESTORER, 0x7fc78c1488f0}, 8) = 0
[pid  2494] write(3, "\2\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0", 16

And thats all - 2494 is waiting on write, 2495 is waiting on read.

On 3.12.x it is working. On 3.17.0 and 3.18.7 it is broken. I haven't
checked versions in the middle.

Any ideas?

--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab




And Iurii Konovalenko has debugged that the root of their troubles is
that Commit 9c225f2655e36a470c4f58dbbc99244c5fc7f2d4.

With best regards,

Al Viro March 20, 2015, 2:46 p.m. UTC | #3

On Fri, Mar 20, 2015 at 04:22:11PM +0200, Vitaly Chernooky wrote:
> > AFAICS, any threads blocked on f_pos_lock are not holding anything else and
> > cannot impede the rest.  What am I missing here?
> 
> As far as I understand it is true only for files on regular filesystem
> like ext4. Let's to see how XEN guys run into trouble with that
> f_pos_lock:

What does it have to do with filesystem type involved?  The only place that
takes f_pos_lock is __fdget_pos(), with only one caller in the entire tree -
fdget_pos().  Which is static in fs/read_write.c and all its callers are
in right in the beginning of sys_<something>.

Is it just that they have read() on procfs file blocked waiting for something
and a bunch of other read/write on the same descriptor blocked on ->f_pos_mutex
waiting for that sucker to finish?

Then basically they are asking to waive XSI 2.9.7 for that file - behaviour
*is* required by POSIX.

What file it is and what is the first read() (or write(), whatever) blocked on?
Stack traces would be useful for the latter...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vitaly Chernooky March 20, 2015, 5:37 p.m. UTC | #4

> What does it have to do with filesystem type involved?

Discussed trouble directly depends on is it used nonseekable_open() in
fs driver or not. Regular fss like ext4 does not use
nonseekable_open() so there is no troubles. But others like ubifs,
debugfs, fuse and xenfs use it and are affected by discussed
regression.

> Then basically they are asking to waive XSI 2.9.7 for that file - behaviour
> *is* required by POSIX.

Yes, but in steps to follow XSI 2.9.7 we were not enough meticulous
and had broken things which are *out of scope of XSI 2.9.7* and are
subject for other standards.

So, in steps to create files broken filesystems do something like it:

res = finish_open(file, dentry, nonseekable_open, &opened);

Let's to look what happen during this call:

1. we call do_dentry_open(file, open, current_cred())
2. f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK |
                FMODE_PREAD | FMODE_PWRITE;
3. Let's to mention FMODE_LSEEK above.
4. if (S_ISREG(inode->i_mode))
        f->f_mode |= FMODE_ATOMIC_POS;
5. Yes, above we do right things because our file is *regular* and we
want to follow XSI 2.9.7 for *regular* files.
6. error = open(inode, f);
7. In line above by fact we call nonseekable_open()
8. filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
9. Let's stop here and mention than in line above we clear FMODE_LSEEK
bit for *regular* files. So we have got *regular* file without
FMODE_LSEEK. But:

- IEEE Std 1003.1, 2013 Edition,
- ISO/IEC 9945-1: 1996 (E) and 1003.1, 1996 Edition,

and few other standards declare than files with cleared FMODE_LSEEK
*is not* regular. I do not know who, when and why has introduced this
mess. It is pre-git historical code. And now by introducing
FMODE_ATOMIC_POS we change behavior of this mature practice and affect
at least few filesystems. Yes, following standards is good, but I do
not accept than breaking mature code is good idea. So I have chosen to
clear FMODE_ATOMIC_POS in nonseekable_open() simultaneously with
FMODE_LSEEK because it returns that nonseekable semi-standard files
back to life. I guess, it is the best solution for now. And, also, XEN
guys are happy with this solution.

So what do you think about all this mess and may be it is possible to
have got better solution?

I apologize for this long and confuses explanation :(

With best regards,

Al Viro March 20, 2015, 5:55 p.m. UTC | #5

On Fri, Mar 20, 2015 at 07:37:58PM +0200, Vitaly Chernooky wrote:
> > What does it have to do with filesystem type involved?
> 
> Discussed trouble directly depends on is it used nonseekable_open() in
> fs driver or not. Regular fss like ext4 does not use
> nonseekable_open() so there is no troubles. But others like ubifs,
> debugfs, fuse and xenfs use it and are affected by discussed
> regression.

What does nonseekable_open() have to do with that, other than as
a heuristics for "we want to break 2.9.7"?

> I do not know who, when and why has introduced this
> mess. It is pre-git historical code. And now by introducing
> FMODE_ATOMIC_POS we change behavior of this mature practice and affect
> at least few filesystems. Yes, following standards is good, but I do
> not accept than breaking mature code is good idea. So I have chosen to
> clear FMODE_ATOMIC_POS in nonseekable_open() simultaneously with
> FMODE_LSEEK because it returns that nonseekable semi-standard files
> back to life. I guess, it is the best solution for now. And, also, XEN
> guys are happy with this solution.
> 
> So what do you think about all this mess and may be it is possible to
> have got better solution?

	What the devil does that have to do with seeks, anyway?  Exact
same problem will happen for blocking read() vs. another read() attempts
on the same descriptor.  With perfectly accepted lseek() (which will also
have to block, as per 2.9.7).

	Which file is that?  And what behaviour did you get on old kernels
with it?  All reads block inside ->read() rather than sys_read()?

	Details, please.  Your strace doesn't show the relevant open(),
so it's hard to tell what's really going on there in that regression.
I agree that user-visible behaviour changes need to be dealt with; it's
the nature of your fix I've a problem with.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marek Marczykowski-Górecki March 20, 2015, 7 p.m. UTC | #6

On Fri, Mar 20, 2015 at 05:55:17PM +0000, Al Viro wrote:
> On Fri, Mar 20, 2015 at 07:37:58PM +0200, Vitaly Chernooky wrote:
> > > What does it have to do with filesystem type involved?
> > 
> > Discussed trouble directly depends on is it used nonseekable_open() in
> > fs driver or not. Regular fss like ext4 does not use
> > nonseekable_open() so there is no troubles. But others like ubifs,
> > debugfs, fuse and xenfs use it and are affected by discussed
> > regression.
> 
> What does nonseekable_open() have to do with that, other than as
> a heuristics for "we want to break 2.9.7"?
> 
> > I do not know who, when and why has introduced this
> > mess. It is pre-git historical code. And now by introducing
> > FMODE_ATOMIC_POS we change behavior of this mature practice and affect
> > at least few filesystems. Yes, following standards is good, but I do
> > not accept than breaking mature code is good idea. So I have chosen to
> > clear FMODE_ATOMIC_POS in nonseekable_open() simultaneously with
> > FMODE_LSEEK because it returns that nonseekable semi-standard files
> > back to life. I guess, it is the best solution for now. And, also, XEN
> > guys are happy with this solution.
> > 
> > So what do you think about all this mess and may be it is possible to
> > have got better solution?
> 
> 	What the devil does that have to do with seeks, anyway?  Exact
> same problem will happen for blocking read() vs. another read() attempts
> on the same descriptor.  With perfectly accepted lseek() (which will also
> have to block, as per 2.9.7).

Yes, the problem here is because this particular file (/proc/xen/xenbus)
blocks the read() operation waiting for new events. Because of said
commit, now it also blocks write() operation used to send some request
(which would result in some response, so unblocking read() call). It
shouldn't be a normal file in the first place...

> 	Which file is that?  And what behaviour did you get on old kernels
> with it?  All reads block inside ->read() rather than sys_read()?

This is /proc/xen/xenbus. 

> 	Details, please.  Your strace doesn't show the relevant open(),
> so it's hard to tell what's really going on there in that regression.
> I agree that user-visible behaviour changes need to be dealt with; it's
> the nature of your fix I've a problem with.

Maybe its better to change the file type? Character device? Pipe?

Al Viro March 20, 2015, 7:35 p.m. UTC | #7

On Fri, Mar 20, 2015 at 08:00:52PM +0100, Marek Marczykowski-Górecki wrote:
> > 	What the devil does that have to do with seeks, anyway?  Exact
> > same problem will happen for blocking read() vs. another read() attempts
> > on the same descriptor.  With perfectly accepted lseek() (which will also
> > have to block, as per 2.9.7).
> 
> Yes, the problem here is because this particular file (/proc/xen/xenbus)
> blocks the read() operation waiting for new events. Because of said
> commit, now it also blocks write() operation used to send some request
> (which would result in some response, so unblocking read() call). It
> shouldn't be a normal file in the first place...

Aha.  OK, so you have something that looks a whole lot like a FIFO in
that respect, and this semantics simply isn't compatible with read()
being atomic wrt write().

So just have that flag explicitly knocked out in your ->open(), preferably
with a comment explaining why is that done.  Having lseek() is a red herring
in that respect - the same problem would exist if that file *did* have
something done on lseek().

That's actually what I'm objecting against - "uses nonseekable_open()" is
used a weird proxy for "can't have read(), write(), etc. atomic wrt each
other".  It's not true in either direction - there's a lot of e.g. procfs
files that are just fine with current exclusion and there can very well
be files _not_ using nonseekable_open() that would break the same way
and for the same reasons as /proc/xen/xenbus does.

It's trivial to fix - either by explicit filp->f_mode &= ~FMODE_ATOMIC_POS;
in xenbus_file_open(), or by adding
static inline void no_atomic_pos(struct file *f)
{
	f->f_mode &= ~FMODE_ATOMIC_POS;
}
somewhere in include/linux/fs.h and having it called in the same
xenbus_file_open().  Either way, it ought to come with something
along the lines of
	/*
	 * we can't live with read() vs. write() atomicity, since we use
	 * write() as source of events returned by read() and write()
	 * called after another thread has blocked in read() waiting for
	 * events cannot be required to wait for that read() to finish.
	 */
next to this removal of FMODE_ATOMIC_POS, whichever way we express it...

Objections?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] Fix deadlock on regular nonseekable files

Commit Message

Comments

Patch