Message ID | CANZZXhCCUh12_1gt0ppc4cMMR-BUsq4rySqNihsEJziOGYbh3A@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Mar 20, 2015 at 03:17:54PM +0200, Vitaly Chernooky wrote: > >From 8ef72cde695d1b1a3e9f6165477c9e7415fca2b1 Mon Sep 17 00:00:00 2001 > From: Vitaly Chernooky <vitaly.chernooky@globallogic.com> > Date: Fri, 20 Mar 2015 12:26:37 +0200 > Subject: [PATCH] Fix deadlock on regular nonseekable files > > 'Commit 9c225f2655e36a470c4f58dbbc99244c5fc7f2d4 ("vfs: atomic f_pos > accesses as per POSIX")' introduce following regression. If some program > does multithreaded IO on file in pseudo-filesystem, like procfs, with > nonseekable files marked as regular, we get deadlock on f_pos_lock > mutex, if there are simultaneous reading and writing by different > threads. Details of deadlock, please. How do we manage that when it's always the outermost lock to be taken? Describe the minimal deadlocked set of threads - thread 1 holds <this> and is blocked trying to get <that>, etc. AFAICS, any threads blocked on f_pos_lock are not holding anything else and cannot impede the rest. What am I missing here? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> AFAICS, any threads blocked on f_pos_lock are not holding anything else and > cannot impede the rest. What am I missing here? As far as I understand it is true only for files on regular filesystem like ext4. Let's to see how XEN guys run into trouble with that f_pos_lock: ---------- Forwarded message ---------- From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Date: Thu, Mar 19, 2015 at 3:19 AM Subject: [Xen-devel] Deadlock in /proc/xen/xenbus watch+read on 3.17+ (maybe earlier) To: xen-devel <xen-devel@lists.xen.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>, David Vrabel <david.vrabel@citrix.com> Hi, I've hit some deadlock in kernel xenstore client exposed via /proc/xen/xenbus. Steps to reproduce are simple: int main() { struct xs_handle *xs; xs = xs_open(0); xs_watch(xs, "domid", "token"); xs_read(xs, 0, "name", NULL); return 0; } xs_watch internally creates new thread, which uses read to wait for the watch. And in the same time, the program tries to read some value, but actually it hangs at sending the command (before even sending a path to be read). Strace gives this (simplified for readability): [pid 2494] write(3, "\4\0\0\0\0\0\0\0\0\0\0\0\f\0\0\0", 160 = 16 [pid 2494] write(3, "domid\0", 6) = 6 [pid 2494] write(3, "token\0", 6) = 6 [pid 2495] read(3, <unfinished ...> [pid 2494] futex(0x71c0d4, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...> [pid 2495] <... read resumed> "\17\0\0\0\377\377\377\377\220~\255\27\f\0\0\0", 16) = 16 [pid 2495] read(3, "domid\0token\0", 12) = 12 [pid 2495] read(3, "\4\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0", 16) = 16 [pid 2495] read(3, "OK\0", 3) = 3 [pid 2495] futex(0x71c0d4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x71c0d0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 2494] <... futex resumed> ) = 0 [pid 2495] <... futex resumed> ) = 1 [pid 2494] futex(0x71c0a8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 2495] futex(0x71c0a8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 2494] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 2495] <... futex resumed> ) = 0 [pid 2494] futex(0x71c0a8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 2495] read(3, <unfinished ...> [pid 2494] <... futex resumed> ) = 0 [pid 2494] rt_sigaction(SIGPIPE, {SIG_DFL, [], SA_RESTORER, 0x7fc78c1488f0}, NULL, 8) = 0 [pid 2494] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7fc78c1488f0}, {SIG_DFL, [], SA_RESTORER, 0x7fc78c1488f0}, 8) = 0 [pid 2494] write(3, "\2\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0", 16 And thats all - 2494 is waiting on write, 2495 is waiting on read. On 3.12.x it is working. On 3.17.0 and 3.18.7 it is broken. I haven't checked versions in the middle. Any ideas? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab And Iurii Konovalenko has debugged that the root of their troubles is that Commit 9c225f2655e36a470c4f58dbbc99244c5fc7f2d4. With best regards,
On Fri, Mar 20, 2015 at 04:22:11PM +0200, Vitaly Chernooky wrote: > > AFAICS, any threads blocked on f_pos_lock are not holding anything else and > > cannot impede the rest. What am I missing here? > > As far as I understand it is true only for files on regular filesystem > like ext4. Let's to see how XEN guys run into trouble with that > f_pos_lock: What does it have to do with filesystem type involved? The only place that takes f_pos_lock is __fdget_pos(), with only one caller in the entire tree - fdget_pos(). Which is static in fs/read_write.c and all its callers are in right in the beginning of sys_<something>. Is it just that they have read() on procfs file blocked waiting for something and a bunch of other read/write on the same descriptor blocked on ->f_pos_mutex waiting for that sucker to finish? Then basically they are asking to waive XSI 2.9.7 for that file - behaviour *is* required by POSIX. What file it is and what is the first read() (or write(), whatever) blocked on? Stack traces would be useful for the latter... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> What does it have to do with filesystem type involved? Discussed trouble directly depends on is it used nonseekable_open() in fs driver or not. Regular fss like ext4 does not use nonseekable_open() so there is no troubles. But others like ubifs, debugfs, fuse and xenfs use it and are affected by discussed regression. > Then basically they are asking to waive XSI 2.9.7 for that file - behaviour > *is* required by POSIX. Yes, but in steps to follow XSI 2.9.7 we were not enough meticulous and had broken things which are *out of scope of XSI 2.9.7* and are subject for other standards. So, in steps to create files broken filesystems do something like it: res = finish_open(file, dentry, nonseekable_open, &opened); Let's to look what happen during this call: 1. we call do_dentry_open(file, open, current_cred()) 2. f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; 3. Let's to mention FMODE_LSEEK above. 4. if (S_ISREG(inode->i_mode)) f->f_mode |= FMODE_ATOMIC_POS; 5. Yes, above we do right things because our file is *regular* and we want to follow XSI 2.9.7 for *regular* files. 6. error = open(inode, f); 7. In line above by fact we call nonseekable_open() 8. filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE); 9. Let's stop here and mention than in line above we clear FMODE_LSEEK bit for *regular* files. So we have got *regular* file without FMODE_LSEEK. But: - IEEE Std 1003.1, 2013 Edition, - ISO/IEC 9945-1: 1996 (E) and 1003.1, 1996 Edition, and few other standards declare than files with cleared FMODE_LSEEK *is not* regular. I do not know who, when and why has introduced this mess. It is pre-git historical code. And now by introducing FMODE_ATOMIC_POS we change behavior of this mature practice and affect at least few filesystems. Yes, following standards is good, but I do not accept than breaking mature code is good idea. So I have chosen to clear FMODE_ATOMIC_POS in nonseekable_open() simultaneously with FMODE_LSEEK because it returns that nonseekable semi-standard files back to life. I guess, it is the best solution for now. And, also, XEN guys are happy with this solution. So what do you think about all this mess and may be it is possible to have got better solution? I apologize for this long and confuses explanation :( With best regards,
On Fri, Mar 20, 2015 at 07:37:58PM +0200, Vitaly Chernooky wrote: > > What does it have to do with filesystem type involved? > > Discussed trouble directly depends on is it used nonseekable_open() in > fs driver or not. Regular fss like ext4 does not use > nonseekable_open() so there is no troubles. But others like ubifs, > debugfs, fuse and xenfs use it and are affected by discussed > regression. What does nonseekable_open() have to do with that, other than as a heuristics for "we want to break 2.9.7"? > I do not know who, when and why has introduced this > mess. It is pre-git historical code. And now by introducing > FMODE_ATOMIC_POS we change behavior of this mature practice and affect > at least few filesystems. Yes, following standards is good, but I do > not accept than breaking mature code is good idea. So I have chosen to > clear FMODE_ATOMIC_POS in nonseekable_open() simultaneously with > FMODE_LSEEK because it returns that nonseekable semi-standard files > back to life. I guess, it is the best solution for now. And, also, XEN > guys are happy with this solution. > > So what do you think about all this mess and may be it is possible to > have got better solution? What the devil does that have to do with seeks, anyway? Exact same problem will happen for blocking read() vs. another read() attempts on the same descriptor. With perfectly accepted lseek() (which will also have to block, as per 2.9.7). Which file is that? And what behaviour did you get on old kernels with it? All reads block inside ->read() rather than sys_read()? Details, please. Your strace doesn't show the relevant open(), so it's hard to tell what's really going on there in that regression. I agree that user-visible behaviour changes need to be dealt with; it's the nature of your fix I've a problem with. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 20, 2015 at 05:55:17PM +0000, Al Viro wrote: > On Fri, Mar 20, 2015 at 07:37:58PM +0200, Vitaly Chernooky wrote: > > > What does it have to do with filesystem type involved? > > > > Discussed trouble directly depends on is it used nonseekable_open() in > > fs driver or not. Regular fss like ext4 does not use > > nonseekable_open() so there is no troubles. But others like ubifs, > > debugfs, fuse and xenfs use it and are affected by discussed > > regression. > > What does nonseekable_open() have to do with that, other than as > a heuristics for "we want to break 2.9.7"? > > > I do not know who, when and why has introduced this > > mess. It is pre-git historical code. And now by introducing > > FMODE_ATOMIC_POS we change behavior of this mature practice and affect > > at least few filesystems. Yes, following standards is good, but I do > > not accept than breaking mature code is good idea. So I have chosen to > > clear FMODE_ATOMIC_POS in nonseekable_open() simultaneously with > > FMODE_LSEEK because it returns that nonseekable semi-standard files > > back to life. I guess, it is the best solution for now. And, also, XEN > > guys are happy with this solution. > > > > So what do you think about all this mess and may be it is possible to > > have got better solution? > > What the devil does that have to do with seeks, anyway? Exact > same problem will happen for blocking read() vs. another read() attempts > on the same descriptor. With perfectly accepted lseek() (which will also > have to block, as per 2.9.7). Yes, the problem here is because this particular file (/proc/xen/xenbus) blocks the read() operation waiting for new events. Because of said commit, now it also blocks write() operation used to send some request (which would result in some response, so unblocking read() call). It shouldn't be a normal file in the first place... > Which file is that? And what behaviour did you get on old kernels > with it? All reads block inside ->read() rather than sys_read()? This is /proc/xen/xenbus. > Details, please. Your strace doesn't show the relevant open(), > so it's hard to tell what's really going on there in that regression. > I agree that user-visible behaviour changes need to be dealt with; it's > the nature of your fix I've a problem with. Maybe its better to change the file type? Character device? Pipe?
On Fri, Mar 20, 2015 at 08:00:52PM +0100, Marek Marczykowski-Górecki wrote: > > What the devil does that have to do with seeks, anyway? Exact > > same problem will happen for blocking read() vs. another read() attempts > > on the same descriptor. With perfectly accepted lseek() (which will also > > have to block, as per 2.9.7). > > Yes, the problem here is because this particular file (/proc/xen/xenbus) > blocks the read() operation waiting for new events. Because of said > commit, now it also blocks write() operation used to send some request > (which would result in some response, so unblocking read() call). It > shouldn't be a normal file in the first place... Aha. OK, so you have something that looks a whole lot like a FIFO in that respect, and this semantics simply isn't compatible with read() being atomic wrt write(). So just have that flag explicitly knocked out in your ->open(), preferably with a comment explaining why is that done. Having lseek() is a red herring in that respect - the same problem would exist if that file *did* have something done on lseek(). That's actually what I'm objecting against - "uses nonseekable_open()" is used a weird proxy for "can't have read(), write(), etc. atomic wrt each other". It's not true in either direction - there's a lot of e.g. procfs files that are just fine with current exclusion and there can very well be files _not_ using nonseekable_open() that would break the same way and for the same reasons as /proc/xen/xenbus does. It's trivial to fix - either by explicit filp->f_mode &= ~FMODE_ATOMIC_POS; in xenbus_file_open(), or by adding static inline void no_atomic_pos(struct file *f) { f->f_mode &= ~FMODE_ATOMIC_POS; } somewhere in include/linux/fs.h and having it called in the same xenbus_file_open(). Either way, it ought to come with something along the lines of /* * we can't live with read() vs. write() atomicity, since we use * write() as source of events returned by read() and write() * called after another thread has blocked in read() waiting for * events cannot be required to wait for that read() to finish. */ next to this removal of FMODE_ATOMIC_POS, whichever way we express it... Objections? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/open.c b/fs/open.c index 33f9cbf..1a4f84b 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1135,7 +1135,7 @@ EXPORT_SYMBOL(generic_file_open); */ int nonseekable_open(struct inode *inode, struct file *filp) { - filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE); + filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE | FMODE_ATOMIC_POS); return 0; }