Message ID | CAOg9mSQf4m-MM_kKd=UcGsCx_LZg=+nFESd8jhzDEbcM9DOGjg@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jan 22, 2016 at 11:59:40AM -0500, Mike Marshall wrote: > Hi Al... > > I moved a tiny bit of your work around so it would compile, but I booted a > kernel from it, and ran some tests, it seems to work OK... doing this > work from home makes me remember writing cobol programs on a > silent 700... my co-workers are helping me look at > wait_for_cancellation_downcall... we recently made some > improvements there based on some problems we were > having in production with our out-of-tree Frankenstein > module... I'm glad you are also looking there. BTW, what should happen to requests that got a buggered response in orangefs_devreq_write_iter()? As it is, you just free them and to hell with whatever might've been waiting on them; that's easy to fix, but what about the submitter? Should we let it time out/simulate an error properly returned by daemon/something else? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I think I see what you are saying... it seems like we have always just freed them and to heck with whatever was waiting on them. Maybe in all the places I detect something bad, set the return code and goto out, perhaps I should also jamb the bad return code into op->downcall.status and goto wakeup intead of out? Our infant fuzzer isn't really up to making test errors show up there, I'd probably have to put some test weirdo stuff in the userspace part that does the writev to make some test failures... -Mike On Fri, Jan 22, 2016 at 12:08 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Jan 22, 2016 at 11:59:40AM -0500, Mike Marshall wrote: >> Hi Al... >> >> I moved a tiny bit of your work around so it would compile, but I booted a >> kernel from it, and ran some tests, it seems to work OK... doing this >> work from home makes me remember writing cobol programs on a >> silent 700... my co-workers are helping me look at >> wait_for_cancellation_downcall... we recently made some >> improvements there based on some problems we were >> having in production with our out-of-tree Frankenstein >> module... I'm glad you are also looking there. > > BTW, what should happen to requests that got a buggered response in > orangefs_devreq_write_iter()? As it is, you just free them and to hell > with whatever might've been waiting on them; that's easy to fix, but > what about the submitter? Should we let it time out/simulate an error > properly returned by daemon/something else? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 22, 2016 at 05:08:38PM +0000, Al Viro wrote: > BTW, what should happen to requests that got a buggered response in > orangefs_devreq_write_iter()? As it is, you just free them and to hell > with whatever might've been waiting on them; that's easy to fix, but > what about the submitter? Should we let it time out/simulate an error > properly returned by daemon/something else? FWIW, here's what I have in mind for lifetime rules: 1) op_alloc() creates them with refcount 1 2) when we decide to pass the sucker to daemon, we bump the refcount before taking the thing out of request list. After successful copying to userland, we check if the submitter has given up on it. If it hasn't, we move the sucker to hash and decrement the refcount (it can't be the last reference there). If it has, we do put_op(). After _failed_ copying to userland, we also check if it's been given up. If it hasn't, we move it back to request list and decrement the refcount. If it has, we do put_op(). 3) when we get a response from daemon, we bump the refcount before taking the thing out of hash. Once we are done, we call put_op() - in *all* cases. Malformed response is treated like a well-formed error. 4) if submitter decides to give up upon a request, it sets a flag in ->op_state. From that point on nobody else is going to touch op->list. 5) once submitter is done, it does put_op() - *NOT* op_release(). Normally it ends up with request freed, but if we'd raced with interaction with daemon we'll get it freed by put_op() done in (2) or (3). No /* * tell the device file owner waiting on I/O that this read has * completed and it can return now. in this exact case, on * wakeup the daemon will free the op, so we *cannot* touch it * after this. */ wake_up_daemon_for_return(new_op); new_op = NULL; anymore - just wake it up and do put_op() regardless. In that case daemon has already done get_op() and will free the sucker once it does its put_op(). Objections, comments? BTW, it really looks like we might be better off without atomic_t here - op->lock is held for all non-trivial transitions and we might as well have put_op() require op->lock held and do something like if (--op->ref_count) { spin_unlock(op->lock); return; } spin_unlock(op->lock); op_release(op); Speaking of op->lock: /* * let process sleep for a few seconds so shared * memory system can be initialized. */ spin_lock_irqsave(&op->lock, irqflags); prepare_to_wait(&orangefs_bufmap_init_waitq, &wait_entry, TASK_INTERRUPTIBLE); spin_unlock_irqrestore(&op->lock, irqflags); What do you need to block interrupts for? Oh, and why do you call orangefs_op_initialize(orangefs_op) from op_release(), when you immediately follow it with freeing orangefs_op and do it on allocation anyway? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Objections, comments? I have no objections so far to your suggestions, I'm trying to keep my co-workers looking in on this discussion... most other days we're all lined up in adjacent cubes... Your attention and fixes will almost certainly all be improvements... I found a few fixes you made to the old version of the writev code that were off base, but that code made it look like impossible situations needed to be handled... I'll compile your repo again and run the quick tests when you think your additions are stable... -Mike On Fri, Jan 22, 2016 at 12:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Jan 22, 2016 at 05:08:38PM +0000, Al Viro wrote: > >> BTW, what should happen to requests that got a buggered response in >> orangefs_devreq_write_iter()? As it is, you just free them and to hell >> with whatever might've been waiting on them; that's easy to fix, but >> what about the submitter? Should we let it time out/simulate an error >> properly returned by daemon/something else? > > FWIW, here's what I have in mind for lifetime rules: > 1) op_alloc() creates them with refcount 1 > 2) when we decide to pass the sucker to daemon, we bump the refcount > before taking the thing out of request list. After successful copying > to userland, we check if the submitter has given up on it. If it hasn't, > we move the sucker to hash and decrement the refcount (it can't be the > last reference there). If it has, we do put_op(). After _failed_ copying > to userland, we also check if it's been given up. If it hasn't, we move > it back to request list and decrement the refcount. If it has, we do > put_op(). > 3) when we get a response from daemon, we bump the refcount before > taking the thing out of hash. Once we are done, we call put_op() - in *all* > cases. Malformed response is treated like a well-formed error. > 4) if submitter decides to give up upon a request, it sets a flag in > ->op_state. From that point on nobody else is going to touch op->list. > 5) once submitter is done, it does put_op() - *NOT* op_release(). > Normally it ends up with request freed, but if we'd raced with interaction > with daemon we'll get it freed by put_op() done in (2) or (3). No > /* > * tell the device file owner waiting on I/O that this read has > * completed and it can return now. in this exact case, on > * wakeup the daemon will free the op, so we *cannot* touch it > * after this. > */ > wake_up_daemon_for_return(new_op); > new_op = NULL; > anymore - just wake it up and do put_op() regardless. In that case daemon > has already done get_op() and will free the sucker once it does its put_op(). > > Objections, comments? > > BTW, it really looks like we might be better off without atomic_t here - > op->lock is held for all non-trivial transitions and we might as well have > put_op() require op->lock held and do something like > if (--op->ref_count) { > spin_unlock(op->lock); > return; > } > spin_unlock(op->lock); > op_release(op); > > Speaking of op->lock: > /* > * let process sleep for a few seconds so shared > * memory system can be initialized. > */ > spin_lock_irqsave(&op->lock, irqflags); > prepare_to_wait(&orangefs_bufmap_init_waitq, > &wait_entry, > TASK_INTERRUPTIBLE); > spin_unlock_irqrestore(&op->lock, irqflags); > What do you need to block interrupts for? > > Oh, and why do you call orangefs_op_initialize(orangefs_op) from op_release(), > when you immediately follow it with freeing orangefs_op and do it on > allocation anyway? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 22, 2016 at 01:17:40PM -0500, Mike Marshall wrote: > >> Objections, comments? > > I have no objections so far to your suggestions, I'm trying to keep > my co-workers looking in on this discussion... most other days > we're all lined up in adjacent cubes... email and IRC exist for purpose... BTW, another thing: if you've managed to get to orangefs_exit() with anything still in request list or hash, you are seriously fscked. Whatever had allocated them is presumably still around and it's running in that module's .text... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin's the only other one subscribed to fs-devel, the rest are looking in here today: http://marc.info/?l=linux-fsdevel&w=2&r=1&s=orangefs&q=b Walt thinks you deserve an Orangefs Contributor's Jacket... -Mike On Fri, Jan 22, 2016 at 1:37 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Jan 22, 2016 at 01:17:40PM -0500, Mike Marshall wrote: >> >> Objections, comments? >> >> I have no objections so far to your suggestions, I'm trying to keep >> my co-workers looking in on this discussion... most other days >> we're all lined up in adjacent cubes... > > email and IRC exist for purpose... BTW, another thing: if you've managed > to get to orangefs_exit() with anything still in request list or hash, > you are seriously fscked. Whatever had allocated them is presumably still > around and it's running in that module's .text... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Al... I think that my commit f987f4c28 "don't trigger copy_attributes_to_inode from d_revalidate." doesn't really do the job... I think I at least need to write code that checks some of the attributes and fails the revalidate if they got changed via some external process... does that sound right? -Mike On Fri, Jan 22, 2016 at 2:07 PM, Mike Marshall <hubcap@omnibond.com> wrote: > Martin's the only other one subscribed to fs-devel, the > rest are looking in here today: > > http://marc.info/?l=linux-fsdevel&w=2&r=1&s=orangefs&q=b > > Walt thinks you deserve an Orangefs Contributor's Jacket... > > -Mike > > On Fri, Jan 22, 2016 at 1:37 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Fri, Jan 22, 2016 at 01:17:40PM -0500, Mike Marshall wrote: >>> >> Objections, comments? >>> >>> I have no objections so far to your suggestions, I'm trying to keep >>> my co-workers looking in on this discussion... most other days >>> we're all lined up in adjacent cubes... >> >> email and IRC exist for purpose... BTW, another thing: if you've managed >> to get to orangefs_exit() with anything still in request list or hash, >> you are seriously fscked. Whatever had allocated them is presumably still >> around and it's running in that module's .text... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 22, 2016 at 01:17:40PM -0500, Mike Marshall wrote: > >> Objections, comments? > > I have no objections so far to your suggestions, I'm trying to keep > my co-workers looking in on this discussion... most other days > we're all lined up in adjacent cubes... FWIW, I do see a problem there. This "submitter has given up" flag would need to be cleared on retries ;-/ When can we end up retrying? We'd need * wait_for_cancellation_downcall/wait_for_matching_downcall finished * op_state_serviced(op) false * op->downcall.status set to -EAGAIN wait_for_matching_downcall() either returns with op_state_serviced() or returns one -ETIMEDOUT, -EINTR, -EAGAIN or -EIO, with its return value ending up in op->downcall.status. So if it was wait_for_matching_downcall(), we'd have to have returned -EAGAIN. Which can happen only with op_state_purged(op), i.e. with character device having been closed after the sucker had been placed on the lists... wait_for_cancellation_downcall() similar, but it never returns -EAGAIN. IOW, for it retries are irrelevant. Damnit, what's the point of picking a purged request from the request list in orangefs_devreq_read()? Ditto for orangefs_devreq_write_iter() and request hash... Note that purge *can't* happen during either of those - ->release() can't be called while something is in ->read() or ->write_iter(). So if we would ignore purged requests in those, the problem with retries clearing the "I gave up" flag wouldn't exist... It's a race anyway - submitter had been woken up when we'd been marking the request as purged, and it's going to remove the sucker from whatever list it's on very soon after getting woken up. Do you see any problems with skipping such request list/hash elements in orangefs_devreq_read()/orangefs_devreq_write_iter() resp.? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 22, 2016 at 02:07:23PM -0500, Mike Marshall wrote: > Martin's the only other one subscribed to fs-devel, the > rest are looking in here today: > > http://marc.info/?l=linux-fsdevel&w=2&r=1&s=orangefs&q=b Umm... Why not Cc to your maillist (pvfs2-developers@beowulf-underground.org?) if it's unmoderated? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 22, 2016 at 02:21:44PM -0500, Mike Marshall wrote: > Al... > > I think that my commit f987f4c28 > "don't trigger copy_attributes_to_inode from d_revalidate." > doesn't really do the job... I think I at least need to > write code that checks some of the attributes and > fails the revalidate if they got changed via some > external process... does that sound right? Won't your orangefs_revalidate_lookup() spot the changed handle? Anyway, some sanity checks in there might be a good idea - at least "has the object type somehow changed without handle going stale?"... Said that, what should be picking e.g. chmod/chown by external source? You don't have ->permission() instances in there, so inode->i_mode is used directly by generic_inode_permission()... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The userspace daemon (client-core) that reads/writes to the device restarts automatically if it stops for some reason... I believe active ops are marked "purged" when this happens, and when client-core restarts "purged" ops are retried (once)... see the comment in waitqueue.c "if the operation was purged in the meantime..." I've tried to rattle Walt and Becky's chains to see if they can describe it better... -Mike On Fri, Jan 22, 2016 at 3:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Jan 22, 2016 at 02:21:44PM -0500, Mike Marshall wrote: >> Al... >> >> I think that my commit f987f4c28 >> "don't trigger copy_attributes_to_inode from d_revalidate." >> doesn't really do the job... I think I at least need to >> write code that checks some of the attributes and >> fails the revalidate if they got changed via some >> external process... does that sound right? > > Won't your orangefs_revalidate_lookup() spot the changed handle? Anyway, > some sanity checks in there might be a good idea - at least "has the > object type somehow changed without handle going stale?"... > > Said that, what should be picking e.g. chmod/chown by external source? > You don't have ->permission() instances in there, so inode->i_mode is > used directly by generic_inode_permission()... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Nope... I guess that's why the original developer called copy_attributes_to_inode every time... I need to at least check permissions... if a file's permissions change via an external source, say from 755 to 700, the other guy can still cat the file once... -Mike On Fri, Jan 22, 2016 at 3:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Jan 22, 2016 at 02:21:44PM -0500, Mike Marshall wrote: >> Al... >> >> I think that my commit f987f4c28 >> "don't trigger copy_attributes_to_inode from d_revalidate." >> doesn't really do the job... I think I at least need to >> write code that checks some of the attributes and >> fails the revalidate if they got changed via some >> external process... does that sound right? > > Won't your orangefs_revalidate_lookup() spot the changed handle? Anyway, > some sanity checks in there might be a good idea - at least "has the > object type somehow changed without handle going stale?"... > > Said that, what should be picking e.g. chmod/chown by external source? > You don't have ->permission() instances in there, so inode->i_mode is > used directly by generic_inode_permission()... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I built and tested orangefs-untested ... Orangefs still seems to work OK... I built it as a module and didn't trigger the BUG_ON when I unloaded it. -Mike On Fri, Jan 22, 2016 at 3:51 PM, Mike Marshall <hubcap@omnibond.com> wrote: > Nope... I guess that's why the original developer called > copy_attributes_to_inode > every time... I need to at least check permissions... if a file's > permissions change > via an external source, say from 755 to 700, the other guy can still cat > the file once... > > -Mike > > On Fri, Jan 22, 2016 at 3:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Fri, Jan 22, 2016 at 02:21:44PM -0500, Mike Marshall wrote: >>> Al... >>> >>> I think that my commit f987f4c28 >>> "don't trigger copy_attributes_to_inode from d_revalidate." >>> doesn't really do the job... I think I at least need to >>> write code that checks some of the attributes and >>> fails the revalidate if they got changed via some >>> external process... does that sound right? >> >> Won't your orangefs_revalidate_lookup() spot the changed handle? Anyway, >> some sanity checks in there might be a good idea - at least "has the >> object type somehow changed without handle going stale?"... >> >> Said that, what should be picking e.g. chmod/chown by external source? >> You don't have ->permission() instances in there, so inode->i_mode is >> used directly by generic_inode_permission()... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 22, 2016 at 03:30:02PM -0500, Mike Marshall wrote: > The userspace daemon (client-core) that reads/writes to the device > restarts automatically if it stops for some reason... I believe active > ops are marked "purged" when this happens, and when client-core > restarts "purged" ops are retried (once)... see the comment > in waitqueue.c "if the operation was purged in the meantime..." > > I've tried to rattle Walt and Becky's chains to see if they > can describe it better... What I mean is the following sequence: Syscall: puts op into request list, sleeps in wait_for_matching_downcall() Daemon: exits, markes purged, wakes Syscall up Daemon gets restarted Daemon calls read(), finds op still on the list Syscall: finally gets the timeslice, removes op from the list, decides to resubmit This is very hard to hit - normally by the time we get around to read() from restarted daemon the waiter had already been woken up and already removed the purged op from the list. So in practice you probably had never hit that case. However, it is theoretically possible. What I propose to do is to have purged requests that are still in the lists to be skipped by orangefs_devreq_read() and orangefs_devreq_remove_op(). IOW, pretend that the race had been won by whatever had been waiting on that request and got woken up when it had been purged. Note that by the time it gets resubmitted, it already has the 'purged' flag removed - set_op_state_waiting(op) is done when we are inserting into request list and it leaves no trace of OP_VFS_STATE_PURGED. So I'm not talking about the resubmitted stuff; just the one that had been in queue since before the daemon restart and hadn't been removed from there yet. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jan 23, 2016 at 12:12:02AM +0000, Al Viro wrote: > On Fri, Jan 22, 2016 at 03:30:02PM -0500, Mike Marshall wrote: > > The userspace daemon (client-core) that reads/writes to the device > > restarts automatically if it stops for some reason... I believe active > > ops are marked "purged" when this happens, and when client-core > > restarts "purged" ops are retried (once)... see the comment > > in waitqueue.c "if the operation was purged in the meantime..." > > > > I've tried to rattle Walt and Becky's chains to see if they > > can describe it better... > > What I mean is the following sequence: > > Syscall: puts op into request list, sleeps in wait_for_matching_downcall() > Daemon: exits, markes purged, wakes Syscall up > Daemon gets restarted > Daemon calls read(), finds op still on the list > Syscall: finally gets the timeslice, removes op from the list, decides to > resubmit > > This is very hard to hit - normally by the time we get around to read() > from restarted daemon the waiter had already been woken up and > already removed the purged op from the list. So in practice you probably > had never hit that case. However, it is theoretically possible. > > What I propose to do is to have purged requests that are still in the lists > to be skipped by orangefs_devreq_read() and orangefs_devreq_remove_op(). > IOW, pretend that the race had been won by whatever had been waiting on > that request and got woken up when it had been purged. > > Note that by the time it gets resubmitted, it already has the 'purged' flag > removed - set_op_state_waiting(op) is done when we are inserting into > request list and it leaves no trace of OP_VFS_STATE_PURGED. So I'm not > talking about the resubmitted stuff; just the one that had been in queue > since before the daemon restart and hadn't been removed from there yet. OK, aforementioned locking/refcounting scheme implemented (and completely untested). See #orangefs-untested. Rules: * refcounting is real refcounting - objects are created with refcount 1, get_op() increments refcount, op_release() decrements and frees when zero. * daemon interaction (read/write_iter) is refcount-neutral - it grabs a reference when picking request from list/hash and always drops it in the end. Submitters are always releasing the reference acquired when allocating request. * when submitter decides to give up upon request (for any reason - timeout, signal, daemon disconnect) it marks it with new bit - OP_VFS_STATE_GIVEN_UP. Once that is done, nobody else is allowed to touch its ->list. * request is inserted into hash only when we'd succeeded copying to daemon's memory (and only if request hadn't been given up while we'd been copying it). If copying fails, we only have to refile to the request list (and only if it hadn't been given up). * when copying a response from daemon, request is only marked serviced if it hadn't been given up while we were parsing the response. Malformed responses, failed copying of response, etc. are treated as well-formed error response would be. Error value is the one we'll be returning to daemon - might or might not be right from the error recovery standpoint. * hash lock nests *outside* of op->lock now and all work with the hash is protected by it. * daemon interaction skips the requests that had been around since before the control device had been opened, acting as if the (already woken up) submitters had already gotten around to removing those from the list/hash. That's the normal outcome of such race anyway, and it simplifies the analysis. Again, it's completely untested; might oops, break stuff, etc. I _think_ it makes sense from the lifetime rules and locking POV, but... I still would like to understand what's going on in wait_for_cancellation_request() - it *probably* shouldn't matter for the correctness of that massage, but there might be dragons. I don't understand in which situation wrt pending signals it is supposed to be called and the comments in there contradict the actual code. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Well... that all seems awesome, and compiled the first time and all my quick tests on my dinky vm make it seem fine... It is Becky that recently spent a bunch of time fighting the cancellation dragons, I'll see if I can't get her to weigh in on wait_for_cancellation_downcall tomorrow. We have some gnarly tests we were running on real hardware that helped reproduce the problems she was seeing in production with Clemson's Palmetto Cluster, I'll run them, but maybe not until Monday with the ice storm... Thanks Al... -Mike On Fri, Jan 22, 2016 at 8:28 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sat, Jan 23, 2016 at 12:12:02AM +0000, Al Viro wrote: >> On Fri, Jan 22, 2016 at 03:30:02PM -0500, Mike Marshall wrote: >> > The userspace daemon (client-core) that reads/writes to the device >> > restarts automatically if it stops for some reason... I believe active >> > ops are marked "purged" when this happens, and when client-core >> > restarts "purged" ops are retried (once)... see the comment >> > in waitqueue.c "if the operation was purged in the meantime..." >> > >> > I've tried to rattle Walt and Becky's chains to see if they >> > can describe it better... >> >> What I mean is the following sequence: >> >> Syscall: puts op into request list, sleeps in wait_for_matching_downcall() >> Daemon: exits, markes purged, wakes Syscall up >> Daemon gets restarted >> Daemon calls read(), finds op still on the list >> Syscall: finally gets the timeslice, removes op from the list, decides to >> resubmit >> >> This is very hard to hit - normally by the time we get around to read() >> from restarted daemon the waiter had already been woken up and >> already removed the purged op from the list. So in practice you probably >> had never hit that case. However, it is theoretically possible. >> >> What I propose to do is to have purged requests that are still in the lists >> to be skipped by orangefs_devreq_read() and orangefs_devreq_remove_op(). >> IOW, pretend that the race had been won by whatever had been waiting on >> that request and got woken up when it had been purged. >> >> Note that by the time it gets resubmitted, it already has the 'purged' flag >> removed - set_op_state_waiting(op) is done when we are inserting into >> request list and it leaves no trace of OP_VFS_STATE_PURGED. So I'm not >> talking about the resubmitted stuff; just the one that had been in queue >> since before the daemon restart and hadn't been removed from there yet. > > OK, aforementioned locking/refcounting scheme implemented (and completely > untested). See #orangefs-untested. > > Rules: > > * refcounting is real refcounting - objects are created with refcount 1, > get_op() increments refcount, op_release() decrements and frees when zero. > > * daemon interaction (read/write_iter) is refcount-neutral - it grabs > a reference when picking request from list/hash and always drops it in > the end. Submitters are always releasing the reference acquired when > allocating request. > > * when submitter decides to give up upon request (for any reason - timeout, > signal, daemon disconnect) it marks it with new bit - OP_VFS_STATE_GIVEN_UP. > Once that is done, nobody else is allowed to touch its ->list. > > * request is inserted into hash only when we'd succeeded copying to daemon's > memory (and only if request hadn't been given up while we'd been copying it). > If copying fails, we only have to refile to the request list (and only if > it hadn't been given up). > > * when copying a response from daemon, request is only marked serviced if it > hadn't been given up while we were parsing the response. Malformed responses, > failed copying of response, etc. are treated as well-formed error response > would be. Error value is the one we'll be returning to daemon - might or > might not be right from the error recovery standpoint. > > * hash lock nests *outside* of op->lock now and all work with the hash is > protected by it. > > * daemon interaction skips the requests that had been around since before > the control device had been opened, acting as if the (already woken up) > submitters had already gotten around to removing those from the list/hash. > That's the normal outcome of such race anyway, and it simplifies the analysis. > > Again, it's completely untested; might oops, break stuff, etc. I _think_ > it makes sense from the lifetime rules and locking POV, but... > > I still would like to understand what's going on in > wait_for_cancellation_request() - it *probably* shouldn't matter for the > correctness of that massage, but there might be dragons. I don't understand > in which situation wrt pending signals it is supposed to be called and the > comments in there contradict the actual code. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 22, 2016 at 09:54:48PM -0500, Mike Marshall wrote: > Well... that all seems awesome, and compiled the first > time and all my quick tests on my dinky vm make > it seem fine... It is Becky that recently spent a > bunch of time fighting the cancellation dragons, > I'll see if I can't get her to weigh in on > wait_for_cancellation_downcall tomorrow. > > We have some gnarly tests we were running on > real hardware that helped reproduce the problems > she was seeing in production with Clemson's > Palmetto Cluster, I'll run them, but maybe not > until Monday with the ice storm... OK, several more pushed. The most interesting part is probably switch to real completions - you'd been open-coding them for no good reason (and as always with reinventing locking primitives, asking for trouble). New bits just as untested as the earlier ones, of course... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OK, I'll get them momentarily... I merged your other patches, and there was a merge conflict I had to work around... you're working from an orangefs tree that lacks one commit I had made last week... my linux-next tree has all your patches through yesterday in it now... I am setting up "the gnarly test" (at home from a VM, though) that should cause a bunch of cancellations, I want to see if I can get wait_for_cancellation_downcall to ever flow past that "if (signal_pending(current)) {" block... if it does, that demonstrate where the comments conflict with the code, right? -Mike On Sat, Jan 23, 2016 at 2:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Jan 22, 2016 at 09:54:48PM -0500, Mike Marshall wrote: >> Well... that all seems awesome, and compiled the first >> time and all my quick tests on my dinky vm make >> it seem fine... It is Becky that recently spent a >> bunch of time fighting the cancellation dragons, >> I'll see if I can't get her to weigh in on >> wait_for_cancellation_downcall tomorrow. >> >> We have some gnarly tests we were running on >> real hardware that helped reproduce the problems >> she was seeing in production with Clemson's >> Palmetto Cluster, I'll run them, but maybe not >> until Monday with the ice storm... > > OK, several more pushed. The most interesting part is probably switch > to real completions - you'd been open-coding them for no good reason > (and as always with reinventing locking primitives, asking for trouble). > > New bits just as untested as the earlier ones, of course... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I compiled and tested the new patches, they seem to work more than great, unless it is just my imagination that the kernel module is much faster now. I'll measure it with more than seat-of-the-pants to see for sure. The patches are pushed to the for-next branch. My "gnarly test" can get the code to flow into wait_for_cancellation_downcall, but never would flow past the "if (signal_pending(current)) {" block, though that doesn't prove anything... I had to look at the wiki page for "cargo culting" <g>... When Becky was working on the cancellation problem I alluded to earlier, we talked about and suspected the spin_lock_irqsaves in service_operation were not appropriate... Thanks again Al... -Mike On Sat, Jan 23, 2016 at 2:24 PM, Mike Marshall <hubcap@omnibond.com> wrote: > OK, I'll get them momentarily... > > I merged your other patches, and there was a merge > conflict I had to work around... you're working from > an orangefs tree that lacks one commit I had made > last week... my linux-next tree has all your patches > through yesterday in it now... > > I am setting up "the gnarly test" (at home from a VM, > though) that should cause a bunch of cancellations, > I want to see if I can get > wait_for_cancellation_downcall to ever > flow past that "if (signal_pending(current)) {" > block... if it does, that demonstrate where > the comments conflict with the code, right? > > -Mike > > On Sat, Jan 23, 2016 at 2:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Fri, Jan 22, 2016 at 09:54:48PM -0500, Mike Marshall wrote: >>> Well... that all seems awesome, and compiled the first >>> time and all my quick tests on my dinky vm make >>> it seem fine... It is Becky that recently spent a >>> bunch of time fighting the cancellation dragons, >>> I'll see if I can't get her to weigh in on >>> wait_for_cancellation_downcall tomorrow. >>> >>> We have some gnarly tests we were running on >>> real hardware that helped reproduce the problems >>> she was seeing in production with Clemson's >>> Palmetto Cluster, I'll run them, but maybe not >>> until Monday with the ice storm... >> >> OK, several more pushed. The most interesting part is probably switch >> to real completions - you'd been open-coding them for no good reason >> (and as always with reinventing locking primitives, asking for trouble). >> >> New bits just as untested as the earlier ones, of course... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jan 23, 2016 at 04:35:32PM -0500, Mike Marshall wrote: > I compiled and tested the new patches, > they seem to work more than great, unless > it is just my imagination that the kernel > module is much faster now. I'll measure it > with more than seat-of-the-pants to see > for sure. The patches are pushed to > the for-next branch. As long as the speedup is not due to not actually doing the IO... ;-) > My "gnarly test" can get the code to flow > into wait_for_cancellation_downcall, but > never would flow past the > "if (signal_pending(current)) {" block, > though that doesn't prove anything... AFAICS, it is possible to get there without a signal - you need the first attempt of file_read / file_write to fail on disconnect, be retried and that retry to fail on timeout. _Then_ you get to sending a cancel without any signals ever being sent to you. Said that, I'm not sure that this "we don't wait at all if the signals are pending" is right - consider e.g. a read getting killed after it had been picked by daemon and before the daemon gets a chance to reply. We submit a cancel, but unless daemon picks it immediately, we don't even bother waiting - we see that cancel hadn't been serviced yet, that a signal is pending and we remove the cancel from request list. In that scenario daemon doesn't get a chance to see the cancel request at all... In which situations do we want cancel to be given to daemon and do we need to wait for its response? > I had to look at the wiki page for "cargo culting" <g>... > When Becky was working on the cancellation > problem I alluded to earlier, we talked about and > suspected the spin_lock_irqsaves in > service_operation were not appropriate... It isn't - first of all, prepare_to_wait() does spin_lock_irqsave() itself (on the queue lock), so interrupt-disabling bits are there anyway. And the exclusion on op->lock is completely pointless, obviously - you are waiting for bufmap initialization, which has nothing to do with specific op. Note that by that point op is not on any lists, so there's nobody who could see it, let alone grab the same spinlock... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h index b926fe77..88e606a 100644 --- a/fs/orangefs/orangefs-kernel.h +++ b/fs/orangefs/orangefs-kernel.h @@ -103,24 +103,6 @@ enum orangefs_vfs_op_states { OP_VFS_STATE_PURGED = 8, }; -#define set_op_state_waiting(op) ((op)->op_state = OP_VFS_STATE_WAITING) -#define set_op_state_inprogress(op) ((op)->op_state = OP_VFS_STATE_INPROGR) -static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op) -{ - op->op_state = OP_VFS_STATE_SERVICED; - wake_up_interruptible(&op->waitq); -} -static inline void set_op_state_purged(struct orangefs_kernel_op_s *op) -{ - op->op_state |= OP_VFS_STATE_PURGED; - wake_up_interruptible(&op->waitq); -} - -#define op_state_waiting(op) ((op)->op_state & OP_VFS_STATE_WAITING) -#define op_state_in_progress(op) ((op)->op_state & OP_VFS_STATE_INPROGR) -#define op_state_serviced(op) ((op)->op_state & OP_VFS_STATE_SERVICED) -#define op_state_purged(op) ((op)->op_state & OP_VFS_STATE_PURGED) - #define get_op(op) \ do { \ atomic_inc(&(op)->ref_count); \ @@ -259,6 +241,25 @@ struct orangefs_kernel_op_s { struct list_head list; }; +#define set_op_state_waiting(op) ((op)->op_state = OP_VFS_STATE_WAITING) +#define set_op_state_inprogress(op) ((op)->op_state = OP_VFS_STATE_INPROGR) +static inline void set_op_state_serviced(struct orangefs_kernel_op_s *op) +{ + op->op_state = OP_VFS_STATE_SERVICED; + wake_up_interruptible(&op->waitq); +} +static inline void set_op_state_purged(struct orangefs_kernel_op_s *op) +{ + op->op_state |= OP_VFS_STATE_PURGED; + wake_up_interruptible(&op->waitq); +} + +#define op_state_waiting(op) ((op)->op_state & OP_VFS_STATE_WAITING) +#define op_state_in_progress(op) ((op)->op_state & OP_VFS_STATE_INPROGR) +#define op_state_serviced(op) ((op)->op_state & OP_VFS_STATE_SERVICED) +#define op_state_purged(op) ((op)->op_state & OP_VFS_STATE_PURGED) + + /* per inode private orangefs info */ struct orangefs_inode_s { struct orangefs_object_kref refn; diff --git a/fs/orangefs/waitqueue.c b/fs/orangefs/waitqueue.c index 641de05..a257891 100644 --- a/fs/orangefs/waitqueue.c +++ b/fs/orangefs/waitqueue.c @@ -16,6 +16,9 @@ #include "orangefs-kernel.h" #include "orangefs-bufmap.h" +static int wait_for_cancellation_downcall(struct orangefs_kernel_op_s *); +static int wait_for_matching_downcall(struct orangefs_kernel_op_s *); + /* * What we do in this function is to walk the list of operations that are * present in the request queue and mark them as purged.