Message ID | 773e0780-6641-ec85-5e78-d04e5a82d6b1@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Oct 24, 2016 at 10:28:14AM -0600, Jens Axboe wrote: > How about the below? Bump the timeout to 5 min, 1 min is a little on the > short side, we want normal error handling to be out of the way before > that happens. And additionally, break out if we have been marked as > reaped/exited, so we avoid grabbing the stat mutex again. Yep, that works. I tried a test with just the second change: > + /* > + * If we took too long to shut down, the main thread could > + * already consider us reaped/exited. If that happens, break > + * out and clean up. > + */ > + if (td->runstate >= TD_EXITED) > + break; > + And that's sufficient to solve the problem. Increasing the timeout to 5 minute also would be a good idea, so we can let the worker threads exit cleanly so the reported stats will be completely accurate. Thanks for your help in figuring out this long-standing problem! - Ted -- To unsubscribe from this list: send the line "unsubscribe fstests" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/24/2016 08:54 PM, Theodore Ts'o wrote: > On Mon, Oct 24, 2016 at 10:28:14AM -0600, Jens Axboe wrote: > >> How about the below? Bump the timeout to 5 min, 1 min is a little on the >> short side, we want normal error handling to be out of the way before >> that happens. And additionally, break out if we have been marked as >> reaped/exited, so we avoid grabbing the stat mutex again. > > Yep, that works. I tried a test with just the second change: > >> + /* >> + * If we took too long to shut down, the main thread could >> + * already consider us reaped/exited. If that happens, break >> + * out and clean up. >> + */ >> + if (td->runstate >= TD_EXITED) >> + break; >> + > > And that's sufficient to solve the problem. Yes, it should be, so glad that it is! > Increasing the timeout to 5 minute also would be a good idea, so we > can let the worker threads exit cleanly so the reported stats will be > completely accurate. I made that separate change as well. If the job is stuck in the kernel for some sync operation, we could feasibly be uninterruptible for minutes. So 1 minutes is too short in any case, and I'd rather just make this check than sending kill signals since it won't fix the uninterruptible problem. > Thanks for your help in figuring out this long-standing problem! It was easy based on all your info, since I could not reproduce. So thanks for your help! Everything should be committed now, and I'll cut a new release tomorrow so we can hopefully put this behind us.
diff --git a/backend.c b/backend.c index 093b6a3a290e..f0927abfccb0 100644 --- a/backend.c +++ b/backend.c @@ -1723,6 +1723,14 @@ static void *thread_main(void *data) } } + /* + * If we took too long to shut down, the main thread could + * already consider us reaped/exited. If that happens, break + * out and clean up. + */ + if (td->runstate >= TD_EXITED) + break; + clear_state = 1; /* diff --git a/fio.h b/fio.h index 080842aef4f8..74c1b306af26 100644 --- a/fio.h +++ b/fio.h @@ -588,7 +588,7 @@ extern const char *runstate_to_name(int runstate); * Allow 60 seconds for a job to quit on its own, otherwise reap with * a vengeance. */ -#define FIO_REAP_TIMEOUT 60 +#define FIO_REAP_TIMEOUT 300 #define TERMINATE_ALL (-1U) extern void fio_terminate_threads(unsigned int);