Test generic/299 stalling forever

Message ID	773e0780-6641-ec85-5e78-d04e5a82d6b1@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <fstests-owner@kernel.org> Subject: Re: Test generic/299 stalling forever To: "Theodore Ts'o" <tytso@mit.edu> References: <20161013231923.j2fidfbtzdp66x3t@thunk.org> <20161018180107.fscbfm66yidwhey4@thunk.org> <7856791a-0795-9183-6057-6ce8fd0e3d58@fb.com> <30fef8cd-67cc-da49-77d9-9d1a833f8a48@fb.com> <20161019203233.mbbmskpn5ekgl7og@thunk.org> <1fb60e7c-a558-80df-09da-d3c36863a461@fb.com> <20161021221551.sdv4hgw33zjxnkvu@thunk.org> <53fe5a98-6ff9-4fa1-e84c-8a3e16cc0f50@fb.com> <20161023193320.rlzlaxdi4vbyu7of@thunk.org> <20161023212408.cjqmnzw3547ujzil@thunk.org> <20161024033852.quinlee4a24mb2e2@thunk.org> CC: Dave Chinner <david@fromorbit.com>, <linux-ext4@vger.kernel.org>, <fstests@vger.kernel.org>, <tarasov@vasily.name> From: Jens Axboe <axboe@fb.com> Message-ID: <773e0780-6641-ec85-5e78-d04e5a82d6b1@fb.com> Date: Mon, 24 Oct 2016 10:28:14 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <20161024033852.quinlee4a24mb2e2@thunk.org> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Received-SPF: None (protection.outlook.com: fb.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?Windows-1252?Q?1; BN6PR15MB1187; 23:TS9EDXhwzLVTO0k0dLCJUN6egoc3g8ila/kdS?= =?Windows-1252?Q?tv62p3aD52CZR/TJzpiR+/h2OdpxkQaqL7Oa5LeW9UxknM/uRtB6OQ/h?= =?Windows-1252?Q?et5NPULbeomx7Kj50V7XCjaEP/pSR9oelo2KD3vRMfMOUDJw34hqRt9G?= =?Windows-1252?Q?nXB/t6KgKiU0rtRCZKWXnUVfSk4gbyFIdG3IxglxJHnROWtqQlhcUJje?= =?Windows-1252?Q?3fHLIl7aBnTNGw2zjfdtcJPervCh6y0a2CDpVPEFy9erw/hbZQk2S0Je?= =?Windows-1252?Q?ugGxT1SfcSqKg6J15V9sM9wpcjAyWoY1nR1spDipkdmB6WxfGxSKxJ2t?= =?Windows-1252?Q?RxY4o1bCAx30EhR3ghr3H86hsojk+Zed8DrbodiSQdYu0cxOwaLUkVg8?= =?Windows-1252?Q?p6ZZuRv5ChLSxvF80woMsTshMBFFhPch1lOcDCG5K2udDtiDoNYcvBNH?= =?Windows-1252?Q?7hpggvrAHyQmR2g5ynN8W2QUbAEnGBmbPt1/XOUVPTns9NBhxM3VXITl?= =?Windows-1252?Q?gpBntnp7Rr9odsrNvENXWgrjDEoFNcTRcJj8sGwczn3IfX5wjTUStZEd?= =?Windows-1252?Q?kJQH0F4hGHwJpnemZbDHOzQ3dMIdXWkccODDyt5gDc5308MM4pzuBupX?= =?Windows-1252?Q?CHd76L0RQRaMUhmwOv2SvlgkwDQMj6Nw8JDeVDusDyBI9PfsdHHuyWVP?= =?Windows-1252?Q?xJ70KbBaa0hNHZVTpPm6YBX738HGg0tgcDBoUdZIQQhHfWyLsgmhjKhj?= =?Windows-1252?Q?LYCAiEJ3e4TxY5z1ui7zAJsB3YUxEe53TRWEudQ5UOOlUIBn6/yKh2J5?= =?Windows-1252?Q?Ws+kNt7KLfUs1bro+XlGtnRiG7hzmgV5mk+hAp36P6TZNc5Oq1HC3NDa?= =?Windows-1252?Q?EjB9AM3fC9yXnBgEF6hSY7p0TQHkQGy/m5NWEyUKdMshBvHH2AvLVj48?= =?Windows-1252?Q?WLKQTu+fkVRS2UYVPd3Rop+MS15XSV/y7hGNyrTqc1Gm+yuHNoiXf2d/?= =?Windows-1252?Q?+saoSRxTgQkRmPZXVZpirBI9qjhP6LnWpnM/n0TqiJWFsttydTFJoYld?= =?Windows-1252?Q?COb2Dm1qsfuDM8UEuIzYbp8IEPl3mDceQ5WY9LhY6BZpsrcyAO3BLcid?= =?Windows-1252?Q?HNKtWTURoAvdZSgij9i4Elqtvd/oUqzaIuIdi56+wwovhrQ0USFXSF7n?= =?Windows-1252?Q?eQfCQESAF0hT6ET1v4FmjxtghXE0+eIlf8QtTbz1z85yMDxs80IUFyuk?= =?Windows-1252?Q?kxt1PJZlOVfEvLFpSKk1hI7l1y7ys3cA2oWnJBZVuU4PkEDLsveNHhJL?= =?Windows-1252?Q?ZIIkFuAnwOTRO/Vu6zHwflxnorjyXSI7DrYKdK61exiAEhuG7JRj4VSU?= =?Windows-1252?Q?tdTZIwbx52Ob8V2XhqDcLHFWoMrqgfnPIskOG1SdG1wogEmhGOPuCe2b?= =?Windows-1252?Q?mF6Ikn2L/57g3pnUzdCg4OHDC1Fqyij626rKBqj67VmdbUrEb14UpWaz?= =?Windows-1252?Q?lgRj23snUCH+Ik12fpXvV9XFqig6lXAPL+hElwudezzoEYh0zELbaHWB?= =?Windows-1252?Q?M1e9ykc5q3AFeI=3D?= X-Microsoft-Exchange-Diagnostics: 1; BN6PR15MB1187; 6:QwtPsBcLOYGPgRcNBQ0IF3rEoau8BEGobXNXYvR2dUp6WF3V1n6mSftBIEvNwaPCO+INBHPEU65FjMxN3yyXs6m3AVN4u/KKDHhgacmPkK50x5O7sz5JZEazVEsNVcbX05Mv3J0psI000cRQM1uo42SHt/hfpA5fZ+Kz0Jtu/s2rh9rV6p8sHviwG1Yx7gwpIvUX2tITmDXih+OOOC7pl3wSPp8zhLwnm9dSe6v2tYvFvoRobnH8vIvnqcBr/I54ah3CqtQLgFAGw0uXgvDhkJC8wKarCK8yi2EWPfCrRYy3O+TPbYGv1/i70eIhOwIg; 5:sXfB9ay1HJpyi87IZS4cQM/pAW4s/AMfWNhryXcPU8meg8yl0MVXtq7YvI64psQC+Wdejk4HiJCDT1efFG5As/MQnOrmSLo0pQyUakW7Uv1wK1zkz6sdVwEOf2eo/7gYvg7/ISRL+VQtIhClx0TGqQ==; 24:N5aGA8qcUxNkPNnVAAHiASI3B6qU584PLHj9pU+3smAPltetnjRUG7IJ/fuUXd85sQ1RXg/Et8qlwAJeVgkw8K66V/sbzWSfxW7dOorLM1g=; 7:yHIyTRp8H/6EfZWeun/Z0ZrXyAknly93FTIYy9djdKl41ob6p0fdqpeA7MfRncXf90UYhgL8VW8I4dutbAePAEiF7lMYGbJrNPN5lcPu6yj779v4Ql35Ub4llUxTiScuE6JWgJ5/Z8Vf1jKDW3vLL5lubOY9QaRCiPTaRX0tWYFlLpqlJHTt40y8Lq8PtgVrJ+84R7T5s2MefrY5H1C6ATTZkr3CIg/FtRdXKZaT/y+DmDkydeY3VOIaMXYpndTUpgEPkvd6SwytNXgeQHyVxFrqWJ8YVteZUlKFJfQPSrM8tihcI97BRWo/BYYTHUV7+nwv+8y/+SF63y9V8jphl3+20dKB WaAXZQS1svcbAQQ= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; BN6PR15MB1187; 20:nnCUPzRK4A1J0gL0vaTmd8Tp9mmbBhiQ+DMnhFPANiOIa5I64V7zFct8YBFpYzAhrjRHX7bcW/B0gjIUUA4Gfjev58ejwiJT9boCxhIYLGLoHjM+xm2NqWxw0gLEldQjakRF71mlotAR4wKKn+IRf3kLPcyGOpBTDlMqSFqgbOc= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Oct 2016 16:28:18.6710 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN6PR15MB1187 Sender: fstests-owner@vger.kernel.org Precedence: bulk

Message ID

773e0780-6641-ec85-5e78-d04e5a82d6b1@fb.com (mailing list archive)

State

New, archived

Headers

Subject: Re: Test generic/299 stalling forever
To: "Theodore Ts'o" <tytso@mit.edu>
References: <20161013231923.j2fidfbtzdp66x3t@thunk.org>
	<20161018180107.fscbfm66yidwhey4@thunk.org>
	<7856791a-0795-9183-6057-6ce8fd0e3d58@fb.com>
	<30fef8cd-67cc-da49-77d9-9d1a833f8a48@fb.com>
	<20161019203233.mbbmskpn5ekgl7og@thunk.org>
	<1fb60e7c-a558-80df-09da-d3c36863a461@fb.com>
	<20161021221551.sdv4hgw33zjxnkvu@thunk.org>
	<53fe5a98-6ff9-4fa1-e84c-8a3e16cc0f50@fb.com>
	<20161023193320.rlzlaxdi4vbyu7of@thunk.org>
	<20161023212408.cjqmnzw3547ujzil@thunk.org>
	<20161024033852.quinlee4a24mb2e2@thunk.org>
CC: Dave Chinner <david@fromorbit.com>, <linux-ext4@vger.kernel.org>,
	<fstests@vger.kernel.org>, <tarasov@vasily.name>
From: Jens Axboe <axboe@fb.com>
Message-ID: <773e0780-6641-ec85-5e78-d04e5a82d6b1@fb.com>
Date: Mon, 24 Oct 2016 10:28:14 -0600
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
	Thunderbird/45.3.0
MIME-Version: 1.0
In-Reply-To: <20161024033852.quinlee4a24mb2e2@thunk.org>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Received-SPF: None (protection.outlook.com: fb.com does not designate
	permitted sender hosts)
SpamDiagnosticOutput: 1:99
SpamDiagnosticMetadata: NSPM
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Oct 2016 16:28:18.6710
	(UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN6PR15MB1187
X-OriginatorOrg: fb.com
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, ,
	definitions=2016-10-24_10:, , signatures=0
Sender: fstests-owner@vger.kernel.org
Precedence: bulk
List-ID: <fstests.vger.kernel.org>
X-Mailing-List: fstests@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Commit Message

Jens Axboe Oct. 24, 2016, 4:28 p.m. UTC

On 10/23/2016 09:38 PM, Theodore Ts'o wrote:
> I enabled some more debugging and it's become more clear what's going
> on.   (See attached for the full log).
>
> The main issue seems to be that once one of fio is done, it kills off
> the other threads (actually, we're using processes):
>
> process  31848 terminate group_id=0
> process  31848 setting terminate on direct_aio/31846
> process  31848 setting terminate on direct_aio/31848
> process  31848 setting terminate on direct_aio/31849
> process  31848 setting terminate on direct_aio/31851
> process  31848 setting terminate on aio-dio-verifier/31852
> process  31848 setting terminate on buffered-aio-verifier/31854
> process  31851 pid=31851: runstate RUNNING -> FINISHING
> process  31851 terminate group_id=0
> process  31851 setting terminate on direct_aio/31846
> process  31851 setting terminate on direct_aio/31848
> process  31851 setting terminate on direct_aio/31849
> process  31851 setting terminate on direct_aio/31851
> process  31851 setting terminate on aio-dio-verifier/31852
> process  31851 setting terminate on buffered-aio-verifier/31854
> process  31852 pid=31852: runstate RUNNING -> FINISHING
> process  31846 pid=31846: runstate RUNNING -> FINISHING
>     ...
>
> but one or more of the threads doesn't exit within 60 seconds:
>
> fio: job 'direct_aio' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job.
> process  31794 pid=31849: runstate RUNNING -> REAPED
> fio: job 'buffered-aio-verifier' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job.
> process  31794 pid=31854: runstate RUNNING -> REAPED
> process  31794 terminate group_id=-1
>
> The main thread then prints all of the statistics, and calls stat_exit():
>
> stat_exit called by tid: 31794       <---- debugging message which prints gettid()
>
> Unfortunately, this process(es) aren't actually, killed, they are
> marked as reap, but they are still in the process listing:
>
> root@xfstests:~# ps augxww | grep fio
> root      1585  0.0  0.0      0     0 ?        S<   18:45   0:00 [dm_bufio_cache]
> root      7191  0.0  0.0  12732  2200 pts/1    S+   23:05   0:00 grep fio
> root     31849  1.5  0.2 407208 18876 ?        Ss   22:36   0:26 /root/xfstests/bin/fio /tmp/31503.fio
> root     31854  1.2  0.1 398480 10240 ?        Ssl  22:36   0:22 /root/xfstests/bin/fio /tmp/31503.fio
>
> And if you attach to them with a gdb, they are spinning trying to grab
> the stat_mutex(), which they can't get because the main thread has
> already called stat_exit() and then has exited.  So these two threads
> did eventually return, but some time after 60 seconds had passed, and
> then they hung waiting for stat_mutex(), which they will never get
> because the main thread has already called stat_exit().
>
> This probably also explains why you had trouble reproducing it.  It
> requires a disk whose performance is variable enougoh that under heavy
> load, it might take more than 60 seconds for the direct_aio or
> buffered-aio-verifier thread to close itself out.

Good catch! Yes, that could certainly explain why we are stuck on that 
stat_mutex and why the main thread just gave up on it and ended up in 
stat_exit() with a thread (or more) still running.

> And I suspect once the main thread exited, it probably also closed out
> the debugging channel so the deadlock detector did probably trip, but
> somehow we just didn't see the output.
>
> So I can imagine some possible fixes.  We could make the thread
> timeout configurable, and/or increase it from 60 seconds to something like
> 300 seconds.  We could make stat_exit() a no-op --- after all, if the
> main thread is exiting, there's no real point to down and then destroy
> the stat_mutex.  And/or we could change the forced reap to send a kill
> -9 to the thread, and instead of maring it as reaped.

We have to clean up - for normal runs, it's not a big deal, but if fio
is run as a client/server setup, the backend will persist across runs.
If we leak, then that could be a concern.

How about the below? Bump the timeout to 5 min, 1 min is a little on the
short side, we want normal error handling to be out of the way before
that happens. And additionally, break out if we have been marked as
reaped/exited, so we avoid grabbing the stat mutex again.

Comments

Theodore Ts'o Oct. 25, 2016, 2:54 a.m. UTC | #1

On Mon, Oct 24, 2016 at 10:28:14AM -0600, Jens Axboe wrote:

> How about the below? Bump the timeout to 5 min, 1 min is a little on the
> short side, we want normal error handling to be out of the way before
> that happens. And additionally, break out if we have been marked as
> reaped/exited, so we avoid grabbing the stat mutex again.

Yep, that works.  I tried a test with just the second change:

> +		/*
> +		 * If we took too long to shut down, the main thread could
> +		 * already consider us reaped/exited. If that happens, break
> +		 * out and clean up.
> +		 */
> +		if (td->runstate >= TD_EXITED)
> +			break;
> +

And that's sufficient to solve the problem.

Increasing the timeout to 5 minute also would be a good idea, so we
can let the worker threads exit cleanly so the reported stats will be
completely accurate.

Thanks for your help in figuring out this long-standing problem!

       	   	     		     	  		- Ted
--
To unsubscribe from this list: send the line "unsubscribe fstests" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Axboe Oct. 25, 2016, 2:59 a.m. UTC | #2

On 10/24/2016 08:54 PM, Theodore Ts'o wrote:
> On Mon, Oct 24, 2016 at 10:28:14AM -0600, Jens Axboe wrote:
>
>> How about the below? Bump the timeout to 5 min, 1 min is a little on the
>> short side, we want normal error handling to be out of the way before
>> that happens. And additionally, break out if we have been marked as
>> reaped/exited, so we avoid grabbing the stat mutex again.
>
> Yep, that works.  I tried a test with just the second change:
>
>> +		/*
>> +		 * If we took too long to shut down, the main thread could
>> +		 * already consider us reaped/exited. If that happens, break
>> +		 * out and clean up.
>> +		 */
>> +		if (td->runstate >= TD_EXITED)
>> +			break;
>> +
>
> And that's sufficient to solve the problem.

Yes, it should be, so glad that it is!

> Increasing the timeout to 5 minute also would be a good idea, so we
> can let the worker threads exit cleanly so the reported stats will be
> completely accurate.

I made that separate change as well. If the job is stuck in the kernel
for some sync operation, we could feasibly be uninterruptible for
minutes. So 1 minutes is too short in any case, and I'd rather just make
this check than sending kill signals since it won't fix the
uninterruptible problem.

> Thanks for your help in figuring out this long-standing problem!

It was easy based on all your info, since I could not reproduce. So
thanks for your help! Everything should be committed now, and I'll cut a
new release tomorrow so we can hopefully put this behind us.

diff --git a/backend.c b/backend.c
index 093b6a3a290e..f0927abfccb0 100644
--- a/backend.c
+++ b/backend.c
@@ -1723,6 +1723,14 @@  static void *thread_main(void *data)
  			}
  		}

+		/*
+		 * If we took too long to shut down, the main thread could
+		 * already consider us reaped/exited. If that happens, break
+		 * out and clean up.
+		 */
+		if (td->runstate >= TD_EXITED)
+			break;
+
  		clear_state = 1;

  		/*
diff --git a/fio.h b/fio.h
index 080842aef4f8..74c1b306af26 100644
--- a/fio.h
+++ b/fio.h
@@ -588,7 +588,7 @@  extern const char *runstate_to_name(int runstate);
   * Allow 60 seconds for a job to quit on its own, otherwise reap with
   * a vengeance.
   */
-#define FIO_REAP_TIMEOUT	60
+#define FIO_REAP_TIMEOUT	300

  #define TERMINATE_ALL		(-1U)
  extern void fio_terminate_threads(unsigned int);

Test generic/299 stalling forever

Commit Message

Comments

Patch