From patchwork Fri Sep 11 01:15:42 2009
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
X-Patchwork-Id: 46760
Received: from hormel.redhat.com (hormel1.redhat.com [209.132.177.33])
	by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n8B1HnN5004101
	for <patchwork-dm-devel@patchwork.kernel.org>;
	Fri, 11 Sep 2009 01:17:49 GMT
Received: from listman.util.phx.redhat.com (listman.util.phx.redhat.com
	[10.8.4.110])
	by hormel.redhat.com (Postfix) with ESMTP id 22D856196DD;
	Thu, 10 Sep 2009 21:17:48 -0400 (EDT)
Received: from int-mx08.intmail.prod.int.phx2.redhat.com
	(nat-pool.util.phx.redhat.com [10.8.5.200])
	by listman.util.phx.redhat.com (8.13.1/8.13.1) with ESMTP id
	n8B1Hgl4014561 for <dm-devel@listman.util.phx.redhat.com>;
	Thu, 10 Sep 2009 21:17:42 -0400
Received: from mx1.redhat.com (ext-mx03.extmail.prod.ext.phx2.redhat.com
	[10.5.110.7])
	by int-mx08.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with
	ESMTP id n8B1Hc1Z028016; Thu, 10 Sep 2009 21:17:38 -0400
Received: from song.cn.fujitsu.com (cn.fujitsu.com [222.73.24.84] (may be
	forged))
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8B1HRBR024358;
	Thu, 10 Sep 2009 21:17:28 -0400
Received: from tang.cn.fujitsu.com (tang.cn.fujitsu.com [10.167.250.3])
	by song.cn.fujitsu.com (Postfix) with ESMTP id CCF7617003F;
	Fri, 11 Sep 2009 09:17:26 +0800 (CST)
Received: from fnst.cn.fujitsu.com (tang.cn.fujitsu.com [127.0.0.1])
	by tang.cn.fujitsu.com (8.14.3/8.13.1) with ESMTP id n8B1HHdI000484;
	Fri, 11 Sep 2009 09:17:17 +0800
Received: from [127.0.0.1] (unknown [10.167.141.226])
	by fnst.cn.fujitsu.com (Postfix) with ESMTPA id F22482928C5;
	Fri, 11 Sep 2009 09:18:08 +0800 (CST)
Message-ID: <4AA9A4BE.30005@cn.fujitsu.com>
Date: Fri, 11 Sep 2009 09:15:42 +0800
From: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
User-Agent: Thunderbird 2.0.0.5 (Windows/20070716)
MIME-Version: 1.0
To: Vivek Goyal <vgoyal@redhat.com>, jens.axboe@oracle.com
References: <1251495072-7780-1-git-send-email-vgoyal@redhat.com>
	<4AA4B905.8010801@cn.fujitsu.com>
	<20090908191941.GF15974@redhat.com>
	<4AA75B71.5060109@cn.fujitsu.com>
	<20090909150537.GD8256@redhat.com>
In-Reply-To: <20090909150537.GD8256@redhat.com>
X-RedHat-Spam-Score: -0.942  (AWL,RDNS_NONE)
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.21
X-Scanned-By: MIMEDefang 2.67 on 10.5.110.7
X-loop: dm-devel@redhat.com
Cc: dhaval@linux.vnet.ibm.com, peterz@infradead.org, dm-devel@redhat.com,
	dpshah@google.com, agk@redhat.com, balbir@linux.vnet.ibm.com,
	paolo.valente@unimore.it, jmarchan@redhat.com, fernando@oss.ntt.co.jp,
	mikew@google.com, jmoyer@redhat.com, nauman@google.com, mingo@elte.hu,
	m-ikeda@ds.jp.nec.com, riel@redhat.com, lizf@cn.fujitsu.com,
	fchecconi@gmail.com, s-uchida@ap.jp.nec.com,
	containers@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, righi.andrea@gmail.com,
	torvalds@linux-foundation.org
Subject: [dm-devel] [PATCH] io-controller: Fix task hanging when there are
	more than one groups
X-BeenThere: dm-devel@redhat.com
X-Mailman-Version: 2.1.5
Precedence: junk
Reply-To: device-mapper development <dm-devel@redhat.com>
List-Id: device-mapper development <dm-devel.redhat.com>
List-Unsubscribe: <https://www.redhat.com/mailman/listinfo/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/dm-devel>
List-Post: <mailto:dm-devel@redhat.com>
List-Help: <mailto:dm-devel-request@redhat.com?subject=help>
List-Subscribe: <https://www.redhat.com/mailman/listinfo/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=subscribe>
Sender: dm-devel-bounces@redhat.com
Errors-To: dm-devel-bounces@redhat.com

Vivek Goyal wrote:
> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>> Hi Vivek,
>>>>
>>>> I happened to encount a bug when i test IO Controller V9.
>>>> When there are three tasks to run concurrently in three group,
>>>> that is, one is parent group, and other two tasks are running 
>>>> in two different child groups respectively to read or write 
>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>> other tasks which access into "hdb" will also hang up.
>>>>
>>>> The bug only happens when using AS io scheduler.
>>>> The following scirpt can reproduce this bug in my box.
>>>>
>>> Hi Gui,
>>>
>>> I tried reproducing this on my system and can't reproduce it. All the
>>> three processes get killed and system does not hang.
>>>
>>> Can you please dig deeper a bit into it. 
>>>
>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>     Only when the task is trying do IO to disk it will hang up.
>>
>>> - Does io scheduler switch on the device work
>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>
>>> - If the system is not hung, can you capture the blktrace on the device.
>>>   Trace might give some idea, what's happening.
>> I run a "find" task to do some io on that disk, it seems that task hangs 
>> when it is issuing getdents() syscall.
>> kernel generates the following message:
>>
>> INFO: task find:3260 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> find          D a1e95787  1912  3260   2897 0x00000004
>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>> Call Trace:
>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>  [<c068ab68>] io_schedule+0x47/0x79
>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>> 1 lock held by find/3260:
>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>
>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
> 
> Thanks Gui. As Jens said, it does look like a case of missing queue
> restart somewhere and now we are stuck, no requests are being dispatched
> to the disk and queue is already unplugged.
> 
> Can you please also try capturing the trace of events at io scheduler
> (blktrace) to see how did we get into that situation.
> 
> Are you using ide drivers and not libata? As jens said, I will try to make
> use of ide drivers and see if I can reproduce it.
> 

Hi Vivek, Jens,

Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
is still under service, and from now on, this ioq won't expire because "only root" optimization.
The following patch ensures the ioq do belongs to the root group if there's only root group existing.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b723c12..3f86552 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2338,9 +2338,10 @@ void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
 	}
 }
 
-static inline int is_only_root_group(void)
+static inline int is_only_root_group(struct elv_fq_data *efqd)
 {
-	if (list_empty(&io_root_cgroup.css.cgroup->children))
+	if (list_empty(&io_root_cgroup.css.cgroup->children) &&
+	    efqd->busy_queues == 1 && efqd->root_group->ioq)
 		return 1;
 
 	return 0;
@@ -2383,7 +2384,7 @@ static void io_free_root_group(struct elevator_queue *e)
 int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
 EXPORT_SYMBOL(elv_iog_should_idle);
 
-static inline int is_only_root_group(void)
+static inline int is_only_root_group(struct elv_fq_data *efqd)
 {
 	return 1;
 }
@@ -2547,7 +2548,7 @@ elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
 	struct elevator_queue *e = q->elevator;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
 	int ret = 1;
-
+	
 	if (e->ops->elevator_expire_ioq_fn) {
 		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
 							slice_expired, force);
@@ -2969,7 +2970,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	 * single queue ioschedulers (noop, deadline, AS).
 	 */
 
-	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
+	if (is_only_root_group(efqd) && elv_iosched_single_ioq(q->elevator))
 		goto keep_queue;
 
 	/* We are waiting for this group to become busy before it expires.*/
@@ -3180,7 +3181,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * unnecessary overhead.
 		 */
 
-		if (is_only_root_group() &&
+		if (is_only_root_group(ioq->efqd) &&
 			elv_iosched_single_ioq(q->elevator)) {
 			elv_log_ioq(efqd, ioq, "select: only root group,"
 					" no expiry");