From patchwork Fri May 27 19:18:27 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Gregory Farnum <gregf@hq.newdream.net>
X-Patchwork-Id: 825202
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p4RJIWiv006514
	for <patchwork-ceph-devel@patchwork.kernel.org>;
	Fri, 27 May 2011 19:18:33 GMT
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754111Ab1E0TSa (ORCPT
	<rfc822;patchwork-ceph-devel@patchwork.kernel.org>);
	Fri, 27 May 2011 15:18:30 -0400
Received: from mail.hq.newdream.net ([66.33.206.127]:55325 "EHLO
	mail.hq.newdream.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754050Ab1E0TS3 convert rfc822-to-8bit (ORCPT
	<rfc822; ceph-devel@vger.kernel.org>); Fri, 27 May 2011 15:18:29 -0400
Received: from mail.hq.newdream.net (localhost [127.0.0.1])
	by mail.hq.newdream.net (Postfix) with ESMTP id 675E7C27A
	for <ceph-devel@vger.kernel.org>;
	Fri, 27 May 2011 12:18:29 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=hq.newdream.net; h=
	mime-version:in-reply-to:references:date:message-id:subject:from
	:to:cc:content-type:content-transfer-encoding; q=dns; s=drama; b=
	S603T4c+8BmUgNsuQqY5yhbTcHq/NigirkRFmtCpEpNKDsLf5WkbJ8Rks2GhraWz
	HFXFTHT+BUXhMbwdBwwRnFW3QWk6WFqFkmn9qCkyEt1Gp7tP+s13C3D2/89+PPyi
	FFIUMONCqS3EAHpRGFA2rEkX4hzZKrkfKY03WiTkRpU=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=hq.newdream.net; h=
	mime-version:in-reply-to:references:date:message-id:subject:from
	:to:cc:content-type:content-transfer-encoding; s=drama; bh=PEWxj
	lKvBxjNKq98UKH2bFoudhE=; b=b/Fcy4hXa6ZyTnI3wFpBPZ/sl48FgUEXO2SxQ
	gdi3p8hSvktXnXg/Ldc5hsMc3qIctE3GyYoC9GYwrI/MsCZtjEnOYa/5XC7Rz+Jl
	8JRx0xsyIQfeXFJ8m4kOP2JEq6Cou52YfXanYK2ppoMg1LosbTvegm/H8bYb25s+
	s20z0k=
Received: from mail-ew0-f46.google.com (mail-ew0-f46.google.com
	[209.85.215.46])
	by mail.hq.newdream.net (Postfix) with ESMTP id E6DA2C070
	for <ceph-devel@vger.kernel.org>;
	Fri, 27 May 2011 12:18:28 -0700 (PDT)
Received: by ewy4 with SMTP id 4so739979ewy.19
	for <ceph-devel@vger.kernel.org>;
	Fri, 27 May 2011 12:18:27 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.14.11.96 with SMTP id 72mr198827eew.240.1306523907395;
	Fri, 27 May 2011 12:18:27 -0700 (PDT)
Received: by 10.14.37.3 with HTTP; Fri, 27 May 2011 12:18:27 -0700 (PDT)
In-Reply-To: <BANLkTinGoEpm3C22HuPN2dH8wsVWPuAFgA@mail.gmail.com>
References: <4DDEEC7C.8060800@ufm.su>
	<BANLkTimxRbM95_LMJwAa+oUwj-HAePJM+Q@mail.gmail.com>
	<4DDFD44E.8070305@ufm.su>
	<BANLkTinGoEpm3C22HuPN2dH8wsVWPuAFgA@mail.gmail.com>
Date: Fri, 27 May 2011 12:18:27 -0700
Message-ID: <BANLkTinQmX5HnzaxQE11Mst9Pjei7j90FA@mail.gmail.com>
Subject: Re: OSD crash
From: Gregory Farnum <gregf@hq.newdream.net>
To: Fyodor Ustinov <ufm@ufm.su>
Cc: ceph-devel@vger.kernel.org
Sender: ceph-devel-owner@vger.kernel.org
Precedence: bulk
List-ID: <ceph-devel.vger.kernel.org>
X-Mailing-List: ceph-devel@vger.kernel.org
X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by
	milter-greylist-4.2.6 (demeter1.kernel.org [140.211.167.41]);
	Fri, 27 May 2011 19:18:33 +0000 (UTC)

Can you check out the recoverywq_fix branch and see if that prevents
this issue? Or just apply the patch I've included below. :)
-Greg

   }


On Fri, May 27, 2011 at 9:49 AM, Gregory Farnum <gregf@hq.newdream.net> wrote:
> Okay, one of us should have a patch for you later today to try out;
> the only cause I can come up with seems unlikely but there aren't many
> options so it should be pretty simple. :)
> I opened a tracker issue for it: http://tracker.newdream.net/issues/1116
> -Greg
>
> On Fri, May 27, 2011 at 9:41 AM, Fyodor Ustinov <ufm@ufm.su> wrote:
>> On 05/27/2011 06:16 PM, Gregory Farnum wrote:
>>>
>>> This is an interesting one -- the invariant that assert is checking
>>> isn't too complicated (that the object lives on the RecoveryWQ's
>>> queue) and seems to hold everywhere the RecoveryWQ is called. And the
>>> functions modifying the queue are always called under the workqueue
>>> lock, and do maintenance if the xlist::item is on a different list.
>>> Which makes me think that the problem must be from conflating the
>>> RecoveryWQ lock and the PG lock in the few places that modify the
>>> PG::recovery_item directly, rather than via RecoveryWQ functions.
>>> Anybody more familiar than me with this have ideas?
>>> Fyodor, based on the time stamps and output you've given us, I assume
>>> you don't have more detailed logs?
>>> -Greg
>>
>> Greg, i got this crash again.
>> Let me tell you the configuration and what is happening:
>> Configuration:
>> 6 osd servers. 4G RAM, 4*1T hdd (mdadmed to raid0), 2*1G etherchannel
>> ethernet, Ubuntu server 11.04/64  with kernel 2.6.39 (hand compiled)
>> mon+mds server 24G RAM, the same os.
>>
>> On each OSD Journal placed on 1G tempfs. OSD data - on xfs in this case.
>>
>> Configuration file:
>>
>> [global]
>>        max open files = 131072
>>        log file = /var/log/ceph/$name.log
>>        pid file = /var/run/ceph/$name.pid
>>
>> [mon]
>>        mon data = /mfs/mon$id
>>
>> [mon.0]
>>        mon addr  = 10.5.51.230:6789
>>
>> [mds]
>>        keyring = /mfs/mds/keyring.$name
>>
>> [mds.0]
>>        host = mds0
>>
>>
>> [osd]
>>        osd data = /$name
>>        osd journal = /journal/$name
>>        osd journal size = 950
>>        journal dio = false
>>
>> [osd.0]
>>        host = osd0
>>        cluster addr = 10.5.51.10
>>        public addr = 10.5.51.140
>>
>> [osd.1]
>>        host = osd1
>>        cluster addr = 10.5.51.11
>>        public addr = 10.5.51.141
>>
>> [osd.2]
>>        host = osd2
>>        cluster addr = 10.5.51.12
>>        public addr = 10.5.51.142
>>
>> [osd.3]
>>        host = osd3
>>        cluster addr = 10.5.51.13
>>        public addr = 10.5.51.143
>>
>> [osd.4]
>>        host = osd4
>>        cluster addr = 10.5.51.14
>>        public addr = 10.5.51.144
>>
>> [osd.5]
>>        host = osd5
>>        cluster addr = 10.5.51.15
>>        public addr = 10.5.51.145
>>
>> What happening:
>> osd2 was crashed, rebooted, osd data and journal created from scratch by
>> "cosd --mkfs -i 2 --monmap /tmp/monmap" and server started.
>> Additional - on osd2 enables "writeahaed", but I think it's not principal in
>> this case.
>>
>> Well, server start rebalancing:
>>
>> 2011-05-27 15:12:49.323558 7f3b69de5740 ceph version 0.28.1.commit:
>> d66c6ca19bbde3c363b135b66072de44e67c6632. process: cosd. pid: 1694
>> 2011-05-27 15:12:49.325331 7f3b69de5740 filestore(/osd.2) mount FIEMAP ioctl
>> is NOT supported
>> 2011-05-27 15:12:49.325378 7f3b69de5740 filestore(/osd.2) mount did NOT
>> detect btrfs
>> 2011-05-27 15:12:49.325467 7f3b69de5740 filestore(/osd.2) mount found snaps
>> <>
>> 2011-05-27 15:12:49.325512 7f3b69de5740 filestore(/osd.2) mount: WRITEAHEAD
>> journal mode explicitly enabled in conf
>> 2011-05-27 15:12:49.325526 7f3b69de5740 filestore(/osd.2) mount WARNING: not
>> btrfs or ext3; data may be lost
>> 2011-05-27 15:12:49.325606 7f3b69de5740 journal _open /journal/osd.2 fd 11:
>> 996147200 bytes, block size 4096 bytes, directio = 0
>> 2011-05-27 15:12:49.325641 7f3b69de5740 journal read_entry 4096 : seq 1 203
>> bytes
>> 2011-05-27 15:12:49.325698 7f3b69de5740 journal _open /journal/osd.2 fd 11:
>> 996147200 bytes, block size 4096 bytes, directio = 0
>> 2011-05-27 15:12:49.544716 7f3b59656700 -- 10.5.51.12:6801/1694 >>
>> 10.5.51.14:6801/5070 pipe(0x1239d20 sd=27 pgs=0 cs=0 l=0).accept we reset
>> (peer sent cseq 2), sending RESETSESSION
>> 2011-05-27 15:12:49.544798 7f3b59c5c700 -- 10.5.51.12:6801/1694 >>
>> 10.5.51.13:6801/5165 pipe(0x104b950 sd=14 pgs=0 cs=0 l=0).accept we reset
>> (peer sent cseq 2), sending RESETSESSION
>> 2011-05-27 15:12:49.544864 7f3b59757700 -- 10.5.51.12:6801/1694 >>
>> 10.5.51.15:6801/1574 pipe(0x11e7cd0 sd=16 pgs=0 cs=0 l=0).accept we reset
>> (peer sent cseq 2), sending RESETSESSION
>> 2011-05-27 15:12:49.544909 7f3b59959700 -- 10.5.51.12:6801/1694 >>
>> 10.5.51.10:6801/6148 pipe(0x11d7d30 sd=15 pgs=0 cs=0 l=0).accept we reset
>> (peer sent cseq 2), sending RESETSESSION
>> 2011-05-27 15:13:23.015637 7f3b64579700 journal check_for_full at 66404352 :
>> JOURNAL FULL 66404352 >= 851967 (max_size 996147200 start 67256320)
>> 2011-05-27 15:13:25.586081 7f3b5dc6b700 journal throttle: waited for bytes
>> 2011-05-27 15:13:25.601789 7f3b5d46a700 journal throttle: waited for bytes
>>
>> [...] and after 2 hours:
>>
>> 2011-05-27 17:30:21.355034 7f3b64579700 journal check_for_full at 415199232
>> : JOURNAL FULL 415199232 >= 778239 (max_size 996147200 start 415977472)
>> 2011-05-27 17:30:23.441445 7f3b5d46a700 journal throttle: waited for bytes
>> 2011-05-27 17:30:36.362877 7f3b64579700 journal check_for_full at 414326784
>> : JOURNAL FULL 414326784 >= 872447 (max_size 996147200 start 415199232)
>> 2011-05-27 17:30:38.391372 7f3b5d46a700 journal throttle: waited for bytes
>> 2011-05-27 17:30:50.373936 7f3b64579700 journal check_for_full at 414314496
>> : JOURNAL FULL 414314496 >= 12287 (max_size 996147200 start 414326784)
>> ./include/xlist.h: In function 'void xlist<T>::remove(xlist<T>::item*) [with
>> T = PG*]', in thread '0x7f3b5cc69700'
>> ./include/xlist.h: 107: FAILED assert(i->_list == this)
>>  ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632)
>>  1: (xlist<PG*>::pop_front()+0xbb) [0x54f28b]
>>  2: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3]
>>  3: (ThreadPool::worker()+0x10a) [0x65799a]
>>  4: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d]
>>  5: (()+0x6d8c) [0x7f3b697b5d8c]
>>  6: (clone()+0x6d) [0x7f3b6866804d]
>>  ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632)
>>  1: (xlist<PG*>::pop_front()+0xbb) [0x54f28b]
>>  2: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3]
>>  3: (ThreadPool::worker()+0x10a) [0x65799a]
>>  4: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d]
>>  5: (()+0x6d8c) [0x7f3b697b5d8c]
>>  6: (clone()+0x6d) [0x7f3b6866804d]
>> *** Caught signal (Aborted) **
>>  in thread 0x7f3b5cc69700
>>  ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632)
>>  1: /usr/bin/cosd() [0x6729f9]
>>  2: (()+0xfc60) [0x7f3b697bec60]
>>  3: (gsignal()+0x35) [0x7f3b685b5d05]
>>  4: (abort()+0x186) [0x7f3b685b9ab6]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f3b68e6c6dd]
>>  6: (()+0xb9926) [0x7f3b68e6a926]
>>  7: (()+0xb9953) [0x7f3b68e6a953]
>>  8: (()+0xb9a5e) [0x7f3b68e6aa5e]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x362) [0x655e32]
>>  10: (xlist<PG*>::pop_front()+0xbb) [0x54f28b]
>>  11: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3]
>>  12: (ThreadPool::worker()+0x10a) [0x65799a]
>>  13: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d]
>>  14: (()+0x6d8c) [0x7f3b697b5d8c]
>>  15: (clone()+0x6d) [0x7f3b6866804d]
>>
>> I.e. it's not "easy reproduced" bug. While I had less data in the cluster -
>> I not seen this error.
>>
>> I think that I do not have enough space for "full" log for 2-3 hours. Sorry.
>>
>> WBR,
>>    Fyodor.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
---
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 2224e67..41c09d6 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -4789,11 +4789,17 @@ void OSD::do_recovery(PG *pg)
        do_queries(query_map);
       else {
        dout(10) << "do_recovery  no luck, giving up on this pg for
now" << dendl;
+       recovery_wq.lock();
        pg->recovery_item.remove_myself();      // sigh...
+       recovery_wq.unlock();
+
       }
     }
-    else if (started < max)
+    else if (started < max) {
+      recovery_wq.lock();
       pg->recovery_item.remove_myself();
+      recovery_wq.unlock();
+    }

     pg->unlock();