From patchwork Wed Jun 24 21:40:24 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Yehuda Sadeh <yehuda@redhat.com>
X-Patchwork-Id: 6670101
Return-Path: <ceph-devel-owner@kernel.org>
X-Original-To: patchwork-ceph-devel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id 5ABCFC05AC
	for <patchwork-ceph-devel@patchwork.kernel.org>;
	Wed, 24 Jun 2015 21:40:32 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 383AB20575
	for <patchwork-ceph-devel@patchwork.kernel.org>;
	Wed, 24 Jun 2015 21:40:31 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id D91C220574
	for <patchwork-ceph-devel@patchwork.kernel.org>;
	Wed, 24 Jun 2015 21:40:29 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751685AbbFXVk1 (ORCPT
	<rfc822;patchwork-ceph-devel@patchwork.kernel.org>);
	Wed, 24 Jun 2015 17:40:27 -0400
Received: from mx5-phx2.redhat.com ([209.132.183.37]:45789 "EHLO
	mx5-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750780AbbFXVk1 convert rfc822-to-8bit (ORCPT
	<rfc822; ceph-devel@vger.kernel.org>); Wed, 24 Jun 2015 17:40:27 -0400
Received: from zmail23.collab.prod.int.phx2.redhat.com
	(zmail23.collab.prod.int.phx2.redhat.com [10.5.83.28])
	by mx5-phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t5OLeO2O027928;
	Wed, 24 Jun 2015 17:40:25 -0400
Date: Wed, 24 Jun 2015 17:40:24 -0400 (EDT)
From: Yehuda Sadeh-Weinraub <yehuda@redhat.com>
To: GuangYang <yguang11@outlook.com>
Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
Message-ID: <445856759.19469749.1435182024677.JavaMail.zimbra@redhat.com>
In-Reply-To: <935875968.19447031.1435180864559.JavaMail.zimbra@redhat.com>
References: <BLU175-W1310E96C473E9ABED62D65DFAF0@phx.gbl>
	<460832233.19349487.1435171225932.JavaMail.zimbra@redhat.com>
	<BLU175-W467685E9DD4BACB37D7AB3DFAF0@phx.gbl>
	<463953216.19442278.1435179845477.JavaMail.zimbra@redhat.com>
	<BLU175-W7F6E88F53F4EF4DF45248DFAF0@phx.gbl>
	<935875968.19447031.1435180864559.JavaMail.zimbra@redhat.com>
Subject: Re: radosgw crash within libfcgi
MIME-Version: 1.0
X-Originating-IP: [10.17.97.101]
X-Mailer: Zimbra 8.0.6_GA_5922 (ZimbraWebClient - FF28 (Linux)/8.0.6_GA_5922)
Thread-Topic: radosgw crash within libfcgi
Thread-Index: Xlv++SzUoUa0ikXBOd7TBynyxDAx1c5jKibv
Sender: ceph-devel-owner@vger.kernel.org
Precedence: bulk
List-ID: <ceph-devel.vger.kernel.org>
X-Mailing-List: ceph-devel@vger.kernel.org
X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Also, looking at the code, I see an extra call to FCGX_Finish_r():


Maybe this is a problem on the specific libfcgi version that you're using?

----- Original Message -----
> From: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> To: "GuangYang" <yguang11@outlook.com>
> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> Sent: Wednesday, June 24, 2015 2:21:04 PM
> Subject: Re: radosgw crash within libfcgi
> 
> 
> 
> ----- Original Message -----
> > From: "GuangYang" <yguang11@outlook.com>
> > To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> > Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> > Sent: Wednesday, June 24, 2015 2:12:23 PM
> > Subject: RE: radosgw crash within libfcgi
> > 
> > ----------------------------------------
> > > Date: Wed, 24 Jun 2015 17:04:05 -0400
> > > From: yehuda@redhat.com
> > > To: yguang11@outlook.com
> > > CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> > > Subject: Re: radosgw crash within libfcgi
> > >
> > >
> > >
> > > ----- Original Message -----
> > >> From: "GuangYang" <yguang11@outlook.com>
> > >> To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> > >> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> > >> Sent: Wednesday, June 24, 2015 1:53:20 PM
> > >> Subject: RE: radosgw crash within libfcgi
> > >>
> > >> Thanks Yehuda for the response.
> > >>
> > >> We already patched libfcgi to use poll instead of select to overcome the
> > >> limitation.
> > >>
> > >> Thanks,
> > >> Guang
> > >>
> > >>
> > >> ----------------------------------------
> > >>> Date: Wed, 24 Jun 2015 14:40:25 -0400
> > >>> From: yehuda@redhat.com
> > >>> To: yguang11@outlook.com
> > >>> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> > >>> Subject: Re: radosgw crash within libfcgi
> > >>>
> > >>>
> > >>>
> > >>> ----- Original Message -----
> > >>>> From: "GuangYang" <yguang11@outlook.com>
> > >>>> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com,
> > >>>> yehuda@redhat.com
> > >>>> Sent: Wednesday, June 24, 2015 10:09:58 AM
> > >>>> Subject: radosgw crash within libfcgi
> > >>>>
> > >>>> Hello Cephers,
> > >>>> Recently we have several radosgw daemon crashes with the same
> > >>>> following
> > >>>> kernel log:
> > >>>>
> > >>>> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
> > >>>> 00007ffa069996f2 sp 00007ff55c432710 error 6 in
> > >
> > > error 6 is sigabrt, right? With invalid pointer I'd expect to get
> > > segfault.
> > > Is the pointer actually invalid?
> > With (ip - {address_load_the_sharded_library}) to get the instruction which
> > caused this crash, the objdump shows the crash happened at instruction 46f2
> > (see below), which was to assign '-1' to the CGX_Request::ipcFd to -1, but
> > I
> > don't quite understand how/why it could crash there.
> > 
> > 0000000000004690 <FCGX_Free>:
> >     4690:       48 89 5c 24 f0          mov    %rbx,-0x10(%rsp)
> >     4695:       48 89 6c 24 f8          mov    %rbp,-0x8(%rsp)
> >     469a:       48 83 ec 18             sub    $0x18,%rsp
> >     469e:       48 85 ff                test   %rdi,%rdi
> >     46a1:       48 89 fb                mov    %rdi,%rbx
> >     46a4:       89 f5                   mov    %esi,%ebp
> >     46a6:       74 28                   je     46d0 <FCGX_Free+0x40>
> >     46a8:       48 8d 7f 08             lea    0x8(%rdi),%rdi
> >     46ac:       e8 67 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
> >     46b1:       48 8d 7b 10             lea    0x10(%rbx),%rdi
> >     46b5:       e8 5e e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
> >     46ba:       48 8d 7b 18             lea    0x18(%rbx),%rdi
> >     46be:       e8 55 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
> >     46c3:       48 8d 7b 28             lea    0x28(%rbx),%rdi
> >     46c7:       e8 d4 f4 ff ff          callq  3ba0 <FCGX_PutS+0x40>
> >     46cc:       85 ed                   test   %ebp,%ebp
> >     46ce:       75 10                   jne    46e0 <FCGX_Free+0x50>
> >     46d0:       48 8b 5c 24 08          mov    0x8(%rsp),%rbx
> >     46d5:       48 8b 6c 24 10          mov    0x10(%rsp),%rbp
> >     46da:       48 83 c4 18             add    $0x18,%rsp
> >     46de:       c3                      retq
> >     46df:       90                      nop
> >     46e0:       31 f6                   xor    %esi,%esi
> >     46e2:       83 7b 4c 00             cmpl   $0x0,0x4c(%rbx)
> >     46e6:       8b 7b 30                mov    0x30(%rbx),%edi
> >     46e9:       40 0f 94 c6             sete   %sil
> >     46ed:       e8 86 e6 ff ff          callq  2d78 <OS_IpcClose@plt>
> >     46f2:       c7 43 30 ff ff ff ff    movl   $0xffffffff,0x30(%rbx)
> 
> info registers?
> 
> Not too familiar with the specific message, but it could be that
> OS_IpcClose() aborts (not highly unlikely) and it only dumps the return
> address of the current function (shouldn't be referenced as ip though).
> 
> What's rbx? Is the memory at %rbx + 0x30 valid?
> 
> Also, did you by any chance upgrade the binaries while the code was running?
> is the code running over nfs?
> 
> Yehuda
> 
> > >
> > > Yehuda
> > >
> > >
> > >>>> libfcgi.so.0.0.0[7ffa06995000+a000] in
> > >>>> libfcgi.so.0.0.0[7ffa06995000+a000]
> > >>>>
> > >>>> Looking at the assembly, it seems crashing at this point -
> > >>>> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035,
> > >>>> which
> > >>>> confused me. I tried to see if there is any other reference holding
> > >>>> the
> > >>>> FCGX_Request which release the handle without any luck.
> > >>>>
> > >>>> There are also other observations:
> > >>>> 1> Several radosgw daemon across different hosts crashed around the
> > >>>> same
> > >>>> time.
> > >>>> 2> Apache's error log has some fcgi error complaining ##idle timeout##
> > >>>> during the time.
> > >>>>
> > >>>> Does anyone experience similar issue?
> > >>>>
> > >>>
> > >>> In the past we've had issues with libfcgi that were related to the
> > >>> number
> > >>> of open fds on the process (> 1024). The issue was a buggy libfcgi that
> > >>> was using select() instead of poll(), so this might be the issue you're
> > >>> noticing.
> > >>>
> > >>> Yehuda
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>> in
> > >>> the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >> N????y??b???v???{.n????z??ay????j?f??????????:+v????????zZ+????"?!?
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
---
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/src/rgw/rgw_main.cc b/src/rgw/rgw_main.cc
index 9a8aa5f..0aa7ded 100644
--- a/src/rgw/rgw_main.cc
+++ b/src/rgw/rgw_main.cc
@@ -669,8 +669,6 @@ void RGWFCGXProcess::handle_request(RGWRequest *r)
     dout(20) << "process_request() returned " << ret << dendl;
   }
 
-  FCGX_Finish_r(fcgx);
-
   delete req;
 }