From patchwork Wed Jun 24 21:40:24 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Yehuda Sadeh X-Patchwork-Id: 6670101 Return-Path: X-Original-To: patchwork-ceph-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 5ABCFC05AC for ; Wed, 24 Jun 2015 21:40:32 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 383AB20575 for ; Wed, 24 Jun 2015 21:40:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D91C220574 for ; Wed, 24 Jun 2015 21:40:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751685AbbFXVk1 (ORCPT ); Wed, 24 Jun 2015 17:40:27 -0400 Received: from mx5-phx2.redhat.com ([209.132.183.37]:45789 "EHLO mx5-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750780AbbFXVk1 convert rfc822-to-8bit (ORCPT ); Wed, 24 Jun 2015 17:40:27 -0400 Received: from zmail23.collab.prod.int.phx2.redhat.com (zmail23.collab.prod.int.phx2.redhat.com [10.5.83.28]) by mx5-phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t5OLeO2O027928; Wed, 24 Jun 2015 17:40:25 -0400 Date: Wed, 24 Jun 2015 17:40:24 -0400 (EDT) From: Yehuda Sadeh-Weinraub To: GuangYang Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com Message-ID: <445856759.19469749.1435182024677.JavaMail.zimbra@redhat.com> In-Reply-To: <935875968.19447031.1435180864559.JavaMail.zimbra@redhat.com> References: <460832233.19349487.1435171225932.JavaMail.zimbra@redhat.com> <463953216.19442278.1435179845477.JavaMail.zimbra@redhat.com> <935875968.19447031.1435180864559.JavaMail.zimbra@redhat.com> Subject: Re: radosgw crash within libfcgi MIME-Version: 1.0 X-Originating-IP: [10.17.97.101] X-Mailer: Zimbra 8.0.6_GA_5922 (ZimbraWebClient - FF28 (Linux)/8.0.6_GA_5922) Thread-Topic: radosgw crash within libfcgi Thread-Index: Xlv++SzUoUa0ikXBOd7TBynyxDAx1c5jKibv Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Also, looking at the code, I see an extra call to FCGX_Finish_r(): Maybe this is a problem on the specific libfcgi version that you're using? ----- Original Message ----- > From: "Yehuda Sadeh-Weinraub" > To: "GuangYang" > Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com > Sent: Wednesday, June 24, 2015 2:21:04 PM > Subject: Re: radosgw crash within libfcgi > > > > ----- Original Message ----- > > From: "GuangYang" > > To: "Yehuda Sadeh-Weinraub" > > Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com > > Sent: Wednesday, June 24, 2015 2:12:23 PM > > Subject: RE: radosgw crash within libfcgi > > > > ---------------------------------------- > > > Date: Wed, 24 Jun 2015 17:04:05 -0400 > > > From: yehuda@redhat.com > > > To: yguang11@outlook.com > > > CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com > > > Subject: Re: radosgw crash within libfcgi > > > > > > > > > > > > ----- Original Message ----- > > >> From: "GuangYang" > > >> To: "Yehuda Sadeh-Weinraub" > > >> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com > > >> Sent: Wednesday, June 24, 2015 1:53:20 PM > > >> Subject: RE: radosgw crash within libfcgi > > >> > > >> Thanks Yehuda for the response. > > >> > > >> We already patched libfcgi to use poll instead of select to overcome the > > >> limitation. > > >> > > >> Thanks, > > >> Guang > > >> > > >> > > >> ---------------------------------------- > > >>> Date: Wed, 24 Jun 2015 14:40:25 -0400 > > >>> From: yehuda@redhat.com > > >>> To: yguang11@outlook.com > > >>> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com > > >>> Subject: Re: radosgw crash within libfcgi > > >>> > > >>> > > >>> > > >>> ----- Original Message ----- > > >>>> From: "GuangYang" > > >>>> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com, > > >>>> yehuda@redhat.com > > >>>> Sent: Wednesday, June 24, 2015 10:09:58 AM > > >>>> Subject: radosgw crash within libfcgi > > >>>> > > >>>> Hello Cephers, > > >>>> Recently we have several radosgw daemon crashes with the same > > >>>> following > > >>>> kernel log: > > >>>> > > >>>> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip > > >>>> 00007ffa069996f2 sp 00007ff55c432710 error 6 in > > > > > > error 6 is sigabrt, right? With invalid pointer I'd expect to get > > > segfault. > > > Is the pointer actually invalid? > > With (ip - {address_load_the_sharded_library}) to get the instruction which > > caused this crash, the objdump shows the crash happened at instruction 46f2 > > (see below), which was to assign '-1' to the CGX_Request::ipcFd to -1, but > > I > > don't quite understand how/why it could crash there. > > > > 0000000000004690 : > >     4690:       48 89 5c 24 f0          mov    %rbx,-0x10(%rsp) > >     4695:       48 89 6c 24 f8          mov    %rbp,-0x8(%rsp) > >     469a:       48 83 ec 18             sub    $0x18,%rsp > >     469e:       48 85 ff                test   %rdi,%rdi > >     46a1:       48 89 fb                mov    %rdi,%rbx > >     46a4:       89 f5                   mov    %esi,%ebp > >     46a6:       74 28                   je     46d0 > >     46a8:       48 8d 7f 08             lea    0x8(%rdi),%rdi > >     46ac:       e8 67 e3 ff ff          callq  2a18 > >     46b1:       48 8d 7b 10             lea    0x10(%rbx),%rdi > >     46b5:       e8 5e e3 ff ff          callq  2a18 > >     46ba:       48 8d 7b 18             lea    0x18(%rbx),%rdi > >     46be:       e8 55 e3 ff ff          callq  2a18 > >     46c3:       48 8d 7b 28             lea    0x28(%rbx),%rdi > >     46c7:       e8 d4 f4 ff ff          callq  3ba0 > >     46cc:       85 ed                   test   %ebp,%ebp > >     46ce:       75 10                   jne    46e0 > >     46d0:       48 8b 5c 24 08          mov    0x8(%rsp),%rbx > >     46d5:       48 8b 6c 24 10          mov    0x10(%rsp),%rbp > >     46da:       48 83 c4 18             add    $0x18,%rsp > >     46de:       c3                      retq > >     46df:       90                      nop > >     46e0:       31 f6                   xor    %esi,%esi > >     46e2:       83 7b 4c 00             cmpl   $0x0,0x4c(%rbx) > >     46e6:       8b 7b 30                mov    0x30(%rbx),%edi > >     46e9:       40 0f 94 c6             sete   %sil > >     46ed:       e8 86 e6 ff ff          callq  2d78 > >     46f2:       c7 43 30 ff ff ff ff    movl   $0xffffffff,0x30(%rbx) > > info registers? > > Not too familiar with the specific message, but it could be that > OS_IpcClose() aborts (not highly unlikely) and it only dumps the return > address of the current function (shouldn't be referenced as ip though). > > What's rbx? Is the memory at %rbx + 0x30 valid? > > Also, did you by any chance upgrade the binaries while the code was running? > is the code running over nfs? > > Yehuda > > > > > > > Yehuda > > > > > > > > >>>> libfcgi.so.0.0.0[7ffa06995000+a000] in > > >>>> libfcgi.so.0.0.0[7ffa06995000+a000] > > >>>> > > >>>> Looking at the assembly, it seems crashing at this point - > > >>>> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, > > >>>> which > > >>>> confused me. I tried to see if there is any other reference holding > > >>>> the > > >>>> FCGX_Request which release the handle without any luck. > > >>>> > > >>>> There are also other observations: > > >>>> 1> Several radosgw daemon across different hosts crashed around the > > >>>> same > > >>>> time. > > >>>> 2> Apache's error log has some fcgi error complaining ##idle timeout## > > >>>> during the time. > > >>>> > > >>>> Does anyone experience similar issue? > > >>>> > > >>> > > >>> In the past we've had issues with libfcgi that were related to the > > >>> number > > >>> of open fds on the process (> 1024). The issue was a buggy libfcgi that > > >>> was using select() instead of poll(), so this might be the issue you're > > >>> noticing. > > >>> > > >>> Yehuda > > >>> -- > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > >>> in > > >>> the body of a message to majordomo@vger.kernel.org > > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >> N????y??b???v???{.n????z??ay????j?f??????????:+v????????zZ+????"?!? > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > --- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/src/rgw/rgw_main.cc b/src/rgw/rgw_main.cc index 9a8aa5f..0aa7ded 100644 --- a/src/rgw/rgw_main.cc +++ b/src/rgw/rgw_main.cc @@ -669,8 +669,6 @@ void RGWFCGXProcess::handle_request(RGWRequest *r) dout(20) << "process_request() returned " << ret << dendl; } - FCGX_Finish_r(fcgx); - delete req; }