From patchwork Sun May 18 12:54:45 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Alexandre Oliva X-Patchwork-Id: 4197931 Return-Path: X-Original-To: patchwork-ceph-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork1.web.kernel.org (Postfix) with ESMTP id AAD259F1CD for ; Sun, 18 May 2014 13:03:10 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id D791C20253 for ; Sun, 18 May 2014 13:03:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id ABC1C20176 for ; Sun, 18 May 2014 13:03:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751180AbaERNDF (ORCPT ); Sun, 18 May 2014 09:03:05 -0400 Received: from linux-libre.fsfla.org ([208.118.235.54]:45693 "EHLO linux-libre.fsfla.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751166AbaERNDE convert rfc822-to-8bit (ORCPT ); Sun, 18 May 2014 09:03:04 -0400 Received: from freie.home (home.lxoliva.fsfla.org [172.31.160.22]) by linux-libre.fsfla.org (8.14.4/8.14.4/Debian-2ubuntu2.1) with ESMTP id s4ID31vc030229 for ; Sun, 18 May 2014 13:03:02 GMT Received: from free.home (free.home [172.31.160.1]) by freie.home (8.14.8/8.14.7) with ESMTP id s4ICsmlj011796; Sun, 18 May 2014 09:54:51 -0300 From: Alexandre Oliva To: ceph-devel@vger.kernel.org Subject: [PATCH] osd: speedup startup by finishing pending removals in background Organization: Free thinker, not speaking for the GNU Project Date: Sun, 18 May 2014 09:54:45 -0300 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) MIME-Version: 1.0 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This patch applies on top of the one I just posted, Subject “osd: avoid flushing every TEMP removal to speedup startup”. When PG removals are underway and the OSD is restarted, or when multiple removals are scheduled manually with ceph_filestore_dump premove, the OSD may take a long time to process all pending removals before it will join the cluster. This patch introduces an option that enables the OSD to join the cluster first, performing the removals in background while actively participating in the cluster, as the OSD would if it hadn't been restarted. In hindsight, I suppose it might have been wiser to add a data member to DeletingStateRef to hold the coll_t, instead of having to search for it again, but this patch is what I tested, and it's likely good enough for now. Signed-off-by: Alexandre Oliva --- src/common/config_opts.h | 10 ++++++ src/osd/OSD.cc | 75 ++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 82 insertions(+), 3 deletions(-) diff --git a/src/common/config_opts.h b/src/common/config_opts.h index 2c65e6c..9baa356 100644 --- a/src/common/config_opts.h +++ b/src/common/config_opts.h @@ -583,6 +583,16 @@ OPTION(osd_client_op_priority, OPT_U32, 63) OPTION(osd_recovery_op_priority, OPT_U32, 10) OPTION(osd_recovery_op_warn_multiple, OPT_U32, 16) +// Removal of PGs is done in background, but if the osd is restarted, +// it will finish all pending removals before joining the cluster. +// This can take a while. If this option is set to true, then pending +// removals will be performed in background, while the osd runs +// normally. This is a bit dangerous if the OSD gets a new copy of +// the PG before the pending removal is completed: attributes stored +// in the leveldb may be lost when removal cleans up an object's +// attributes AFTER the new object is backfilled. +OPTION(osd_startup_finish_remove_in_background, OPT_BOOL, false) + // Max time to wait between notifying mon of shutdown and shutting down OPTION(osd_mon_shutdown_timeout, OPT_DOUBLE, 5) diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 504cb71..a6d58c1 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -1924,6 +1924,15 @@ OSD::res_result OSD::_try_resurrect_pg( if (!df) return RES_NONE; // good to go + // If we're background deleting a pg scheduled for removal in an + // earlier session, this will be NULL. We can't resurrect this one, + // nor should we create a new PG with the same pgid, so we'll fail + // this assert and let the user restart the osd without + // osd_startup_finish_remove_in_background, so that removal is + // completed before the osd gets a chance to try to create or + // resurrect the PG. + assert(df->old_pg_state); + df->old_pg_state->lock(); OSDMapRef create_map = df->old_pg_state->get_osdmap(); df->old_pg_state->unlock(); @@ -2048,6 +2057,7 @@ void OSD::load_pgs() set head_pgs; map > pgs; + map *bgremove = NULL; bool flush = false; for (vector::iterator it = ls.begin(); it != ls.end(); @@ -2056,14 +2066,30 @@ void OSD::load_pgs() snapid_t snap; uint64_t seq; - if (it->is_temp(pgid) || - it->is_removal(&seq, &pgid)) { - dout(10) << "load_pgs " << *it << " clearing temp" << dendl; + if (it->is_temp(pgid)) { + dout(10) << "load_pgs " << *it << " clearing temp " << dendl; recursive_remove_collection(store, *it, false); flush = true; continue; } + if (it->is_removal(&seq, &pgid)) { + if (cct->_conf->osd_startup_finish_remove_in_background) { + dout(10) << "load_pgs " << *it + << " delaying pending removal" << dendl; + if (seq >= next_removal_seq) + next_removal_seq = seq + 1; + if (!bgremove) + bgremove = new map(); + (*bgremove)[seq] = pgid; + } else { + dout(10) << "load_pgs " << *it << " clearing pending removal " << dendl; + recursive_remove_collection(store, *it, false); + flush = true; + } + continue; + } + if (it->is_pg(pgid, snap)) { if (snap != CEPH_NOSNAP) { dout(10) << "load_pgs skipping snapped dir " << *it @@ -2081,6 +2107,18 @@ void OSD::load_pgs() if (flush) store->sync_and_flush(); + if (bgremove) { + for (map::iterator it = bgremove->begin(); + it != bgremove->end(); it++) { + dout(10) << "load_pgs FORREMOVAL_" << it->first << "_" << it->second + << " scheduling background removal " << dendl; + DeletingStateRef deleting = service.deleting_pgs.lookup_or_create + (it->second, make_pair(it->second, PGRef(0))); + remove_wq.queue(make_pair(PGRef(0), deleting)); + } + delete bgremove; + } + bool has_upgraded = false; for (map >::iterator i = pgs.begin(); i != pgs.end(); @@ -3520,6 +3558,37 @@ void OSD::RemoveWQ::_process( ThreadPool::TPHandle &handle) { PGRef pg(item.first); + + if (!pg) { + // this is for background live removal of pending FORREMOVAL pgs, + // remaining from earlier OSD sessions. This only happens if + // osd_startup_finish_remove_in_background is enabled. + if (!item.second->start_clearing()) + return; + + if (!item.second->start_deleting()) + return; + + vector ls; + int r = store->list_collections(ls); + assert (!(r < 0)); + + for (vector::iterator it = ls.begin(); + it != ls.end(); + ++it) { + spg_t pgid; + uint64_t seq; + + if (it->is_removal(&seq, &pgid) && pgid == item.second->pgid) { + recursive_remove_collection(store, *it, false); + break; + } + } + + item.second->finish_deleting(); + return; + } + SnapMapper &mapper = pg->snap_mapper; OSDriver &driver = pg->osdriver; coll_t coll = coll_t(pg->info.pgid);