From patchwork Sat Aug 24 14:56:35 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Alexandre Oliva X-Patchwork-Id: 2849217 Return-Path: X-Original-To: patchwork-ceph-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 6EFC6BF546 for ; Sat, 24 Aug 2013 14:56:59 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 52B9B20264 for ; Sat, 24 Aug 2013 14:56:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 278492025D for ; Sat, 24 Aug 2013 14:56:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754870Ab3HXO4z (ORCPT ); Sat, 24 Aug 2013 10:56:55 -0400 Received: from linux-libre.fsfla.org ([208.118.235.54]:45975 "EHLO linux-libre.fsfla.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754719Ab3HXO4z (ORCPT ); Sat, 24 Aug 2013 10:56:55 -0400 Received: from freie (home.lxoliva.fsfla.org [172.31.160.22]) by linux-libre.fsfla.org (8.14.4/8.14.4/Debian-2ubuntu2) with ESMTP id r7OEupW6021072; Sat, 24 Aug 2013 14:56:51 GMT Received: from livre.home (livre.home [172.31.160.2]) by freie (8.14.7/8.14.6) with ESMTP id r7OEuc5p028333; Sat, 24 Aug 2013 11:56:41 -0300 From: Alexandre Oliva To: Sage Weil Cc: sam.just@inktank.com, ceph-devel@vger.kernel.org Subject: Re: [PATCH] reinstate ceph cluster_snap support Organization: Free thinker, not speaking for the GNU Project References: Date: Sat, 24 Aug 2013 11:56:35 -0300 In-Reply-To: (Sage Weil's message of "Fri, 23 Aug 2013 17:17:22 -0700 (PDT)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) MIME-Version: 1.0 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Spam-Status: No, score=-8.0 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, T_TVD_MIME_EPI, T_TVD_MIME_NO_HEADERS, UNPARSEABLE_RELAY, URIBL_BLACK autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Aug 23, 2013, Sage Weil wrote: > FWIW Alexandre, this feature was never really complete. For it to work, > we also need to snapshot the monitors, and roll them back as well. That depends on what's expected from the feature, actually. One use is to roll back a single osd, and for that, the feature works just fine. Of course, for that one doesn't need the multi-osd snapshots to be mutually consistent, but it's still convenient to be able to take a global snapshot with a single command. Another use is to roll back the entire cluster to an earlier state, and for that, you *probably* want to roll back the monitors too, although it doesn't seem like this is strictly necessary, unless some significant configuration changes occurrend in the cluster since the snapshot was taken, and you want to roll back those too. In my experience, rolling back only osds has worked just fine, with the exception of cases in which the snapshot is much too old, and the mons have already expired osdmaps after the last one the osd got when the snapshot was taken. For this one case, I have a patch that enables the osd to rejoin the cluster in spite of the expired osdmaps, which has always worked for me, but I understand there may be exceptional cluster reconfigurations in which this wouldn't have worked. As for snapshotting monitors... I suppose the way to go is to start a store.db dump in background, instead of taking a btrfs snapshot, since the store.db is not created as a subvolume. That said, it would make some sense to make it so, to make it trivially snapshottable. Anyway, I found a problem in the earlier patch: when I added a new disk to my cluster this morning, it tried to iterate over osdmaps that were not available (e.g. the long-gone osdmap 1), and crashed. Here's a fixed version, that makes sure we don't start the iteration before m->get_first(). reinstate ceph cluster_snap support From: Alexandre Oliva This patch brings back and updates (for dumpling) the code originally introduced to support “ceph osd cluster_snap ”, that was disabled and partially removed before cuttlefish. Some minimal testing appears to indicate this even works: the modified mon actually generated an osdmap with the cluster_snap request, and starting a modified osd that was down and letting it catch up caused the osd to take the requested snapshot. I see no reason why it wouldn't have taken it if it was up and running, so... Why was this feature disabled in the first place? Signed-off-by: Alexandre Oliva --- src/mon/MonCommands.h | 6 ++++-- src/mon/OSDMonitor.cc | 11 +++++++---- src/osd/OSD.cc | 14 ++++++++++++++ 3 files changed, 25 insertions(+), 6 deletions(-) diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h index 8e9c2bb..225c687 100644 --- a/src/mon/MonCommands.h +++ b/src/mon/MonCommands.h @@ -431,8 +431,10 @@ COMMAND("osd set " \ COMMAND("osd unset " \ "name=key,type=CephChoices,strings=pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub", \ "unset ", "osd", "rw", "cli,rest") -COMMAND("osd cluster_snap", "take cluster snapshot (disabled)", \ - "osd", "r", "") +COMMAND("osd cluster_snap " \ + "name=snap,type=CephString", \ + "take cluster snapshot", \ + "osd", "r", "cli") COMMAND("osd down " \ "type=CephString,name=ids,n=N", \ "set osd(s) [...] down", "osd", "rw", "cli,rest") diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc index 07022ae..9bf9511 100644 --- a/src/mon/OSDMonitor.cc +++ b/src/mon/OSDMonitor.cc @@ -3099,10 +3099,13 @@ bool OSDMonitor::prepare_command(MMonCommand *m) return prepare_unset_flag(m, CEPH_OSDMAP_NODEEP_SCRUB); } else if (prefix == "osd cluster_snap") { - // ** DISABLE THIS FOR NOW ** - ss << "cluster snapshot currently disabled (broken implementation)"; - // ** DISABLE THIS FOR NOW ** - + string snap; + cmd_getval(g_ceph_context, cmdmap, "snap", snap); + pending_inc.cluster_snapshot = snap; + ss << "creating cluster snap " << snap; + getline(ss, rs); + wait_for_finished_proposal(new Monitor::C_Command(mon, m, 0, rs, get_last_committed())); + return true; } else if (prefix == "osd down" || prefix == "osd out" || prefix == "osd in" || diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 1a77dae..01527a6 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -5022,6 +5022,20 @@ void OSD::handle_osd_map(MOSDMap *m) assert(0 == "MOSDMap lied about what maps it had?"); } + // check for cluster snapshots + for (epoch_t cur = MAX(superblock.current_epoch + 1, m->get_first()); + cur <= m->get_last(); cur++) { + OSDMapRef newmap = get_map(cur); + string cluster_snap = newmap->get_cluster_snapshot(); + if (cluster_snap.length() == 0) + continue; + + dout(0) << "creating cluster snapshot '" << cluster_snap << "'" << dendl; + int r = store->snapshot(cluster_snap); + if (r) + dout(0) << "failed to create cluster snapshot: " << cpp_strerror(r) << dendl; + } + if (superblock.oldest_map) { int num = 0; epoch_t min(