Message ID | or7gb3ad3w.fsf@livre.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote: >> Finally, eventually we should make this do a checkpoint on the mons too. >> We can add the osd snapping back in first, but before this can/should >> really be used the mons need to be snapshotted as well. Probably that's >> just adding in a snapshot() method to MonitorStore.h and doing either a >> leveldb snap or making a full copy of store.db... I forget what leveldb is >> capable of here. > I haven't looked into this yet. I looked a bit at the leveldb interface. It offers a facility to create Snapshots, but they only last for the duration of one session of the database. It can be used to create multiple iterators at once state of the db, or to read multiple values from the same state of the db, but not to roll back to a state you had at an earlier session, e.g., after a monitor restart. So they won't help us. I thus see a few possibilities (all of them to be done between taking note of the request for the new snapshot and returning a response to the requestor that the request was satisfied): 1. take a snapshot, create an iterator out of the snapshot, create a new database named after the cluster_snap key, and go over all key/value pairs tha the iterator can see, adding each one to this new database. 2. close the database, create a dir named after the cluster_snap key, create hardlinks to all files in the database tree in the cluster_snap dir, and then reopen the database 3. flush the leveldb (how? will a write with sync=true do? must we close it?) and take a btrfs snapshot of the store.db tree, named after the cluster_snap key, and then reopen the database None of these are particularly appealing; (1) wastes disk space and cpu cycles; (2) relies on leveldb internal implementation details such as the fact that files are never modified after they're first closed, and (3) requires a btrfs subvol for the store.db. My favorite choice would be 3, but can we just fail mon snaps when this requirement is not met?
On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote: > On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote: >>> Finally, eventually we should make this do a checkpoint on the mons too. >>> We can add the osd snapping back in first, but before this can/should >>> really be used the mons need to be snapshotted as well. Probably that's >>> just adding in a snapshot() method to MonitorStore.h and doing either a >>> leveldb snap or making a full copy of store.db... I forget what leveldb is >>> capable of here. >> I haven't looked into this yet. > None of these are particularly appealing; (1) wastes disk space and cpu > cycles; (2) relies on leveldb internal implementation details such as > the fact that files are never modified after they're first closed, and > (3) requires a btrfs subvol for the store.db. My favorite choice would > be 3, but can we just fail mon snaps when this requirement is not met? Another aspect that needs to be considered is whether to take a snapshot of the leader only, or of all monitors in the quorum. The fact that the snapshot operation may take a while to complete (particularly (1)), and monitors may not make progress while taking the snapshot (which might cause the client and other monitors to assume other monitors have failed), make the whole thing quite more complex than what I'd have hoped for. Another point that may affect the decision is the amount of information in store.db that may have to be retained. E.g., if it's just a small amount of information, creating a separate database makes far more sense than taking a complete copy of the entire database, and it might even make sense for the leader to include the full snapshot data in the snapshot-taking message shared with other monitors, so that they all take exactly the same snapshot, even if they're not in the quorum and receive the update at a later time. Of course this wouldn't work if the amount of snapshotted monitor data was more than reasonable for a monitor message. Anyway, this is probably more than what I'd be able to undertake myself, at least in part because, although I can see one place to add the snapshot-taking code to the leader (assuming it's ok to take the snapshot just before or right after all monitors agree on it), I have no idea of where to plug the snapshot-taking behavior into peon and recovering monitors. Absent a two-phase protocol, it seems to me that all monitors ought to take snapshots tentatively when they issue or acknowledge the snapshot-taking proposal, so as to make sure that if it succeeds we'll have a quorum of snapshots, but if the proposal doesn't succeed at first, I don't know how to deal with retries (overwrite existing snapshots? discard the snapshot when its proposal fails?) or cancellation (say, the client doesn't get confirmation from the leader, the leader changes, it retries that some times, and eventually it gives up, but some monitors have already tentatively taken the snapshot in the mean time).
On Tue, Dec 17, 2013 at 4:14 AM, Alexandre Oliva <oliva@gnu.org> wrote: > On Aug 27, 2013, Sage Weil <sage@inktank.com> wrote: > >> Hi, >> On Sat, 24 Aug 2013, Alexandre Oliva wrote: >>> On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote: >>> >>> > FWIW Alexandre, this feature was never really complete. For it to work, >>> > we also need to snapshot the monitors, and roll them back as well. >>> >>> That depends on what's expected from the feature, actually. >>> >>> One use is to roll back a single osd, and for that, the feature works >>> just fine. Of course, for that one doesn't need the multi-osd snapshots >>> to be mutually consistent, but it's still convenient to be able to take >>> a global snapshot with a single command. > >> In principle, we can add this back in. I think it needs a few changes, >> though. > >> First, FileStore::snapshot() needs to pause and drain the workqueue before >> taking the snapshot, similar to what is done with the sync sequence. >> Otherwise it isn't a transactionally consistent snapshot and may tear some >> update. Because it is draining the work queue, it *might* also need to >> drop some locks, but I'm hopeful that that isn't necessary. > > Hmm... I don't quite get this. The Filestore implementation of > snapshot already performs a sync_and_flush before calling the backend's > create_checkpoint. Shouldn't that be enough? FWIW, the code I brought > in from argonaut didn't do any such thing; it did drop locks, but that > doesn't seem to be necessary any more: From a quick skim I think you're right about that. The more serious concern in the OSDs (which motivated removing the cluster snap) is what Sage mentioned: we used to be able to take a snapshot for which all PGs were at the same epoch, and we can't do that now. It's possible that's okay, but it makes the semantics even weirder than they used to be (you've never been getting a real point-in-time snapshot, although as long as you didn't use external communication channels you could at least be sure it contained a causal cut). And of course that's nothing compared to snapshotting the monitors, as you've noticed — but making it actually be a cluster snapshot (instead of something you could basically do by taking a btrfs snapshot yourself) is something I would want to see before we bring the feature back into mainline. On Tue, Dec 17, 2013 at 6:22 AM, Alexandre Oliva <oliva@gnu.org> wrote: > On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote: > >> On Dec 17, 2013, Alexandre Oliva <oliva@gnu.org> wrote: >>>> Finally, eventually we should make this do a checkpoint on the mons too. >>>> We can add the osd snapping back in first, but before this can/should >>>> really be used the mons need to be snapshotted as well. Probably that's >>>> just adding in a snapshot() method to MonitorStore.h and doing either a >>>> leveldb snap or making a full copy of store.db... I forget what leveldb is >>>> capable of here. > >>> I haven't looked into this yet. > >> None of these are particularly appealing; (1) wastes disk space and cpu >> cycles; (2) relies on leveldb internal implementation details such as >> the fact that files are never modified after they're first closed, and >> (3) requires a btrfs subvol for the store.db. My favorite choice would >> be 3, but can we just fail mon snaps when this requirement is not met? > > Another aspect that needs to be considered is whether to take a snapshot > of the leader only, or of all monitors in the quorum. The fact that the > snapshot operation may take a while to complete (particularly (1)), and > monitors may not make progress while taking the snapshot (which might > cause the client and other monitors to assume other monitors have > failed), make the whole thing quite more complex than what I'd have > hoped for. > > Another point that may affect the decision is the amount of information > in store.db that may have to be retained. E.g., if it's just a small > amount of information, creating a separate database makes far more sense > than taking a complete copy of the entire database, and it might even > make sense for the leader to include the full snapshot data in the > snapshot-taking message shared with other monitors, so that they all > take exactly the same snapshot, even if they're not in the quorum and > receive the update at a later time. Of course this wouldn't work if the > amount of snapshotted monitor data was more than reasonable for a > monitor message. > > Anyway, this is probably more than what I'd be able to undertake myself, > at least in part because, although I can see one place to add the > snapshot-taking code to the leader (assuming it's ok to take the > snapshot just before or right after all monitors agree on it), I have no > idea of where to plug the snapshot-taking behavior into peon and > recovering monitors. Absent a two-phase protocol, it seems to me that > all monitors ought to take snapshots tentatively when they issue or > acknowledge the snapshot-taking proposal, so as to make sure that if it > succeeds we'll have a quorum of snapshots, but if the proposal doesn't > succeed at first, I don't know how to deal with retries (overwrite > existing snapshots? discard the snapshot when its proposal fails?) or > cancellation (say, the client doesn't get confirmation from the leader, > the leader changes, it retries that some times, and eventually it gives > up, but some monitors have already tentatively taken the snapshot in the > mean time). The best way I can think of in a short time to solve these problems would be to make snapshots first-class citizens in the monitor. We could extend the monitor store to handle multiple leveldb instances, and then a snapshot would would be an async operation which does a leveldb snapshot inline and spins off a thread to clone that data into a new leveldb instance. When all the up monitors complete, the user gets a report saying the snapshot was successful and it gets marked complete in some snapshot map. Any monitors which have to get a full store sync would also sync any snapshots they don't already have. If the monitors can't complete a snapshot (all failing at once for some reason) then they could block the user from doing anything except deleting them. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Dec 18, 2013, Gregory Farnum <greg@inktank.com> wrote: > (you've never been getting a real point-in-time > snapshot, although as long as you didn't use external communication > channels you could at least be sure it contained a causal cut). I never expected more than a causal cut, really (my wife got a PhD in consistent checkpointing of distributed systems, so my expectations may be somewhat better informed than those a random user might have ;-), although even now I've seldom got a snapshot in which the osd data differs across replicas (I actually check for that; that's part of my reason for taking the snapshots in the first place), even when I fail to explicitly make the cluster quiescent. But that's probably just “luck”, as my cluster usually isn't busy when I take such snapshots ;-) > And of course that's nothing compared to snapshotting the monitors, as > you've noticed I've given it some more thought, and it occurred to me that, if we make mons take the snapshot when the snapshot-taking request is committed to the cluster history, we should have the snapshots taking at the right time and without the need for rolling them back and taking them again. The idea is that, if the snapshot-taking is committed, eventually we'll have a quorum carrying that commit, and thus each of the quorum members will have taken a snapshot as soon as they got that commit, even if they did so during recovery, or if they took so long to take the snapshot that they got kicked out of the quorum for a while. If they get actually restarted, they will get the commit again and take the snapshot from the beginning. If all mons in the quorum that accepted the commit get restarted so that none of them actually records the commit request, and it doesn't get propagated to other mons that attempt to rejoin, well, it's as if the request had never been committed. OTOH, if it did get to other mons, or if any of them survives, the committed request will make to a quorum and eventually to all monitors, each one taking its snapshot at the time it gets the commit. This should work as long as all mons get recovery info in the same order, i.e., they won't get into their database history information that happens-after the snapshot commit before the snapshot commit, nor will they fail to get information that happened-before the snapshot commit before getting the snapshot commit. That said, having little idea of the inner workings of the monitors, I can't tell whether they actually meet this “as long as” condition ;-( > — but making it actually be a cluster snapshot (instead > of something you could basically do by taking a btrfs snapshot > yourself) Taking btrfs snapshots manually over several osds on several hosts is hardly a way to get a causal cut (but you already knew that ;-)
reinstate ceph cluster_snap support From: Alexandre Oliva <oliva@gnu.org> This patch brings back and updates (for dumpling) the code originally introduced to support “ceph osd cluster_snap <snap>”, that was disabled and partially removed before cuttlefish. Some minimal testing appears to indicate this even works: the modified mon actually generated an osdmap with the cluster_snap request, and starting a modified osd that was down and letting it catch up caused the osd to take the requested snapshot. I see no reason why it wouldn't have taken it if it was up and running, so... Why was this feature disabled in the first place? Signed-off-by: Alexandre Oliva <oliva@gnu.org> --- src/mon/MonCommands.h | 6 ++++-- src/mon/OSDMonitor.cc | 11 +++++++---- src/osd/OSD.cc | 8 ++++++++ 3 files changed, 19 insertions(+), 6 deletions(-) diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h index 5a6ca6a..8977f29 100644 --- a/src/mon/MonCommands.h +++ b/src/mon/MonCommands.h @@ -445,8 +445,10 @@ COMMAND("osd set " \ COMMAND("osd unset " \ "name=key,type=CephChoices,strings=pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub", \ "unset <key>", "osd", "rw", "cli,rest") -COMMAND("osd cluster_snap", "take cluster snapshot (disabled)", \ - "osd", "r", "") +COMMAND("osd cluster_snap " \ + "name=snap,type=CephString", \ + "take cluster snapshot", \ + "osd", "r", "cli") COMMAND("osd down " \ "type=CephString,name=ids,n=N", \ "set osd(s) <id> [<id>...] down", "osd", "rw", "cli,rest") diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc index 07775fc..9a46978 100644 --- a/src/mon/OSDMonitor.cc +++ b/src/mon/OSDMonitor.cc @@ -3428,10 +3428,13 @@ bool OSDMonitor::prepare_command(MMonCommand *m) return prepare_unset_flag(m, CEPH_OSDMAP_NODEEP_SCRUB); } else if (prefix == "osd cluster_snap") { - // ** DISABLE THIS FOR NOW ** - ss << "cluster snapshot currently disabled (broken implementation)"; - // ** DISABLE THIS FOR NOW ** - + string snap; + cmd_getval(g_ceph_context, cmdmap, "snap", snap); + pending_inc.cluster_snapshot = snap; + ss << "creating cluster snap " << snap; + getline(ss, rs); + wait_for_finished_proposal(new Monitor::C_Command(mon, m, 0, rs, get_last_committed())); + return true; } else if (prefix == "osd down" || prefix == "osd out" || prefix == "osd in" || diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 8da4d96..b5720f7 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -5169,6 +5169,14 @@ void OSD::handle_osd_map(MOSDMap *m) } } + string cluster_snap = newmap->get_cluster_snapshot(); + if (cluster_snap.length()) { + dout(0) << "creating cluster snapshot '" << cluster_snap << "'" << dendl; + int r = store->snapshot(cluster_snap); + if (r) + dout(0) << "failed to create cluster snapshot: " << cpp_strerror(r) << dendl; + } + osdmap = newmap; superblock.current_epoch = cur;