Message ID | d9aa96913b1730f1d0c238d7d52e27c20bc55390.1636544377.git.ps@pks.im (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | refs: sync loose refs to disk before committing them | expand |
On Wed, Nov 10 2021, Patrick Steinhardt wrote: > [...] > Fix this by introducing a new configuration "core.fsyncRefFiles". This > config matches behaviour of "core.fsyncObjectFiles" in that it provides > three different modes: > > - "off" disables calling fsync on ref files. This is the default > behaviour previous to this change and remains the default after > this change. > > - "on" enables calling fsync on ref files, where each reference is > flushed to disk before it is being committed. This is the safest > setting, but may incur significant performance overhead. > > - "batch" will flush the page cache of each file as it is written to > ensure its data is persisted. After all refs have been written, > the directories which host refs are flushed. > > With this change in place and when "core.fsyncRefFiles" is set to either > "on" or "batch", this kind of corruption shouldn't happen anymore. > > [1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt With the understanding that my grokking of this approach is still somewhere between "uh, that works?" and "wow, voodoo FS magic!". .... I haven't looked at these changes in much daiter, or Neeraj's recent related changes but... > +core.fsyncRefFiles:: > + A value indicating the level of effort Git will expend in trying to make > + refs added to the repo durable in the event of an unclean system > + shutdown. This setting currently only controls loose refs in the object > + store, so updates to packed refs may not be equally durable. Takes the > + same parameters as `core.fsyncObjectFiles`. > + ...my understanding of it is basically a way of going back to what Linus pointed out way back in aafe9fbaf4f (Add config option to enable 'fsync()' of object files, 2008-06-18). I.e. we've got files x and y. POSIX sayeth we'd need to fsync them all and the directory entry, but on some FS's it would be sufficient to fsync() just y if they're created in that order. It'll imply an fsync of both x and y, is that accurate? If not you can probably discard the rest, but forging on: Why do we then need to fsync() a "z" file in get_object_directory() (i.e. .git/objects) then? That's a new "z", wasn't flushing "y" enough? Or if you've written .git/objects/x and .git/refs/y I can imagine wanting to create and sync a .git/z if the FS's semantics are to then flush all remaining updates from that tree up, but here it's .git/objects, not .git. That also seems to contract this above: > ensure its data is persisted. After all refs have been written, > the directories which host refs are flushed. I.e. that would be .git/refs (let's ignore .git/HEAD and the like for now), not .git/objects or .git? And again, forging on but more generally [continued below]... > + if (!strcmp(var, "core.fsyncreffiles")) { UX side: now we've got a core.fsyncRefFiles and core.fsyncWhateverItWasCalled in Neeraj series. Let's make sure those work together like say "fsck.*" and "fetch.fsck.*" do, i.e. you'd be able to configure this once for objects and refs, or in two variables, one for objects, one for refs... > +static int sync_loose_ref(int fd) > +{ > + switch (fsync_ref_files) { > + case FSYNC_REF_FILES_OFF: > + return 0; > + case FSYNC_REF_FILES_ON: > + return git_fsync(fd, FSYNC_HARDWARE_FLUSH); > + case FSYNC_REF_FILES_BATCH: > + return git_fsync(fd, FSYNC_WRITEOUT_ONLY); > + default: > + BUG("invalid fsync mode %d", fsync_ref_files); > + } > +} > + > +#define SYNC_LOOSE_REF_GITDIR (1 << 0) > +#define SYNC_LOOSE_REF_COMMONDIR (1 << 1) nit: make this an enum and ... > +static int sync_loose_refs_flags(const char *refname) > +{ > + switch (ref_type(refname)) { > + case REF_TYPE_PER_WORKTREE: > + case REF_TYPE_PSEUDOREF: > + return SYNC_LOOSE_REF_GITDIR; > + case REF_TYPE_MAIN_PSEUDOREF: > + case REF_TYPE_OTHER_PSEUDOREF: > + case REF_TYPE_NORMAL: > + return SYNC_LOOSE_REF_COMMONDIR; > + default: > + BUG("unknown ref type %d of ref %s", > + ref_type(refname), refname); ... you won't need this default case... > [...] > /* > * Emit a better error message than lockfile.c's > * unable_to_lock_message() would in case there is a D/F conflict with > @@ -1502,6 +1553,7 @@ static int files_copy_or_rename_ref(struct ref_store *ref_store, > oidcpy(&lock->old_oid, &orig_oid); > > if (write_ref_to_lockfile(lock, &orig_oid, &err) || > + sync_loose_refs(refs, sync_loose_refs_flags(newrefname), &err) || > commit_ref_update(refs, lock, &orig_oid, logmsg, &err)) { > error("unable to write current sha1 into %s: %s", newrefname, err.buf); > strbuf_release(&err); > @@ -1522,6 +1574,7 @@ static int files_copy_or_rename_ref(struct ref_store *ref_store, > flag = log_all_ref_updates; > log_all_ref_updates = LOG_REFS_NONE; > if (write_ref_to_lockfile(lock, &orig_oid, &err) || > + sync_loose_refs(refs, sync_loose_refs_flags(newrefname), &err) || > commit_ref_update(refs, lock, &orig_oid, NULL, &err)) { > error("unable to write current sha1 into %s: %s", oldrefname, err.buf); > strbuf_release(&err); > @@ -1781,6 +1834,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock, > fd = get_lock_file_fd(&lock->lk); > if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 || > write_in_full(fd, &term, 1) < 0 || > + sync_loose_ref(fd) < 0 || > close_ref_gently(lock) < 0) { > strbuf_addf(err, > "couldn't write '%s'", get_lock_file_path(&lock->lk)); > @@ -2665,7 +2719,7 @@ static int files_transaction_prepare(struct ref_store *ref_store, > files_downcast(ref_store, REF_STORE_WRITE, > "ref_transaction_prepare"); > size_t i; > - int ret = 0; > + int ret = 0, sync_flags = 0; > struct string_list affected_refnames = STRING_LIST_INIT_NODUP; > char *head_ref = NULL; > int head_type; > @@ -2777,8 +2831,14 @@ static int files_transaction_prepare(struct ref_store *ref_store, > &update->new_oid, NULL, > NULL); > } > + > + sync_flags |= sync_loose_refs_flags(update->refname); > } > > + ret = sync_loose_refs(refs, sync_flags, err); > + if (ret) > + goto cleanup; > + > if (packed_transaction) { > if (packed_refs_lock(refs->packed_ref_store, 0, err)) { > ret = TRANSACTION_GENERIC_ERROR; ...[continued from above]: Again, per my potentially wrong understanding of syncing a "x" and "y" via an fsync of a subsequent "z" that's adjacent on the FS to those two. Isn't this setting us up for a really bad interaction between this series and Neeraj's work? Well "bad" as in "bad for performance". I.e. you'll turn on "use the batch thing for objects and refs" and we'll do two fsyncs, one for the object update, and one for refs. The common case is that we'll have both in play. So shouldn't this go to a higher level for both so we only create a "z" .git/sync-it-now-please.txt or whatever once we do all pending updates on the .git/ directory? I can also imagine that we'd want that at an even higher level, e.g. for "git pull" surely we'd want it not for refs or objects, but in builtin/pull.c somewhere because we'll be updating the .git/index after we do both refs and objects, and you'd want to fsync at the very end, no? None of the above should mean we can't pursue a more narrow approach for now. I'm just: 1) Seeing if I understand what we're trying to do here, maybe not. 2) Encouraging you two to think about a holistic way to configure some logical conclusion to this topic at large, so we won't end up with core.fsyncConfigFiles, core.fsyncObjectFiles, core.fsyncIndexFile, core.fsyncRefFiles, core.fsyncTheCrapRebaseWritesOutFiles etc. :) I'll send another more generic follow-up E-Mail for #2.
On Wed, Nov 10, 2021 at 03:49:02PM +0100, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Nov 10 2021, Patrick Steinhardt wrote: > > > + shutdown. This setting currently only controls loose refs in the object > > + store, so updates to packed refs may not be equally durable. Takes the > > + same parameters as `core.fsyncObjectFiles`. > > + > > ...my understanding of it is basically a way of going back to what Linus > pointed out way back in aafe9fbaf4f (Add config option to enable > 'fsync()' of object files, 2008-06-18). > > I.e. we've got files x and y. POSIX sayeth we'd need to fsync them all > and the directory entry, but on some FS's it would be sufficient to > fsync() just y if they're created in that order. It'll imply an fsync of > both x and y, is that accurate? > > If not you can probably discard the rest, but forging on: > > Why do we then need to fsync() a "z" file in get_object_directory() > (i.e. .git/objects) then? That's a new "z", wasn't flushing "y" enough? > > Or if you've written .git/objects/x and .git/refs/y I can imagine > wanting to create and sync a .git/z if the FS's semantics are to then > flush all remaining updates from that tree up, but here it's > .git/objects, not .git. That also seems to contract this above: We're breaking through the abstraction provided by POSIX in 'batch' mode, since the POSIX abstraction does not give us any option to be both safe and reasonably fast on hardware that does something expensive upon fsync(). We need to ensure that 'x' and 'y' are handed off to the storage device via a non-flushing pagecache cleaning call (e.g. sync_file_ranges or macOS fsync). Then creating and flushing 'z' ensures that 'x' and 'y' will be visible after successful completion. On FSes with a single journal that is flushed on fsync, this should also ensure that any other files that have been cleaned out of the pagecache and any other metadata updates are also persisted. The fsync provides a barrier in addition to the durability. > > ensure its data is persisted. After all refs have been written, > > the directories which host refs are flushed. > > I.e. that would be .git/refs (let's ignore .git/HEAD and the like for > now), not .git/objects or .git? > > And again, forging on but more generally [continued below]... > > > + if (!strcmp(var, "core.fsyncreffiles")) { > > UX side: now we've got a core.fsyncRefFiles and > core.fsyncWhateverItWasCalled in Neeraj series. Let's make sure those > work together like say "fsck.*" and "fetch.fsck.*" do, i.e. you'd be > able to configure this once for objects and refs, or in two variables, > one for objects, one for refs... > I agree with this feedback. We should have a core.fsync flag to control everything that might be synced in the repo. However, we'd have to decide what we want to do with packfiles and the index which are currently always fsynced. > ...[continued from above]: Again, per my potentially wrong understanding > of syncing a "x" and "y" via an fsync of a subsequent "z" that's > adjacent on the FS to those two. I suspect Patrick is concerned about the case where the worktree is on a separate filesystem from the main repo, so there might be a motivation to sync the worktree refs separately. Is that right, Patrick? > > Isn't this setting us up for a really bad interaction between this > series and Neeraj's work? Well "bad" as in "bad for performance". > > I.e. you'll turn on "use the batch thing for objects and refs" and we'll > do two fsyncs, one for the object update, and one for refs. The common > case is that we'll have both in play. > > So shouldn't this go to a higher level for both so we only create a "z" > .git/sync-it-now-please.txt or whatever once we do all pending updates > on the .git/ directory? > > I can also imagine that we'd want that at an even higher level, e.g. for > "git pull" surely we'd want it not for refs or objects, but in > builtin/pull.c somewhere because we'll be updating the .git/index after > we do both refs and objects, and you'd want to fsync at the very end, > no? > > None of the above should mean we can't pursue a more narrow approach for > now. I'm just: > > 1) Seeing if I understand what we're trying to do here, maybe not. > > 2) Encouraging you two to think about a holistic way to configure some > logical conclusion to this topic at large, so we won't end up with > core.fsyncConfigFiles, core.fsyncObjectFiles, core.fsyncIndexFile, > core.fsyncRefFiles, core.fsyncTheCrapRebaseWritesOutFiles etc. :) > > I'll send another more generic follow-up E-Mail for #2. In my view there are two separable concerns: 1) What durability do we want for the user's data when an entire 'git' command completes? I.e. if I run 'git add <1000 files>; git commit' and the system loses power after both commands return, should I see all of those files in the committed state of the repo? 2) What internal consistency do we want in the git databases (ODB and REFS) if we crash midway through a 'git' command? I.e. if 'git add <1000 files>' crashes before returning, can there be an invalid object or a torn ref state? If were only concerned with (1), then doing a single fsync at the end of the high-level git command would be sufficient. However, (2) requires more fsyncs to provide barriers between different phases internal to the git commands. For instance, we need a barrier between creating the ODB state for a commit (blobs, trees, commit obj) and the refs pointing to it. I am not concerned with a few additional fsyncs for (2). On recent mainstream filesystems/ssds each fsync may cost tens of milliseconds, so the difference between 1 to 3 fsyncs would not be user perceptible. However, going from a couple fsyncs to 1000 fsyncs (one per blob added) would become apparent to the user. The more optimal way to handle consistency and durability is Write-ahead-logging with log replay on crash recovery. That approach can reduce the number of fsyncs in the active workload to at most two (one to sync the log with a commit record and then one before truncating the log after updating the main database). At that point, however, I think it would be best to use an existing database engine with some modifications to create a good data layout for git. Thanks, Neeraj
On Wed, Nov 10 2021, Neeraj Singh wrote: > On Wed, Nov 10, 2021 at 03:49:02PM +0100, Ævar Arnfjörð Bjarmason wrote: >> >> On Wed, Nov 10 2021, Patrick Steinhardt wrote: >> >> > + shutdown. This setting currently only controls loose refs in the object >> > + store, so updates to packed refs may not be equally durable. Takes the >> > + same parameters as `core.fsyncObjectFiles`. >> > + >> >> ...my understanding of it is basically a way of going back to what Linus >> pointed out way back in aafe9fbaf4f (Add config option to enable >> 'fsync()' of object files, 2008-06-18). >> >> I.e. we've got files x and y. POSIX sayeth we'd need to fsync them all >> and the directory entry, but on some FS's it would be sufficient to >> fsync() just y if they're created in that order. It'll imply an fsync of >> both x and y, is that accurate? >> >> If not you can probably discard the rest, but forging on: >> >> Why do we then need to fsync() a "z" file in get_object_directory() >> (i.e. .git/objects) then? That's a new "z", wasn't flushing "y" enough? >> >> Or if you've written .git/objects/x and .git/refs/y I can imagine >> wanting to create and sync a .git/z if the FS's semantics are to then >> flush all remaining updates from that tree up, but here it's >> .git/objects, not .git. That also seems to contract this above: > > We're breaking through the abstraction provided by POSIX in 'batch' mode, > since the POSIX abstraction does not give us any option to be both safe > and reasonably fast on hardware that does something expensive upon fsync(). > > We need to ensure that 'x' and 'y' are handed off to the storage device via > a non-flushing pagecache cleaning call (e.g. sync_file_ranges or macOS fsync). > Then creating and flushing 'z' ensures that 'x' and 'y' will be visible after > successful completion. On FSes with a single journal that is flushed on fsync, > this should also ensure that any other files that have been cleaned out of the > pagecache and any other metadata updates are also persisted. The fsync provides > a barrier in addition to the durability. Yes. I understand that we're not doing POSIX fsyncing(). I'm asking about something else, i.e. with this not-like-POSIX-sync why it is that when you have a directory: A and files: A/{X,Y} That you'd write those two, and then proceed to do the "batch flush" by creating and fsync()-ing a: B/Z As opposed to either of: A/Z Or: Z I.e. why update .git/refs/* and create a flush file in .git/object/* and not .git/refs/* or .git/*? Maybe the answer is just that this is WIP code copied from your .git/objects/ fsync() implementation, or maybe it's more subtle and I'm missing something. >> > ensure its data is persisted. After all refs have been written, >> > the directories which host refs are flushed. >> >> I.e. that would be .git/refs (let's ignore .git/HEAD and the like for >> now), not .git/objects or .git? >> >> And again, forging on but more generally [continued below]... >> >> > + if (!strcmp(var, "core.fsyncreffiles")) { >> >> UX side: now we've got a core.fsyncRefFiles and >> core.fsyncWhateverItWasCalled in Neeraj series. Let's make sure those >> work together like say "fsck.*" and "fetch.fsck.*" do, i.e. you'd be >> able to configure this once for objects and refs, or in two variables, >> one for objects, one for refs... >> > > I agree with this feedback. We should have a core.fsync flag to control everything > that might be synced in the repo. However, we'd have to decide what we want > to do with packfiles and the index which are currently always fsynced. *nod*. See this if you haven't yet: https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/T/#u >> ...[continued from above]: Again, per my potentially wrong understanding >> of syncing a "x" and "y" via an fsync of a subsequent "z" that's >> adjacent on the FS to those two. > > I suspect Patrick is concerned about the case where the worktree is on > a separate filesystem from the main repo, so there might be a motivation > to sync the worktree refs separately. Is that right, Patrick? But he's syncing .git/objects, not .git/worktrees/<NAME>/refs/, no? >> >> Isn't this setting us up for a really bad interaction between this >> series and Neeraj's work? Well "bad" as in "bad for performance". >> >> I.e. you'll turn on "use the batch thing for objects and refs" and we'll >> do two fsyncs, one for the object update, and one for refs. The common >> case is that we'll have both in play. >> >> So shouldn't this go to a higher level for both so we only create a "z" >> .git/sync-it-now-please.txt or whatever once we do all pending updates >> on the .git/ directory? >> >> I can also imagine that we'd want that at an even higher level, e.g. for >> "git pull" surely we'd want it not for refs or objects, but in >> builtin/pull.c somewhere because we'll be updating the .git/index after >> we do both refs and objects, and you'd want to fsync at the very end, >> no? >> >> None of the above should mean we can't pursue a more narrow approach for >> now. I'm just: >> >> 1) Seeing if I understand what we're trying to do here, maybe not. >> >> 2) Encouraging you two to think about a holistic way to configure some >> logical conclusion to this topic at large, so we won't end up with >> core.fsyncConfigFiles, core.fsyncObjectFiles, core.fsyncIndexFile, >> core.fsyncRefFiles, core.fsyncTheCrapRebaseWritesOutFiles etc. :) >> >> I'll send another more generic follow-up E-Mail for #2. > > In my view there are two separable concerns: > > 1) What durability do we want for the user's data when an entire 'git' > command completes? I.e. if I run 'git add <1000 files>; git commit' and the > system loses power after both commands return, should I see all of those files > in the committed state of the repo? > > 2) What internal consistency do we want in the git databases (ODB and REFS) if > we crash midway through a 'git' command? I.e. if 'git add <1000 files>' crashes > before returning, can there be an invalid object or a torn ref state? > > If were only concerned with (1), then doing a single fsync at the end of the high-level > git command would be sufficient. However, (2) requires more fsyncs to provide barriers > between different phases internal to the git commands. For instance, we need a barrier > between creating the ODB state for a commit (blobs, trees, commit obj) and the refs > pointing to it. > > I am not concerned with a few additional fsyncs for (2). On recent mainstream filesystems/ssds > each fsync may cost tens of milliseconds, so the difference between 1 to 3 fsyncs would not > be user perceptible. However, going from a couple fsyncs to 1000 fsyncs (one per blob added) > would become apparent to the user. > > The more optimal way to handle consistency and durability is Write-ahead-logging with log > replay on crash recovery. That approach can reduce the number of fsyncs in the active workload > to at most two (one to sync the log with a commit record and then one before truncating the > log after updating the main database). At that point, however, I think it would be best to > use an existing database engine with some modifications to create a good data layout for git. I think that git should safely store your data by default, we clearly don't do that well enough in some cases, and should fix it. But there's also cases where people would much rather have performance, and I think we should support that. E.g. running git in CI, doing a one-off import/fetch that you'll invoke "sync(1)" yourself after etc. This is the direction Eric Wong's patches are going into. I understand from his patches/comments that you're not correct that just 1-3 fsyncs are OK, i.e. for some use-cases 0 is OK, and any >0 is too many, since it'll force a flush of the whole disk or something. Even when git is is operating in a completely safe mode I think we'd still have license to play it fast and loose with /some/ user data, because users don't really care about all of their data in the .git/ directory. I.e. you do care about the *.pack file, but an associated *.bitmap is a derived file, so we probably don't need to fsync that too. Is it worth worrying about? Probably not, but that's what I had in mind with the above-linked proposed config schema. Similarly for say writing 1k loose objects I think it would be completely fine to end up with a corrupt object during a "git fetch", as long as we also guarantee that we can gracefully recover from that corruption. I.e. 1) we didn't update the ref(s) involved 2) we know the FS has the sorts of properties you're aiming for with "batch". Now, as some patches I've had to object*.c recently show we don't handle those cases gracefully either. I.e. if we find a short loose object we'll panic, even if we'd be perfectly capable of getting it from elsewhere, and nothing otherwise references it. But if we fixed that and gc/fsck/fetch etc. learned to deal with that content I don't see why wouldn't e.g. default to not fsyncing loose objects when invoked from say "fetch" by default, depending on FS/OS detection, and if we couldn't say it was safe defaulted to some "POSIX" mode that would be pedantic but slow and safe.
On Wed, Nov 10, 2021 at 09:23:04PM +0100, Ævar Arnfjörð Bjarmason wrote: > > > Yes. I understand that we're not doing POSIX fsyncing(). > > I'm asking about something else, i.e. with this not-like-POSIX-sync why > it is that when you have a directory: > > A > > and files: > > A/{X,Y} > > That you'd write those two, and then proceed to do the "batch flush" by > creating and fsync()-ing a: > > B/Z > > As opposed to either of: > > A/Z > > Or: > > Z > > I.e. why update .git/refs/* and create a flush file in .git/object/* and > not .git/refs/* or .git/*? > > Maybe the answer is just that this is WIP code copied from your > .git/objects/ fsync() implementation, or maybe it's more subtle and I'm > missing something. It looks I didn't really answer your actual question before. On the filesystems which I'm familiar with at a code level (NTFS and ReFS on Windows), fsyncing a file or dir anywhere on the filesystem should ensure that metadata operations completed before the fsync starts are durable when the fsync returns. So which specific directory we put the file in shouldn't matter as long as it's on the same filesystem as the other files we're interested in. > > *nod*. See this if you haven't yet: > https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/T/#u I'll respond on that thread with my opinion. Thanks for reviewing this and pushing for a good holistic approach. Thanks, Neeraj
On Wed, Nov 10, 2021 at 12:41:03PM +0100, Patrick Steinhardt wrote: > > +static int sync_loose_ref(int fd) > +{ > + switch (fsync_ref_files) { > + case FSYNC_REF_FILES_OFF: > + return 0; > + case FSYNC_REF_FILES_ON: > + return git_fsync(fd, FSYNC_HARDWARE_FLUSH); > + case FSYNC_REF_FILES_BATCH: > + return git_fsync(fd, FSYNC_WRITEOUT_ONLY); > + default: > + BUG("invalid fsync mode %d", fsync_ref_files); > + } > +} > + > +#define SYNC_LOOSE_REF_GITDIR (1 << 0) > +#define SYNC_LOOSE_REF_COMMONDIR (1 << 1) > + > +static int sync_loose_refs_flags(const char *refname) > +{ > + switch (ref_type(refname)) { > + case REF_TYPE_PER_WORKTREE: > + case REF_TYPE_PSEUDOREF: > + return SYNC_LOOSE_REF_GITDIR; > + case REF_TYPE_MAIN_PSEUDOREF: > + case REF_TYPE_OTHER_PSEUDOREF: > + case REF_TYPE_NORMAL: > + return SYNC_LOOSE_REF_COMMONDIR; > + default: > + BUG("unknown ref type %d of ref %s", > + ref_type(refname), refname); > + } > +} > + > +static int sync_loose_refs(struct files_ref_store *refs, > + int flags, > + struct strbuf *err) > +{ > + if (fsync_ref_files != FSYNC_REF_FILES_BATCH) > + return 0; > + > + if ((flags & SYNC_LOOSE_REF_GITDIR) && > + git_fsync_dir(refs->base.gitdir) < 0) > + return -1; > + > + if ((flags & SYNC_LOOSE_REF_COMMONDIR) && > + git_fsync_dir(refs->gitcommondir) < 0) > + return -1; > + > + return 0; > +} > + Now that I understand it better, I agree with Ævar's feedback. I think only a single sync is needed to cover everything in .git/. I'd also suggest renaming 'sync_loose_refs' to something which makes it clearer that it's the batch-flush step. Maybe s/do_batch_fsync/do_batch_objects_fsync and s/sync_loose_refs/do_batch_refs_fsync. Thanks Neeraj
On Wed, Nov 10, 2021 at 03:49:02PM +0100, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Nov 10 2021, Patrick Steinhardt wrote: > > > [...] > > Fix this by introducing a new configuration "core.fsyncRefFiles". This > > config matches behaviour of "core.fsyncObjectFiles" in that it provides > > three different modes: > > > > - "off" disables calling fsync on ref files. This is the default > > behaviour previous to this change and remains the default after > > this change. > > > > - "on" enables calling fsync on ref files, where each reference is > > flushed to disk before it is being committed. This is the safest > > setting, but may incur significant performance overhead. > > > > - "batch" will flush the page cache of each file as it is written to > > ensure its data is persisted. After all refs have been written, > > the directories which host refs are flushed. > > > > With this change in place and when "core.fsyncRefFiles" is set to either > > "on" or "batch", this kind of corruption shouldn't happen anymore. > > > > [1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt > > With the understanding that my grokking of this approach is still > somewhere between "uh, that works?" and "wow, voodoo FS magic!". .... > > I haven't looked at these changes in much daiter, or Neeraj's recent > related changes but... > > > +core.fsyncRefFiles:: > > + A value indicating the level of effort Git will expend in trying to make > > + refs added to the repo durable in the event of an unclean system > > + shutdown. This setting currently only controls loose refs in the object > > + store, so updates to packed refs may not be equally durable. Takes the > > + same parameters as `core.fsyncObjectFiles`. > > + > > ...my understanding of it is basically a way of going back to what Linus > pointed out way back in aafe9fbaf4f (Add config option to enable > 'fsync()' of object files, 2008-06-18). > > I.e. we've got files x and y. POSIX sayeth we'd need to fsync them all > and the directory entry, but on some FS's it would be sufficient to > fsync() just y if they're created in that order. It'll imply an fsync of > both x and y, is that accurate? > > If not you can probably discard the rest, but forging on: > > Why do we then need to fsync() a "z" file in get_object_directory() > (i.e. .git/objects) then? That's a new "z", wasn't flushing "y" enough? > > Or if you've written .git/objects/x and .git/refs/y I can imagine > wanting to create and sync a .git/z if the FS's semantics are to then > flush all remaining updates from that tree up, but here it's > .git/objects, not .git. That also seems to contract this above: > > > ensure its data is persisted. After all refs have been written, > > the directories which host refs are flushed. > > I.e. that would be .git/refs (let's ignore .git/HEAD and the like for > now), not .git/objects or .git? Yeah, .git/refs. Note that we need to flush refs in two locations though given that there are common refs shared across worktrees, and then there are worktree-specific refs. These may not even share the filesystem, so there's not much we can do about this. We could play games with the fsid, but I'm not sure it's worth it. > And again, forging on but more generally [continued below]... > > > + if (!strcmp(var, "core.fsyncreffiles")) { > > UX side: now we've got a core.fsyncRefFiles and > core.fsyncWhateverItWasCalled in Neeraj series. Let's make sure those > work together like say "fsck.*" and "fetch.fsck.*" do, i.e. you'd be > able to configure this once for objects and refs, or in two variables, > one for objects, one for refs... Honestly, I'd prefer to just have a global variable with per-subsystem overrides so that you can say "Please batch-sync all files by default, but I really don't care for files in the worktree". Kind of goes into the same direction as your RFC. [snip] > > @@ -2665,7 +2719,7 @@ static int files_transaction_prepare(struct ref_store *ref_store, > > files_downcast(ref_store, REF_STORE_WRITE, > > "ref_transaction_prepare"); > > size_t i; > > - int ret = 0; > > + int ret = 0, sync_flags = 0; > > struct string_list affected_refnames = STRING_LIST_INIT_NODUP; > > char *head_ref = NULL; > > int head_type; > > @@ -2777,8 +2831,14 @@ static int files_transaction_prepare(struct ref_store *ref_store, > > &update->new_oid, NULL, > > NULL); > > } > > + > > + sync_flags |= sync_loose_refs_flags(update->refname); > > } > > > > + ret = sync_loose_refs(refs, sync_flags, err); > > + if (ret) > > + goto cleanup; > > + > > if (packed_transaction) { > > if (packed_refs_lock(refs->packed_ref_store, 0, err)) { > > ret = TRANSACTION_GENERIC_ERROR; > > ...[continued from above]: Again, per my potentially wrong understanding > of syncing a "x" and "y" via an fsync of a subsequent "z" that's > adjacent on the FS to those two. > > Isn't this setting us up for a really bad interaction between this > series and Neeraj's work? Well "bad" as in "bad for performance". > > I.e. you'll turn on "use the batch thing for objects and refs" and we'll > do two fsyncs, one for the object update, and one for refs. The common > case is that we'll have both in play. > > So shouldn't this go to a higher level for both so we only create a "z" > .git/sync-it-now-please.txt or whatever once we do all pending updates > on the .git/ directory? Good question. Taking a different stance would be to say "I don't ever want to write a ref referring to an object which isn't yet guaranteed to be persistent". In that case, writing and syncing objects first and then updating refs would be the correct thing to do and wouldn't need coordination on a higher level. > I can also imagine that we'd want that at an even higher level, e.g. for > "git pull" surely we'd want it not for refs or objects, but in > builtin/pull.c somewhere because we'll be updating the .git/index after > we do both refs and objects, and you'd want to fsync at the very end, > no? > > None of the above should mean we can't pursue a more narrow approach for > now. I'm just: > > 1) Seeing if I understand what we're trying to do here, maybe not. Personally, my most important bullet point is that I want to get rid of the corrupt refs we see in production every now and then. As Peff pointed out in another email, it _might_ be sufficient to just flush out the page cache for those without synchronizing any metadata at all. But none of us seems to be sure that this really works the way we think it does. > 2) Encouraging you two to think about a holistic way to configure some > logical conclusion to this topic at large, so we won't end up with > core.fsyncConfigFiles, core.fsyncObjectFiles, core.fsyncIndexFile, > core.fsyncRefFiles, core.fsyncTheCrapRebaseWritesOutFiles etc. :) Yeah, I agree with you on this point. We definitely don't want to end up with twenty different knobs to sync objects, refs, packed-refs, configs, reflogs et al, and we're currently steering into that direction. Having a central knob core.fsync which directs defaults for all subsystems would be very much preferable if you ask me, where specific subsystems can then be changed as the user sees fit. What I think is important though is that we should have the right defaults in place. It doesn't help to say "fsync is slow" as an excuse to not do them at all by default, where the result is that users get a corrupt repository. Patrick
On Wed, Nov 10, 2021 at 09:23:04PM +0100, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Nov 10 2021, Neeraj Singh wrote: > > > On Wed, Nov 10, 2021 at 03:49:02PM +0100, Ævar Arnfjörð Bjarmason wrote: > >> > >> On Wed, Nov 10 2021, Patrick Steinhardt wrote: [snip] > >> ...[continued from above]: Again, per my potentially wrong understanding > >> of syncing a "x" and "y" via an fsync of a subsequent "z" that's > >> adjacent on the FS to those two. > > > > I suspect Patrick is concerned about the case where the worktree is on > > a separate filesystem from the main repo, so there might be a motivation > > to sync the worktree refs separately. Is that right, Patrick? > > But he's syncing .git/objects, not .git/worktrees/<NAME>/refs/, no? That'd be a bug then ;) My intent was to sync .git/refs and the per-worktree refs. [snip] > > In my view there are two separable concerns: > > > > 1) What durability do we want for the user's data when an entire 'git' > > command completes? I.e. if I run 'git add <1000 files>; git commit' and the > > system loses power after both commands return, should I see all of those files > > in the committed state of the repo? > > > > 2) What internal consistency do we want in the git databases (ODB and REFS) if > > we crash midway through a 'git' command? I.e. if 'git add <1000 files>' crashes > > before returning, can there be an invalid object or a torn ref state? > > > > If were only concerned with (1), then doing a single fsync at the end of the high-level > > git command would be sufficient. However, (2) requires more fsyncs to provide barriers > > between different phases internal to the git commands. For instance, we need a barrier > > between creating the ODB state for a commit (blobs, trees, commit obj) and the refs > > pointing to it. > > > > I am not concerned with a few additional fsyncs for (2). On recent mainstream filesystems/ssds > > each fsync may cost tens of milliseconds, so the difference between 1 to 3 fsyncs would not > > be user perceptible. However, going from a couple fsyncs to 1000 fsyncs (one per blob added) > > would become apparent to the user. > > > > The more optimal way to handle consistency and durability is Write-ahead-logging with log > > replay on crash recovery. That approach can reduce the number of fsyncs in the active workload > > to at most two (one to sync the log with a commit record and then one before truncating the > > log after updating the main database). At that point, however, I think it would be best to > > use an existing database engine with some modifications to create a good data layout for git. > > I think that git should safely store your data by default, we clearly > don't do that well enough in some cases, and should fix it. > > But there's also cases where people would much rather have performance, > and I think we should support that. E.g. running git in CI, doing a > one-off import/fetch that you'll invoke "sync(1)" yourself after > etc. This is the direction Eric Wong's patches are going into. > > I understand from his patches/comments that you're not correct that just > 1-3 fsyncs are OK, i.e. for some use-cases 0 is OK, and any >0 is too > many, since it'll force a flush of the whole disk or something. > > Even when git is is operating in a completely safe mode I think we'd > still have license to play it fast and loose with /some/ user data, > because users don't really care about all of their data in the .git/ > directory. I think we need to discern two important cases: the first case is us losing data, and the second case is us leaving the repository in a corrupt state. I'm okay-ish with the first case: if your machine crashes it's not completely unexpected that losing data would be a natural consequence (well, losing the data that's currently in transit at least). But we should make sure that we're not leaving the repo in a corrupt state under any circumstance, and that's exactly what can happen right now because we don't flush refs to disk before doing the renames. Assuming we've got semantics correct, we are thus forced to do a page cache flush to make sure that data is on disk before doing a rename to not corrupt the repo. But in the case where we're fine with losing some data, we may skip doing the final fsync to also synchronize the rename to disk. Patrick > I.e. you do care about the *.pack file, but an associated *.bitmap is a > derived file, so we probably don't need to fsync that too. Is it worth > worrying about? Probably not, but that's what I had in mind with the > above-linked proposed config schema. > > Similarly for say writing 1k loose objects I think it would be > completely fine to end up with a corrupt object during a "git fetch", as > long as we also guarantee that we can gracefully recover from that > corruption. > > I.e. 1) we didn't update the ref(s) involved 2) we know the FS has the > sorts of properties you're aiming for with "batch". > > Now, as some patches I've had to object*.c recently show we don't handle > those cases gracefully either. I.e. if we find a short loose object > we'll panic, even if we'd be perfectly capable of getting it from > elsewhere, and nothing otherwise references it. > > But if we fixed that and gc/fsck/fetch etc. learned to deal with that > content I don't see why wouldn't e.g. default to not fsyncing loose > objects when invoked from say "fetch" by default, depending on FS/OS > detection, and if we couldn't say it was safe defaulted to some "POSIX" > mode that would be pedantic but slow and safe.
diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index 200b4d9f06..e2fd0d8c90 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -572,6 +572,13 @@ core.fsyncObjectFiles:: stored on HFS+ or APFS filesystems and on Windows for repos stored on NTFS or ReFS. +core.fsyncRefFiles:: + A value indicating the level of effort Git will expend in trying to make + refs added to the repo durable in the event of an unclean system + shutdown. This setting currently only controls loose refs in the object + store, so updates to packed refs may not be equally durable. Takes the + same parameters as `core.fsyncObjectFiles`. + core.preloadIndex:: Enable parallel index preload for operations like 'git diff' + diff --git a/cache.h b/cache.h index 6d6e6770ec..14c8fab6b4 100644 --- a/cache.h +++ b/cache.h @@ -991,7 +991,14 @@ enum fsync_object_files_mode { FSYNC_OBJECT_FILES_BATCH }; +enum fsync_ref_files_mode { + FSYNC_REF_FILES_OFF, + FSYNC_REF_FILES_ON, + FSYNC_REF_FILES_BATCH +}; + extern enum fsync_object_files_mode fsync_object_files; +extern enum fsync_ref_files_mode fsync_ref_files; extern int core_preload_index; extern int precomposed_unicode; extern int protect_hfs; diff --git a/config.c b/config.c index 5eb36ecd77..4cbad5a29d 100644 --- a/config.c +++ b/config.c @@ -1500,6 +1500,16 @@ static int git_default_core_config(const char *var, const char *value, void *cb) return 0; } + if (!strcmp(var, "core.fsyncreffiles")) { + if (value && !strcmp(value, "batch")) + fsync_ref_files = FSYNC_REF_FILES_BATCH; + else if (git_config_bool(var, value)) + fsync_ref_files = FSYNC_REF_FILES_ON; + else + fsync_ref_files = FSYNC_REF_FILES_OFF; + return 0; + } + if (!strcmp(var, "core.preloadindex")) { core_preload_index = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index aeafe80235..1514ac9384 100644 --- a/environment.c +++ b/environment.c @@ -43,6 +43,7 @@ const char *git_hooks_path; int zlib_compression_level = Z_BEST_SPEED; int pack_compression_level = Z_DEFAULT_COMPRESSION; enum fsync_object_files_mode fsync_object_files; +enum fsync_ref_files_mode fsync_ref_files; size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; size_t delta_base_cache_limit = 96 * 1024 * 1024; diff --git a/refs/files-backend.c b/refs/files-backend.c index 4b14f30d48..31d7456266 100644 --- a/refs/files-backend.c +++ b/refs/files-backend.c @@ -1360,6 +1360,57 @@ static int commit_ref_update(struct files_ref_store *refs, const struct object_id *oid, const char *logmsg, struct strbuf *err); +static int sync_loose_ref(int fd) +{ + switch (fsync_ref_files) { + case FSYNC_REF_FILES_OFF: + return 0; + case FSYNC_REF_FILES_ON: + return git_fsync(fd, FSYNC_HARDWARE_FLUSH); + case FSYNC_REF_FILES_BATCH: + return git_fsync(fd, FSYNC_WRITEOUT_ONLY); + default: + BUG("invalid fsync mode %d", fsync_ref_files); + } +} + +#define SYNC_LOOSE_REF_GITDIR (1 << 0) +#define SYNC_LOOSE_REF_COMMONDIR (1 << 1) + +static int sync_loose_refs_flags(const char *refname) +{ + switch (ref_type(refname)) { + case REF_TYPE_PER_WORKTREE: + case REF_TYPE_PSEUDOREF: + return SYNC_LOOSE_REF_GITDIR; + case REF_TYPE_MAIN_PSEUDOREF: + case REF_TYPE_OTHER_PSEUDOREF: + case REF_TYPE_NORMAL: + return SYNC_LOOSE_REF_COMMONDIR; + default: + BUG("unknown ref type %d of ref %s", + ref_type(refname), refname); + } +} + +static int sync_loose_refs(struct files_ref_store *refs, + int flags, + struct strbuf *err) +{ + if (fsync_ref_files != FSYNC_REF_FILES_BATCH) + return 0; + + if ((flags & SYNC_LOOSE_REF_GITDIR) && + git_fsync_dir(refs->base.gitdir) < 0) + return -1; + + if ((flags & SYNC_LOOSE_REF_COMMONDIR) && + git_fsync_dir(refs->gitcommondir) < 0) + return -1; + + return 0; +} + /* * Emit a better error message than lockfile.c's * unable_to_lock_message() would in case there is a D/F conflict with @@ -1502,6 +1553,7 @@ static int files_copy_or_rename_ref(struct ref_store *ref_store, oidcpy(&lock->old_oid, &orig_oid); if (write_ref_to_lockfile(lock, &orig_oid, &err) || + sync_loose_refs(refs, sync_loose_refs_flags(newrefname), &err) || commit_ref_update(refs, lock, &orig_oid, logmsg, &err)) { error("unable to write current sha1 into %s: %s", newrefname, err.buf); strbuf_release(&err); @@ -1522,6 +1574,7 @@ static int files_copy_or_rename_ref(struct ref_store *ref_store, flag = log_all_ref_updates; log_all_ref_updates = LOG_REFS_NONE; if (write_ref_to_lockfile(lock, &orig_oid, &err) || + sync_loose_refs(refs, sync_loose_refs_flags(newrefname), &err) || commit_ref_update(refs, lock, &orig_oid, NULL, &err)) { error("unable to write current sha1 into %s: %s", oldrefname, err.buf); strbuf_release(&err); @@ -1781,6 +1834,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock, fd = get_lock_file_fd(&lock->lk); if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 || write_in_full(fd, &term, 1) < 0 || + sync_loose_ref(fd) < 0 || close_ref_gently(lock) < 0) { strbuf_addf(err, "couldn't write '%s'", get_lock_file_path(&lock->lk)); @@ -2665,7 +2719,7 @@ static int files_transaction_prepare(struct ref_store *ref_store, files_downcast(ref_store, REF_STORE_WRITE, "ref_transaction_prepare"); size_t i; - int ret = 0; + int ret = 0, sync_flags = 0; struct string_list affected_refnames = STRING_LIST_INIT_NODUP; char *head_ref = NULL; int head_type; @@ -2777,8 +2831,14 @@ static int files_transaction_prepare(struct ref_store *ref_store, &update->new_oid, NULL, NULL); } + + sync_flags |= sync_loose_refs_flags(update->refname); } + ret = sync_loose_refs(refs, sync_flags, err); + if (ret) + goto cleanup; + if (packed_transaction) { if (packed_refs_lock(refs->packed_ref_store, 0, err)) { ret = TRANSACTION_GENERIC_ERROR;
When writing loose refs, we first create a lockfile, write the new ref into that lockfile, close it and then rename the lockfile into place such that the actual update is atomic for that single ref. While this works as intended under normal circumstences, at GitLab we infrequently encounter corrupt loose refs in repositories after a machine encountered a hard reset. The corruption is always of the same type: the ref has been committed into place, but it is completely empty. The root cause of this is likely that we don't sync contents of the lockfile to disk before renaming it into place. As a result, it's not guaranteed that the contents are properly persisted and one may observe weird in-between states on hard resets. Quoting ext4 documentation [1]: Many broken applications don't use fsync() when replacing existing files via patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and force that any delayed allocation blocks are allocated such that at the next journal commit, in the default data=ordered mode, the data blocks of the new file are forced to disk before the rename() operation is committed. This provides roughly the same level of guarantees as ext3, and avoids the "zero-length" problem that can happen when a system crashes before the delayed allocation blocks are forced to disk. This explicitly points out that one must call fsync(3P) before doing the rename(3P) call, or otherwise data may not be correctly persisted to disk. Fix this by introducing a new configuration "core.fsyncRefFiles". This config matches behaviour of "core.fsyncObjectFiles" in that it provides three different modes: - "off" disables calling fsync on ref files. This is the default behaviour previous to this change and remains the default after this change. - "on" enables calling fsync on ref files, where each reference is flushed to disk before it is being committed. This is the safest setting, but may incur significant performance overhead. - "batch" will flush the page cache of each file as it is written to ensure its data is persisted. After all refs have been written, the directories which host refs are flushed. With this change in place and when "core.fsyncRefFiles" is set to either "on" or "batch", this kind of corruption shouldn't happen anymore. [1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt Signed-off-by: Patrick Steinhardt <ps@pks.im> --- Documentation/config/core.txt | 7 ++++ cache.h | 7 ++++ config.c | 10 ++++++ environment.c | 1 + refs/files-backend.c | 62 ++++++++++++++++++++++++++++++++++- 5 files changed, 86 insertions(+), 1 deletion(-)