Message ID | d1e68d4a2afc1d0ba74af64680bea09f412f21cc.1629856293.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Implement a bulk-checkin option for core.fsyncObjectFiles | expand |
On Wed, Aug 25, 2021 at 01:51:32AM +0000, Neeraj Singh via GitGitGadget wrote: > From: Neeraj Singh <neerajsi@microsoft.com> > > When adding many objects to a repo with core.fsyncObjectFiles set to > true, the cost of fsync'ing each object file can become prohibitive. > > One major source of the cost of fsync is the implied flush of the > hardware writeback cache within the disk drive. Fortunately, Windows, > MacOS, and Linux each offer mechanisms to write data from the filesystem > page cache without initiating a hardware flush. > > This patch introduces a new 'core.fsyncObjectFiles = 2' option that > takes advantage of the bulk-checkin infrastructure to batch up hardware > flushes. Another interesting way to flush on linux would be the syncfs call, which syncs all files on a file system. Once you write more than handful or two of files that tends to win out over a batch of fsync calls.
On Wed, Aug 25 2021, Neeraj Singh via GitGitGadget wrote: > From: Neeraj Singh <neerajsi@microsoft.com> > > When adding many objects to a repo with core.fsyncObjectFiles set to > true, the cost of fsync'ing each object file can become prohibitive. > > One major source of the cost of fsync is the implied flush of the > hardware writeback cache within the disk drive. Fortunately, Windows, > MacOS, and Linux each offer mechanisms to write data from the filesystem > page cache without initiating a hardware flush. > > This patch introduces a new 'core.fsyncObjectFiles = 2' option that > takes advantage of the bulk-checkin infrastructure to batch up hardware > flushes. > > When the new mode is enabled we do the following for new objects: > > 1. Create a tmp_obj_XXXX file and write the object data to it. > 2. Issue a pagecache writeback request and wait for it to complete. > 3. Record the tmp name and the final name in the bulk-checkin state for > later name. > > At the end of the entire transaction we: > 1. Issue a fsync against the lock file to flush the hardware writeback > cache, which should by now have processed the tmp file writes. > 2. Rename all of the temp files to their final names. > 3. When updating the index and/or refs, we will issue another fsync > internal to that operation. > > On a filesystem with a singular journal that is updated during name > operations (e.g. create, link, rename, etc), such as NTFS and HFS+, we > would expect the fsync to trigger a journal writeout so that this > sequence is enough to ensure that the user's data is durable by the time > the git command returns. > > This change also updates the MacOS code to trigger a real hardware flush > via fnctl(fd, F_FULLFSYNC) when fsync_or_die is called. Previously, on > MacOS there was no guarantee of durability since a simple fsync(2) call > does not flush any hardware caches. Thanks for working on this, good to see fsck issues picked up after some on-list pause. > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > index c04f62a54a1..3b672c2db67 100644 > --- a/Documentation/config/core.txt > +++ b/Documentation/config/core.txt > @@ -548,12 +548,17 @@ core.whitespace:: > errors. The default tab width is 8. Allowed values are 1 to 63. > > core.fsyncObjectFiles:: > - This boolean will enable 'fsync()' when writing object files. > -+ > -This is a total waste of time and effort on a filesystem that orders > -data writes properly, but can be useful for filesystems that do not use > -journalling (traditional UNIX filesystems) or that only journal metadata > -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback"). > + A boolean value or the number '2', indicating the level of durability > + applied to object files. > ++ > +This setting controls how much effort Git makes to ensure that data added to > +the object store are durable in the case of an unclean system shutdown. If > +'false', Git allows data to remain in file system caches according to operating > +system policy, whence they may be lost if the system loses power or crashes. A > +value of 'true' instructs Git to force objects to stable storage immediately > +when they are added to the object store. The number '2' is an experimental > +value that also preserves durability but tries to perform hardware flushes in a > +batch. Some feedback/thoughts: 0) Let's not expose "2" to users, but give it some friendly config name and just translate this to the enum internally. 1) Your commit message says "When updating the index and/or refs[...]" but we're changing core.fsyncObjectFiles here, I assume that's summarizing existing behavior then 2) You say "when adding many [loose] objects to a repo[...]", and the test-case is "git stash push", but for e.g. accepting pushes we have transfer.unpackLimit. It would be interesting to see if/how this impacts performance there, and also if not that should at least be called out in documentation. I.e. users might want to set this differently on servers v.s. checkouts. But also, is this sort of thing something we could mitigate even more in commands like "git stash push" by just writing a pack instead of N loose objects? I don't think such questions should preclude changing the fsync approach, or offering more options, but they should inform our longer-term goals. 3) Re some of the musings about fsync() recently in https://lore.kernel.org/git/877dhs20x3.fsf@evledraar.gmail.com/; is this method of doing not-quite-an-fsync guaranteed by some OS's / POSIX etc, or is it more like the initial approach before core.fsyncObjectFiles, i.e. the happy-go-lucky approach described in the "[...]that orders data writes properly[...]" documentation you're removing. 4) While that documentation written by Linus long ago is rather flippant, I think just removing it and not replacing it with some discussion about how this is a trade-off v.s. real-world filesystem semantics isn't a good change. 5) On a similar thought as transfer.unpackLimit in #2, I wonder if this fsync() setting shouldn't really be something we should be splitting up. I.e. maybe handle batch loose object writes one way, ref updates another way etc. I think moving core.fsync* to a setting like what we have for fsck.* and <cmd>.fsck.* is probably a better thing to do in the longer term. I.e. being able to do things like: fsync.objectFiles = none fsync.refFiles = cache # or "hardware" receive.fsync.objectFiles = hardware receive.fsync.refFiles = hardware Or whatever, i.e. we're using one hammer for all of these now, but I suspect most users who care about fsync care about /some/ fsync, not everything. 6) Inline comments below. > +struct object_rename { > + char *src; > + char *dst; > +}; > + > +static struct bulk_rename_state { > + struct object_rename *renames; > + uint32_t alloc_renames; > + uint32_t nr_renames; > +} bulk_rename_state; In a crash of git itself it seems we're going to leave some litter behind in the object dir now, and "git gc" won't know how to clean it up. I think this is going to want to just use the tmp-objdir.[ch] API, which might or might not need to be extended for loose objects / some oddities of this use-case. Also, if you have a pair of things like this the string-list API is much more pleasing to use than coming up with your own encapsulation. > static struct bulk_checkin_state { > unsigned plugged:1; > > @@ -21,13 +33,15 @@ static struct bulk_checkin_state { > struct pack_idx_entry **written; > uint32_t alloc_written; > uint32_t nr_written; > -} state; > + > + free(state->renames); > + memset(state, 0, sizeof(*state)); So with this and other use of the "state" variable is this part of bulk-checkin going to become thread-unsafe, was that already the case? > +static void add_rename_bulk_checkin(struct bulk_rename_state *state, > + const char *src, const char *dst) > +{ > + struct object_rename *rename; > + > + ALLOC_GROW(state->renames, state->nr_renames + 1, state->alloc_renames); > + > + rename = &state->renames[state->nr_renames++]; > + rename->src = xstrdup(src); > + rename->dst = xstrdup(dst); > +} All boilerplate duplicating things you'd get with a string-list for free... > + /* > + * If we have a plugged bulk checkin, we issue a call that > + * cleans the filesystem page cache but avoids a hardware flush > + * command. Later on we will issue a single hardware flush > + * before renaming files as part of do_sync_and_rename. > + */ So this is the sort of thing I meant by extending Linus's docs, I know some FS's work this way, but not all do. Also there's no guarantee in git that your .git is on one FS, so I think even for the FS's you have in mind this might not be an absolute guarantee...
On Tue, Aug 24, 2021 at 10:38 PM Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Aug 25, 2021 at 01:51:32AM +0000, Neeraj Singh via GitGitGadget wrote: > > From: Neeraj Singh <neerajsi@microsoft.com> > > > > When adding many objects to a repo with core.fsyncObjectFiles set to > > true, the cost of fsync'ing each object file can become prohibitive. > > > > One major source of the cost of fsync is the implied flush of the > > hardware writeback cache within the disk drive. Fortunately, Windows, > > MacOS, and Linux each offer mechanisms to write data from the filesystem > > page cache without initiating a hardware flush. > > > > This patch introduces a new 'core.fsyncObjectFiles = 2' option that > > takes advantage of the bulk-checkin infrastructure to batch up hardware > > flushes. > > Another interesting way to flush on linux would be the syncfs call, > which syncs all files on a file system. Once you write more than > handful or two of files that tends to win out over a batch of fsync > calls. I'd expect syncfs to suffer from the noisy-neighbor problem that Linus alluded to on the big thread you kicked off. The equivalent call on Windows currently requires administrative privileges (I'm really not sure exactly why, perhaps we should change that). If someone adds a more targeted bulk sync interface to the Linux kernel, I'm sure Git could be changed to use it. Maybe an fcntl(2) interface that initiates writeback and registers completion with an eventfd.
Hi Neeraj, continuing my review here, inlined. On Wed, 25 Aug 2021, Neeraj Singh via GitGitGadget wrote: > From: Neeraj Singh <neerajsi@microsoft.com> > > When adding many objects to a repo with core.fsyncObjectFiles set to > true, the cost of fsync'ing each object file can become prohibitive. > > One major source of the cost of fsync is the implied flush of the > hardware writeback cache within the disk drive. Fortunately, Windows, > MacOS, and Linux each offer mechanisms to write data from the filesystem > page cache without initiating a hardware flush. > > This patch introduces a new 'core.fsyncObjectFiles = 2' option that > takes advantage of the bulk-checkin infrastructure to batch up hardware > flushes. It makes sense, but I would recommend using a more easily explained value than `2`. Maybe `delayed`? Or `bulk` or `batched`? The way this would be implemented would look somewhat like the implementation for `core.abbrev`, which also accepts a string ("auto") or a Boolean (or even an integral number), see https://github.com/git/git/blob/v2.33.0/config.c#L1367-L1381: if (!strcmp(var, "core.abbrev")) { if (!value) return config_error_nonbool(var); if (!strcasecmp(value, "auto")) default_abbrev = -1; else if (!git_parse_maybe_bool_text(value)) default_abbrev = the_hash_algo->hexsz; else { int abbrev = git_config_int(var, value); if (abbrev < minimum_abbrev || abbrev > the_hash_algo->hexsz) return error(_("abbrev length out of range: %d"), abbrev); default_abbrev = abbrev; } return 0; } > When the new mode is enabled we do the following for new objects: > > 1. Create a tmp_obj_XXXX file and write the object data to it. > 2. Issue a pagecache writeback request and wait for it to complete. > 3. Record the tmp name and the final name in the bulk-checkin state for > later name. > > At the end of the entire transaction we: > 1. Issue a fsync against the lock file to flush the hardware writeback > cache, which should by now have processed the tmp file writes. > 2. Rename all of the temp files to their final names. > 3. When updating the index and/or refs, we will issue another fsync > internal to that operation. > > On a filesystem with a singular journal that is updated during name > operations (e.g. create, link, rename, etc), such as NTFS and HFS+, we > would expect the fsync to trigger a journal writeout so that this > sequence is enough to ensure that the user's data is durable by the time > the git command returns. > > This change also updates the MacOS code to trigger a real hardware flush > via fnctl(fd, F_FULLFSYNC) when fsync_or_die is called. Previously, on > MacOS there was no guarantee of durability since a simple fsync(2) call > does not flush any hardware caches. You included a very nice table with performance numbers in the cover letter. Maybe include that here, in the commit message? > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> > --- > Documentation/config/core.txt | 17 ++++-- > Makefile | 4 ++ > builtin/add.c | 3 +- > builtin/update-index.c | 3 + > bulk-checkin.c | 105 +++++++++++++++++++++++++++++++--- > bulk-checkin.h | 4 +- > config.c | 4 +- > config.mak.uname | 2 + > configure.ac | 8 +++ > git-compat-util.h | 7 +++ > object-file.c | 12 +--- > wrapper.c | 36 ++++++++++++ > write-or-die.c | 2 +- > 13 files changed, 177 insertions(+), 30 deletions(-) > > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > index c04f62a54a1..3b672c2db67 100644 > --- a/Documentation/config/core.txt > +++ b/Documentation/config/core.txt > @@ -548,12 +548,17 @@ core.whitespace:: > errors. The default tab width is 8. Allowed values are 1 to 63. > > core.fsyncObjectFiles:: > - This boolean will enable 'fsync()' when writing object files. > -+ > -This is a total waste of time and effort on a filesystem that orders > -data writes properly, but can be useful for filesystems that do not use > -journalling (traditional UNIX filesystems) or that only journal metadata > -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback"). > + A boolean value or the number '2', indicating the level of durability > + applied to object files. > ++ > +This setting controls how much effort Git makes to ensure that data added to > +the object store are durable in the case of an unclean system shutdown. If In addition to the content, I also like a lot that this tempers down the language to be a lot more agreeable to read. > +'false', Git allows data to remain in file system caches according to operating > +system policy, whence they may be lost if the system loses power or crashes. A > +value of 'true' instructs Git to force objects to stable storage immediately > +when they are added to the object store. The number '2' is an experimental > +value that also preserves durability but tries to perform hardware flushes in a > +batch. > > core.preloadIndex:: > Enable parallel index preload for operations like 'git diff' > diff --git a/Makefile b/Makefile > index 9573190f1d7..cb950ee43d3 100644 > --- a/Makefile > +++ b/Makefile > @@ -1896,6 +1896,10 @@ ifdef HAVE_CLOCK_MONOTONIC > BASIC_CFLAGS += -DHAVE_CLOCK_MONOTONIC > endif > > +ifdef HAVE_SYNC_FILE_RANGE > + BASIC_CFLAGS += -DHAVE_SYNC_FILE_RANGE > +endif > + > ifdef NEEDS_LIBRT > EXTLIBS += -lrt > endif > diff --git a/builtin/add.c b/builtin/add.c > index 09e684585d9..c58dfcd4bc3 100644 > --- a/builtin/add.c > +++ b/builtin/add.c > @@ -670,7 +670,8 @@ int cmd_add(int argc, const char **argv, const char *prefix) > > if (chmod_arg && pathspec.nr) > exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only); > - unplug_bulk_checkin(); > + > + unplug_bulk_checkin(&lock_file); > > finish: > if (write_locked_index(&the_index, &lock_file, > diff --git a/builtin/update-index.c b/builtin/update-index.c > index f1f16f2de52..64d025cf49e 100644 > --- a/builtin/update-index.c > +++ b/builtin/update-index.c > @@ -5,6 +5,7 @@ > */ > #define USE_THE_INDEX_COMPATIBILITY_MACROS > #include "cache.h" > +#include "bulk-checkin.h" > #include "config.h" > #include "lockfile.h" > #include "quote.h" > @@ -1152,6 +1153,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > struct strbuf unquoted = STRBUF_INIT; > > setup_work_tree(); > + plug_bulk_checkin(); > while (getline_fn(&buf, stdin) != EOF) { > char *p; > if (!nul_term_line && buf.buf[0] == '"') { > @@ -1166,6 +1168,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > chmod_path(set_executable_bit, p); > free(p); > } > + unplug_bulk_checkin(&lock_file); > strbuf_release(&unquoted); > strbuf_release(&buf); > } This change to `cmd_update_index()`, would it make sense to separate it out into its own commit? I think it would, as it is a slight change of behavior of the `--stdin` mode, no? > diff --git a/bulk-checkin.c b/bulk-checkin.c > index b023d9959aa..71004db863e 100644 > --- a/bulk-checkin.c > +++ b/bulk-checkin.c > @@ -3,6 +3,7 @@ > */ > #include "cache.h" > #include "bulk-checkin.h" > +#include "lockfile.h" > #include "repository.h" > #include "csum-file.h" > #include "pack.h" > @@ -10,6 +11,17 @@ > #include "packfile.h" > #include "object-store.h" > > +struct object_rename { > + char *src; > + char *dst; > +}; > + > +static struct bulk_rename_state { > + struct object_rename *renames; > + uint32_t alloc_renames; > + uint32_t nr_renames; > +} bulk_rename_state; > + > static struct bulk_checkin_state { > unsigned plugged:1; > > @@ -21,13 +33,15 @@ static struct bulk_checkin_state { > struct pack_idx_entry **written; > uint32_t alloc_written; > uint32_t nr_written; > -} state; > + > +} bulk_checkin_state; While it definitely looks better after this patch, having the new code _and_ the rename in the same set of changes makes it a bit harder to review and to spot bugs. Could I ask you to split this rename out into its own, preparatory patch ("preparatory" meaning that it should be ordered before the patch that adds support for the new fsync mode)? > > static void finish_bulk_checkin(struct bulk_checkin_state *state) > { > struct object_id oid; > struct strbuf packname = STRBUF_INIT; > int i; > + unsigned old_plugged; Since this variable is designed to hold the value of the `plugged` field of the `bulk_checkin_state`, which is declared as `unsigned plugged:1;`, we probably want a `:1` here, too. Also: is it really "old", rather than "orig"? I would have expected the name `orig_plugged` or `save_plugged`. > > if (!state->f) > return; > @@ -55,13 +69,42 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state) > > clear_exit: > free(state->written); > + old_plugged = state->plugged; > memset(state, 0, sizeof(*state)); > + state->plugged = old_plugged; Unfortunately, I lack the context to understand the purpose of this. Is the idea that `plugged` gives an indication whether we're still within that batch that should be fsync'ed all at once? I only see one caller where this would make a difference, and that caller is `deflate_to_pack()`. Maybe we should just start that function with `unsigned save_plugged:1 = state->plugged;` and restore it after the `while (1)` loop? > > strbuf_release(&packname); > /* Make objects we just wrote available to ourselves */ > reprepare_packed_git(the_repository); > } > > +static void do_sync_and_rename(struct bulk_rename_state *state, struct lock_file *lock_file) > +{ > + if (state->nr_renames) { > + int i; > + > + /* > + * Issue a full hardware flush against the lock file to ensure > + * that all objects are durable before any renames occur. > + * The code in fsync_and_close_loose_object_bulk_checkin has > + * already ensured that writeout has occurred, but it has not > + * flushed any writeback cache in the storage hardware. > + */ > + fsync_or_die(get_lock_file_fd(lock_file), get_lock_file_path(lock_file)); > + > + for (i = 0; i < state->nr_renames; i++) { > + if (finalize_object_file(state->renames[i].src, state->renames[i].dst)) > + die_errno(_("could not rename '%s'"), state->renames[i].src); > + > + free(state->renames[i].src); > + free(state->renames[i].dst); > + } > + > + free(state->renames); > + memset(state, 0, sizeof(*state)); Hmm. There is a lot of `memset()`ing going on, and I am not quite sure that I like what I am seeing. It does not help that there are now two very easily-confused structs: `bulk_rename_state` and `bulk_checkin_state`. Which made me worried at first that we might be resetting the `renames` field inadvertently in `finish_bulk_checkin()`. Maybe we can do this instead? FREE_AND_NULL(state->renames); state->nr_renames = state->alloc_renames = 0; > + } > +} > + > static int already_written(struct bulk_checkin_state *state, struct object_id *oid) > { > int i; > @@ -256,25 +299,69 @@ static int deflate_to_pack(struct bulk_checkin_state *state, > return 0; > } > > +static void add_rename_bulk_checkin(struct bulk_rename_state *state, > + const char *src, const char *dst) > +{ > + struct object_rename *rename; > + > + ALLOC_GROW(state->renames, state->nr_renames + 1, state->alloc_renames); > + > + rename = &state->renames[state->nr_renames++]; > + rename->src = xstrdup(src); > + rename->dst = xstrdup(dst); > +} > + > +int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile, > + const char *filename) > +{ > + if (fsync_object_files) { > + /* > + * If we have a plugged bulk checkin, we issue a call that > + * cleans the filesystem page cache but avoids a hardware flush > + * command. Later on we will issue a single hardware flush > + * before renaming files as part of do_sync_and_rename. > + */ > + if (bulk_checkin_state.plugged && > + fsync_object_files == 2 && > + git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) { > + add_rename_bulk_checkin(&bulk_rename_state, tmpfile, filename); > + if (close(fd)) > + die_errno(_("error when closing loose object file")); > + > + return 0; > + > + } else { > + fsync_or_die(fd, "loose object file"); > + } > + } > + > + if (close(fd)) > + die_errno(_("error when closing loose object file")); > + > + return finalize_object_file(tmpfile, filename); > +} > + > int index_bulk_checkin(struct object_id *oid, > int fd, size_t size, enum object_type type, > const char *path, unsigned flags) > { > - int status = deflate_to_pack(&state, oid, fd, size, type, > + int status = deflate_to_pack(&bulk_checkin_state, oid, fd, size, type, > path, flags); > - if (!state.plugged) > - finish_bulk_checkin(&state); > + if (!bulk_checkin_state.plugged) > + finish_bulk_checkin(&bulk_checkin_state); > return status; > } > > void plug_bulk_checkin(void) > { > - state.plugged = 1; > + bulk_checkin_state.plugged = 1; > } > > -void unplug_bulk_checkin(void) > +void unplug_bulk_checkin(struct lock_file *lock_file) > { > - state.plugged = 0; > - if (state.f) > - finish_bulk_checkin(&state); > + bulk_checkin_state.plugged = 0; > + if (bulk_checkin_state.f) > + finish_bulk_checkin(&bulk_checkin_state); > + > + do_sync_and_rename(&bulk_rename_state, lock_file); > } > diff --git a/bulk-checkin.h b/bulk-checkin.h > index b26f3dc3b74..8efb01ed669 100644 > --- a/bulk-checkin.h > +++ b/bulk-checkin.h > @@ -6,11 +6,13 @@ > > #include "cache.h" > > +int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile, const char *filename); > + > int index_bulk_checkin(struct object_id *oid, > int fd, size_t size, enum object_type type, > const char *path, unsigned flags); > > void plug_bulk_checkin(void); > -void unplug_bulk_checkin(void); > +void unplug_bulk_checkin(struct lock_file *); > > #endif > diff --git a/config.c b/config.c > index f33abeab851..375bdb24b0a 100644 > --- a/config.c > +++ b/config.c > @@ -1509,7 +1509,9 @@ static int git_default_core_config(const char *var, const char *value, void *cb) > } > > if (!strcmp(var, "core.fsyncobjectfiles")) { > - fsync_object_files = git_config_bool(var, value); > + int is_bool; > + > + fsync_object_files = git_config_bool_or_int(var, value, &is_bool); > return 0; > } > > diff --git a/config.mak.uname b/config.mak.uname > index 69413fb3dc0..8c07f2265a8 100644 > --- a/config.mak.uname > +++ b/config.mak.uname > @@ -53,6 +53,7 @@ ifeq ($(uname_S),Linux) > HAVE_CLOCK_MONOTONIC = YesPlease > # -lrt is needed for clock_gettime on glibc <= 2.16 > NEEDS_LIBRT = YesPlease > + HAVE_SYNC_FILE_RANGE = YesPlease > HAVE_GETDELIM = YesPlease > SANE_TEXT_GREP=-a > FREAD_READS_DIRECTORIES = UnfortunatelyYes > @@ -133,6 +134,7 @@ ifeq ($(uname_S),Darwin) > COMPAT_OBJS += compat/precompose_utf8.o > BASIC_CFLAGS += -DPRECOMPOSE_UNICODE > BASIC_CFLAGS += -DPROTECT_HFS_DEFAULT=1 > + BASIC_CFLAGS += -DFSYNC_DOESNT_FLUSH=1 > HAVE_BSD_SYSCTL = YesPlease > FREAD_READS_DIRECTORIES = UnfortunatelyYes > HAVE_NS_GET_EXECUTABLE_PATH = YesPlease > diff --git a/configure.ac b/configure.ac > index 031e8d3fee8..c711037d625 100644 > --- a/configure.ac > +++ b/configure.ac > @@ -1090,6 +1090,14 @@ AC_COMPILE_IFELSE([CLOCK_MONOTONIC_SRC], > [AC_MSG_RESULT([no]) > HAVE_CLOCK_MONOTONIC=]) > GIT_CONF_SUBST([HAVE_CLOCK_MONOTONIC]) > + > +# > +# Define HAVE_SYNC_FILE_RANGE=YesPlease if sync_file_range is available. > +GIT_CHECK_FUNC(sync_file_range, > + [HAVE_SYNC_FILE_RANGE=YesPlease], > + [HAVE_SYNC_FILE_RANGE]) > +GIT_CONF_SUBST([HAVE_SYNC_FILE_RANGE]) > + > # > # Define NO_SETITIMER if you don't have setitimer. > GIT_CHECK_FUNC(setitimer, > diff --git a/git-compat-util.h b/git-compat-util.h > index b46605300ab..d14e2436276 100644 > --- a/git-compat-util.h > +++ b/git-compat-util.h > @@ -1210,6 +1210,13 @@ __attribute__((format (printf, 1, 2))) NORETURN > void BUG(const char *fmt, ...); > #endif > > +enum fsync_action { > + FSYNC_WRITEOUT_ONLY, > + FSYNC_HARDWARE_FLUSH > +}; > + > +int git_fsync(int fd, enum fsync_action action); > + > /* > * Preserves errno, prints a message, but gives no warning for ENOENT. > * Returns 0 on success, which includes trying to unlink an object that does > diff --git a/object-file.c b/object-file.c > index 607e9e2f80b..5f04143dde0 100644 > --- a/object-file.c > +++ b/object-file.c > @@ -1859,16 +1859,6 @@ int hash_object_file(const struct git_hash_algo *algo, const void *buf, > return 0; > } > > -/* Finalize a file on disk, and close it. */ > -static int close_loose_object(int fd, const char *tmpfile, const char *filename) > -{ > - if (fsync_object_files) > - fsync_or_die(fd, "loose object file"); > - if (close(fd) != 0) > - die_errno(_("error when closing loose object file")); > - return finalize_object_file(tmpfile, filename); > -} > - > /* Size of directory component, including the ending '/' */ > static inline int directory_size(const char *filename) > { > @@ -1982,7 +1972,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr, > warning_errno(_("failed futimes() on %s"), tmp_file.buf); > } > > - return close_loose_object(fd, tmp_file.buf, filename.buf); > + return fsync_and_close_loose_object_bulk_checkin(fd, tmp_file.buf, filename.buf); > } > > static int freshen_loose_object(const struct object_id *oid) > diff --git a/wrapper.c b/wrapper.c > index 563ad590df1..37a8b61a7df 100644 > --- a/wrapper.c > +++ b/wrapper.c > @@ -538,6 +538,42 @@ int xmkstemp_mode(char *filename_template, int mode) > return fd; > } > > +int git_fsync(int fd, enum fsync_action action) > +{ > + if (action == FSYNC_WRITEOUT_ONLY) { > +#ifdef __APPLE__ > + /* > + * on Mac OS X, fsync just causes filesystem cache writeback but does not > + * flush hardware caches. > + */ > + return fsync(fd); > +#endif > + > +#ifdef HAVE_SYNC_FILE_RANGE > + /* > + * On linux 2.6.17 and above, sync_file_range is the way to issue > + * a writeback without a hardware flush. An offset of 0 and size of 0 > + * indicates writeout of the entire file and the wait flags ensure that all > + * dirty data is written to the disk (potentially in a disk-side cache) > + * before we continue. > + */ > + > + return sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE | > + SYNC_FILE_RANGE_WRITE | > + SYNC_FILE_RANGE_WAIT_AFTER); > +#endif > + > + errno = ENOSYS; > + return -1; > + } Hmm. I wonder whether we can do this more consistently with how Git usually does platform-specific things. In the 3rd patch, the one where you implemented Windows-specific support, in the Git for Windows PR at https://github.com/git-for-windows/git/pull/3391, you introduce a `mingw_fsync_no_flush()` function and define `fsync_no_flush` to expand to that function name. This is very similar to how Git does things. Take for example the `offset_1st_component` macro: https://github.com/git/git/blob/v2.33.0/git-compat-util.h#L386-L392 Unless defined in a platform-specific manner, it is defined in `git-compat-util.h`: #ifndef offset_1st_component static inline int git_offset_1st_component(const char *path) { return is_dir_sep(path[0]); } #define offset_1st_component git_offset_1st_component #endif And on Windows, it is defined as following (https://github.com/git/git/blob/v2.33.0/compat/win32/path-utils.h#L34-L35), before the lines quoted above: int win32_offset_1st_component(const char *path); #define offset_1st_component win32_offset_1st_component We could do the exact same thing here. Define a platform-specific `mingw_fsync_no_flush()` in `compat/mingw.h` and define the macro `fsync_no_flush` to point to it. In `git-compat-util.h`, in the `__APPLE__`-specific part, implement it via `fsync()`. And later, in the platform-independent part, _iff_ the macro has not yet been defined, implement an inline function that does that `HAVE_SYNC_FILE_RANGE` dance and falls back to `ENOSYS`. That would contain the platform-specific `#ifdef` blocks to `git-compat-util.h`, which is exactly where we want them. > + > +#ifdef __APPLE__ > + return fcntl(fd, F_FULLFSYNC); > +#else > + return fsync(fd); > +#endif Same thing here. We would probably want something like `fsync_with_flush` here. It is my hope that you find my comments and suggestions helpful. Thank you, Johannes > +} > + > static int warn_if_unremovable(const char *op, const char *file, int rc) > { > int err; > diff --git a/write-or-die.c b/write-or-die.c > index d33e68f6abb..8f53953d4ab 100644 > --- a/write-or-die.c > +++ b/write-or-die.c > @@ -57,7 +57,7 @@ void fprintf_or_die(FILE *f, const char *fmt, ...) > > void fsync_or_die(int fd, const char *msg) > { > - while (fsync(fd) < 0) { > + while (git_fsync(fd, FSYNC_HARDWARE_FLUSH) < 0) { > if (errno != EINTR) > die_errno("fsync error on '%s'", msg); > } > -- > gitgitgadget >
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > It makes sense, but I would recommend using a more easily explained value > than `2`. Maybe `delayed`? Or `bulk` or `batched`? While we have less than 100% confidence in the implementation, it may make sense to have such a knob to choose between "do we fsync the old, known-safe but slow way, or do we fsync in batch" behaviours, and I agree that the knob should not be called cryptic "2". But in a distant future when this new way of flushing proves to be stable, it would make sense if the enw behaviour were triggered by the plain vanilla 'true', no? In a sense, running fsync in a batch (or using syncfs) is an implementation detail of "we sync after writing out object files and before declaring success". Thanks.
On Wed, Aug 25, 2021 at 9:31 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > On Wed, Aug 25 2021, Neeraj Singh via GitGitGadget wrote: > > > From: Neeraj Singh <neerajsi@microsoft.com> > > > > When adding many objects to a repo with core.fsyncObjectFiles set to > > true, the cost of fsync'ing each object file can become prohibitive. > > > > One major source of the cost of fsync is the implied flush of the > > hardware writeback cache within the disk drive. Fortunately, Windows, > > MacOS, and Linux each offer mechanisms to write data from the filesystem > > page cache without initiating a hardware flush. > > > > This patch introduces a new 'core.fsyncObjectFiles = 2' option that > > takes advantage of the bulk-checkin infrastructure to batch up hardware > > flushes. > > > > When the new mode is enabled we do the following for new objects: > > > > 1. Create a tmp_obj_XXXX file and write the object data to it. > > 2. Issue a pagecache writeback request and wait for it to complete. > > 3. Record the tmp name and the final name in the bulk-checkin state for > > later name. > > > > At the end of the entire transaction we: > > 1. Issue a fsync against the lock file to flush the hardware writeback > > cache, which should by now have processed the tmp file writes. > > 2. Rename all of the temp files to their final names. > > 3. When updating the index and/or refs, we will issue another fsync > > internal to that operation. > > > > On a filesystem with a singular journal that is updated during name > > operations (e.g. create, link, rename, etc), such as NTFS and HFS+, we > > would expect the fsync to trigger a journal writeout so that this > > sequence is enough to ensure that the user's data is durable by the time > > the git command returns. > > > > This change also updates the MacOS code to trigger a real hardware flush > > via fnctl(fd, F_FULLFSYNC) when fsync_or_die is called. Previously, on > > MacOS there was no guarantee of durability since a simple fsync(2) call > > does not flush any hardware caches. > > Thanks for working on this, good to see fsck issues picked up after some > on-list pause. > > > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > > index c04f62a54a1..3b672c2db67 100644 > > --- a/Documentation/config/core.txt > > +++ b/Documentation/config/core.txt > > @@ -548,12 +548,17 @@ core.whitespace:: > > errors. The default tab width is 8. Allowed values are 1 to 63. > > > > core.fsyncObjectFiles:: > > - This boolean will enable 'fsync()' when writing object files. > > -+ > > -This is a total waste of time and effort on a filesystem that orders > > -data writes properly, but can be useful for filesystems that do not use > > -journalling (traditional UNIX filesystems) or that only journal metadata > > -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback"). > > + A boolean value or the number '2', indicating the level of durability > > + applied to object files. > > ++ > > +This setting controls how much effort Git makes to ensure that data added to > > +the object store are durable in the case of an unclean system shutdown. If > > +'false', Git allows data to remain in file system caches according to operating > > +system policy, whence they may be lost if the system loses power or crashes. A > > +value of 'true' instructs Git to force objects to stable storage immediately > > +when they are added to the object store. The number '2' is an experimental > > +value that also preserves durability but tries to perform hardware flushes in a > > +batch. > > Some feedback/thoughts: > > 0) Let's not expose "2" to users, but give it some friendly config name > and just translate this to the enum internally. Agreed. I'll follow suggestions from here and elsewhere to make this a human-readable string. Is "core.fsyncObjectFiles=batch" acceptable? > > 1) Your commit message says "When updating the index and/or refs[...]" > but we're changing core.fsyncObjectFiles here, I assume that's > summarizing existing behavior then That's what I intended. I'll make the patch description more explicit about that. In general, this patch is only concerned with loose object files. It's my assumption that other parts of the system (like the refs db) need to perform their own data consistency and must have their own durability. I do think that long-term, the Git community should think about having a general transaction mechanism with redo logging to have a consistent method for achieving durability. > > 2) You say "when adding many [loose] objects to a repo[...]", and the > test-case is "git stash push", but for e.g. accepting pushes we have > transfer.unpackLimit. > > It would be interesting to see if/how this impacts performance there, > and also if not that should at least be called out in > documentation. I.e. users might want to set this differently on servers > v.s. checkouts. > > But also, is this sort of thing something we could mitigate even more in > commands like "git stash push" by just writing a pack instead of N loose > objects? > > I don't think such questions should preclude changing the fsync > approach, or offering more options, but they should inform our > longer-term goals. Dealing only/mostly in packfiles would be a great approach. I'd hope that this fsyncing work would mostly be superseded if such a change is rolled out. I just read about the geometric repacking stuff, and it looks reminiscent of the LSM-tree approach to databases. > > 3) Re some of the musings about fsync() recently in > https://lore.kernel.org/git/877dhs20x3.fsf@evledraar.gmail.com/; is this > method of doing not-quite-an-fsync guaranteed by some OS's / POSIX etc, > or is it more like the initial approach before core.fsyncObjectFiles, > i.e. the happy-go-lucky approach described in the "[...]that orders data > writes properly[...]" documentation you're removing. I am confident about the validity of the batched approach on Windows when running on NTFS and ReFS, given my background as a Windows FS developer. We are unlikely to change any mainstream data consistency behavior of our filesystems to be weaker, given the type of errors such changes would cause. In Windows, we call the FS requirements to support this change "external metadata consistency", which states that all metadata operations that could have led to a state later FSYNCed must be visible after the fsync. macOS's fsync documentation indicates that they are likely to implement the required guarantees. They specifically say that fsync triggers writeback or data and metadata, but does not issue a hardware flush. Please see the doc at https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fsync.2.html. It is also notable that Apple SSDs have particularly bad performance for flush operations (we noticed this when booting Windows through BootCamp as well). Unfortunately my perusal of the man pages and documentation I could find doesn't give me this level of confidence on typical Linux filesystems. For instance, the notion of having to fsync the parent directory in order to render an inode's link findable eliminates a lot of the advantage of this change, though we could batch those and would have to do at most 256. This thread is somewhat instructive, but inconclusive: https://lwn.net/ml/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/. One conclusion from reviewing that thread is that as of then, sync_file_ranges isn't actually enough to make a hard guarantee about writeout occurring. See https://lore.kernel.org/linux-fsdevel/20190319204330.GY26298@dastard/. My hope is that the Linux FS developers have rectified that shortcoming by now. > > 4) While that documentation written by Linus long ago is rather > flippant, I think just removing it and not replacing it with some > discussion about how this is a trade-off v.s. real-world filesystem > semantics isn't a good change. I don't think the replaced documentation applies anymore to an ext4 or xfs system with delayed allocation. Those filesystems effectively have data=writeback semantics because they don't need data ordering to avoid exposing unwritten data, and so don't write the data at any particular syscall boundary or with any particular ordering. I think my updated version of the documentation for "= false" is accurate and more helpful from a user perspective ("up to OS policy when your data becomes durable in the event of an unclean shutdown"). "= true" also has a reasonable description, though I might add some verbiage indicating that this setting could be costly. I'll take a crack at improving the batched mode documentation. > > 5) On a similar thought as transfer.unpackLimit in #2, I wonder if this > fsync() setting shouldn't really be something we should be splitting > up. I.e. maybe handle batch loose object writes one way, ref updates > another way etc. I think moving core.fsync* to a setting like what we > have for fsck.* and <cmd>.fsck.* is probably a better thing to do in the > longer term. > > I.e. being able to do things like: > > fsync.objectFiles = none > fsync.refFiles = cache # or "hardware" > receive.fsync.objectFiles = hardware > receive.fsync.refFiles = hardware > > Or whatever, i.e. we're using one hammer for all of these now, but I > suspect most users who care about fsync care about /some/ fsync, not > everything. > I disagree. I believe Git should offer a consolidated config setting with two overall goals: 1) A consistent high-integrity setting across the entire git index/object/ref state, primarily for people using a repo for active development of changes. This should roughly guarantee that when a git command that adds data to the repo completes, the data is durable within git, including the refs needed to find it. 2) A lower-integrity setting useful for build/CI, maintainers who are applying lots of patches, etc, where it is expected that the data is available elsewhere and can be recovered easily. > 6) Inline comments below. > > > +struct object_rename { > > + char *src; > > + char *dst; > > +}; > > + > > +static struct bulk_rename_state { > > + struct object_rename *renames; > > + uint32_t alloc_renames; > > + uint32_t nr_renames; > > +} bulk_rename_state; > > In a crash of git itself it seems we're going to leave some litter > behind in the object dir now, and "git gc" won't know how to clean it > up. I think this is going to want to just use the tmp-objdir.[ch] API, > which might or might not need to be extended for loose objects / some > oddities of this use-case. > It appears that "git prune" would take care of these files. > Also, if you have a pair of things like this the string-list API is much > more pleasing to use than coming up with your own encapsulation. Thanks, I'll update to use the string-list API. > > static struct bulk_checkin_state { > > unsigned plugged:1; > > > > @@ -21,13 +33,15 @@ static struct bulk_checkin_state { > > struct pack_idx_entry **written; > > uint32_t alloc_written; > > uint32_t nr_written; > > -} state; > > > > + > > + free(state->renames); > > + memset(state, 0, sizeof(*state)); > > So with this and other use of the "state" variable is this part of > bulk-checkin going to become thread-unsafe, was that already the case? Yes, this code was already thread-unsafe if we're in the "bulk checkin plugged" mode and that hasn't changed. Is it worth fixing this right now? Is there a preexisting example of code that uses thread-local-storage inside a library function and then merges the state later? Bonus points for only doing the thread-local stuff if alternate threads are actually active. > > +static void add_rename_bulk_checkin(struct bulk_rename_state *state, > > + const char *src, const char *dst) > > +{ > > + struct object_rename *rename; > > + > > + ALLOC_GROW(state->renames, state->nr_renames + 1, state->alloc_renames); > > + > > + rename = &state->renames[state->nr_renames++]; > > + rename->src = xstrdup(src); > > + rename->dst = xstrdup(dst); > > +} > > All boilerplate duplicating things you'd get with a string-list for free... Will fix.Thanks. > > > + /* > > + * If we have a plugged bulk checkin, we issue a call that > > + * cleans the filesystem page cache but avoids a hardware flush > > + * command. Later on we will issue a single hardware flush > > + * before renaming files as part of do_sync_and_rename. > > + */ > > So this is the sort of thing I meant by extending Linus's docs, I know > some FS's work this way, but not all do. > > Also there's no guarantee in git that your .git is on one FS, so I think > even for the FS's you have in mind this might not be an absolute > guarantee... This is unfortunately an issue that isn't resolvable within Git. I think there's value in supporting batched mode for the common systems and filesystems that do support the guarantee. I'd like to be able to set batched mode as the default on Windows eventually. It also looks like it can be default on macOS. I think linux might be able to get to the desired semantics for default distro filesystems with minimal changes. Perhaps this new mode in Git would provide them with a motivation to do so.
On Wed, Aug 25, 2021 at 11:52 AM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > On Wed, 25 Aug 2021, Neeraj Singh via GitGitGadget wrote: > > > From: Neeraj Singh <neerajsi@microsoft.com> > > > > When adding many objects to a repo with core.fsyncObjectFiles set to > > true, the cost of fsync'ing each object file can become prohibitive. > > > > One major source of the cost of fsync is the implied flush of the > > hardware writeback cache within the disk drive. Fortunately, Windows, > > MacOS, and Linux each offer mechanisms to write data from the filesystem > > page cache without initiating a hardware flush. > > > > This patch introduces a new 'core.fsyncObjectFiles = 2' option that > > takes advantage of the bulk-checkin infrastructure to batch up hardware > > flushes. > > It makes sense, but I would recommend using a more easily explained value > than `2`. Maybe `delayed`? Or `bulk` or `batched`? > > The way this would be implemented would look somewhat like the > implementation for `core.abbrev`, which also accepts a string ("auto") or > a Boolean (or even an integral number), see > https://github.com/git/git/blob/v2.33.0/config.c#L1367-L1381: > > if (!strcmp(var, "core.abbrev")) { > if (!value) > return config_error_nonbool(var); > if (!strcasecmp(value, "auto")) > default_abbrev = -1; > else if (!git_parse_maybe_bool_text(value)) > default_abbrev = the_hash_algo->hexsz; > else { > int abbrev = git_config_int(var, value); > if (abbrev < minimum_abbrev || abbrev > the_hash_algo->hexsz) > return error(_("abbrev length out of range: %d"), abbrev); > default_abbrev = abbrev; > } > return 0; > } > Thanks for the code example. I'll follow something similar. I'll prefer the name "batch" for the new fsync mode. > > When the new mode is enabled we do the following for new objects: > > > > 1. Create a tmp_obj_XXXX file and write the object data to it. > > 2. Issue a pagecache writeback request and wait for it to complete. > > 3. Record the tmp name and the final name in the bulk-checkin state for > > later name. > > > > At the end of the entire transaction we: > > 1. Issue a fsync against the lock file to flush the hardware writeback > > cache, which should by now have processed the tmp file writes. > > 2. Rename all of the temp files to their final names. > > 3. When updating the index and/or refs, we will issue another fsync > > internal to that operation. > > > > On a filesystem with a singular journal that is updated during name > > operations (e.g. create, link, rename, etc), such as NTFS and HFS+, we > > would expect the fsync to trigger a journal writeout so that this > > sequence is enough to ensure that the user's data is durable by the time > > the git command returns. > > > > This change also updates the MacOS code to trigger a real hardware flush > > via fnctl(fd, F_FULLFSYNC) when fsync_or_die is called. Previously, on > > MacOS there was no guarantee of durability since a simple fsync(2) call > > does not flush any hardware caches. > > You included a very nice table with performance numbers in the cover > letter. Maybe include that here, in the commit message? Will do. > > > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> > > --- > > Documentation/config/core.txt | 17 ++++-- > > Makefile | 4 ++ > > builtin/add.c | 3 +- > > builtin/update-index.c | 3 + > > bulk-checkin.c | 105 +++++++++++++++++++++++++++++++--- > > bulk-checkin.h | 4 +- > > config.c | 4 +- > > config.mak.uname | 2 + > > configure.ac | 8 +++ > > git-compat-util.h | 7 +++ > > object-file.c | 12 +--- > > wrapper.c | 36 ++++++++++++ > > write-or-die.c | 2 +- > > 13 files changed, 177 insertions(+), 30 deletions(-) > > > > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > > index c04f62a54a1..3b672c2db67 100644 > > --- a/Documentation/config/core.txt > > +++ b/Documentation/config/core.txt > > @@ -548,12 +548,17 @@ core.whitespace:: > > errors. The default tab width is 8. Allowed values are 1 to 63. > > > > core.fsyncObjectFiles:: > > - This boolean will enable 'fsync()' when writing object files. > > -+ > > -This is a total waste of time and effort on a filesystem that orders > > -data writes properly, but can be useful for filesystems that do not use > > -journalling (traditional UNIX filesystems) or that only journal metadata > > -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback"). > > + A boolean value or the number '2', indicating the level of durability > > + applied to object files. > > ++ > > +This setting controls how much effort Git makes to ensure that data added to > > +the object store are durable in the case of an unclean system shutdown. If > > In addition to the content, I also like a lot that this tempers down the > language to be a lot more agreeable to read. > > > +'false', Git allows data to remain in file system caches according to operating > > +system policy, whence they may be lost if the system loses power or crashes. A > > +value of 'true' instructs Git to force objects to stable storage immediately > > +when they are added to the object store. The number '2' is an experimental > > +value that also preserves durability but tries to perform hardware flushes in a > > +batch. I'll be revising this text a little bit to respond to avarab's feedback. I'm guessing that this will need a few more rounds of tweaking. > > > > core.preloadIndex:: > > Enable parallel index preload for operations like 'git diff' > > diff --git a/Makefile b/Makefile > > index 9573190f1d7..cb950ee43d3 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -1896,6 +1896,10 @@ ifdef HAVE_CLOCK_MONOTONIC > > BASIC_CFLAGS += -DHAVE_CLOCK_MONOTONIC > > endif > > > > +ifdef HAVE_SYNC_FILE_RANGE > > + BASIC_CFLAGS += -DHAVE_SYNC_FILE_RANGE > > +endif > > + > > ifdef NEEDS_LIBRT > > EXTLIBS += -lrt > > endif > > diff --git a/builtin/add.c b/builtin/add.c > > index 09e684585d9..c58dfcd4bc3 100644 > > --- a/builtin/add.c > > +++ b/builtin/add.c > > @@ -670,7 +670,8 @@ int cmd_add(int argc, const char **argv, const char *prefix) > > > > if (chmod_arg && pathspec.nr) > > exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only); > > - unplug_bulk_checkin(); > > + > > + unplug_bulk_checkin(&lock_file); > > > > finish: > > if (write_locked_index(&the_index, &lock_file, > > diff --git a/builtin/update-index.c b/builtin/update-index.c > > index f1f16f2de52..64d025cf49e 100644 > > --- a/builtin/update-index.c > > +++ b/builtin/update-index.c > > @@ -5,6 +5,7 @@ > > */ > > #define USE_THE_INDEX_COMPATIBILITY_MACROS > > #include "cache.h" > > +#include "bulk-checkin.h" > > #include "config.h" > > #include "lockfile.h" > > #include "quote.h" > > @@ -1152,6 +1153,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > > struct strbuf unquoted = STRBUF_INIT; > > > > setup_work_tree(); > > + plug_bulk_checkin(); > > while (getline_fn(&buf, stdin) != EOF) { > > char *p; > > if (!nul_term_line && buf.buf[0] == '"') { > > @@ -1166,6 +1168,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > > chmod_path(set_executable_bit, p); > > free(p); > > } > > + unplug_bulk_checkin(&lock_file); > > strbuf_release(&unquoted); > > strbuf_release(&buf); > > } > > This change to `cmd_update_index()`, would it make sense to separate it > out into its own commit? I think it would, as it is a slight change of > behavior of the `--stdin` mode, no? Will do. This makes me think that there is some risk here if someone launches an update-index and then attempts to use the index to find a newly-added OID without completing the update-index invocation. It's possible to construct such a scenario if someone's using the "--verbose" scenario to find out about actions taken by the subprocess. Do you think I should add a flag to conditionally enable bulk-checkin to avoid regression for this (I expect unlikely) case? > > diff --git a/bulk-checkin.c b/bulk-checkin.c > > index b023d9959aa..71004db863e 100644 > > --- a/bulk-checkin.c > > +++ b/bulk-checkin.c > > @@ -3,6 +3,7 @@ > > */ > > #include "cache.h" > > #include "bulk-checkin.h" > > +#include "lockfile.h" > > #include "repository.h" > > #include "csum-file.h" > > #include "pack.h" > > @@ -10,6 +11,17 @@ > > #include "packfile.h" > > #include "object-store.h" > > > > +struct object_rename { > > + char *src; > > + char *dst; > > +}; > > + > > +static struct bulk_rename_state { > > + struct object_rename *renames; > > + uint32_t alloc_renames; > > + uint32_t nr_renames; > > +} bulk_rename_state; > > + > > static struct bulk_checkin_state { > > unsigned plugged:1; > > > > @@ -21,13 +33,15 @@ static struct bulk_checkin_state { > > struct pack_idx_entry **written; > > uint32_t alloc_written; > > uint32_t nr_written; > > -} state; > > + > > +} bulk_checkin_state; > > While it definitely looks better after this patch, having the new code > _and_ the rename in the same set of changes makes it a bit harder to > review and to spot bugs. > > Could I ask you to split this rename out into its own, preparatory patch > ("preparatory" meaning that it should be ordered before the patch that > adds support for the new fsync mode)? Will do. I'm going to change a few things as I'll mention below. > > > > > static void finish_bulk_checkin(struct bulk_checkin_state *state) > > { > > struct object_id oid; > > struct strbuf packname = STRBUF_INIT; > > int i; > > + unsigned old_plugged; > > Since this variable is designed to hold the value of the `plugged` field > of the `bulk_checkin_state`, which is declared as `unsigned plugged:1;`, > we probably want a `:1` here, too. > > Also: is it really "old", rather than "orig"? I would have expected the > name `orig_plugged` or `save_plugged`. > > > > > if (!state->f) > > return; > > @@ -55,13 +69,42 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state) > > > > clear_exit: > > free(state->written); > > + old_plugged = state->plugged; > > memset(state, 0, sizeof(*state)); > > + state->plugged = old_plugged; > > Unfortunately, I lack the context to understand the purpose of this. Is > the idea that `plugged` gives an indication whether we're still within > that batch that should be fsync'ed all at once? > > I only see one caller where this would make a difference, and that caller > is `deflate_to_pack()`. Maybe we should just start that function with > `unsigned save_plugged:1 = state->plugged;` and restore it after the > `while (1)` loop? > The problem being solved here is that in some (rare) circumstances it looks like we could unplug the state before an explicit unplug call. That would be fatal for the bulk-fsync code, and is probably suboptimal for the bulk checkin code. I'm going to separate this out the boolean to a different variable in my preparatory patch so that this fragility goes away. > > > > strbuf_release(&packname); > > /* Make objects we just wrote available to ourselves */ > > reprepare_packed_git(the_repository); > > } > > > > +static void do_sync_and_rename(struct bulk_rename_state *state, struct lock_file *lock_file) > > +{ > > + if (state->nr_renames) { > > + int i; > > + > > + /* > > + * Issue a full hardware flush against the lock file to ensure > > + * that all objects are durable before any renames occur. > > + * The code in fsync_and_close_loose_object_bulk_checkin has > > + * already ensured that writeout has occurred, but it has not > > + * flushed any writeback cache in the storage hardware. > > + */ > > + fsync_or_die(get_lock_file_fd(lock_file), get_lock_file_path(lock_file)); > > + > > + for (i = 0; i < state->nr_renames; i++) { > > + if (finalize_object_file(state->renames[i].src, state->renames[i].dst)) > > + die_errno(_("could not rename '%s'"), state->renames[i].src); > > + > > + free(state->renames[i].src); > > + free(state->renames[i].dst); > > + } > > + > > + free(state->renames); > > + memset(state, 0, sizeof(*state)); > > Hmm. There is a lot of `memset()`ing going on, and I am not quite sure > that I like what I am seeing. It does not help that there are now two very > easily-confused structs: `bulk_rename_state` and `bulk_checkin_state`. > Which made me worried at first that we might be resetting the `renames` > field inadvertently in `finish_bulk_checkin()`. Yeah, the problem I was trying to avoid was an issue with resetting the rename state during the "too big pack" case. What do you think about a "bulk_pack_state" and a "bulk_fsync_state" variable? One side advantage of the "s/state/bulk_rename_state" change is that the variable name is no longer ambiguous in the debugger across different object files. > > Maybe we can do this instead? > > FREE_AND_NULL(state->renames); > state->nr_renames = state->alloc_renames = 0; > I wrote it that way originally, but saw an error with coccinelle and then decided to follow the convention from elsewhere in the file. That's when I noticed the nasty "early unplug" case, so the Github actions certainly saved me from an unfortunate bug. Thanks for setting them up! I'll go toward your suggestion. > > + } > > +} > > + > > static int already_written(struct bulk_checkin_state *state, struct object_id *oid) > > { > > int i; > > @@ -256,25 +299,69 @@ static int deflate_to_pack(struct bulk_checkin_state *state, > > return 0; > > } > > > > +static void add_rename_bulk_checkin(struct bulk_rename_state *state, > > + const char *src, const char *dst) > > +{ > > + struct object_rename *rename; > > + > > + ALLOC_GROW(state->renames, state->nr_renames + 1, state->alloc_renames); > > + > > + rename = &state->renames[state->nr_renames++]; > > + rename->src = xstrdup(src); > > + rename->dst = xstrdup(dst); > > +} > > + > > +int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile, > > + const char *filename) > > +{ > > + if (fsync_object_files) { > > + /* > > + * If we have a plugged bulk checkin, we issue a call that > > + * cleans the filesystem page cache but avoids a hardware flush > > + * command. Later on we will issue a single hardware flush > > + * before renaming files as part of do_sync_and_rename. > > + */ > > + if (bulk_checkin_state.plugged && > > + fsync_object_files == 2 && > > + git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) { > > + add_rename_bulk_checkin(&bulk_rename_state, tmpfile, filename); > > + if (close(fd)) > > + die_errno(_("error when closing loose object file")); > > + > > + return 0; > > + > > + } else { > > + fsync_or_die(fd, "loose object file"); > > + } > > + } > > + > > + if (close(fd)) > > + die_errno(_("error when closing loose object file")); > > + > > + return finalize_object_file(tmpfile, filename); > > +} > > + > > int index_bulk_checkin(struct object_id *oid, > > int fd, size_t size, enum object_type type, > > const char *path, unsigned flags) > > { > > - int status = deflate_to_pack(&state, oid, fd, size, type, > > + int status = deflate_to_pack(&bulk_checkin_state, oid, fd, size, type, > > path, flags); > > - if (!state.plugged) > > - finish_bulk_checkin(&state); > > + if (!bulk_checkin_state.plugged) > > + finish_bulk_checkin(&bulk_checkin_state); > > return status; > > } > > > > void plug_bulk_checkin(void) > > { > > - state.plugged = 1; > > + bulk_checkin_state.plugged = 1; > > } > > > > -void unplug_bulk_checkin(void) > > +void unplug_bulk_checkin(struct lock_file *lock_file) > > { > > - state.plugged = 0; > > - if (state.f) > > - finish_bulk_checkin(&state); > > + bulk_checkin_state.plugged = 0; > > + if (bulk_checkin_state.f) > > + finish_bulk_checkin(&bulk_checkin_state); > > + > > + do_sync_and_rename(&bulk_rename_state, lock_file); > > } > > diff --git a/bulk-checkin.h b/bulk-checkin.h > > index b26f3dc3b74..8efb01ed669 100644 > > --- a/bulk-checkin.h > > +++ b/bulk-checkin.h > > @@ -6,11 +6,13 @@ > > > > #include "cache.h" > > > > +int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile, const char *filename); > > + > > int index_bulk_checkin(struct object_id *oid, > > int fd, size_t size, enum object_type type, > > const char *path, unsigned flags); > > > > void plug_bulk_checkin(void); > > -void unplug_bulk_checkin(void); > > +void unplug_bulk_checkin(struct lock_file *); > > > > #endif > > diff --git a/config.c b/config.c > > index f33abeab851..375bdb24b0a 100644 > > --- a/config.c > > +++ b/config.c > > @@ -1509,7 +1509,9 @@ static int git_default_core_config(const char *var, const char *value, void *cb) > > } > > > > if (!strcmp(var, "core.fsyncobjectfiles")) { > > - fsync_object_files = git_config_bool(var, value); > > + int is_bool; > > + > > + fsync_object_files = git_config_bool_or_int(var, value, &is_bool); > > return 0; > > } > > > > diff --git a/config.mak.uname b/config.mak.uname > > index 69413fb3dc0..8c07f2265a8 100644 > > --- a/config.mak.uname > > +++ b/config.mak.uname > > @@ -53,6 +53,7 @@ ifeq ($(uname_S),Linux) > > HAVE_CLOCK_MONOTONIC = YesPlease > > # -lrt is needed for clock_gettime on glibc <= 2.16 > > NEEDS_LIBRT = YesPlease > > + HAVE_SYNC_FILE_RANGE = YesPlease > > HAVE_GETDELIM = YesPlease > > SANE_TEXT_GREP=-a > > FREAD_READS_DIRECTORIES = UnfortunatelyYes > > @@ -133,6 +134,7 @@ ifeq ($(uname_S),Darwin) > > COMPAT_OBJS += compat/precompose_utf8.o > > BASIC_CFLAGS += -DPRECOMPOSE_UNICODE > > BASIC_CFLAGS += -DPROTECT_HFS_DEFAULT=1 > > + BASIC_CFLAGS += -DFSYNC_DOESNT_FLUSH=1 > > HAVE_BSD_SYSCTL = YesPlease > > FREAD_READS_DIRECTORIES = UnfortunatelyYes > > HAVE_NS_GET_EXECUTABLE_PATH = YesPlease > > diff --git a/configure.ac b/configure.ac > > index 031e8d3fee8..c711037d625 100644 > > --- a/configure.ac > > +++ b/configure.ac > > @@ -1090,6 +1090,14 @@ AC_COMPILE_IFELSE([CLOCK_MONOTONIC_SRC], > > [AC_MSG_RESULT([no]) > > HAVE_CLOCK_MONOTONIC=]) > > GIT_CONF_SUBST([HAVE_CLOCK_MONOTONIC]) > > + > > +# > > +# Define HAVE_SYNC_FILE_RANGE=YesPlease if sync_file_range is available. > > +GIT_CHECK_FUNC(sync_file_range, > > + [HAVE_SYNC_FILE_RANGE=YesPlease], > > + [HAVE_SYNC_FILE_RANGE]) > > +GIT_CONF_SUBST([HAVE_SYNC_FILE_RANGE]) > > + > > # > > # Define NO_SETITIMER if you don't have setitimer. > > GIT_CHECK_FUNC(setitimer, > > diff --git a/git-compat-util.h b/git-compat-util.h > > index b46605300ab..d14e2436276 100644 > > --- a/git-compat-util.h > > +++ b/git-compat-util.h > > @@ -1210,6 +1210,13 @@ __attribute__((format (printf, 1, 2))) NORETURN > > void BUG(const char *fmt, ...); > > #endif > > > > +enum fsync_action { > > + FSYNC_WRITEOUT_ONLY, > > + FSYNC_HARDWARE_FLUSH > > +}; > > + > > +int git_fsync(int fd, enum fsync_action action); > > + > > /* > > * Preserves errno, prints a message, but gives no warning for ENOENT. > > * Returns 0 on success, which includes trying to unlink an object that does > > diff --git a/object-file.c b/object-file.c > > index 607e9e2f80b..5f04143dde0 100644 > > --- a/object-file.c > > +++ b/object-file.c > > @@ -1859,16 +1859,6 @@ int hash_object_file(const struct git_hash_algo *algo, const void *buf, > > return 0; > > } > > > > -/* Finalize a file on disk, and close it. */ > > -static int close_loose_object(int fd, const char *tmpfile, const char *filename) > > -{ > > - if (fsync_object_files) > > - fsync_or_die(fd, "loose object file"); > > - if (close(fd) != 0) > > - die_errno(_("error when closing loose object file")); > > - return finalize_object_file(tmpfile, filename); > > -} > > - > > /* Size of directory component, including the ending '/' */ > > static inline int directory_size(const char *filename) > > { > > @@ -1982,7 +1972,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr, > > warning_errno(_("failed futimes() on %s"), tmp_file.buf); > > } > > > > - return close_loose_object(fd, tmp_file.buf, filename.buf); > > + return fsync_and_close_loose_object_bulk_checkin(fd, tmp_file.buf, filename.buf); > > } > > > > static int freshen_loose_object(const struct object_id *oid) > > diff --git a/wrapper.c b/wrapper.c > > index 563ad590df1..37a8b61a7df 100644 > > --- a/wrapper.c > > +++ b/wrapper.c > > @@ -538,6 +538,42 @@ int xmkstemp_mode(char *filename_template, int mode) > > return fd; > > } > > > > +int git_fsync(int fd, enum fsync_action action) > > +{ > > + if (action == FSYNC_WRITEOUT_ONLY) { > > +#ifdef __APPLE__ > > + /* > > + * on Mac OS X, fsync just causes filesystem cache writeback but does not > > + * flush hardware caches. > > + */ > > + return fsync(fd); > > +#endif > > + > > +#ifdef HAVE_SYNC_FILE_RANGE > > + /* > > + * On linux 2.6.17 and above, sync_file_range is the way to issue > > + * a writeback without a hardware flush. An offset of 0 and size of 0 > > + * indicates writeout of the entire file and the wait flags ensure that all > > + * dirty data is written to the disk (potentially in a disk-side cache) > > + * before we continue. > > + */ > > + > > + return sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE | > > + SYNC_FILE_RANGE_WRITE | > > + SYNC_FILE_RANGE_WAIT_AFTER); > > +#endif > > + > > + errno = ENOSYS; > > + return -1; > > + } > > Hmm. I wonder whether we can do this more consistently with how Git > usually does platform-specific things. > > In the 3rd patch, the one where you implemented Windows-specific support, > in the Git for Windows PR at > https://github.com/git-for-windows/git/pull/3391, you introduce a > `mingw_fsync_no_flush()` function and define `fsync_no_flush` to expand to > that function name. > > This is very similar to how Git does things. Take for example the > `offset_1st_component` macro: > https://github.com/git/git/blob/v2.33.0/git-compat-util.h#L386-L392 > > Unless defined in a platform-specific manner, it is defined in > `git-compat-util.h`: > > #ifndef offset_1st_component > static inline int git_offset_1st_component(const char *path) > { > return is_dir_sep(path[0]); > } > #define offset_1st_component git_offset_1st_component > #endif > > And on Windows, it is defined as following > (https://github.com/git/git/blob/v2.33.0/compat/win32/path-utils.h#L34-L35), > before the lines quoted above: > > int win32_offset_1st_component(const char *path); > #define offset_1st_component win32_offset_1st_component > > We could do the exact same thing here. Define a platform-specific > `mingw_fsync_no_flush()` in `compat/mingw.h` and define the macro > `fsync_no_flush` to point to it. In `git-compat-util.h`, in the > `__APPLE__`-specific part, implement it via `fsync()`. And later, in the > platform-independent part, _iff_ the macro has not yet been defined, > implement an inline function that does that `HAVE_SYNC_FILE_RANGE` dance > and falls back to `ENOSYS`. > > That would contain the platform-specific `#ifdef` blocks to > `git-compat-util.h`, which is exactly where we want them. > > > + > > +#ifdef __APPLE__ > > + return fcntl(fd, F_FULLFSYNC); > > +#else > > + return fsync(fd); > > +#endif > > Same thing here. We would probably want something like `fsync_with_flush` > here. I thought about doing it that way originally. But there's the unfortunate fact that I'd have to alias fsync_no_flush to fsync on macOS and then fsync to fnctl(F_FULLFSYNC), I felt that it would be clearer to someone reviewing this functionality if we provide a very explicit git_fsync API with a well-named flag and document the OS-specific craziness in the C file rather than through a layer of macros in the header file. Given that, are you okay with keeping this code layout in the C file, potentially with more local modifications? > > It is my hope that you find my comments and suggestions helpful. > > Thank you, > Johannes > Very much so! Again thanks for the review. -Neeraj
On Wed, Aug 25, 2021 at 05:49:45PM -0700, Neeraj Singh wrote: > Unfortunately my perusal of the man pages and documentation I could find doesn't > give me this level of confidence on typical Linux filesystems. For > instance, the notion of having to > fsync the parent directory in order to render an inode's link findable > eliminates a lot of the > advantage of this change, though we could batch those and would have > to do at most 256. > > This thread is somewhat instructive, but inconclusive: > https://lwn.net/ml/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/. fsync/fdatasync only guarantees consistency for the file handle they are called on. The first linked document mentioned an implementation artifact that file systems with metadata logging tend to force their log out until the last modified transaction and thus force out metadata changes done earlier. This won't help with actual data writes at all, as for them the fact of writing back data will often generate new metadata changes., and in general is not a property to rely on if you care about data integrity. It is nice to optimize the order of the fsync calls for metadata only workloads, as then often the later fsync calls on earlier modified file handles will be no-ops. > One conclusion from reviewing that thread is that as of then, > sync_file_ranges isn't actually enough > to make a hard guarantee about writeout occurring. See > https://lore.kernel.org/linux-fsdevel/20190319204330.GY26298@dastard/. > My hope is that the Linux FS developers have rectified that shortcoming by now. I'm not sure what shortcoming you mean. sync_file_ranges is a system call that only causes data writeback. It never performs metadata write back and thus is not an integrity operation at all. That is also very clearly documented in the man page. > I think my updated version of the documentation for "= false" is > accurate and more helpful > from a user perspective ("up to OS policy when your data becomes durable in > the event of an unclean shutdown"). "= true" also has a reasonable > description, though I > might add some verbiage indicating that this setting could be costly. Your version is much better. In fact it almost still too nice as in general it will not be durable and you do end up with a corrupted repository in that case. Note that even for bad old ext3 that was usually the case.
On Wed, Aug 25, 2021 at 10:40:53AM -0700, Neeraj Singh wrote: > I'd expect syncfs to suffer from the noisy-neighbor problem that Linus > alluded to on the big > thread you kicked off. It does. That being said I suspect in most developer workstation use cases it will still be a win. Maybe I'll look into implemeting it after your series lands. > If someone adds a more targeted bulk sync interface to the Linux > kernel, I'm sure Git could be > changed to use it. Maybe an fcntl(2) interface that initiates > writeback and registers completion with an > eventfd. That is in general very hard to do with how the VM-level writeback occurs. In the file system itself it could work much better, e.g. for XFS we write the log up to a specific sequence number and could notify when doing that.
On Wed, Aug 25, 2021 at 06:11:13PM +0200, Ævar Arnfjörð Bjarmason wrote: > 3) Re some of the musings about fsync() recently in > https://lore.kernel.org/git/877dhs20x3.fsf@evledraar.gmail.com/; is this > method of doing not-quite-an-fsync guaranteed by some OS's / POSIX etc, > or is it more like the initial approach before core.fsyncObjectFiles, > i.e. the happy-go-lucky approach described in the "[...]that orders data > writes properly[...]" documentation you're removing. Except for the now removed ext3 filesystem in Linux that basically turned every fsync into syncfs, that is a file system-wide sync I've never heard about such behavior for data writeback. Many file systems will sometimes or always behave like that for metadata writeback, but there is no guarantees you could rely on for that.
On Wed, Aug 25, 2021 at 10:50 PM Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Aug 25, 2021 at 05:49:45PM -0700, Neeraj Singh wrote: > > One conclusion from reviewing that thread is that as of then, > > sync_file_ranges isn't actually enough > > to make a hard guarantee about writeout occurring. See > > https://lore.kernel.org/linux-fsdevel/20190319204330.GY26298@dastard/. > > My hope is that the Linux FS developers have rectified that shortcoming by now. > > I'm not sure what shortcoming you mean. sync_file_ranges is a system > call that only causes data writeback. It never performs metadata write > back and thus is not an integrity operation at all. That is also very > clearly documented in the man page. > You're right. On re-read of the man page, sync_file_range is listed as an "extremely dangerous" system call. The opportunity in the linux kernel is to offer an alternative set of flags or separate API that allows for an application like Git to separate a metadata writeback request from the disk flush. Separately, I'm hoping I can push from the Windows filesystem side to get a barrier primitive put into the NVME standard so that we can offer more useful behavior to applications rather than these painful hardware flushes.
On Fri, Aug 27, 2021 at 05:20:44PM -0700, Neeraj Singh wrote: > You're right. On re-read of the man page, sync_file_range is listed as > an "extremely dangerous" > system call. The opportunity in the linux kernel is to offer an > alternative set of flags or separate > API that allows for an application like Git to separate a metadata > writeback request from the disk flush. How do you want to do that? I metadata writeback without a cache flush is worse than useless, in fact it is generally actively harmful. To take XFS as an example: fsync and fdatasync do the following thing: 1) writeback all dirty data for file to the data device 2) flush the write cache of the data device to ensure they are really on disk before writing back the metadata referring to them 3) write out the log up until the log sequence that contained the last modifications to the file 4) flush the cache for the log device. If the data device and the log device are the same (they usually are for common setups) and the log device support the FUA bit that writes through the cache, the log writes use that bit and this step can be skipped. So in general there are very few metadata writes, and it is absolutely essential to flush the cache before that, because otherwise your metadata could point to data that might not actually have made it to disk. The best way to optimize such a workload is by first batching all the data writeout for multiple fils in step one, then only doing one cache flush and one log force (as we call it) to cover all the files. syncfs will do that, but without a good way to pick individual files. > Separately, I'm hoping I can push from the Windows filesystem side to > get a barrier primitive put into > the NVME standard so that we can offer more useful behavior to > applications rather than these painful > hardware flushes. I'm not sure what you mean with barriers, but if you mean the concept of implying a global ordering on I/Os as we did in Linux back in the bad old days the barrier bio flag, or badly reinvented by this paper: https://www.usenix.org/conference/fast18/presentation/won they might help a little bit with single threaded operations, but will heavily degrade I/O performance for multithreaded workloads. As an active member of (but not speaking for) the NVMe technical working group with a bit of knowledge of SSD internals I also doubt it will be very well received there.
On Fri, Aug 27, 2021 at 11:57 PM Christoph Hellwig <hch@lst.de> wrote: > > On Fri, Aug 27, 2021 at 05:20:44PM -0700, Neeraj Singh wrote: > > You're right. On re-read of the man page, sync_file_range is listed as > > an "extremely dangerous" > > system call. The opportunity in the linux kernel is to offer an > > alternative set of flags or separate > > API that allows for an application like Git to separate a metadata > > writeback request from the disk flush. > > How do you want to do that? I metadata writeback without a cache flush > is worse than useless, in fact it is generally actively harmful. > > To take XFS as an example: fsync and fdatasync do the following thing: > > 1) writeback all dirty data for file to the data device > 2) flush the write cache of the data device to ensure they are really > on disk before writing back the metadata referring to them > 3) write out the log up until the log sequence that contained the last > modifications to the file > 4) flush the cache for the log device. > If the data device and the log device are the same (they usually are > for common setups) and the log device support the FUA bit that writes > through the cache, the log writes use that bit and this step can > be skipped. > > So in general there are very few metadata writes, and it is absolutely > essential to flush the cache before that, because otherwise your metadata > could point to data that might not actually have made it to disk. > > The best way to optimize such a workload is by first batching all the > data writeout for multiple fils in step one, then only doing one cache > flush and one log force (as we call it) to cover all the files. syncfs > will do that, but without a good way to pick individual files. Yes, I think we want to do step (1) of your sequence for all of the files, then issue steps (2-4) for all files as a group. Of course, if the log fills up then we can flush the intermediate steps. The unfortunate thing is that there's no Linux interface to do step (1) and to also ensure that the relevant data is in the log stream or is otherwise available to be part of the durable metadata. It seems to me that XFS would be compatible with this sequence if the appropriate kernel API exists. > > > Separately, I'm hoping I can push from the Windows filesystem side to > > get a barrier primitive put into > > the NVME standard so that we can offer more useful behavior to > > applications rather than these painful > > hardware flushes. > > I'm not sure what you mean with barriers, but if you mean the concept > of implying a global ordering on I/Os as we did in Linux back in the > bad old days the barrier bio flag, or badly reinvented by this paper: > > https://www.usenix.org/conference/fast18/presentation/won > > they might help a little bit with single threaded operations, but will > heavily degrade I/O performance for multithreaded workloads. As an > active member of (but not speaking for) the NVMe technical working group > with a bit of knowledge of SSD internals I also doubt it will be very > well received there. I looked at that paper and definitely agree with you about the questionable implementation strategy they picked. I don't (yet) have detailed knowledge of SSD internals, but it's surprising to me that there is little value to barrier semantics within the drive as opposed to a full durability sync. At least for Windows, we have a database (the Registry) for which any single-threaded latency improvement would be welcome.
On Tue, Aug 31, 2021 at 12:59:14PM -0700, Neeraj Singh wrote: > > So in general there are very few metadata writes, and it is absolutely > > essential to flush the cache before that, because otherwise your metadata > > could point to data that might not actually have made it to disk. > > > > The best way to optimize such a workload is by first batching all the > > data writeout for multiple fils in step one, then only doing one cache > > flush and one log force (as we call it) to cover all the files. syncfs > > will do that, but without a good way to pick individual files. > > Yes, I think we want to do step (1) of your sequence for all of the files, then > issue steps (2-4) for all files as a group. Of course, if the log > fills up then we > can flush the intermediate steps. The unfortunate thing is that > there's no Linux interface > to do step (1) and to also ensure that the relevant data is in the log > stream or is > otherwise available to be part of the durable metadata. There is also no interface to do 2-4 separately, mostly because they are so hard to separate. The only API I could envision is one that takes an array of file descriptors and has the semantics of doing a fsync/ fdatasync for all of them, allowing the implementation to optimize the order. It would be implementable, but not quite as efficient as syncfs. I'm also pretty sure we've seen a few attempts at it in the past that ran into various issues and didn't really make it far. > I looked at that paper and definitely agree with you about the questionable > implementation strategy they picked. I don't (yet) have detailed knowledge of > SSD internals, but it's surprising to me that there is little value to > barrier semantics > within the drive as opposed to a full durability sync. At least for > Windows, we have > a database (the Registry) for which any single-threaded latency > improvement would > be welcome. The major issue with barrier like in the paper above or as historic Linux 2.6 kernels had it is that it enforces a global ordering. For software-only implementation like the Linux one this was already bad enough, but a hardware/firmware implementation in nvme would mean you'd have to add global serialize to a storage interface standard and its implementations, while these are very much about offering parallelisms. In fact we'd also have very similar issues with the modern Linux block layer, which has applied some similar ideas.
diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index c04f62a54a1..3b672c2db67 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -548,12 +548,17 @@ core.whitespace:: errors. The default tab width is 8. Allowed values are 1 to 63. core.fsyncObjectFiles:: - This boolean will enable 'fsync()' when writing object files. -+ -This is a total waste of time and effort on a filesystem that orders -data writes properly, but can be useful for filesystems that do not use -journalling (traditional UNIX filesystems) or that only journal metadata -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback"). + A boolean value or the number '2', indicating the level of durability + applied to object files. ++ +This setting controls how much effort Git makes to ensure that data added to +the object store are durable in the case of an unclean system shutdown. If +'false', Git allows data to remain in file system caches according to operating +system policy, whence they may be lost if the system loses power or crashes. A +value of 'true' instructs Git to force objects to stable storage immediately +when they are added to the object store. The number '2' is an experimental +value that also preserves durability but tries to perform hardware flushes in a +batch. core.preloadIndex:: Enable parallel index preload for operations like 'git diff' diff --git a/Makefile b/Makefile index 9573190f1d7..cb950ee43d3 100644 --- a/Makefile +++ b/Makefile @@ -1896,6 +1896,10 @@ ifdef HAVE_CLOCK_MONOTONIC BASIC_CFLAGS += -DHAVE_CLOCK_MONOTONIC endif +ifdef HAVE_SYNC_FILE_RANGE + BASIC_CFLAGS += -DHAVE_SYNC_FILE_RANGE +endif + ifdef NEEDS_LIBRT EXTLIBS += -lrt endif diff --git a/builtin/add.c b/builtin/add.c index 09e684585d9..c58dfcd4bc3 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -670,7 +670,8 @@ int cmd_add(int argc, const char **argv, const char *prefix) if (chmod_arg && pathspec.nr) exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only); - unplug_bulk_checkin(); + + unplug_bulk_checkin(&lock_file); finish: if (write_locked_index(&the_index, &lock_file, diff --git a/builtin/update-index.c b/builtin/update-index.c index f1f16f2de52..64d025cf49e 100644 --- a/builtin/update-index.c +++ b/builtin/update-index.c @@ -5,6 +5,7 @@ */ #define USE_THE_INDEX_COMPATIBILITY_MACROS #include "cache.h" +#include "bulk-checkin.h" #include "config.h" #include "lockfile.h" #include "quote.h" @@ -1152,6 +1153,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) struct strbuf unquoted = STRBUF_INIT; setup_work_tree(); + plug_bulk_checkin(); while (getline_fn(&buf, stdin) != EOF) { char *p; if (!nul_term_line && buf.buf[0] == '"') { @@ -1166,6 +1168,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) chmod_path(set_executable_bit, p); free(p); } + unplug_bulk_checkin(&lock_file); strbuf_release(&unquoted); strbuf_release(&buf); } diff --git a/bulk-checkin.c b/bulk-checkin.c index b023d9959aa..71004db863e 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -3,6 +3,7 @@ */ #include "cache.h" #include "bulk-checkin.h" +#include "lockfile.h" #include "repository.h" #include "csum-file.h" #include "pack.h" @@ -10,6 +11,17 @@ #include "packfile.h" #include "object-store.h" +struct object_rename { + char *src; + char *dst; +}; + +static struct bulk_rename_state { + struct object_rename *renames; + uint32_t alloc_renames; + uint32_t nr_renames; +} bulk_rename_state; + static struct bulk_checkin_state { unsigned plugged:1; @@ -21,13 +33,15 @@ static struct bulk_checkin_state { struct pack_idx_entry **written; uint32_t alloc_written; uint32_t nr_written; -} state; + +} bulk_checkin_state; static void finish_bulk_checkin(struct bulk_checkin_state *state) { struct object_id oid; struct strbuf packname = STRBUF_INIT; int i; + unsigned old_plugged; if (!state->f) return; @@ -55,13 +69,42 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state) clear_exit: free(state->written); + old_plugged = state->plugged; memset(state, 0, sizeof(*state)); + state->plugged = old_plugged; strbuf_release(&packname); /* Make objects we just wrote available to ourselves */ reprepare_packed_git(the_repository); } +static void do_sync_and_rename(struct bulk_rename_state *state, struct lock_file *lock_file) +{ + if (state->nr_renames) { + int i; + + /* + * Issue a full hardware flush against the lock file to ensure + * that all objects are durable before any renames occur. + * The code in fsync_and_close_loose_object_bulk_checkin has + * already ensured that writeout has occurred, but it has not + * flushed any writeback cache in the storage hardware. + */ + fsync_or_die(get_lock_file_fd(lock_file), get_lock_file_path(lock_file)); + + for (i = 0; i < state->nr_renames; i++) { + if (finalize_object_file(state->renames[i].src, state->renames[i].dst)) + die_errno(_("could not rename '%s'"), state->renames[i].src); + + free(state->renames[i].src); + free(state->renames[i].dst); + } + + free(state->renames); + memset(state, 0, sizeof(*state)); + } +} + static int already_written(struct bulk_checkin_state *state, struct object_id *oid) { int i; @@ -256,25 +299,69 @@ static int deflate_to_pack(struct bulk_checkin_state *state, return 0; } +static void add_rename_bulk_checkin(struct bulk_rename_state *state, + const char *src, const char *dst) +{ + struct object_rename *rename; + + ALLOC_GROW(state->renames, state->nr_renames + 1, state->alloc_renames); + + rename = &state->renames[state->nr_renames++]; + rename->src = xstrdup(src); + rename->dst = xstrdup(dst); +} + +int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile, + const char *filename) +{ + if (fsync_object_files) { + /* + * If we have a plugged bulk checkin, we issue a call that + * cleans the filesystem page cache but avoids a hardware flush + * command. Later on we will issue a single hardware flush + * before renaming files as part of do_sync_and_rename. + */ + if (bulk_checkin_state.plugged && + fsync_object_files == 2 && + git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) { + add_rename_bulk_checkin(&bulk_rename_state, tmpfile, filename); + if (close(fd)) + die_errno(_("error when closing loose object file")); + + return 0; + + } else { + fsync_or_die(fd, "loose object file"); + } + } + + if (close(fd)) + die_errno(_("error when closing loose object file")); + + return finalize_object_file(tmpfile, filename); +} + int index_bulk_checkin(struct object_id *oid, int fd, size_t size, enum object_type type, const char *path, unsigned flags) { - int status = deflate_to_pack(&state, oid, fd, size, type, + int status = deflate_to_pack(&bulk_checkin_state, oid, fd, size, type, path, flags); - if (!state.plugged) - finish_bulk_checkin(&state); + if (!bulk_checkin_state.plugged) + finish_bulk_checkin(&bulk_checkin_state); return status; } void plug_bulk_checkin(void) { - state.plugged = 1; + bulk_checkin_state.plugged = 1; } -void unplug_bulk_checkin(void) +void unplug_bulk_checkin(struct lock_file *lock_file) { - state.plugged = 0; - if (state.f) - finish_bulk_checkin(&state); + bulk_checkin_state.plugged = 0; + if (bulk_checkin_state.f) + finish_bulk_checkin(&bulk_checkin_state); + + do_sync_and_rename(&bulk_rename_state, lock_file); } diff --git a/bulk-checkin.h b/bulk-checkin.h index b26f3dc3b74..8efb01ed669 100644 --- a/bulk-checkin.h +++ b/bulk-checkin.h @@ -6,11 +6,13 @@ #include "cache.h" +int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile, const char *filename); + int index_bulk_checkin(struct object_id *oid, int fd, size_t size, enum object_type type, const char *path, unsigned flags); void plug_bulk_checkin(void); -void unplug_bulk_checkin(void); +void unplug_bulk_checkin(struct lock_file *); #endif diff --git a/config.c b/config.c index f33abeab851..375bdb24b0a 100644 --- a/config.c +++ b/config.c @@ -1509,7 +1509,9 @@ static int git_default_core_config(const char *var, const char *value, void *cb) } if (!strcmp(var, "core.fsyncobjectfiles")) { - fsync_object_files = git_config_bool(var, value); + int is_bool; + + fsync_object_files = git_config_bool_or_int(var, value, &is_bool); return 0; } diff --git a/config.mak.uname b/config.mak.uname index 69413fb3dc0..8c07f2265a8 100644 --- a/config.mak.uname +++ b/config.mak.uname @@ -53,6 +53,7 @@ ifeq ($(uname_S),Linux) HAVE_CLOCK_MONOTONIC = YesPlease # -lrt is needed for clock_gettime on glibc <= 2.16 NEEDS_LIBRT = YesPlease + HAVE_SYNC_FILE_RANGE = YesPlease HAVE_GETDELIM = YesPlease SANE_TEXT_GREP=-a FREAD_READS_DIRECTORIES = UnfortunatelyYes @@ -133,6 +134,7 @@ ifeq ($(uname_S),Darwin) COMPAT_OBJS += compat/precompose_utf8.o BASIC_CFLAGS += -DPRECOMPOSE_UNICODE BASIC_CFLAGS += -DPROTECT_HFS_DEFAULT=1 + BASIC_CFLAGS += -DFSYNC_DOESNT_FLUSH=1 HAVE_BSD_SYSCTL = YesPlease FREAD_READS_DIRECTORIES = UnfortunatelyYes HAVE_NS_GET_EXECUTABLE_PATH = YesPlease diff --git a/configure.ac b/configure.ac index 031e8d3fee8..c711037d625 100644 --- a/configure.ac +++ b/configure.ac @@ -1090,6 +1090,14 @@ AC_COMPILE_IFELSE([CLOCK_MONOTONIC_SRC], [AC_MSG_RESULT([no]) HAVE_CLOCK_MONOTONIC=]) GIT_CONF_SUBST([HAVE_CLOCK_MONOTONIC]) + +# +# Define HAVE_SYNC_FILE_RANGE=YesPlease if sync_file_range is available. +GIT_CHECK_FUNC(sync_file_range, + [HAVE_SYNC_FILE_RANGE=YesPlease], + [HAVE_SYNC_FILE_RANGE]) +GIT_CONF_SUBST([HAVE_SYNC_FILE_RANGE]) + # # Define NO_SETITIMER if you don't have setitimer. GIT_CHECK_FUNC(setitimer, diff --git a/git-compat-util.h b/git-compat-util.h index b46605300ab..d14e2436276 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -1210,6 +1210,13 @@ __attribute__((format (printf, 1, 2))) NORETURN void BUG(const char *fmt, ...); #endif +enum fsync_action { + FSYNC_WRITEOUT_ONLY, + FSYNC_HARDWARE_FLUSH +}; + +int git_fsync(int fd, enum fsync_action action); + /* * Preserves errno, prints a message, but gives no warning for ENOENT. * Returns 0 on success, which includes trying to unlink an object that does diff --git a/object-file.c b/object-file.c index 607e9e2f80b..5f04143dde0 100644 --- a/object-file.c +++ b/object-file.c @@ -1859,16 +1859,6 @@ int hash_object_file(const struct git_hash_algo *algo, const void *buf, return 0; } -/* Finalize a file on disk, and close it. */ -static int close_loose_object(int fd, const char *tmpfile, const char *filename) -{ - if (fsync_object_files) - fsync_or_die(fd, "loose object file"); - if (close(fd) != 0) - die_errno(_("error when closing loose object file")); - return finalize_object_file(tmpfile, filename); -} - /* Size of directory component, including the ending '/' */ static inline int directory_size(const char *filename) { @@ -1982,7 +1972,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr, warning_errno(_("failed futimes() on %s"), tmp_file.buf); } - return close_loose_object(fd, tmp_file.buf, filename.buf); + return fsync_and_close_loose_object_bulk_checkin(fd, tmp_file.buf, filename.buf); } static int freshen_loose_object(const struct object_id *oid) diff --git a/wrapper.c b/wrapper.c index 563ad590df1..37a8b61a7df 100644 --- a/wrapper.c +++ b/wrapper.c @@ -538,6 +538,42 @@ int xmkstemp_mode(char *filename_template, int mode) return fd; } +int git_fsync(int fd, enum fsync_action action) +{ + if (action == FSYNC_WRITEOUT_ONLY) { +#ifdef __APPLE__ + /* + * on Mac OS X, fsync just causes filesystem cache writeback but does not + * flush hardware caches. + */ + return fsync(fd); +#endif + +#ifdef HAVE_SYNC_FILE_RANGE + /* + * On linux 2.6.17 and above, sync_file_range is the way to issue + * a writeback without a hardware flush. An offset of 0 and size of 0 + * indicates writeout of the entire file and the wait flags ensure that all + * dirty data is written to the disk (potentially in a disk-side cache) + * before we continue. + */ + + return sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE | + SYNC_FILE_RANGE_WRITE | + SYNC_FILE_RANGE_WAIT_AFTER); +#endif + + errno = ENOSYS; + return -1; + } + +#ifdef __APPLE__ + return fcntl(fd, F_FULLFSYNC); +#else + return fsync(fd); +#endif +} + static int warn_if_unremovable(const char *op, const char *file, int rc) { int err; diff --git a/write-or-die.c b/write-or-die.c index d33e68f6abb..8f53953d4ab 100644 --- a/write-or-die.c +++ b/write-or-die.c @@ -57,7 +57,7 @@ void fprintf_or_die(FILE *f, const char *fmt, ...) void fsync_or_die(int fd, const char *msg) { - while (fsync(fd) < 0) { + while (git_fsync(fd, FSYNC_HARDWARE_FLUSH) < 0) { if (errno != EINTR) die_errno("fsync error on '%s'", msg); }