Message ID | 3ed1dcd9b9ba9b34f26b3012eaba8da0269ee842.1647760560.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects | expand |
On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote: > From: Neeraj Singh <neerajsi@microsoft.com> > [...] > + if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) { > + bulk_fsync_objdir = tmp_objdir_create("bulk-fsync"); > + if (!bulk_fsync_objdir) > + die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch")); Should camel-case the config var, and we should have a die_errno() here which tell us why we couldn't create it (possibly needing to ferry it up from the tempfile API...)
On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote: > From: Neeraj Singh <neerajsi@microsoft.com> > > One major source of the cost of fsync is the implied flush of the > hardware writeback cache within the disk drive. This commit introduces > a new `core.fsyncMethod=batch` option that batches up hardware flushes. > It hooks into the bulk-checkin plugging and unplugging functionality, > takes advantage of tmp-objdir, and uses the writeout-only support code. > > When the new mode is enabled, we do the following for each new object: > 1. Create the object in a tmp-objdir. > 2. Issue a pagecache writeback request and wait for it to complete. > > At the end of the entire transaction when unplugging bulk checkin: > 1. Issue an fsync against a dummy file to flush the hardware writeback > cache, which should by now have seen the tmp-objdir writes. > 2. Rename all of the tmp-objdir files to their final names. > 3. When updating the index and/or refs, we assume that Git will issue > another fsync internal to that operation. This is not the default > today, but the user now has the option of syncing the index and there > is a separate patch series to implement syncing of refs. Re my question in https://lore.kernel.org/git/220310.86r179ki38.gmgdl@evledraar.gmail.com/ (which you *partially* replied to per my reading, i.e. not the fsync_nth() question) I still don't get why the tmp-objdir part of this is needed. For "git stash" which is one thing sped up by this let's go over what commands/FS ops we do. I changed the test like this: diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh index 3fc16944e9e..479a495c68c 100755 --- a/t/t3903-stash.sh +++ b/t/t3903-stash.sh @@ -1383,7 +1383,7 @@ BATCH_CONFIGURATION='-c core.fsync=loose-object -c core.fsyncmethod=batch' test_expect_success 'stash with core.fsyncmethod=batch' " test_create_unique_files 2 4 fsync-files && - git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ && + strace -f git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ && rm -f fsynced_files && # The files were untracked, so use the third parent, Then we get this output, with my comments, and I snipped some output: $ ./t3903-stash.sh --run=1-4,114 -vixd 2>&1|grep --color -e 89772c935031c228ed67890f9 -e .git/stash -e bulk_fsync -e .git/index [pid 14703] access(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory) [pid 14703] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory) [pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/tmp_obj_bdUlzu", ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0 Here we're creating the tmp_objdir() files. We then sync_file_range() and close() this. [pid 14703] openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7", O_RDWR|O_CREAT|O_EXCL, 0600) = 9 [pid 14703] unlink("/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7") = 0 This is the flushing of the "cookie" in do_batch_fsync(). [pid 14703] newfstatat(AT_FDCWD, ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, 0) = 0 [pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0 Here we're going through the object dir migration with unplug_bulk_checkin(). [pid 14703] unlink(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0 newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0 [pid 14705] access(".git/objects/tmp_objdir-bulk-fsync-0F7DGy/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory) [pid 14705] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = 0 [pid 14705] utimensat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", NULL, 0) = 0 [pid 14707] openat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", O_RDONLY|O_CLOEXEC) = 9 We then update the index itself, first a temporary index.stash : openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 8 openat(AT_FDCWD, ".git/index.stash.19141", O_RDONLY) = 9 newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", {st_mode=S_IFREG|0644, st_size=927, ...}, 0) = 0 rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141") = 0 unlink(".git/index.stash.19141") = 0 Followed by the same and a later rename of the actual index: [pid 19146] rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index") = 0 So, my question is still why the temporary object dir migration part of this is needed. We are writing N loose object files, and we write those to temporary names already. AFAIKT we could do all of this by doing the same tmp/rename/sync_file_range dance on the main object store. Then instead of the "bulk_fsync" cookie file don't close() the last file object file we write until we issue the fsync on it. But maybe this is all needed, I just can't understand from the commit message why the "bulk checkin" part is being done. I think since we've been over this a few times without any success it would really help to have some example of the smallest set of syscalls to write a file like this safely. I.e. this is doing (pseudocode): /* first the bulk path */ open("bulk/x.tmp"); write("bulk/x.tmp"); sync_file_range("bulk/x.tmp"); close("bulk/x.tmp"); rename("bulk/x.tmp", "bulk/x"); open("bulk/y.tmp"); write("bulk/y.tmp"); sync_file_range("bulk/y.tmp"); close("bulk/y.tmp"); rename("bulk/y.tmp", "bulk/y"); /* Rename to "real" */ rename("bulk/x", x"); rename("bulk/y", y"); /* sync a cookie */ fsync("cookie"); And I'm asking why it's not: /* Rename to "real" as we go */ open("x.tmp"); write("x.tmp"); sync_file_range("x.tmp"); close("x.tmp"); rename("x.tmp", "x"); last_fd = open("y.tmp"); /* don't close() the last one yet */ write("y.tmp"); sync_file_range("y.tmp"); rename("y.tmp", "y"); /* sync a cookie */ fsync(last_fd); Which I guess is two questions: A. do we need the cookie, or can we re-use the fd of the last thing we write? B. Is the bulk indirection needed? > + fsync_or_die(fd, "loose object file"); Unrelated nit: this API is producing sentence lego unfriendly to translators. Should be made to take an enum or something, so we can emit the relevant translated message in fsync_or_die(). Imagine getting: fsync error on '日本語は話せません' Which this will do, just the other way around for non-English speakers using the translation. (The solution is also not to add _() here, since translators will want to control the word order.) > diff --git a/cache.h b/cache.h > index 3160bc1e489..d1ae51388c9 100644 > --- a/cache.h > +++ b/cache.h > @@ -1040,7 +1040,8 @@ extern int use_fsync; > > enum fsync_method { > FSYNC_METHOD_FSYNC, > - FSYNC_METHOD_WRITEOUT_ONLY > + FSYNC_METHOD_WRITEOUT_ONLY, > + FSYNC_METHOD_BATCH > }; > > extern enum fsync_method fsync_method; > @@ -1767,6 +1768,11 @@ void fsync_or_die(int fd, const char *); > int fsync_component(enum fsync_component component, int fd); > void fsync_component_or_die(enum fsync_component component, int fd, const char *msg); > > +static inline int batch_fsync_enabled(enum fsync_component component) > +{ > + return (fsync_components & component) && (fsync_method == FSYNC_METHOD_BATCH); > +} > + > ssize_t read_in_full(int fd, void *buf, size_t count); > ssize_t write_in_full(int fd, const void *buf, size_t count); > ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset); > diff --git a/config.c b/config.c > index 261ee7436e0..0b28f90de8b 100644 > --- a/config.c > +++ b/config.c > @@ -1688,6 +1688,8 @@ static int git_default_core_config(const char *var, const char *value, void *cb) > fsync_method = FSYNC_METHOD_FSYNC; > else if (!strcmp(value, "writeout-only")) > fsync_method = FSYNC_METHOD_WRITEOUT_ONLY; > + else if (!strcmp(value, "batch")) > + fsync_method = FSYNC_METHOD_BATCH; > else > warning(_("ignoring unknown core.fsyncMethod value '%s'"), value); > > diff --git a/object-file.c b/object-file.c > index 5258d9ed827..bdb0a38328f 100644 > --- a/object-file.c > +++ b/object-file.c > @@ -1895,6 +1895,8 @@ static void close_loose_object(int fd) > > if (fsync_object_files > 0) > fsync_or_die(fd, "loose object file"); > + else if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) > + fsync_loose_object_bulk_checkin(fd); > else > fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd, > "loose object file"); This is related to the above comments about what minimum set of syscalls are needed to trigger this "bulk" behavior, but it seems to me that this whole API is avoiding just passing some new flags down to object-file.c and friends. For e.g. update-index that results in e.g. the "plug bulk" not being aware of HASH_WRITE_OBJECT, so with dry-run writes and the like we'll do the whole setup/teardown for nothing. Which is another reason I wondered why this couldn't be a flagged passed down to the object writing...
"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > +* `batch` enables a mode that uses writeout-only flushes to stage multiple > + updates in the disk writeback cache and then does a single full fsync of > + a dummy file to trigger the disk cache flush at the end of the operation. It is unfortunate that we have a rather independent "unplug" that is not tied to the "this is the last operation in the batch"---if there were we didn't have to invent a dummy but a single full sync on the real file who happened to be the last one in the batch would be sufficient. It would not matter, if the batch is any meaningful size, hopefully. > +/* > + * Cleanup after batch-mode fsync_object_files. > + */ > +static void do_batch_fsync(void) > +{ > + /* > + * Issue a full hardware flush against a temporary file to ensure > + * that all objects are durable before any renames occur. The code in > + * fsync_loose_object_bulk_checkin has already issued a writeout > + * request, but it has not flushed any writeback cache in the storage > + * hardware. > + */ > + > + if (needs_batch_fsync) { > + struct strbuf temp_path = STRBUF_INIT; > + struct tempfile *temp; > + > + strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory()); > + temp = xmks_tempfile(temp_path.buf); > + fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp)); > + delete_tempfile(&temp); > + strbuf_release(&temp_path); > + needs_batch_fsync = 0; > + } > + > + if (bulk_fsync_objdir) { > + tmp_objdir_migrate(bulk_fsync_objdir); > + bulk_fsync_objdir = NULL; The struct obtained from tmp_objdir_create() is consumed by tmp_objdir_migrate() so the only clean-up left for the caller to do is to clear it to NULL. OK. > + } This initially made me wonder why we need two independent flags. After applying this patch but not any later steps, upon plugging, we create the tentative object directory, and any loose object will be created there, but because nobody calls the writeout-only variant via fsync_loose_object_bulk_checkin() yet, needs_batch_fsync may not be turned on. But even in that case, any new loose objects are in the tentative object directory and need to be migrated to the real place. And we may not cover all the existing code paths at the end of the series, or any new code paths right away after they get introduced, to be aware of the fsync_loose_object_bulk_checkin() when they create a loose object file, so it is most likely that these two if statements will be with us forever. OK. > @@ -274,6 +311,24 @@ static int deflate_to_pack(struct bulk_checkin_state *state, > return 0; > } > > +void fsync_loose_object_bulk_checkin(int fd) > +{ > + /* > + * If we have a plugged bulk checkin, we issue a call that > + * cleans the filesystem page cache but avoids a hardware flush > + * command. Later on we will issue a single hardware flush > + * before as part of do_batch_fsync. > + */ > + if (bulk_checkin_plugged && > + git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) { > + assert(bulk_fsync_objdir); > + if (!needs_batch_fsync) > + needs_batch_fsync = 1; Except for when we unplug, do we ever flip needs_batch_fsync bit off, once it is set? If the answer is no, wouldn't it be clearer to unconditionally set it, instead of "set it only for the first time"? > + } else { > + fsync_or_die(fd, "loose object file"); > + } > +} > + > int index_bulk_checkin(struct object_id *oid, > int fd, size_t size, enum object_type type, > const char *path, unsigned flags) > @@ -288,6 +343,19 @@ int index_bulk_checkin(struct object_id *oid, > void plug_bulk_checkin(void) > { > assert(!bulk_checkin_plugged); > + > + /* > + * A temporary object directory is used to hold the files > + * while they are not fsynced. > + */ > + if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) { > + bulk_fsync_objdir = tmp_objdir_create("bulk-fsync"); > + if (!bulk_fsync_objdir) > + die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch")); > + > + tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0); > + } > + > bulk_checkin_plugged = 1; > } > > @@ -297,4 +365,6 @@ void unplug_bulk_checkin(void) > bulk_checkin_plugged = 0; > if (bulk_checkin_state.f) > finish_bulk_checkin(&bulk_checkin_state); > + > + do_batch_fsync(); > } > diff --git a/bulk-checkin.h b/bulk-checkin.h > index b26f3dc3b74..08f292379b6 100644 > --- a/bulk-checkin.h > +++ b/bulk-checkin.h > @@ -6,6 +6,8 @@ > > #include "cache.h" > > +void fsync_loose_object_bulk_checkin(int fd); > + > int index_bulk_checkin(struct object_id *oid, > int fd, size_t size, enum object_type type, > const char *path, unsigned flags); > diff --git a/cache.h b/cache.h > index 3160bc1e489..d1ae51388c9 100644 > --- a/cache.h > +++ b/cache.h > @@ -1040,7 +1040,8 @@ extern int use_fsync; > > enum fsync_method { > FSYNC_METHOD_FSYNC, > - FSYNC_METHOD_WRITEOUT_ONLY > + FSYNC_METHOD_WRITEOUT_ONLY, > + FSYNC_METHOD_BATCH > }; Style. These days we allow trailing comma to enum definitions. Perhaps give a trailing comma after _BATCH so that the next update patch will become less noisy? Thanks.
On Mon, Mar 21, 2022 at 7:43 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote: > > > From: Neeraj Singh <neerajsi@microsoft.com> > > [...] > > + if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) { > > + bulk_fsync_objdir = tmp_objdir_create("bulk-fsync"); > > + if (!bulk_fsync_objdir) > > + die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch")); > > Should camel-case the config var, and we should have a die_errno() here > which tell us why we couldn't create it (possibly needing to ferry it up > from the tempfile API...) Thanks for noticing the camelCasing. The config var name was also wrong. Now it will read: > > + die(_("Could not create temporary object directory for core.fsyncMethod=batch")); Do you have any recommendations on how to easily ferry the correct errno out of tmp_objdir_create? It looks like the remerge-diff usage has the same die behavior w/o errno, and the builtin/receive-pack.c usage doesn't die, but also loses the errno. I'm concerned about preserving the errno across the or tmp_objdir_destroy calls. I could introduce a temp errno var to preserve it across those. Is that what you had in mind? Thanks, Neeraj
On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote: > > > From: Neeraj Singh <neerajsi@microsoft.com> > > > > One major source of the cost of fsync is the implied flush of the > > hardware writeback cache within the disk drive. This commit introduces > > a new `core.fsyncMethod=batch` option that batches up hardware flushes. > > It hooks into the bulk-checkin plugging and unplugging functionality, > > takes advantage of tmp-objdir, and uses the writeout-only support code. > > > > When the new mode is enabled, we do the following for each new object: > > 1. Create the object in a tmp-objdir. > > 2. Issue a pagecache writeback request and wait for it to complete. > > > > At the end of the entire transaction when unplugging bulk checkin: > > 1. Issue an fsync against a dummy file to flush the hardware writeback > > cache, which should by now have seen the tmp-objdir writes. > > 2. Rename all of the tmp-objdir files to their final names. > > 3. When updating the index and/or refs, we assume that Git will issue > > another fsync internal to that operation. This is not the default > > today, but the user now has the option of syncing the index and there > > is a separate patch series to implement syncing of refs. > > Re my question in > https://lore.kernel.org/git/220310.86r179ki38.gmgdl@evledraar.gmail.com/ > (which you *partially* replied to per my reading, i.e. not the > fsync_nth() question) I still don't get why the tmp-objdir part of this > is needed. > Sorry for not fully answering your question. I think part of the issue might be background, where it's not clear to me what's different between your understanding and mine, so may not have included something that's questionable to you but not to me. Your syscall description below makes the issues very concrete, so I think we'll get it this round :). > For "git stash" which is one thing sped up by this let's go over what > commands/FS ops we do. I changed the test like this: > > diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh > index 3fc16944e9e..479a495c68c 100755 > --- a/t/t3903-stash.sh > +++ b/t/t3903-stash.sh > @@ -1383,7 +1383,7 @@ BATCH_CONFIGURATION='-c core.fsync=loose-object -c core.fsyncmethod=batch' > > test_expect_success 'stash with core.fsyncmethod=batch' " > test_create_unique_files 2 4 fsync-files && > - git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ && > + strace -f git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ && > rm -f fsynced_files && > > # The files were untracked, so use the third parent, > > Then we get this output, with my comments, and I snipped some output: > > $ ./t3903-stash.sh --run=1-4,114 -vixd 2>&1|grep --color -e 89772c935031c228ed67890f9 -e .git/stash -e bulk_fsync -e .git/index > [pid 14703] access(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory) > [pid 14703] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory) > [pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/tmp_obj_bdUlzu", ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0 > > Here we're creating the tmp_objdir() files. We then sync_file_range() > and close() this. > > [pid 14703] openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7", O_RDWR|O_CREAT|O_EXCL, 0600) = 9 > [pid 14703] unlink("/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7") = 0 > > This is the flushing of the "cookie" in do_batch_fsync(). > > [pid 14703] newfstatat(AT_FDCWD, ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, 0) = 0 > [pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0 > > Here we're going through the object dir migration with > unplug_bulk_checkin(). > > [pid 14703] unlink(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0 > newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0 > [pid 14705] access(".git/objects/tmp_objdir-bulk-fsync-0F7DGy/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory) > [pid 14705] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = 0 > [pid 14705] utimensat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", NULL, 0) = 0 > [pid 14707] openat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", O_RDONLY|O_CLOEXEC) = 9 > > We then update the index itself, first a temporary index.stash : > > openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 8 > openat(AT_FDCWD, ".git/index.stash.19141", O_RDONLY) = 9 > newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0 > newfstatat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", {st_mode=S_IFREG|0644, st_size=927, ...}, 0) = 0 > rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141") = 0 > unlink(".git/index.stash.19141") = 0 > > Followed by the same and a later rename of the actual index: > > [pid 19146] rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index") = 0 > > So, my question is still why the temporary object dir migration part of > this is needed. > > We are writing N loose object files, and we write those to temporary > names already. > > AFAIKT we could do all of this by doing the same > tmp/rename/sync_file_range dance on the main object store. > Why not the main object store? We want to maintain the invariant that any name in the main object store refers to a file that durably has the correct contents. If we do sync_file_range and then rename, and then crash, we now have a file in the main object store with some SHA name, whose contents may or may not match the SHA. However, if we ensure an fsync happens before the rename, a crash at any point will leave us either with no file in the main object store or with a file that is durable on the disk. > Then instead of the "bulk_fsync" cookie file don't close() the last file > object file we write until we issue the fsync on it. > > But maybe this is all needed, I just can't understand from the commit > message why the "bulk checkin" part is being done. > > I think since we've been over this a few times without any success it > would really help to have some example of the smallest set of syscalls > to write a file like this safely. I.e. this is doing (pseudocode): > > /* first the bulk path */ > open("bulk/x.tmp"); > write("bulk/x.tmp"); > sync_file_range("bulk/x.tmp"); > close("bulk/x.tmp"); > rename("bulk/x.tmp", "bulk/x"); > open("bulk/y.tmp"); > write("bulk/y.tmp"); > sync_file_range("bulk/y.tmp"); > close("bulk/y.tmp"); > rename("bulk/y.tmp", "bulk/y"); > /* Rename to "real" */ > rename("bulk/x", x"); > rename("bulk/y", y"); > /* sync a cookie */ > fsync("cookie"); > The '/* Rename to "real" */' and '/* sync a cookie */' steps are reversed in your above sequence. It should be 1: (for each file) a) open b) write c) sync_file_range d) close e) rename in tmp_objdir -- The rename step is not required for bulk-fsync. An earlier version of this series didn't do it, but Jeff King pointed out that it was required for concurrency: https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@coredump.intra.peff.net/ 2: fsync something on the same volume to flush the filesystem log and disk cache. This functions as a "barrier". 3: Rename to final names. At this point we know that the "contents" are durable, so if the final name exists, we can read through it to get the data. > And I'm asking why it's not: > > /* Rename to "real" as we go */ > open("x.tmp"); > write("x.tmp"); > sync_file_range("x.tmp"); > close("x.tmp"); > rename("x.tmp", "x"); > last_fd = open("y.tmp"); /* don't close() the last one yet */ > write("y.tmp"); > sync_file_range("y.tmp"); > rename("y.tmp", "y"); > /* sync a cookie */ > fsync(last_fd); > > Which I guess is two questions: > > A. do we need the cookie, or can we re-use the fd of the last thing we > write? We can re-use the FD of the last thing we write, but that results in a tricker API which is more intrusive on callers. I was originally using a lockfile, but found a usage where there was no lockfile in unpack-objects. > B. Is the bulk indirection needed? > Hopefully the explanation above makes it clear why we need the indirection. To state it again, we need a real fsync before creating the final name in the objdir, otherwise on a crash a name could exist that points at contents which could have been lost, since they weren't durable. I updated the comment in do_batch_fsync to make this a little clearer. > > + fsync_or_die(fd, "loose object file"); > > Unrelated nit: this API is producing sentence lego unfriendly to > translators. > > Should be made to take an enum or something, so we can emit the relevant > translated message in fsync_or_die(). Imagine getting: > > fsync error on '日本語は話せません' > > Which this will do, just the other way around for non-English speakers > using the translation. > > (The solution is also not to add _() here, since translators will want > to control the word order.) This line is copied from the preexisting version of the same code in close_loose_object. If I'm understanding it correctly, the entire chain of messages is untranslated and would remain as english. fsync_or_die doesn't have a _(). Can we just leave it that way, since this is not a situation that should actually happen to many users? Alternatively, I think it would be pretty trivial to just pass through the file name, so I'll just do that. > > diff --git a/cache.h b/cache.h > > index 3160bc1e489..d1ae51388c9 100644 > > --- a/cache.h > > +++ b/cache.h > > @@ -1040,7 +1040,8 @@ extern int use_fsync; > > > > enum fsync_method { > > FSYNC_METHOD_FSYNC, > > - FSYNC_METHOD_WRITEOUT_ONLY > > + FSYNC_METHOD_WRITEOUT_ONLY, > > + FSYNC_METHOD_BATCH > > }; > > > > extern enum fsync_method fsync_method; > > @@ -1767,6 +1768,11 @@ void fsync_or_die(int fd, const char *); > > int fsync_component(enum fsync_component component, int fd); > > void fsync_component_or_die(enum fsync_component component, int fd, const char *msg); > > > > +static inline int batch_fsync_enabled(enum fsync_component component) > > +{ > > + return (fsync_components & component) && (fsync_method == FSYNC_METHOD_BATCH); > > +} > > + > > ssize_t read_in_full(int fd, void *buf, size_t count); > > ssize_t write_in_full(int fd, const void *buf, size_t count); > > ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset); > > diff --git a/config.c b/config.c > > index 261ee7436e0..0b28f90de8b 100644 > > --- a/config.c > > +++ b/config.c > > @@ -1688,6 +1688,8 @@ static int git_default_core_config(const char *var, const char *value, void *cb) > > fsync_method = FSYNC_METHOD_FSYNC; > > else if (!strcmp(value, "writeout-only")) > > fsync_method = FSYNC_METHOD_WRITEOUT_ONLY; > > + else if (!strcmp(value, "batch")) > > + fsync_method = FSYNC_METHOD_BATCH; > > else > > warning(_("ignoring unknown core.fsyncMethod value '%s'"), value); > > > > diff --git a/object-file.c b/object-file.c > > index 5258d9ed827..bdb0a38328f 100644 > > --- a/object-file.c > > +++ b/object-file.c > > @@ -1895,6 +1895,8 @@ static void close_loose_object(int fd) > > > > if (fsync_object_files > 0) > > fsync_or_die(fd, "loose object file"); > > + else if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) > > + fsync_loose_object_bulk_checkin(fd); > > else > > fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd, > > "loose object file"); > > This is related to the above comments about what minimum set of syscalls > are needed to trigger this "bulk" behavior, but it seems to me that this > whole API is avoiding just passing some new flags down to object-file.c > and friends. > > For e.g. update-index that results in e.g. the "plug bulk" not being > aware of HASH_WRITE_OBJECT, so with dry-run writes and the like we'll do > the whole setup/teardown for nothing. > > Which is another reason I wondered why this couldn't be a flagged passed > down to the object writing... In the original implementation [1], I did some custom thing for renaming the files rather than tmp_objdir. But you suggested at the time that I use tmp_objdir, which was a good decision, since it made access to the objects possible in-process and for descendents in the middle of the transaction. It sounds to me like I just shouldn't plug the bulk checkin for cases where we're not going to add to the ODB. Plugging the bulk checkin is always optional. But when I wrote the code, I didn't love the result, since it makes arbitrary callers harder. So I changed the code to lazily create the tmp objdir the first time an object shows up, which has the same effect of avoiding the cost when we aren't adding any objects. This also avoids the need to write an error message, since failing to create the tmp objdir will just result in a fsync. The main downside here is that it's another thing that will have to change if we want to make adding to the ODB multithreaded. Thanks, Neeraj [1] https://lore.kernel.org/all/12cad737635663ed596e52f89f0f4f22f58bfe38.1632176111.git.gitgitgadget@gmail.com/
On Mon, Mar 21 2022, Neeraj Singh wrote: [Don't have time for a full reply, sorry, just something quick] > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason > [...] >> So, my question is still why the temporary object dir migration part of >> this is needed. >> >> We are writing N loose object files, and we write those to temporary >> names already. >> >> AFAIKT we could do all of this by doing the same >> tmp/rename/sync_file_range dance on the main object store. >> > > Why not the main object store? We want to maintain the invariant that any > name in the main object store refers to a file that durably has the > correct contents. > If we do sync_file_range and then rename, and then crash, we now have a file > in the main object store with some SHA name, whose contents may or may not > match the SHA. However, if we ensure an fsync happens before the rename, > a crash at any point will leave us either with no file in the main > object store or > with a file that is durable on the disk. Ah, I see. Why does that matter? If the "bulk" mode works as advertised we might have such a corrupt loose or pack file, but we won't have anything referring to it as far as reachability goes. I'm aware that the various code paths that handle OID writing don't deal too well with it in practice to say the least, which one can try with say: $ echo foo | git hash-object -w --stdin 45b983be36b73c0788dc9cbcb76cbb80fc7bb057 $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057 I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running that hash-object again even returns successful (we see it's there already, and think it's OK). But in any case, I think it would me much easier to both review and reason about this code if these concerns were split up. I.e. things that want no fsync at all (I'd think especially so) might want to have such updates serialized in this manner, and as Junio pointed out making these things inseparable as you've done creates API concerns & fallout that's got nothing to do with what we need for the performance gains of the bulk checkin fsyncing technique, e.g. concurrent "update-index" consumers not being able to assume reported objects exist as soon as they're reported. >> Then instead of the "bulk_fsync" cookie file don't close() the last file >> object file we write until we issue the fsync on it. >> >> But maybe this is all needed, I just can't understand from the commit >> message why the "bulk checkin" part is being done. >> >> I think since we've been over this a few times without any success it >> would really help to have some example of the smallest set of syscalls >> to write a file like this safely. I.e. this is doing (pseudocode): >> >> /* first the bulk path */ >> open("bulk/x.tmp"); >> write("bulk/x.tmp"); >> sync_file_range("bulk/x.tmp"); >> close("bulk/x.tmp"); >> rename("bulk/x.tmp", "bulk/x"); >> open("bulk/y.tmp"); >> write("bulk/y.tmp"); >> sync_file_range("bulk/y.tmp"); >> close("bulk/y.tmp"); >> rename("bulk/y.tmp", "bulk/y"); >> /* Rename to "real" */ >> rename("bulk/x", x"); >> rename("bulk/y", y"); >> /* sync a cookie */ >> fsync("cookie"); >> > > The '/* Rename to "real" */' and '/* sync a cookie */' steps are > reversed in your above sequence. It should be Sorry. > 1: (for each file) > a) open > b) write > c) sync_file_range > d) close > e) rename in tmp_objdir -- The rename step is not required for > bulk-fsync. An earlier version of this series didn't do it, but > Jeff King pointed out that it was required for concurrency: > https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@coredump.intra.peff.net/ Yes we definitely need the rename, I was wondering about why we needed it 2x for each file, but that was answered above. >> And I'm asking why it's not: >> >> /* Rename to "real" as we go */ >> open("x.tmp"); >> write("x.tmp"); >> sync_file_range("x.tmp"); >> close("x.tmp"); >> rename("x.tmp", "x"); >> last_fd = open("y.tmp"); /* don't close() the last one yet */ >> write("y.tmp"); >> sync_file_range("y.tmp"); >> rename("y.tmp", "y"); >> /* sync a cookie */ >> fsync(last_fd); >> >> Which I guess is two questions: >> >> A. do we need the cookie, or can we re-use the fd of the last thing we >> write? > > We can re-use the FD of the last thing we write, but that results in a > tricker API which > is more intrusive on callers. I was originally using a lockfile, but > found a usage where > there was no lockfile in unpack-objects. Ok, so it's something we could do, but passing down 2-3 functions to object-file.c was a hassle. I tried to hack that up earlier and found that it wasn't *too bad*. I.e. we'd pass some "flags" about our intent, and amend various functions to take "don't close this one" and pass up the fd (or even do that as a global). In any case, having the commit message clearly document what's needed for what & what's essential & just shortcut taken for the convenience of the current implementation would be really useful. Then we can always e.g. change this later to just do the the fsync() on the last of N we write. [Ran out of time, sorry]
On Mon, Mar 21, 2022 at 10:30 AM Junio C Hamano <gitster@pobox.com> wrote: > > "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > +* `batch` enables a mode that uses writeout-only flushes to stage multiple > > + updates in the disk writeback cache and then does a single full fsync of > > + a dummy file to trigger the disk cache flush at the end of the operation. > > It is unfortunate that we have a rather independent "unplug" that is > not tied to the "this is the last operation in the batch"---if there > were we didn't have to invent a dummy but a single full sync on the > real file who happened to be the last one in the batch would be > sufficient. It would not matter, if the batch is any meaningful > size, hopefully. > I'm banking on a large batch size or the fact that the additional cost of creating and syncing an empty file to be so small that it wouldn't be noticeable event for small batches. The current unfortunate scheme at least has a very simple API that's easy to apply to any other operation going forward. For instance builtin/hash-object.c might be another good operation, but it wasn't clear to me if it's used for any mainline scenario. > > +/* > > + * Cleanup after batch-mode fsync_object_files. > > + */ > > +static void do_batch_fsync(void) > > +{ > > + /* > > + * Issue a full hardware flush against a temporary file to ensure > > + * that all objects are durable before any renames occur. The code in > > + * fsync_loose_object_bulk_checkin has already issued a writeout > > + * request, but it has not flushed any writeback cache in the storage > > + * hardware. > > + */ > > + > > + if (needs_batch_fsync) { > > + struct strbuf temp_path = STRBUF_INIT; > > + struct tempfile *temp; > > + > > + strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory()); > > + temp = xmks_tempfile(temp_path.buf); > > + fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp)); > > + delete_tempfile(&temp); > > + strbuf_release(&temp_path); > > + needs_batch_fsync = 0; > > + } > > + > > + if (bulk_fsync_objdir) { > > + tmp_objdir_migrate(bulk_fsync_objdir); > > + bulk_fsync_objdir = NULL; > > The struct obtained from tmp_objdir_create() is consumed by > tmp_objdir_migrate() so the only clean-up left for the caller to do > is to clear it to NULL. OK. > > > + } > > This initially made me wonder why we need two independent flags. > After applying this patch but not any later steps, upon plugging, we > create the tentative object directory, and any loose object will be > created there, but because nobody calls the writeout-only variant > via fsync_loose_object_bulk_checkin() yet, needs_batch_fsync may not > be turned on. But even in that case, any new loose objects are in > the tentative object directory and need to be migrated to the real > place. > > And we may not cover all the existing code paths at the end of the > series, or any new code paths right away after they get introduced, > to be aware of the fsync_loose_object_bulk_checkin() when they > create a loose object file, so it is most likely that these two if > statements will be with us forever. > > OK. After Avarb's last feedback, I've changed this to lazily create the objdir, so the existence of an objdir is a suitable proxy for there being something worth syncing. The potential downside is that the lazy-creation would need to be synchronized if the ODB becomes multithreaded. > > > @@ -274,6 +311,24 @@ static int deflate_to_pack(struct bulk_checkin_state *state, > > return 0; > > } > > > > +void fsync_loose_object_bulk_checkin(int fd) > > +{ > > + /* > > + * If we have a plugged bulk checkin, we issue a call that > > + * cleans the filesystem page cache but avoids a hardware flush > > + * command. Later on we will issue a single hardware flush > > + * before as part of do_batch_fsync. > > + */ > > + if (bulk_checkin_plugged && > > + git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) { > > + assert(bulk_fsync_objdir); > > + if (!needs_batch_fsync) > > + needs_batch_fsync = 1; > > Except for when we unplug, do we ever flip needs_batch_fsync bit > off, once it is set? If the answer is no, wouldn't it be clearer to > unconditionally set it, instead of "set it only for the first time"? > This code is now gone. I was stupidly optimizing for a future multithreaded world which might never come. > > + } else { > > + fsync_or_die(fd, "loose object file"); > > + } > > +} > > + > > int index_bulk_checkin(struct object_id *oid, > > int fd, size_t size, enum object_type type, > > const char *path, unsigned flags) > > @@ -288,6 +343,19 @@ int index_bulk_checkin(struct object_id *oid, > > void plug_bulk_checkin(void) > > { > > assert(!bulk_checkin_plugged); > > + > > + /* > > + * A temporary object directory is used to hold the files > > + * while they are not fsynced. > > + */ > > + if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) { > > + bulk_fsync_objdir = tmp_objdir_create("bulk-fsync"); > > + if (!bulk_fsync_objdir) > > + die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch")); > > + > > + tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0); > > + } > > + > > bulk_checkin_plugged = 1; > > } > > > > @@ -297,4 +365,6 @@ void unplug_bulk_checkin(void) > > bulk_checkin_plugged = 0; > > if (bulk_checkin_state.f) > > finish_bulk_checkin(&bulk_checkin_state); > > + > > + do_batch_fsync(); > > } > > diff --git a/bulk-checkin.h b/bulk-checkin.h > > index b26f3dc3b74..08f292379b6 100644 > > --- a/bulk-checkin.h > > +++ b/bulk-checkin.h > > @@ -6,6 +6,8 @@ > > > > #include "cache.h" > > > > +void fsync_loose_object_bulk_checkin(int fd); > > + > > int index_bulk_checkin(struct object_id *oid, > > int fd, size_t size, enum object_type type, > > const char *path, unsigned flags); > > diff --git a/cache.h b/cache.h > > index 3160bc1e489..d1ae51388c9 100644 > > --- a/cache.h > > +++ b/cache.h > > @@ -1040,7 +1040,8 @@ extern int use_fsync; > > > > enum fsync_method { > > FSYNC_METHOD_FSYNC, > > - FSYNC_METHOD_WRITEOUT_ONLY > > + FSYNC_METHOD_WRITEOUT_ONLY, > > + FSYNC_METHOD_BATCH > > }; > > Style. > > These days we allow trailing comma to enum definitions. Perhaps > give a trailing comma after _BATCH so that the next update patch > will become less noisy? > Fixed. > Thanks. Thanks! -Neeraj
On Mon, Mar 21, 2022 at 1:37 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Mon, Mar 21 2022, Neeraj Singh wrote: > > [Don't have time for a full reply, sorry, just something quick] > > > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason > > [...] > >> So, my question is still why the temporary object dir migration part of > >> this is needed. > >> > >> We are writing N loose object files, and we write those to temporary > >> names already. > >> > >> AFAIKT we could do all of this by doing the same > >> tmp/rename/sync_file_range dance on the main object store. > >> > > > > Why not the main object store? We want to maintain the invariant that any > > name in the main object store refers to a file that durably has the > > correct contents. > > If we do sync_file_range and then rename, and then crash, we now have a file > > in the main object store with some SHA name, whose contents may or may not > > match the SHA. However, if we ensure an fsync happens before the rename, > > a crash at any point will leave us either with no file in the main > > object store or > > with a file that is durable on the disk. > > Ah, I see. > > Why does that matter? If the "bulk" mode works as advertised we might > have such a corrupt loose or pack file, but we won't have anything > referring to it as far as reachability goes. > > I'm aware that the various code paths that handle OID writing don't deal > too well with it in practice to say the least, which one can try with > say: > > $ echo foo | git hash-object -w --stdin > 45b983be36b73c0788dc9cbcb76cbb80fc7bb057 > $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057 > > I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running > that hash-object again even returns successful (we see it's there > already, and think it's OK). > I was under the impression that in-practice a corrupt loose-object can create persistent problems in the repo for future commands, since we might not aggressively verify that an existing file with a certain OID really is valid when adding a new instance of the data with the same OID. If you don't have an fsync barrier before producing the final content-addressable name, you can't reason about "this operation happened before that operation," so it wouldn't really be valid to say that "we won't have anything referring to it as far as reachability goes." It's entirely possible that you'd have trees pointing to other trees or blobs that aren't valid, since data writes can be durable in any order. At this point, future attempts add the same blobs or trees might silently drop the updates. I'm betting that's why core.fsyncObjectFiles was added in the first place, since someone observed severe persistent consequences for this form of corruption. > But in any case, I think it would me much easier to both review and > reason about this code if these concerns were split up. > > I.e. things that want no fsync at all (I'd think especially so) might > want to have such updates serialized in this manner, and as Junio > pointed out making these things inseparable as you've done creates API > concerns & fallout that's got nothing to do with what we need for the > performance gains of the bulk checkin fsyncing technique, > e.g. concurrent "update-index" consumers not being able to assume > reported objects exist as soon as they're reported. > I want to explicitly not respond to this concern. I don't believe this 100 line patch can be usefully split. > >> Then instead of the "bulk_fsync" cookie file don't close() the last file > >> object file we write until we issue the fsync on it. > >> > >> But maybe this is all needed, I just can't understand from the commit > >> message why the "bulk checkin" part is being done. > >> > >> I think since we've been over this a few times without any success it > >> would really help to have some example of the smallest set of syscalls > >> to write a file like this safely. I.e. this is doing (pseudocode): > >> > >> /* first the bulk path */ > >> open("bulk/x.tmp"); > >> write("bulk/x.tmp"); > >> sync_file_range("bulk/x.tmp"); > >> close("bulk/x.tmp"); > >> rename("bulk/x.tmp", "bulk/x"); > >> open("bulk/y.tmp"); > >> write("bulk/y.tmp"); > >> sync_file_range("bulk/y.tmp"); > >> close("bulk/y.tmp"); > >> rename("bulk/y.tmp", "bulk/y"); > >> /* Rename to "real" */ > >> rename("bulk/x", x"); > >> rename("bulk/y", y"); > >> /* sync a cookie */ > >> fsync("cookie"); > >> > > > > The '/* Rename to "real" */' and '/* sync a cookie */' steps are > > reversed in your above sequence. It should be > > Sorry. > > > 1: (for each file) > > a) open > > b) write > > c) sync_file_range > > d) close > > e) rename in tmp_objdir -- The rename step is not required for > > bulk-fsync. An earlier version of this series didn't do it, but > > Jeff King pointed out that it was required for concurrency: > > https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@coredump.intra.peff.net/ > > Yes we definitely need the rename, I was wondering about why we needed > it 2x for each file, but that was answered above. > > >> And I'm asking why it's not: > >> > >> /* Rename to "real" as we go */ > >> open("x.tmp"); > >> write("x.tmp"); > >> sync_file_range("x.tmp"); > >> close("x.tmp"); > >> rename("x.tmp", "x"); > >> last_fd = open("y.tmp"); /* don't close() the last one yet */ > >> write("y.tmp"); > >> sync_file_range("y.tmp"); > >> rename("y.tmp", "y"); > >> /* sync a cookie */ > >> fsync(last_fd); > >> > >> Which I guess is two questions: > >> > >> A. do we need the cookie, or can we re-use the fd of the last thing we > >> write? > > > > We can re-use the FD of the last thing we write, but that results in a > > tricker API which > > is more intrusive on callers. I was originally using a lockfile, but > > found a usage where > > there was no lockfile in unpack-objects. > > Ok, so it's something we could do, but passing down 2-3 functions to > object-file.c was a hassle. > > I tried to hack that up earlier and found that it wasn't *too > bad*. I.e. we'd pass some "flags" about our intent, and amend various > functions to take "don't close this one" and pass up the fd (or even do > that as a global). > > In any case, having the commit message clearly document what's needed > for what & what's essential & just shortcut taken for the convenience of > the current implementation would be really useful. > > Then we can always e.g. change this later to just do the the fsync() on > the last of N we write. > I left a comment in the (now very long) commit message that indicates the dummy file is there to make the API simpler. Thanks, Neeraj
On Mon, Mar 21 2022, Neeraj Singh wrote: > On Mon, Mar 21, 2022 at 1:37 PM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> >> On Mon, Mar 21 2022, Neeraj Singh wrote: >> >> [Don't have time for a full reply, sorry, just something quick] >> >> > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason >> > [...] >> >> So, my question is still why the temporary object dir migration part of >> >> this is needed. >> >> >> >> We are writing N loose object files, and we write those to temporary >> >> names already. >> >> >> >> AFAIKT we could do all of this by doing the same >> >> tmp/rename/sync_file_range dance on the main object store. >> >> >> > >> > Why not the main object store? We want to maintain the invariant that any >> > name in the main object store refers to a file that durably has the >> > correct contents. >> > If we do sync_file_range and then rename, and then crash, we now have a file >> > in the main object store with some SHA name, whose contents may or may not >> > match the SHA. However, if we ensure an fsync happens before the rename, >> > a crash at any point will leave us either with no file in the main >> > object store or >> > with a file that is durable on the disk. >> >> Ah, I see. >> >> Why does that matter? If the "bulk" mode works as advertised we might >> have such a corrupt loose or pack file, but we won't have anything >> referring to it as far as reachability goes. >> >> I'm aware that the various code paths that handle OID writing don't deal >> too well with it in practice to say the least, which one can try with >> say: >> >> $ echo foo | git hash-object -w --stdin >> 45b983be36b73c0788dc9cbcb76cbb80fc7bb057 >> $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057 >> >> I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running >> that hash-object again even returns successful (we see it's there >> already, and think it's OK). >> > > I was under the impression that in-practice a corrupt loose-object can create > persistent problems in the repo for future commands, since we might not > aggressively verify that an existing file with a certain OID really is > valid when > adding a new instance of the data with the same OID. Yes, it can. As the hash-object case shows we don't even check at all. For "incoming push" we *will* notice, but will just uselessly error out. I actually had some patches a while ago to turn off our own home-grown SHA-1 collision checking. It had the nice side effect of making it easier to recover from loose object corruption, since you could (re-)push the corrupted OID as a PACK, we wouldn't check (and die) on the bad loose object, and since we take a PACK over LOOSE we'd recover: https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/ > If you don't have an fsync barrier before producing the final > content-addressable > name, you can't reason about "this operation happened before that operation," > so it wouldn't really be valid to say that "we won't have anything > referring to it as far > as reachability goes." That's correct, but we're discussing a feature that *does have* that fsync barrier. So if we get an error while writing the loose objects before the "cookie" fsync we'll presumably error out. That'll then be followed by an fsync() of whatever makes the objects reachable. > It's entirely possible that you'd have trees pointing to other trees > or blobs that aren't > valid, since data writes can be durable in any order. At this point, > future attempts add > the same blobs or trees might silently drop the updates. I'm betting that's why > core.fsyncObjectFiles was added in the first place, since someone > observed severe > persistent consequences for this form of corruption. Well, you can see Linus's original rant-as-documentation for why we added it :) I.e. the original git implementation made some heavy linux-FS assumption about the order of writes and an fsync() flushing any previous writes, which wasn't portable. >> But in any case, I think it would me much easier to both review and >> reason about this code if these concerns were split up. >> >> I.e. things that want no fsync at all (I'd think especially so) might >> want to have such updates serialized in this manner, and as Junio >> pointed out making these things inseparable as you've done creates API >> concerns & fallout that's got nothing to do with what we need for the >> performance gains of the bulk checkin fsyncing technique, >> e.g. concurrent "update-index" consumers not being able to assume >> reported objects exist as soon as they're reported. >> > > I want to explicitly not respond to this concern. I don't believe this > 100 line patch > can be usefully split. Leaving "usefully" aside for a second (since that's subjective), it clearly "can". I just tried this on top of "seen": diff --git a/bulk-checkin.c b/bulk-checkin.c index a702e0ff203..9e994c4d6ae 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -9,15 +9,12 @@ #include "pack.h" #include "strbuf.h" #include "string-list.h" -#include "tmp-objdir.h" #include "packfile.h" #include "object-store.h" static int bulk_checkin_plugged; static int needs_batch_fsync; -static struct tmp_objdir *bulk_fsync_objdir; - static struct bulk_checkin_state { char *pack_tmp_name; struct hashfile *f; @@ -110,11 +107,6 @@ static void do_batch_fsync(void) strbuf_release(&temp_path); needs_batch_fsync = 0; } - - if (bulk_fsync_objdir) { - tmp_objdir_migrate(bulk_fsync_objdir); - bulk_fsync_objdir = NULL; - } } static int already_written(struct bulk_checkin_state *state, struct object_id *oid) @@ -321,7 +313,6 @@ void fsync_loose_object_bulk_checkin(int fd) */ if (bulk_checkin_plugged && git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) { - assert(bulk_fsync_objdir); if (!needs_batch_fsync) needs_batch_fsync = 1; } else { @@ -343,19 +334,6 @@ int index_bulk_checkin(struct object_id *oid, void plug_bulk_checkin(void) { assert(!bulk_checkin_plugged); - - /* - * A temporary object directory is used to hold the files - * while they are not fsynced. - */ - if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) { - bulk_fsync_objdir = tmp_objdir_create("bulk-fsync"); - if (!bulk_fsync_objdir) - die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch")); - - tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0); - } - bulk_checkin_plugged = 1; } And then tried running: $ GIT_PERF_MAKE_OPTS='CFLAGS=-O3' ./run HEAD~ HEAD -- p3900-stash.sh And got: Test HEAD~ HEAD -------------------------------------------------------------------------------------------- 3900.2: stash 500 files (object_fsyncing=false) 0.56(0.08+0.09) 0.60(0.08+0.08) +7.1% 3900.4: stash 500 files (object_fsyncing=true) 14.50(0.07+0.15) 17.13(0.10+0.12) +18.1% 3900.6: stash 500 files (object_fsyncing=batch) 1.14(0.08+0.11) 1.03(0.08+0.10) -9.6% Now, I really don't trust that perf run to say anything except these being in the same ballpark, but it's clearly going to be a *bit* faster since we'll be doing fewer IOps. As to "usefully" I really do get what you're saying that you only find these useful when you combine the two because you'd like to have 100% safety, and that's fair enough. But since we are going to have a knob to turn off fsyncing entirely, and we have this "bulk" mode which requires you to carefully reason about your FS semantics to ascertain safety the performance/safety trade-off is clearly something that's useful to have tweaks for. And with "bulk" the concern about leaving behind stray corrupt objects is entirely orthagonal to corcerns about losing a ref update, which is the main thing we're worried about. I also don't see how even if you're arguing that nobody would want one without the other because everyone who cares about "bulk" cares about this stray-corrupt-loose-but-no-ref-update case, how it has any business being tied up in the "bulk" mode as far as the implementation goes. That's because the same edge case is exposed by core.fsyncObjectFiles=false for those who are assuming the initial "ordered" semantics. I.e. if we're saying that before we write the ref we'd like to not expose the WIP objects in the primary object store because they're not fsync'd yet, how is that mode different than "bulk" if we crash while doing that operation (before the eventual fsync()). So I really think it's much better to split these concerns up. I think even if you insist on the same end-state it makes the patch progression much *easier* to reason about. We'd then solve one problem at a time, and start with a commit where just the semantics that are unique to "bulk" are implemented, with nothing else conflated with those. > [...] >> Ok, so it's something we could do, but passing down 2-3 functions to >> object-file.c was a hassle. >> >> I tried to hack that up earlier and found that it wasn't *too >> bad*. I.e. we'd pass some "flags" about our intent, and amend various >> functions to take "don't close this one" and pass up the fd (or even do >> that as a global). >> >> In any case, having the commit message clearly document what's needed >> for what & what's essential & just shortcut taken for the convenience of >> the current implementation would be really useful. >> >> Then we can always e.g. change this later to just do the the fsync() on >> the last of N we write. >> > > I left a comment in the (now very long) commit message that indicates the > dummy file is there to make the API simpler. In terms of more understandable progression I also think this series would be much easier to understand if it converted one caller without needing the "cookie" where doing so is easy, e.g. the unpack-objects.c caller where we're processing nr_objects, so we can just pass down a flag to do the fsync() for i == nr_objects. That'll then clearly show that the whole business of having the global state on the side is just a replacement for passing down such a flag.
On Tue, Mar 22, 2022 at 2:29 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Mon, Mar 21 2022, Neeraj Singh wrote: > > > On Mon, Mar 21, 2022 at 1:37 PM Ævar Arnfjörð Bjarmason > > <avarab@gmail.com> wrote: > >> > >> > >> On Mon, Mar 21 2022, Neeraj Singh wrote: > >> > >> [Don't have time for a full reply, sorry, just something quick] > >> > >> > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason > >> > [...] > >> >> So, my question is still why the temporary object dir migration part of > >> >> this is needed. > >> >> > >> >> We are writing N loose object files, and we write those to temporary > >> >> names already. > >> >> > >> >> AFAIKT we could do all of this by doing the same > >> >> tmp/rename/sync_file_range dance on the main object store. > >> >> > >> > > >> > Why not the main object store? We want to maintain the invariant that any > >> > name in the main object store refers to a file that durably has the > >> > correct contents. > >> > If we do sync_file_range and then rename, and then crash, we now have a file > >> > in the main object store with some SHA name, whose contents may or may not > >> > match the SHA. However, if we ensure an fsync happens before the rename, > >> > a crash at any point will leave us either with no file in the main > >> > object store or > >> > with a file that is durable on the disk. > >> > >> Ah, I see. > >> > >> Why does that matter? If the "bulk" mode works as advertised we might > >> have such a corrupt loose or pack file, but we won't have anything > >> referring to it as far as reachability goes. > >> > >> I'm aware that the various code paths that handle OID writing don't deal > >> too well with it in practice to say the least, which one can try with > >> say: > >> > >> $ echo foo | git hash-object -w --stdin > >> 45b983be36b73c0788dc9cbcb76cbb80fc7bb057 > >> $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057 > >> > >> I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running > >> that hash-object again even returns successful (we see it's there > >> already, and think it's OK). > >> > > > > I was under the impression that in-practice a corrupt loose-object can create > > persistent problems in the repo for future commands, since we might not > > aggressively verify that an existing file with a certain OID really is > > valid when > > adding a new instance of the data with the same OID. > > Yes, it can. As the hash-object case shows we don't even check at all. > > For "incoming push" we *will* notice, but will just uselessly error > out. > > I actually had some patches a while ago to turn off our own home-grown > SHA-1 collision checking. > > It had the nice side effect of making it easier to recover from loose > object corruption, since you could (re-)push the corrupted OID as a > PACK, we wouldn't check (and die) on the bad loose object, and since we > take a PACK over LOOSE we'd recover: > https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/ > > > If you don't have an fsync barrier before producing the final > > content-addressable > > name, you can't reason about "this operation happened before that operation," > > so it wouldn't really be valid to say that "we won't have anything > > referring to it as far > > as reachability goes." > > That's correct, but we're discussing a feature that *does have* that > fsync barrier. So if we get an error while writing the loose objects > before the "cookie" fsync we'll presumably error out. That'll then be > followed by an fsync() of whatever makes the objects reachable. > Because we have a content-addressable store which generally trusts its contents are valid (at least when adding new instances of the same content), the mere existence of a loose-object with a certain name is enough to make it "reachable" to future operations, even if there are no other immediate ways to get to that object. > > It's entirely possible that you'd have trees pointing to other trees > > or blobs that aren't > > valid, since data writes can be durable in any order. At this point, > > future attempts add > > the same blobs or trees might silently drop the updates. I'm betting that's why > > core.fsyncObjectFiles was added in the first place, since someone > > observed severe > > persistent consequences for this form of corruption. > > Well, you can see Linus's original rant-as-documentation for why we > added it :) I.e. the original git implementation made some heavy > linux-FS assumption about the order of writes and an fsync() flushing > any previous writes, which wasn't portable. > > >> But in any case, I think it would me much easier to both review and > >> reason about this code if these concerns were split up. > >> > >> I.e. things that want no fsync at all (I'd think especially so) might > >> want to have such updates serialized in this manner, and as Junio > >> pointed out making these things inseparable as you've done creates API > >> concerns & fallout that's got nothing to do with what we need for the > >> performance gains of the bulk checkin fsyncing technique, > >> e.g. concurrent "update-index" consumers not being able to assume > >> reported objects exist as soon as they're reported. > >> > > > > I want to explicitly not respond to this concern. I don't believe this > > 100 line patch > > can be usefully split. > > Leaving "usefully" aside for a second (since that's subjective), it > clearly "can". I just tried this on top of "seen": > > diff --git a/bulk-checkin.c b/bulk-checkin.c > index a702e0ff203..9e994c4d6ae 100644 > --- a/bulk-checkin.c > +++ b/bulk-checkin.c > @@ -9,15 +9,12 @@ > #include "pack.h" > #include "strbuf.h" > #include "string-list.h" > -#include "tmp-objdir.h" > #include "packfile.h" > #include "object-store.h" > > static int bulk_checkin_plugged; > static int needs_batch_fsync; > > -static struct tmp_objdir *bulk_fsync_objdir; > - > static struct bulk_checkin_state { > char *pack_tmp_name; > struct hashfile *f; > @@ -110,11 +107,6 @@ static void do_batch_fsync(void) > strbuf_release(&temp_path); > needs_batch_fsync = 0; > } > - > - if (bulk_fsync_objdir) { > - tmp_objdir_migrate(bulk_fsync_objdir); > - bulk_fsync_objdir = NULL; > - } > } > > static int already_written(struct bulk_checkin_state *state, struct object_id *oid) > @@ -321,7 +313,6 @@ void fsync_loose_object_bulk_checkin(int fd) > */ > if (bulk_checkin_plugged && > git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) { > - assert(bulk_fsync_objdir); > if (!needs_batch_fsync) > needs_batch_fsync = 1; > } else { > @@ -343,19 +334,6 @@ int index_bulk_checkin(struct object_id *oid, > void plug_bulk_checkin(void) > { > assert(!bulk_checkin_plugged); > - > - /* > - * A temporary object directory is used to hold the files > - * while they are not fsynced. > - */ > - if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) { > - bulk_fsync_objdir = tmp_objdir_create("bulk-fsync"); > - if (!bulk_fsync_objdir) > - die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch")); > - > - tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0); > - } > - > bulk_checkin_plugged = 1; > } > > And then tried running: > > $ GIT_PERF_MAKE_OPTS='CFLAGS=-O3' ./run HEAD~ HEAD -- p3900-stash.sh > > And got: > > Test HEAD~ HEAD > -------------------------------------------------------------------------------------------- > 3900.2: stash 500 files (object_fsyncing=false) 0.56(0.08+0.09) 0.60(0.08+0.08) +7.1% > 3900.4: stash 500 files (object_fsyncing=true) 14.50(0.07+0.15) 17.13(0.10+0.12) +18.1% > 3900.6: stash 500 files (object_fsyncing=batch) 1.14(0.08+0.11) 1.03(0.08+0.10) -9.6% > > Now, I really don't trust that perf run to say anything except these > being in the same ballpark, but it's clearly going to be a *bit* faster > since we'll be doing fewer IOps. > > As to "usefully" I really do get what you're saying that you only find > these useful when you combine the two because you'd like to have 100% > safety, and that's fair enough. > > But since we are going to have a knob to turn off fsyncing entirely, and > we have this "bulk" mode which requires you to carefully reason about > your FS semantics to ascertain safety the performance/safety trade-off > is clearly something that's useful to have tweaks for. > > And with "bulk" the concern about leaving behind stray corrupt objects > is entirely orthagonal to corcerns about losing a ref update, which is > the main thing we're worried about. > > I also don't see how even if you're arguing that nobody would want one > without the other because everyone who cares about "bulk" cares about > this stray-corrupt-loose-but-no-ref-update case, how it has any business > being tied up in the "bulk" mode as far as the implementation goes. > > That's because the same edge case is exposed by > core.fsyncObjectFiles=false for those who are assuming the initial > "ordered" semantics. > > I.e. if we're saying that before we write the ref we'd like to not > expose the WIP objects in the primary object store because they're not > fsync'd yet, how is that mode different than "bulk" if we crash while > doing that operation (before the eventual fsync()). > > So I really think it's much better to split these concerns up. > > I think even if you insist on the same end-state it makes the patch > progression much *easier* to reason about. We'd then solve one problem > at a time, and start with a commit where just the semantics that are > unique to "bulk" are implemented, with nothing else conflated with > those. On Windows, where we want to have a consistent ODB by default, I'm adding a faster way to achieve that safety. No user is asking for a world where we are doing half the work to make a consistent ODB but not the other half. This one patch works holistically to provide the full batch safety feature, and splitting it into two patches (which in the new version wouldn't be as clean as you've done it above) doesn't make the correctness of the whole thing more reviewable. In fact it's less reviewable since the fsync and objdir migration are in two separate patches and a future historian wouldn't get as clear of a picture of the whole mechanism. > > [...] > >> Ok, so it's something we could do, but passing down 2-3 functions to > >> object-file.c was a hassle. > >> > >> I tried to hack that up earlier and found that it wasn't *too > >> bad*. I.e. we'd pass some "flags" about our intent, and amend various > >> functions to take "don't close this one" and pass up the fd (or even do > >> that as a global). > >> > >> In any case, having the commit message clearly document what's needed > >> for what & what's essential & just shortcut taken for the convenience of > >> the current implementation would be really useful. > >> > >> Then we can always e.g. change this later to just do the the fsync() on > >> the last of N we write. > >> > > > > I left a comment in the (now very long) commit message that indicates the > > dummy file is there to make the API simpler. > > In terms of more understandable progression I also think this series > would be much easier to understand if it converted one caller without > needing the "cookie" where doing so is easy, e.g. the unpack-objects.c > caller where we're processing nr_objects, so we can just pass down a > flag to do the fsync() for i == nr_objects. > > That'll then clearly show that the whole business of having the global > state on the side is just a replacement for passing down such a flag. That seems appropriate for our mailing list discussion, but I don't see how it helps the patch series, because we'd be doing work to fsync the final object and then reversing that work when producing the final end state, which uses the dummy file. Thanks, Neeraj
On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote: > From: Neeraj Singh <neerajsi@microsoft.com> > [.. > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > index 889522956e4..a3798dfc334 100644 > --- a/Documentation/config/core.txt > +++ b/Documentation/config/core.txt > @@ -628,6 +628,13 @@ core.fsyncMethod:: > * `writeout-only` issues pagecache writeback requests, but depending on the > filesystem and storage hardware, data added to the repository may not be > durable in the event of a system crash. This is the default mode on macOS. > +* `batch` enables a mode that uses writeout-only flushes to stage multiple > + updates in the disk writeback cache and then does a single full fsync of > + a dummy file to trigger the disk cache flush at the end of the operation. I think adding a \n\n here would help make this more readable & break the flow a bit. I.e. just add a "+" on its own line, followed by "Currently... > + Currently `batch` mode only applies to loose-object files. Other repository > + data is made durable as if `fsync` was specified. This mode is expected to > + be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems > + and on Windows for repos stored on NTFS or ReFS filesystems.
On Wed, Mar 23, 2022 at 6:27 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote: > > > From: Neeraj Singh <neerajsi@microsoft.com> > > [.. > > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > > index 889522956e4..a3798dfc334 100644 > > --- a/Documentation/config/core.txt > > +++ b/Documentation/config/core.txt > > @@ -628,6 +628,13 @@ core.fsyncMethod:: > > * `writeout-only` issues pagecache writeback requests, but depending on the > > filesystem and storage hardware, data added to the repository may not be > > durable in the event of a system crash. This is the default mode on macOS. > > +* `batch` enables a mode that uses writeout-only flushes to stage multiple > > + updates in the disk writeback cache and then does a single full fsync of > > + a dummy file to trigger the disk cache flush at the end of the operation. > > I think adding a \n\n here would help make this more readable & break > the flow a bit. I.e. just add a "+" on its own line, followed by > "Currently... > > > + Currently `batch` mode only applies to loose-object files. Other repository > > + data is made durable as if `fsync` was specified. This mode is expected to > > + be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems > > + and on Windows for repos stored on NTFS or ReFS filesystems. Thanks, will fix.
diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index 889522956e4..a3798dfc334 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -628,6 +628,13 @@ core.fsyncMethod:: * `writeout-only` issues pagecache writeback requests, but depending on the filesystem and storage hardware, data added to the repository may not be durable in the event of a system crash. This is the default mode on macOS. +* `batch` enables a mode that uses writeout-only flushes to stage multiple + updates in the disk writeback cache and then does a single full fsync of + a dummy file to trigger the disk cache flush at the end of the operation. + Currently `batch` mode only applies to loose-object files. Other repository + data is made durable as if `fsync` was specified. This mode is expected to + be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems + and on Windows for repos stored on NTFS or ReFS filesystems. core.fsyncObjectFiles:: This boolean will enable 'fsync()' when writing object files. diff --git a/bulk-checkin.c b/bulk-checkin.c index 93b1dc5138a..a702e0ff203 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -3,14 +3,20 @@ */ #include "cache.h" #include "bulk-checkin.h" +#include "lockfile.h" #include "repository.h" #include "csum-file.h" #include "pack.h" #include "strbuf.h" +#include "string-list.h" +#include "tmp-objdir.h" #include "packfile.h" #include "object-store.h" static int bulk_checkin_plugged; +static int needs_batch_fsync; + +static struct tmp_objdir *bulk_fsync_objdir; static struct bulk_checkin_state { char *pack_tmp_name; @@ -80,6 +86,37 @@ clear_exit: reprepare_packed_git(the_repository); } +/* + * Cleanup after batch-mode fsync_object_files. + */ +static void do_batch_fsync(void) +{ + /* + * Issue a full hardware flush against a temporary file to ensure + * that all objects are durable before any renames occur. The code in + * fsync_loose_object_bulk_checkin has already issued a writeout + * request, but it has not flushed any writeback cache in the storage + * hardware. + */ + + if (needs_batch_fsync) { + struct strbuf temp_path = STRBUF_INIT; + struct tempfile *temp; + + strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory()); + temp = xmks_tempfile(temp_path.buf); + fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp)); + delete_tempfile(&temp); + strbuf_release(&temp_path); + needs_batch_fsync = 0; + } + + if (bulk_fsync_objdir) { + tmp_objdir_migrate(bulk_fsync_objdir); + bulk_fsync_objdir = NULL; + } +} + static int already_written(struct bulk_checkin_state *state, struct object_id *oid) { int i; @@ -274,6 +311,24 @@ static int deflate_to_pack(struct bulk_checkin_state *state, return 0; } +void fsync_loose_object_bulk_checkin(int fd) +{ + /* + * If we have a plugged bulk checkin, we issue a call that + * cleans the filesystem page cache but avoids a hardware flush + * command. Later on we will issue a single hardware flush + * before as part of do_batch_fsync. + */ + if (bulk_checkin_plugged && + git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) { + assert(bulk_fsync_objdir); + if (!needs_batch_fsync) + needs_batch_fsync = 1; + } else { + fsync_or_die(fd, "loose object file"); + } +} + int index_bulk_checkin(struct object_id *oid, int fd, size_t size, enum object_type type, const char *path, unsigned flags) @@ -288,6 +343,19 @@ int index_bulk_checkin(struct object_id *oid, void plug_bulk_checkin(void) { assert(!bulk_checkin_plugged); + + /* + * A temporary object directory is used to hold the files + * while they are not fsynced. + */ + if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) { + bulk_fsync_objdir = tmp_objdir_create("bulk-fsync"); + if (!bulk_fsync_objdir) + die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch")); + + tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0); + } + bulk_checkin_plugged = 1; } @@ -297,4 +365,6 @@ void unplug_bulk_checkin(void) bulk_checkin_plugged = 0; if (bulk_checkin_state.f) finish_bulk_checkin(&bulk_checkin_state); + + do_batch_fsync(); } diff --git a/bulk-checkin.h b/bulk-checkin.h index b26f3dc3b74..08f292379b6 100644 --- a/bulk-checkin.h +++ b/bulk-checkin.h @@ -6,6 +6,8 @@ #include "cache.h" +void fsync_loose_object_bulk_checkin(int fd); + int index_bulk_checkin(struct object_id *oid, int fd, size_t size, enum object_type type, const char *path, unsigned flags); diff --git a/cache.h b/cache.h index 3160bc1e489..d1ae51388c9 100644 --- a/cache.h +++ b/cache.h @@ -1040,7 +1040,8 @@ extern int use_fsync; enum fsync_method { FSYNC_METHOD_FSYNC, - FSYNC_METHOD_WRITEOUT_ONLY + FSYNC_METHOD_WRITEOUT_ONLY, + FSYNC_METHOD_BATCH }; extern enum fsync_method fsync_method; @@ -1767,6 +1768,11 @@ void fsync_or_die(int fd, const char *); int fsync_component(enum fsync_component component, int fd); void fsync_component_or_die(enum fsync_component component, int fd, const char *msg); +static inline int batch_fsync_enabled(enum fsync_component component) +{ + return (fsync_components & component) && (fsync_method == FSYNC_METHOD_BATCH); +} + ssize_t read_in_full(int fd, void *buf, size_t count); ssize_t write_in_full(int fd, const void *buf, size_t count); ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset); diff --git a/config.c b/config.c index 261ee7436e0..0b28f90de8b 100644 --- a/config.c +++ b/config.c @@ -1688,6 +1688,8 @@ static int git_default_core_config(const char *var, const char *value, void *cb) fsync_method = FSYNC_METHOD_FSYNC; else if (!strcmp(value, "writeout-only")) fsync_method = FSYNC_METHOD_WRITEOUT_ONLY; + else if (!strcmp(value, "batch")) + fsync_method = FSYNC_METHOD_BATCH; else warning(_("ignoring unknown core.fsyncMethod value '%s'"), value); diff --git a/object-file.c b/object-file.c index 5258d9ed827..bdb0a38328f 100644 --- a/object-file.c +++ b/object-file.c @@ -1895,6 +1895,8 @@ static void close_loose_object(int fd) if (fsync_object_files > 0) fsync_or_die(fd, "loose object file"); + else if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) + fsync_loose_object_bulk_checkin(fd); else fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd, "loose object file");