Message ID | f7f756f3932cdbca587de397598758c685bac29a.1632176111.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Implement a batched fsync option for core.fsyncObjectFiles | expand |
On Mon, Sep 20 2021, Neeraj Singh via GitGitGadget wrote: > From: Neeraj Singh <neerajsi@microsoft.com> > > The update-index functionality is used internally by 'git stash push' to > setup the internal stashed commit. > > This change enables bulk-checkin for update-index infrastructure to > speed up adding new objects to the object database by leveraging the > pack functionality and the new bulk-fsync functionality. This mode > is enabled when passing paths to update-index via the --stdin flag, > as is done by 'git stash'. > > There is some risk with this change, since under batch fsync, the object > files will not be available until the update-index is entirely complete. > This usage is unlikely, since any tool invoking update-index and > expecting to see objects would have to snoop the output of --verbose to > find out when update-index has actually processed a given path. > Additionally the index is locked for the duration of the update. Would you really need to sniff the verbose output? If I'm streaming data to update-index now it looks like I could assume before that update-index would have done the work if I managed to fflush() to it, since it's processing a line at a time and doing the work in that line-at-a-time loop. I.e. you could print lines to it, and then do concurrent object lookups knowing the data was written already... I think this is probably fine, but that case seems way likelier than someone sniffing back the verbose output, presumably for the "add" in update_one(), but that's called in the getline_fn() loop... All of this makes me wonder why this isn't using tmp-objdir.c, i.e. we could have our cake and eat it too by writing the "real" objects, and then just renaming them between directories instead. But perhaps the answer has something to do with the metadata issues I raised. And well, tmp-objdir.c isn't going to help someone in practice that's relying on this "update-index --stdin" behavior, as they won't know where we staged the temporary files...
On Tue, Sep 21, 2021 at 4:53 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Mon, Sep 20 2021, Neeraj Singh via GitGitGadget wrote: > > > From: Neeraj Singh <neerajsi@microsoft.com> > > > > The update-index functionality is used internally by 'git stash push' to > > setup the internal stashed commit. > > > > This change enables bulk-checkin for update-index infrastructure to > > speed up adding new objects to the object database by leveraging the > > pack functionality and the new bulk-fsync functionality. This mode > > is enabled when passing paths to update-index via the --stdin flag, > > as is done by 'git stash'. > > > > There is some risk with this change, since under batch fsync, the object > > files will not be available until the update-index is entirely complete. > > This usage is unlikely, since any tool invoking update-index and > > expecting to see objects would have to snoop the output of --verbose to > > find out when update-index has actually processed a given path. > > Additionally the index is locked for the duration of the update. > > Would you really need to sniff the verbose output? If I'm streaming data > to update-index now it looks like I could assume before that > update-index would have done the work if I managed to fflush() to it, > since it's processing a line at a time and doing the work in that > line-at-a-time loop. > > I.e. you could print lines to it, and then do concurrent object lookups > knowing the data was written already... > > I think this is probably fine, but that case seems way likelier than > someone sniffing back the verbose output, presumably for the "add" in > update_one(), but that's called in the getline_fn() loop... Does fflush really guarantee that the reader has picked up the input from a pipe across all environments? Even if a reader picks up the input, does that mean that the reader is done processing it? Do you think I really need to revise this comment? Maybe leave a terser, 'this usage is thought to be unlikely'? > > All of this makes me wonder why this isn't using tmp-objdir.c, i.e. we > could have our cake and eat it too by writing the "real" objects, and > then just renaming them between directories instead. But perhaps the > answer has something to do with the metadata issues I raised. > > And well, tmp-objdir.c isn't going to help someone in practice that's > relying on this "update-index --stdin" behavior, as they won't know > where we staged the temporary files... > One motivation of the current design behind renaming the files is that some networked filesystems don't seem to like cross-directory renames much. It also so happens that ReFS on Windows also prefers renames to stay within the directory. Actually any filesystem would likely be slightly faster, since fewer objects are being modified (one dir versus two).
On Tue, Sep 21, 2021 at 6:27 PM Neeraj Singh <nksingh85@gmail.com> wrote: > > On Tue, Sep 21, 2021 at 4:53 PM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: > > > > All of this makes me wonder why this isn't using tmp-objdir.c, i.e. we > > could have our cake and eat it too by writing the "real" objects, and > > then just renaming them between directories instead. But perhaps the > > answer has something to do with the metadata issues I raised. > > > > And well, tmp-objdir.c isn't going to help someone in practice that's > > relying on this "update-index --stdin" behavior, as they won't know > > where we staged the temporary files... > > > > One motivation of the current design behind renaming the files is that > some networked filesystems don't seem to like cross-directory renames > much. It also so happens that ReFS on Windows also prefers renames to > stay within the directory. Actually any filesystem would likely be > slightly faster, > since fewer objects are being modified (one dir versus two). Whelp, as part of v5 I tried to make unpack-objects.c use the batch fsync mode and now I see a strong reason to take your tmp-objdir suggestion. As part of OBJ_REF_DELTA unpacking, we need access to the object while we're in the plugged state. I didn't notice this at first, but got lucky that I tested that case first and hit an error. V5 will create a tmp-objdir and add a new interface to install it as the primary objdir.
diff --git a/builtin/update-index.c b/builtin/update-index.c index 187203e8bb5..b0689f2cdf6 100644 --- a/builtin/update-index.c +++ b/builtin/update-index.c @@ -5,6 +5,7 @@ */ #define USE_THE_INDEX_COMPATIBILITY_MACROS #include "cache.h" +#include "bulk-checkin.h" #include "config.h" #include "lockfile.h" #include "quote.h" @@ -1150,6 +1151,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) struct strbuf unquoted = STRBUF_INIT; setup_work_tree(); + plug_bulk_checkin(); while (getline_fn(&buf, stdin) != EOF) { char *p; if (!nul_term_line && buf.buf[0] == '"') { @@ -1164,6 +1166,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) chmod_path(set_executable_bit, p); free(p); } + unplug_bulk_checkin(&lock_file); strbuf_release(&unquoted); strbuf_release(&buf); }