Message ID | 54797dbc52060b7fa913642cd5266f7e159a5bc9.1647760561.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects | expand |
On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote: > From: Neeraj Singh <neerajsi@microsoft.com> > > The update-index functionality is used internally by 'git stash push' to > setup the internal stashed commit. > > This change enables bulk-checkin for update-index infrastructure to > speed up adding new objects to the object database by leveraging the > batch fsync functionality. > > There is some risk with this change, since under batch fsync, the object > files will be in a tmp-objdir until update-index is complete. This > usage is unlikely, since any tool invoking update-index and expecting to > see objects would have to synchronize with the update-index process > after passing it a file path. > > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> > --- > builtin/update-index.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/builtin/update-index.c b/builtin/update-index.c > index 75d646377cc..38e9d7e88cb 100644 > --- a/builtin/update-index.c > +++ b/builtin/update-index.c > @@ -5,6 +5,7 @@ > */ > #define USE_THE_INDEX_COMPATIBILITY_MACROS > #include "cache.h" > +#include "bulk-checkin.h" > #include "config.h" > #include "lockfile.h" > #include "quote.h" > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > > the_index.updated_skipworktree = 1; > > + /* we might be adding many objects to the object database */ > + plug_bulk_checkin(); > + Shouldn't this be after parse_options_start()?
"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/builtin/update-index.c b/builtin/update-index.c > index 75d646377cc..38e9d7e88cb 100644 > --- a/builtin/update-index.c > +++ b/builtin/update-index.c > @@ -5,6 +5,7 @@ > */ > #define USE_THE_INDEX_COMPATIBILITY_MACROS > #include "cache.h" > +#include "bulk-checkin.h" > #include "config.h" > #include "lockfile.h" > #include "quote.h" > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > > the_index.updated_skipworktree = 1; > > + /* we might be adding many objects to the object database */ > + plug_bulk_checkin(); > + > /* > * Custom copy of parse_options() because we want to handle > * filename arguments as they come. > @@ -1190,6 +1194,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > strbuf_release(&buf); > } > > + /* by now we must have added all of the new objects */ > + unplug_bulk_checkin(); I understand read-from-stdin code path would be worth plugging, but the list of paths on the command line? How many of them would one fit? Of course, the feeder may be expecting for the objects to appear in the object store as it feeds the paths and will be utterly broken by this change, as you mentioned in the proposed log message. The existing plug/unplug will change the behaviour by making the objects sent to the packfile available only after getting unplugged. This series makes it even worse by making loose objects also unavailable until unplug is called. So, it probably is safer and more sensible approach to introduce a new command line option to allow the bulk checkin, and those who do not care about the intermediate state to opt into the new feature.
On Mon, Mar 21, 2022 at 8:04 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote: > > > From: Neeraj Singh <neerajsi@microsoft.com> > > > > The update-index functionality is used internally by 'git stash push' to > > setup the internal stashed commit. > > > > This change enables bulk-checkin for update-index infrastructure to > > speed up adding new objects to the object database by leveraging the > > batch fsync functionality. > > > > There is some risk with this change, since under batch fsync, the object > > files will be in a tmp-objdir until update-index is complete. This > > usage is unlikely, since any tool invoking update-index and expecting to > > see objects would have to synchronize with the update-index process > > after passing it a file path. > > > > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> > > --- > > builtin/update-index.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/builtin/update-index.c b/builtin/update-index.c > > index 75d646377cc..38e9d7e88cb 100644 > > --- a/builtin/update-index.c > > +++ b/builtin/update-index.c > > @@ -5,6 +5,7 @@ > > */ > > #define USE_THE_INDEX_COMPATIBILITY_MACROS > > #include "cache.h" > > +#include "bulk-checkin.h" > > #include "config.h" > > #include "lockfile.h" > > #include "quote.h" > > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > > > > the_index.updated_skipworktree = 1; > > > > + /* we might be adding many objects to the object database */ > > + plug_bulk_checkin(); > > + > > Shouldn't this be after parse_options_start()? Does it make a difference? Especially if we do the object dir creation lazily?
On Mon, Mar 21, 2022 at 10:50 AM Junio C Hamano <gitster@pobox.com> wrote: > > "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > diff --git a/builtin/update-index.c b/builtin/update-index.c > > index 75d646377cc..38e9d7e88cb 100644 > > --- a/builtin/update-index.c > > +++ b/builtin/update-index.c > > @@ -5,6 +5,7 @@ > > */ > > #define USE_THE_INDEX_COMPATIBILITY_MACROS > > #include "cache.h" > > +#include "bulk-checkin.h" > > #include "config.h" > > #include "lockfile.h" > > #include "quote.h" > > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > > > > the_index.updated_skipworktree = 1; > > > > + /* we might be adding many objects to the object database */ > > + plug_bulk_checkin(); > > + > > /* > > * Custom copy of parse_options() because we want to handle > > * filename arguments as they come. > > @@ -1190,6 +1194,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) > > strbuf_release(&buf); > > } > > > > + /* by now we must have added all of the new objects */ > > + unplug_bulk_checkin(); > > I understand read-from-stdin code path would be worth plugging, but > the list of paths on the command line? How many of them would one > fit? > do_reupdate could touch all the files in the index. Also one can pass a directory, and re-add all files under the directory. > Of course, the feeder may be expecting for the objects to appear in > the object store as it feeds the paths and will be utterly broken by > this change, as you mentioned in the proposed log message. The > existing plug/unplug will change the behaviour by making the objects > sent to the packfile available only after getting unplugged. This > series makes it even worse by making loose objects also unavailable > until unplug is called. > > So, it probably is safer and more sensible approach to introduce a > new command line option to allow the bulk checkin, and those who do > not care about the intermediate state to opt into the new feature. > I don't believe this usage is likely today. How would the feeder know when it can expect to find an object in the object directory after passing something on stdin? When fed via stdin, git-update-index will asynchronously add that object to the object database, leaving no indication to the feeder of when it actually happens, aside from it happening before the git-update-index process terminates. I used to have a comment here about the feeder being able to parse the --verbose output to get feedback from git-update-index, which would be quite tricky. I thought it was unnecessarily detailed. Thanks, Neeraj
On Mon, Mar 21 2022, Neeraj Singh wrote: > On Mon, Mar 21, 2022 at 8:04 AM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> >> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote: >> >> > From: Neeraj Singh <neerajsi@microsoft.com> >> > >> > The update-index functionality is used internally by 'git stash push' to >> > setup the internal stashed commit. >> > >> > This change enables bulk-checkin for update-index infrastructure to >> > speed up adding new objects to the object database by leveraging the >> > batch fsync functionality. >> > >> > There is some risk with this change, since under batch fsync, the object >> > files will be in a tmp-objdir until update-index is complete. This >> > usage is unlikely, since any tool invoking update-index and expecting to >> > see objects would have to synchronize with the update-index process >> > after passing it a file path. >> > >> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> >> > --- >> > builtin/update-index.c | 6 ++++++ >> > 1 file changed, 6 insertions(+) >> > >> > diff --git a/builtin/update-index.c b/builtin/update-index.c >> > index 75d646377cc..38e9d7e88cb 100644 >> > --- a/builtin/update-index.c >> > +++ b/builtin/update-index.c >> > @@ -5,6 +5,7 @@ >> > */ >> > #define USE_THE_INDEX_COMPATIBILITY_MACROS >> > #include "cache.h" >> > +#include "bulk-checkin.h" >> > #include "config.h" >> > #include "lockfile.h" >> > #include "quote.h" >> > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) >> > >> > the_index.updated_skipworktree = 1; >> > >> > + /* we might be adding many objects to the object database */ >> > + plug_bulk_checkin(); >> > + >> >> Shouldn't this be after parse_options_start()? > > Does it make a difference? Especially if we do the object dir creation lazily? I think it won't matter for the machine, but it helps with readability to keep code like this as close to where it's used as possible. Close enough and we'd also spot the other bug I mentioned here, i.e. that we're setting this up where we're not writing objects at all :)
diff --git a/builtin/update-index.c b/builtin/update-index.c index 75d646377cc..38e9d7e88cb 100644 --- a/builtin/update-index.c +++ b/builtin/update-index.c @@ -5,6 +5,7 @@ */ #define USE_THE_INDEX_COMPATIBILITY_MACROS #include "cache.h" +#include "bulk-checkin.h" #include "config.h" #include "lockfile.h" #include "quote.h" @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) the_index.updated_skipworktree = 1; + /* we might be adding many objects to the object database */ + plug_bulk_checkin(); + /* * Custom copy of parse_options() because we want to handle * filename arguments as they come. @@ -1190,6 +1194,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) strbuf_release(&buf); } + /* by now we must have added all of the new objects */ + unplug_bulk_checkin(); if (split_index > 0) { if (git_config_get_split_index() == 0) warning(_("core.splitIndex is set to false; "