Message ID | 53261f0099d53524155464fe79d10f9605fe93aa.1648097906.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects | expand |
On Thu, Mar 24 2022, Neeraj Singh via GitGitGadget wrote: > From: Neeraj Singh <neerajsi@microsoft.com> > > Make it clearer in the naming and documentation of the plug_bulk_checkin > and unplug_bulk_checkin APIs that they can be thought of as > a "transaction" to optimize operations on the object database. > > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> > --- > builtin/add.c | 4 ++-- > bulk-checkin.c | 4 ++-- > bulk-checkin.h | 14 ++++++++++++-- > 3 files changed, 16 insertions(+), 6 deletions(-) > > diff --git a/builtin/add.c b/builtin/add.c > index 3ffb86a4338..9bf37ceae8e 100644 > --- a/builtin/add.c > +++ b/builtin/add.c > @@ -670,7 +670,7 @@ int cmd_add(int argc, const char **argv, const char *prefix) > string_list_clear(&only_match_skip_worktree, 0); > } > > - plug_bulk_checkin(); > + begin_odb_transaction(); > > if (add_renormalize) > exit_status |= renormalize_tracked_files(&pathspec, flags); > @@ -682,7 +682,7 @@ int cmd_add(int argc, const char **argv, const char *prefix) > > if (chmod_arg && pathspec.nr) > exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only); > - unplug_bulk_checkin(); > + end_odb_transaction(); Aside from anything else we've (dis)agreed on, I found this part really odd when hacking on my RFC-on-top, i.e. originally I (wrongly) thought the plug_bulk_checkin() was something that originated with this series which adds the "bulk" mode. But no, on second inspection it's a thing Junio added a long time ago so that in this case we "stream to N pack" where we'd otherwise add N loose objects. Which, and I think Junio brought this up in an earlier round, but I didn't fully understand that at the time makes this whole thing quite odd to me. So first, shouldn't we add this begin_odb_transaction() as a new thing? I.e. surely wanting to do that object target redirection within a given begin/end "scope" should be orthagonal to how fsync() happens within that "scope", though in this case that happens to correspond. And secondly, per the commit message and comment when it was added in (568508e7657 (bulk-checkin: replace fast-import based implementation, 2011-10-28)) is it something we need *for that purpose* with the series to unpack-objects without malloc()ing the size of the blob[1]. And, if so and orthagonal to that: If we know how to either stream N objects to a PACK (as fast-import does), *and* we now (or SOON) know how to stream loose objects without using size(blob) amounts of memory, doesn't the "optimize fsync()" rather want to make use of the stream-to-pack approach? I.e. have you tried for the caseses where we create say 1k objects for "git stash" tried to stream those to a pack? How does that compare (both with/without the fsync changes). I.e. I do worry (also per [2]) that while the whole "bulk fsync" is neat (and I think can use it in either case, to defer object syncs until the "index" or "ref" sync, as my RFC does) I worry that we're adding a bunch of configuration and complexity for something that: 1. Ultimately isn't all that important, as already for part of it we can mostly configure it away. I.e. "git-unpack-objects" v.s. writing a pack, cf. transfer.unpackLimit) 2. We don't have #1 for "add" and "update-index", but if we stream to packs there is there any remaining benefit in practice? 1. https://lore.kernel.org/git/cover-v11-0.8-00000000000-20220319T001411Z-avarab@gmail.com/ 2. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/
On Thu, Mar 24, 2022 at 9:24 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Thu, Mar 24 2022, Neeraj Singh via GitGitGadget wrote: > > > From: Neeraj Singh <neerajsi@microsoft.com> > > > > Make it clearer in the naming and documentation of the plug_bulk_checkin > > and unplug_bulk_checkin APIs that they can be thought of as > > a "transaction" to optimize operations on the object database. > > > > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> > > --- > > builtin/add.c | 4 ++-- > > bulk-checkin.c | 4 ++-- > > bulk-checkin.h | 14 ++++++++++++-- > > 3 files changed, 16 insertions(+), 6 deletions(-) > > > > diff --git a/builtin/add.c b/builtin/add.c > > index 3ffb86a4338..9bf37ceae8e 100644 > > --- a/builtin/add.c > > +++ b/builtin/add.c > > @@ -670,7 +670,7 @@ int cmd_add(int argc, const char **argv, const char *prefix) > > string_list_clear(&only_match_skip_worktree, 0); > > } > > > > - plug_bulk_checkin(); > > + begin_odb_transaction(); > > > > if (add_renormalize) > > exit_status |= renormalize_tracked_files(&pathspec, flags); > > @@ -682,7 +682,7 @@ int cmd_add(int argc, const char **argv, const char *prefix) > > > > if (chmod_arg && pathspec.nr) > > exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only); > > - unplug_bulk_checkin(); > > + end_odb_transaction(); > > Aside from anything else we've (dis)agreed on, I found this part really > odd when hacking on my RFC-on-top, i.e. originally I (wrongly) thought > the plug_bulk_checkin() was something that originated with this series > which adds the "bulk" mode. > > But no, on second inspection it's a thing Junio added a long time ago so > that in this case we "stream to N pack" where we'd otherwise add N loose > objects. > > Which, and I think Junio brought this up in an earlier round, but I > didn't fully understand that at the time makes this whole thing quite > odd to me. > > So first, shouldn't we add this begin_odb_transaction() as a new thing? > I.e. surely wanting to do that object target redirection within a given > begin/end "scope" should be orthagonal to how fsync() happens within > that "scope", though in this case that happens to correspond. > > And secondly, per the commit message and comment when it was added in > (568508e7657 (bulk-checkin: replace fast-import based implementation, > 2011-10-28)) is it something we need *for that purpose* with the series > to unpack-objects without malloc()ing the size of the blob[1]. > The original change seems to be about optimizing addition of successive large blobs to the ODB when we know we have a large batch. It's a batch-mode optimization for the ODB, similar to my patch series, just targeting large blobs rather than small blobs/trees. It also has the same property that the added data is "invisible" until the transaction ends. > And, if so and orthagonal to that: If we know how to either stream N > objects to a PACK (as fast-import does), *and* we now (or SOON) know how > to stream loose objects without using size(blob) amounts of memory, > doesn't the "optimize fsync()" rather want to make use of the > stream-to-pack approach? > > I.e. have you tried for the caseses where we create say 1k objects for > "git stash" tried to stream those to a pack? How does that compare (both > with/without the fsync changes). > > I.e. I do worry (also per [2]) that while the whole "bulk fsync" is neat > (and I think can use it in either case, to defer object syncs until the > "index" or "ref" sync, as my RFC does) I worry that we're adding a bunch > of configuration and complexity for something that: > > 1. Ultimately isn't all that important, as already for part of it we > can mostly configure it away. I.e. "git-unpack-objects" v.s. writing > a pack, cf. transfer.unpackLimit) > 2. We don't have #1 for "add" and "update-index", but if we stream to > packs there is there any remaining benefit in practice? > > 1. https://lore.kernel.org/git/cover-v11-0.8-00000000000-20220319T001411Z-avarab@gmail.com/ > 2. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/ Stream to pack is a good idea. But I think we'd want a way to append to the most recent pack so that we don't explode the number of packs, which seems to impose a linear cost on ODB operations, at least to load up the indexes. I think this is orthogonal and we can always change the meaning of batch mode to use a pack mechanism when such a mechanism is ready. Thanks, Neeraj
diff --git a/builtin/add.c b/builtin/add.c index 3ffb86a4338..9bf37ceae8e 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -670,7 +670,7 @@ int cmd_add(int argc, const char **argv, const char *prefix) string_list_clear(&only_match_skip_worktree, 0); } - plug_bulk_checkin(); + begin_odb_transaction(); if (add_renormalize) exit_status |= renormalize_tracked_files(&pathspec, flags); @@ -682,7 +682,7 @@ int cmd_add(int argc, const char **argv, const char *prefix) if (chmod_arg && pathspec.nr) exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only); - unplug_bulk_checkin(); + end_odb_transaction(); finish: if (write_locked_index(&the_index, &lock_file, diff --git a/bulk-checkin.c b/bulk-checkin.c index 6d6c37171c9..a16ae3c629d 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -285,12 +285,12 @@ int index_bulk_checkin(struct object_id *oid, return status; } -void plug_bulk_checkin(void) +void begin_odb_transaction(void) { state.plugged = 1; } -void unplug_bulk_checkin(void) +void end_odb_transaction(void) { state.plugged = 0; if (state.f) diff --git a/bulk-checkin.h b/bulk-checkin.h index b26f3dc3b74..69a94422ac7 100644 --- a/bulk-checkin.h +++ b/bulk-checkin.h @@ -10,7 +10,17 @@ int index_bulk_checkin(struct object_id *oid, int fd, size_t size, enum object_type type, const char *path, unsigned flags); -void plug_bulk_checkin(void); -void unplug_bulk_checkin(void); +/* + * Tell the object database to optimize for adding + * multiple objects. end_odb_transaction must be called + * to make new objects visible. + */ +void begin_odb_transaction(void); + +/* + * Tell the object database to make any objects from the + * current transaction visible. + */ +void end_odb_transaction(void); #endif