diff mbox series

[1/7] bulk-checkin: rename 'state' variable and separate 'plugged' boolean

Message ID a77d02df626ed6dff485e1342ff7affd6999ec44.1647379859.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects | expand

Commit Message

Neeraj Singh (WINDOWS-SFS) March 15, 2022, 9:30 p.m. UTC
From: Neeraj Singh <neerajsi@microsoft.com>

Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure.

* Rename 'state' variable to 'bulk_checkin_state', since we will later
  be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
  find in the debugger, since the name is more unique.

* Move the 'plugged' data member of 'bulk_checkin_state' into a separate
  static variable. Doing this avoids resetting the variable in
  finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
  seem to unintentionally disable the plugging functionality the first
  time a new packfile must be created due to packfile size limits. While
  disabling the plugging state only results in suboptimal behavior for
  the current code, it would be fatal for the bulk-fsync functionality
  later in this patch series.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
---
 bulk-checkin.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

Comments

Junio C Hamano March 16, 2022, 5:33 a.m. UTC | #1
"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure.
>
> * Rename 'state' variable to 'bulk_checkin_state', since we will later
>   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
>   find in the debugger, since the name is more unique.
>
> * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
>   static variable. Doing this avoids resetting the variable in
>   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
>   seem to unintentionally disable the plugging functionality the first
>   time a new packfile must be created due to packfile size limits. While
>   disabling the plugging state only results in suboptimal behavior for
>   the current code, it would be fatal for the bulk-fsync functionality
>   later in this patch series.

Sorry, but I am confused.  The bulk-checkin infrastructure is there
so that we can send many little objects into a single packfile
instead of creating many little loose object files.  Everything we
throw at object-file.c::index_stream() will be concatenated into the
single packfile while we are "plugged" until we get "unplugged".

My understanding of what you are doing in this series is to still
create many little loose object files, but avoid the overhead of
having to fsync them individually.  And I am not sure how well the
original idea behind the bulk-checkin infrastructure to avoid
overhead of having to create many loose objects by creating a single
packfile (and presumably having to fsync at the end, but that is
just a single .pack file) with your goal of still creating many
loose object files but synching them more efficiently.

Is it just the new feature is piggybacking on the existing bulk
checkin infrastructure, even though these two have nothing in
common?
Neeraj Singh March 16, 2022, 7:33 a.m. UTC | #2
On Tue, Mar 15, 2022 at 10:33 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure.
> >
> > * Rename 'state' variable to 'bulk_checkin_state', since we will later
> >   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
> >   find in the debugger, since the name is more unique.
> >
> > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
> >   static variable. Doing this avoids resetting the variable in
> >   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
> >   seem to unintentionally disable the plugging functionality the first
> >   time a new packfile must be created due to packfile size limits. While
> >   disabling the plugging state only results in suboptimal behavior for
> >   the current code, it would be fatal for the bulk-fsync functionality
> >   later in this patch series.
>
> Sorry, but I am confused.  The bulk-checkin infrastructure is there
> so that we can send many little objects into a single packfile
> instead of creating many little loose object files.  Everything we
> throw at object-file.c::index_stream() will be concatenated into the
> single packfile while we are "plugged" until we get "unplugged".
>

I noticed that you invented bulk-checkin back in 2011, but I don't think your
description matches what the code actually does.  index_bulk_checkin
is only called from index_stream, which is only called from index_fd. index_fd
goes down the index_bulk_checkin path for large files (512MB by default). It
looks like the effect of the 'plug/unplug' code is to allow multiple
large blobs to
go into a single packfile rather than each getting one getting its own separate
packfile.

> My understanding of what you are doing in this series is to still
> create many little loose object files, but avoid the overhead of
> having to fsync them individually.  And I am not sure how well the
> original idea behind the bulk-checkin infrastructure to avoid
> overhead of having to create many loose objects by creating a single
> packfile (and presumably having to fsync at the end, but that is
> just a single .pack file) with your goal of still creating many
> loose object files but synching them more efficiently.
>
> Is it just the new feature is piggybacking on the existing bulk
> checkin infrastructure, even though these two have nothing in
> common?
>

I think my new usage is congruent with the existing API, which seems
to be about combining multiple add operations into a large transaction,
where we can do some cleanup operations once we're finished. In the
preexisting code, the transaction is about adding a bunch of large objects
to a single pack file (while leaving small objects loose), and then completing
the packfile when the adds are finished.

---
On a side note, I've also been thinking about how we could use a packfile
approach as an alternative means to achieve faster addition of many small
objects. It's essentially what you stated above, where we'd send our
little objects
into a pack file. But to avoid frequent repacking overhead, we might
want to reuse
the 'latest' packfile across multiple Git invocations by appending
objects to it, with
an fsync on the file at the end.

We'd need sufficient padding between objects created by different Git
invocations to
ensure that previously synced data doesn't get disturbed by later
operations.  We'd
need to rewrite the pack indexes each time, but that's at least
derived metadata, so it
doesn't need to be fsynced. To make the pack indexes more
incrementally-updatable,
we might want to have the fanout table be checksummed, with
checksummed pointers to
leaf blocks. If we detect corruption during an index lookup, we could
recreate the index
from the packfile.

Essentially the above proposal is to move away from storing loose
objects in the filesystem
and instead to index the data within Git itself.

Thanks,
Neeraj
Junio C Hamano March 16, 2022, 4:14 p.m. UTC | #3
Neeraj Singh <nksingh85@gmail.com> writes:

> I think my new usage is congruent with the existing API, which seems
> to be about combining multiple add operations into a large transaction,
> where we can do some cleanup operations once we're finished. In the
> preexisting code, the transaction is about adding a bunch of large objects
> to a single pack file (while leaving small objects loose), and then completing
> the packfile when the adds are finished.

OK, so it was part me, and part a suboptimal presentation, I guess
;-)

Let me rephrase the idea to see if I got it right this time.

The bulk-checkin API has two interesting entry points, "plug" that
signals that we are about to repeat possibly many operations to add
new objects to the object store, and "unplug" that signals that we
are done such adding.  They are meant to serve as a hint for the
object layer to optimize its operation.

So far the only way the hint was used was that the logic that sends
an overly large object into a packfile (instead of storing it loose,
which leaves it subject to expensive repacking later) can shove more
than one such objects in the same packfile.

This series invents another use of the "plug"-"unplug" hint.  By
knowing that many loose object files are created and when the series
of object creation ended, we can avoid having to fsync each and
every one of them on certain filesystems and achieve the same
robustness.  The new "batch" option to core.fsyncmethod triggers
this mechanism.

Did I get it right, more-or-less?

Thanks.
Neeraj Singh March 16, 2022, 5:59 p.m. UTC | #4
On Wed, Mar 16, 2022 at 9:14 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> > I think my new usage is congruent with the existing API, which seems
> > to be about combining multiple add operations into a large transaction,
> > where we can do some cleanup operations once we're finished. In the
> > preexisting code, the transaction is about adding a bunch of large objects
> > to a single pack file (while leaving small objects loose), and then completing
> > the packfile when the adds are finished.
>
> OK, so it was part me, and part a suboptimal presentation, I guess
> ;-)
>
> Let me rephrase the idea to see if I got it right this time.
>
> The bulk-checkin API has two interesting entry points, "plug" that
> signals that we are about to repeat possibly many operations to add
> new objects to the object store, and "unplug" that signals that we
> are done such adding.  They are meant to serve as a hint for the
> object layer to optimize its operation.
>
> So far the only way the hint was used was that the logic that sends
> an overly large object into a packfile (instead of storing it loose,
> which leaves it subject to expensive repacking later) can shove more
> than one such objects in the same packfile.
>
> This series invents another use of the "plug"-"unplug" hint.  By
> knowing that many loose object files are created and when the series
> of object creation ended, we can avoid having to fsync each and
> every one of them on certain filesystems and achieve the same
> robustness.  The new "batch" option to core.fsyncmethod triggers
> this mechanism.
>
> Did I get it right, more-or-less?

Yes, that's my understanding as well.

Thanks,
Neeraj
Junio C Hamano March 16, 2022, 6:10 p.m. UTC | #5
Neeraj Singh <nksingh85@gmail.com> writes:

>> Did I get it right, more-or-less?
>
> Yes, that's my understanding as well.

I guess what I wrote would make a useful material for early part of
the log message to help future developers.

Thanks.
Neeraj Singh March 16, 2022, 7:50 p.m. UTC | #6
On Wed, Mar 16, 2022 at 11:10 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> >> Did I get it right, more-or-less?
> >
> > Yes, that's my understanding as well.
>
> I guess what I wrote would make a useful material for early part of
> the log message to help future developers.
>
> Thanks.

Will do.  I changed the commit message to explain the current
functionality of bulk-checkin and how it's similar to batched-fsync.
diff mbox series

Patch

diff --git a/bulk-checkin.c b/bulk-checkin.c
index e988a388b65..93b1dc5138a 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -10,9 +10,9 @@ 
 #include "packfile.h"
 #include "object-store.h"
 
-static struct bulk_checkin_state {
-	unsigned plugged:1;
+static int bulk_checkin_plugged;
 
+static struct bulk_checkin_state {
 	char *pack_tmp_name;
 	struct hashfile *f;
 	off_t offset;
@@ -21,7 +21,7 @@  static struct bulk_checkin_state {
 	struct pack_idx_entry **written;
 	uint32_t alloc_written;
 	uint32_t nr_written;
-} state;
+} bulk_checkin_state;
 
 static void finish_tmp_packfile(struct strbuf *basename,
 				const char *pack_tmp_name,
@@ -278,21 +278,23 @@  int index_bulk_checkin(struct object_id *oid,
 		       int fd, size_t size, enum object_type type,
 		       const char *path, unsigned flags)
 {
-	int status = deflate_to_pack(&state, oid, fd, size, type,
+	int status = deflate_to_pack(&bulk_checkin_state, oid, fd, size, type,
 				     path, flags);
-	if (!state.plugged)
-		finish_bulk_checkin(&state);
+	if (!bulk_checkin_plugged)
+		finish_bulk_checkin(&bulk_checkin_state);
 	return status;
 }
 
 void plug_bulk_checkin(void)
 {
-	state.plugged = 1;
+	assert(!bulk_checkin_plugged);
+	bulk_checkin_plugged = 1;
 }
 
 void unplug_bulk_checkin(void)
 {
-	state.plugged = 0;
-	if (state.f)
-		finish_bulk_checkin(&state);
+	assert(bulk_checkin_plugged);
+	bulk_checkin_plugged = 0;
+	if (bulk_checkin_state.f)
+		finish_bulk_checkin(&bulk_checkin_state);
 }