[v4,4/6] update-index: use the bulk-checkin infrastructure

Message ID	f7f756f3932cdbca587de397598758c685bac29a.1632176111.git.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <f7f756f3932cdbca587de397598758c685bac29a.1632176111.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1076.v4.git.git.1632176111.gitgitgadget@gmail.com> References: <pull.1076.v3.git.git.1631590725.gitgitgadget@gmail.com> <pull.1076.v4.git.git.1632176111.gitgitgadget@gmail.com> Date: Mon, 20 Sep 2021 22:15:09 +0000 Subject: [PATCH v4 4/6] update-index: use the bulk-checkin infrastructure Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Neeraj-Personal <nksingh85@gmail.com>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, Jeff King <peff@peff.net>, Jeff Hostetler <jeffhost@microsoft.com>, Christoph Hellwig <hch@lst.de>, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@gmail.com>, "Randall S. Becker" <rsbecker@nexbridge.com>, Bagas Sanjaya <bagasdotme@gmail.com>, "Neeraj K. Singh" <neerajsi@microsoft.com>, Neeraj Singh <neerajsi@microsoft.com> Precedence: bulk From: Neeraj Singh <neerajsi@microsoft.com>
Series	Implement a batched fsync option for core.fsyncObjectFiles \| expand [v4,0/6] Implement a batched fsync option for core.fsyncObjectFiles [v4,1/6] bulk-checkin: rename 'state' variable and separate 'plugged' boolean [v4,2/6] core.fsyncobjectfiles: batched disk flushes [v4,3/6] core.fsyncobjectfiles: add windows support for batch mode [v4,4/6] update-index: use the bulk-checkin infrastructure [v4,5/6] core.fsyncobjectfiles: tests for batch mode [v4,6/6] core.fsyncobjectfiles: performance tests for add and stash

Message ID

f7f756f3932cdbca587de397598758c685bac29a.1632176111.git.gitgitgadget@gmail.com (mailing list archive)

State

Superseded

Headers

Message-Id: 
 <f7f756f3932cdbca587de397598758c685bac29a.1632176111.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.1076.v4.git.git.1632176111.gitgitgadget@gmail.com>
References: <pull.1076.v3.git.git.1631590725.gitgitgadget@gmail.com>
        <pull.1076.v4.git.git.1632176111.gitgitgadget@gmail.com>
Date: Mon, 20 Sep 2021 22:15:09 +0000
Subject: [PATCH v4 4/6] update-index: use the bulk-checkin infrastructure
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: Neeraj-Personal <nksingh85@gmail.com>,
 Johannes Schindelin <Johannes.Schindelin@gmx.de>, Jeff King <peff@peff.net>,
 Jeff Hostetler <jeffhost@microsoft.com>, Christoph Hellwig <hch@lst.de>,
	=?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason  <avarab@gmail.com>,
 "Randall S. Becker" <rsbecker@nexbridge.com>,
 Bagas Sanjaya <bagasdotme@gmail.com>,
 "Neeraj K. Singh" <neerajsi@microsoft.com>,
 Neeraj Singh <neerajsi@microsoft.com>
Precedence: bulk
From: Neeraj Singh <neerajsi@microsoft.com>

Series

Implement a batched fsync option for core.fsyncObjectFiles | expand

Commit Message

Neeraj Singh (WINDOWS-SFS) Sept. 20, 2021, 10:15 p.m. UTC

From: Neeraj Singh <neerajsi@microsoft.com>

The update-index functionality is used internally by 'git stash push' to
setup the internal stashed commit.

This change enables bulk-checkin for update-index infrastructure to
speed up adding new objects to the object database by leveraging the
pack functionality and the new bulk-fsync functionality. This mode
is enabled when passing paths to update-index via the --stdin flag,
as is done by 'git stash'.

There is some risk with this change, since under batch fsync, the object
files will not be available until the update-index is entirely complete.
This usage is unlikely, since any tool invoking update-index and
expecting to see objects would have to snoop the output of --verbose to
find out when update-index has actually processed a given path.
Additionally the index is locked for the duration of the update.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
---
 builtin/update-index.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Ævar Arnfjörð Bjarmason Sept. 21, 2021, 11:46 p.m. UTC | #1

On Mon, Sep 20 2021, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> The update-index functionality is used internally by 'git stash push' to
> setup the internal stashed commit.
>
> This change enables bulk-checkin for update-index infrastructure to
> speed up adding new objects to the object database by leveraging the
> pack functionality and the new bulk-fsync functionality. This mode
> is enabled when passing paths to update-index via the --stdin flag,
> as is done by 'git stash'.
>
> There is some risk with this change, since under batch fsync, the object
> files will not be available until the update-index is entirely complete.
> This usage is unlikely, since any tool invoking update-index and
> expecting to see objects would have to snoop the output of --verbose to
> find out when update-index has actually processed a given path.
> Additionally the index is locked for the duration of the update.

Would you really need to sniff the verbose output? If I'm streaming data
to update-index now it looks like I could assume before that
update-index would have done the work if I managed to fflush() to it,
since it's processing a line at a time and doing the work in that
line-at-a-time loop.

I.e. you could print lines to it, and then do concurrent object lookups
knowing the data was written already...

I think this is probably fine, but that case seems way likelier than
someone sniffing back the verbose output, presumably for the "add" in
update_one(), but that's called in the getline_fn() loop...

All of this makes me wonder why this isn't using tmp-objdir.c, i.e. we
could have our cake and eat it too by writing the "real" objects, and
then just renaming them between directories instead. But perhaps the
answer has something to do with the metadata issues I raised.

And well, tmp-objdir.c isn't going to help someone in practice that's
relying on this "update-index --stdin" behavior, as they won't know
where we staged the temporary files...

Neeraj Singh Sept. 22, 2021, 1:27 a.m. UTC | #2

On Tue, Sep 21, 2021 at 4:53 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Mon, Sep 20 2021, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > The update-index functionality is used internally by 'git stash push' to
> > setup the internal stashed commit.
> >
> > This change enables bulk-checkin for update-index infrastructure to
> > speed up adding new objects to the object database by leveraging the
> > pack functionality and the new bulk-fsync functionality. This mode
> > is enabled when passing paths to update-index via the --stdin flag,
> > as is done by 'git stash'.
> >
> > There is some risk with this change, since under batch fsync, the object
> > files will not be available until the update-index is entirely complete.
> > This usage is unlikely, since any tool invoking update-index and
> > expecting to see objects would have to snoop the output of --verbose to
> > find out when update-index has actually processed a given path.
> > Additionally the index is locked for the duration of the update.
>
> Would you really need to sniff the verbose output? If I'm streaming data
> to update-index now it looks like I could assume before that
> update-index would have done the work if I managed to fflush() to it,
> since it's processing a line at a time and doing the work in that
> line-at-a-time loop.
>
> I.e. you could print lines to it, and then do concurrent object lookups
> knowing the data was written already...
>
> I think this is probably fine, but that case seems way likelier than
> someone sniffing back the verbose output, presumably for the "add" in
> update_one(), but that's called in the getline_fn() loop...

Does fflush really guarantee that the reader has picked up the input from
a pipe across all environments?  Even if a reader picks up the input, does
that mean that the reader is done processing it?

Do you think I really need to revise this comment? Maybe leave a terser,
'this usage is thought to be unlikely'?

>
> All of this makes me wonder why this isn't using tmp-objdir.c, i.e. we
> could have our cake and eat it too by writing the "real" objects, and
> then just renaming them between directories instead. But perhaps the
> answer has something to do with the metadata issues I raised.
>
> And well, tmp-objdir.c isn't going to help someone in practice that's
> relying on this "update-index --stdin" behavior, as they won't know
> where we staged the temporary files...
>

One motivation of the current design behind renaming the files is that
some networked filesystems don't seem to like cross-directory renames
much.  It also so happens that ReFS on Windows also prefers renames to
stay within the directory. Actually any filesystem would likely be
slightly faster,
since fewer objects are being modified (one dir versus two).

Neeraj Singh Sept. 23, 2021, 10:32 p.m. UTC | #3

On Tue, Sep 21, 2021 at 6:27 PM Neeraj Singh <nksingh85@gmail.com> wrote:
>
> On Tue, Sep 21, 2021 at 4:53 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
> >
> > All of this makes me wonder why this isn't using tmp-objdir.c, i.e. we
> > could have our cake and eat it too by writing the "real" objects, and
> > then just renaming them between directories instead. But perhaps the
> > answer has something to do with the metadata issues I raised.
> >
> > And well, tmp-objdir.c isn't going to help someone in practice that's
> > relying on this "update-index --stdin" behavior, as they won't know
> > where we staged the temporary files...
> >
>
> One motivation of the current design behind renaming the files is that
> some networked filesystems don't seem to like cross-directory renames
> much.  It also so happens that ReFS on Windows also prefers renames to
> stay within the directory. Actually any filesystem would likely be
> slightly faster,
> since fewer objects are being modified (one dir versus two).

Whelp, as part of v5 I tried to make unpack-objects.c use the batch fsync
mode and now I see a strong reason to take your tmp-objdir suggestion. As
part of OBJ_REF_DELTA unpacking, we need access to the object while
we're in the plugged state. I didn't notice this at first, but got
lucky that I tested
that case first and hit an error.

V5 will create a tmp-objdir and add a new interface to install it as the primary
objdir.

diff --git a/builtin/update-index.c b/builtin/update-index.c
index 187203e8bb5..b0689f2cdf6 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -5,6 +5,7 @@ 
  */
 #define USE_THE_INDEX_COMPATIBILITY_MACROS
 #include "cache.h"
+#include "bulk-checkin.h"
 #include "config.h"
 #include "lockfile.h"
 #include "quote.h"
@@ -1150,6 +1151,7 @@  int cmd_update_index(int argc, const char **argv, const char *prefix)
 		struct strbuf unquoted = STRBUF_INIT;
 
 		setup_work_tree();
+		plug_bulk_checkin();
 		while (getline_fn(&buf, stdin) != EOF) {
 			char *p;
 			if (!nul_term_line && buf.buf[0] == '"') {
@@ -1164,6 +1166,7 @@  int cmd_update_index(int argc, const char **argv, const char *prefix)
 				chmod_path(set_executable_bit, p);
 			free(p);
 		}
+		unplug_bulk_checkin(&lock_file);
 		strbuf_release(&unquoted);
 		strbuf_release(&buf);
 	}

[v4,4/6] update-index: use the bulk-checkin infrastructure

Commit Message

Comments

Patch