diff mbox series

[v7,1/9] object-file.c: do not rename in a temp odb

Message ID 6e65f68fd6d4d90b0a7bca2e2e57ace9ad749266.1632871971.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Implement a batched fsync option for core.fsyncObjectFiles | expand

Commit Message

Neeraj Singh (WINDOWS-SFS) Sept. 28, 2021, 11:32 p.m. UTC
From: Neeraj Singh <neerajsi@microsoft.com>

If a temporary ODB is active, as determined by GIT_QUARANTINE_PATH
being set, create object files with their final names. This avoids
an extra rename beyond what is needed to merge the temporary ODB in
tmp_objdir_migrate.

Creating an object file with the expected final name should be okay
since the git process writing to the temporary object store is the
only writer, and it only invokes write_loose_object/create_object_file
after checking that the object doesn't exist.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
---
 environment.c  |  4 ++++
 object-file.c  | 51 ++++++++++++++++++++++++++++++++++----------------
 object-store.h |  6 ++++++
 repository.c   |  2 ++
 repository.h   |  1 +
 5 files changed, 48 insertions(+), 16 deletions(-)

Comments

Jeff King Sept. 28, 2021, 11:55 p.m. UTC | #1
On Tue, Sep 28, 2021 at 11:32:43PM +0000, Neeraj Singh via GitGitGadget wrote:

> If a temporary ODB is active, as determined by GIT_QUARANTINE_PATH
> being set, create object files with their final names. This avoids
> an extra rename beyond what is needed to merge the temporary ODB in
> tmp_objdir_migrate.

What's our goal here? Is it the performance of avoiding the extra
rename()? Or do we benefit from the simplicity of avoiding it?

If the former, do we have measurements on how much this matters?

If the latter, what does the simplicity buy us? I thought maybe it would
make reasoning about fsync() easier, because we don't have to worry
about fsyncing the rename. But we'd eventually have to rename() into the
real object directory anyway.

The reason I want to push back is...

> Creating an object file with the expected final name should be okay
> since the git process writing to the temporary object store is the
> only writer, and it only invokes write_loose_object/create_object_file
> after checking that the object doesn't exist.

...this seems like a kind-of dangerous assumption. Most of the time,
yeah, I'd expect just a single process to be writing. But one of the
things that happens during the receive-pack quarantine is that we run
hooks, which can run any set of arbitrary Git commands, including
simultaneous readers and writers. It seems like we might be introducing
subtle races there.

-Peff
Neeraj Singh Sept. 29, 2021, 12:10 a.m. UTC | #2
On Tue, Sep 28, 2021 at 4:55 PM Jeff King <peff@peff.net> wrote:
>
> On Tue, Sep 28, 2021 at 11:32:43PM +0000, Neeraj Singh via GitGitGadget wrote:
>
> > If a temporary ODB is active, as determined by GIT_QUARANTINE_PATH
> > being set, create object files with their final names. This avoids
> > an extra rename beyond what is needed to merge the temporary ODB in
> > tmp_objdir_migrate.
>
> What's our goal here? Is it the performance of avoiding the extra
> rename()? Or do we benefit from the simplicity of avoiding it?
>
> If the former, do we have measurements on how much this matters?
>
> If the latter, what does the simplicity buy us? I thought maybe it would
> make reasoning about fsync() easier, because we don't have to worry
> about fsyncing the rename. But we'd eventually have to rename() into the
> real object directory anyway.
>
> The reason I want to push back is...
>
> > Creating an object file with the expected final name should be okay
> > since the git process writing to the temporary object store is the
> > only writer, and it only invokes write_loose_object/create_object_file
> > after checking that the object doesn't exist.
>
> ...this seems like a kind-of dangerous assumption. Most of the time,
> yeah, I'd expect just a single process to be writing. But one of the
> things that happens during the receive-pack quarantine is that we run
> hooks, which can run any set of arbitrary Git commands, including
> simultaneous readers and writers. It seems like we might be introducing
> subtle races there.
>
> -Peff

Yes, the main goal was to avoid an extra rename. I see your concern
and I guess we have no way of knowing if someone is really going to
get bitten by this or not. On the other hand, we do know in the case
of batch_fsync that only Git is running against the objdir at that
time.  I'll remove this change since it's not where the real perf
benefit is.

Thanks,
Neeraj
diff mbox series

Patch

diff --git a/environment.c b/environment.c
index b4ba4fa22db..30fca67e6d6 100644
--- a/environment.c
+++ b/environment.c
@@ -176,6 +176,10 @@  void setup_git_env(const char *git_dir)
 	args.graft_file = getenv_safe(&to_free, GRAFT_ENVIRONMENT);
 	args.index_file = getenv_safe(&to_free, INDEX_ENVIRONMENT);
 	args.alternate_db = getenv_safe(&to_free, ALTERNATE_DB_ENVIRONMENT);
+	if (getenv(GIT_QUARANTINE_ENVIRONMENT)) {
+		args.object_dir_is_temp = 1;
+	}
+
 	repo_set_gitdir(the_repository, git_dir, &args);
 	strvec_clear(&to_free);
 
diff --git a/object-file.c b/object-file.c
index be4f94ecf3b..49c53f801f7 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1826,12 +1826,17 @@  static void write_object_file_prepare(const struct git_hash_algo *algo,
 }
 
 /*
- * Move the just written object into its final resting place.
+ * Move the just written object into its final resting place,
+ * unless it is already there, as indicated by an empty string for
+ * tmpfile.
  */
 int finalize_object_file(const char *tmpfile, const char *filename)
 {
 	int ret = 0;
 
+	if (!*tmpfile)
+		goto out;
+
 	if (object_creation_mode == OBJECT_CREATION_USES_RENAMES)
 		goto try_rename;
 	else if (link(tmpfile, filename))
@@ -1904,21 +1909,37 @@  static inline int directory_size(const char *filename)
 }
 
 /*
- * This creates a temporary file in the same directory as the final
- * 'filename'
+ * This creates a loose object file for the specified object id.
+ * If we're working in a temporary object directory, the file is
+ * created with its final filename, otherwise it is created with
+ * a temporary name and renamed by finalize_object_file.
+ * If no rename is required, an empty string is returned in tmp.
  *
  * We want to avoid cross-directory filename renames, because those
  * can have problems on various filesystems (FAT, NFS, Coda).
  */
-static int create_tmpfile(struct strbuf *tmp, const char *filename)
+static int create_objfile(const struct object_id *oid, struct strbuf *tmp,
+			  struct strbuf *filename)
 {
-	int fd, dirlen = directory_size(filename);
+	int fd, dirlen, is_retrying = 0;
+	const char *object_name;
+	static const int object_mode = 0444;
 
+	loose_object_path(the_repository, filename, oid);
+	dirlen = directory_size(filename->buf);
+
+retry_create:
 	strbuf_reset(tmp);
-	strbuf_add(tmp, filename, dirlen);
-	strbuf_addstr(tmp, "tmp_obj_XXXXXX");
-	fd = git_mkstemp_mode(tmp->buf, 0444);
-	if (fd < 0 && dirlen && errno == ENOENT) {
+	if (!the_repository->objects->odb->is_temp) {
+		strbuf_add(tmp, filename->buf, dirlen);
+		object_name = "tmp_obj_XXXXXX";
+		strbuf_addstr(tmp, object_name);
+		fd = git_mkstemp_mode(tmp->buf, object_mode);
+	} else {
+		fd = open(filename->buf, O_CREAT | O_EXCL | O_RDWR, object_mode);
+	}
+
+	if (fd < 0 && dirlen && errno == ENOENT && !is_retrying) {
 		/*
 		 * Make sure the directory exists; note that the contents
 		 * of the buffer are undefined after mkstemp returns an
@@ -1926,15 +1947,15 @@  static int create_tmpfile(struct strbuf *tmp, const char *filename)
 		 * scratch.
 		 */
 		strbuf_reset(tmp);
-		strbuf_add(tmp, filename, dirlen - 1);
+		strbuf_add(tmp, filename->buf, dirlen - 1);
 		if (mkdir(tmp->buf, 0777) && errno != EEXIST)
 			return -1;
 		if (adjust_shared_perm(tmp->buf))
 			return -1;
 
 		/* Try again */
-		strbuf_addstr(tmp, "/tmp_obj_XXXXXX");
-		fd = git_mkstemp_mode(tmp->buf, 0444);
+		is_retrying = 1;
+		goto retry_create;
 	}
 	return fd;
 }
@@ -1951,14 +1972,12 @@  static int write_loose_object(const struct object_id *oid, char *hdr,
 	static struct strbuf tmp_file = STRBUF_INIT;
 	static struct strbuf filename = STRBUF_INIT;
 
-	loose_object_path(the_repository, &filename, oid);
-
-	fd = create_tmpfile(&tmp_file, filename.buf);
+	fd = create_objfile(oid, &tmp_file, &filename);
 	if (fd < 0) {
 		if (errno == EACCES)
 			return error(_("insufficient permission for adding an object to repository database %s"), get_object_directory());
 		else
-			return error_errno(_("unable to create temporary file"));
+			return error_errno(_("unable to create object file"));
 	}
 
 	/* Set it up */
diff --git a/object-store.h b/object-store.h
index c5130d8baea..551639f173d 100644
--- a/object-store.h
+++ b/object-store.h
@@ -27,6 +27,12 @@  struct object_directory {
 	uint32_t loose_objects_subdir_seen[8]; /* 256 bits */
 	struct oidtree *loose_objects_cache;
 
+	/*
+	 * This is a temporary object store, so there is no need to
+	 * create new objects via rename.
+	 */
+	int is_temp;
+
 	/*
 	 * Path to the alternative object store. If this is a relative path,
 	 * it is relative to the current working directory.
diff --git a/repository.c b/repository.c
index 710a3b4bf87..75966153b75 100644
--- a/repository.c
+++ b/repository.c
@@ -80,6 +80,8 @@  void repo_set_gitdir(struct repository *repo,
 	expand_base_dir(&repo->objects->odb->path, o->object_dir,
 			repo->commondir, "objects");
 
+	repo->objects->odb->is_temp = o->object_dir_is_temp;
+
 	free(repo->objects->alternate_db);
 	repo->objects->alternate_db = xstrdup_or_null(o->alternate_db);
 	expand_base_dir(&repo->graft_file, o->graft_file,
diff --git a/repository.h b/repository.h
index 3740c93bc0f..d3711367a6f 100644
--- a/repository.h
+++ b/repository.h
@@ -162,6 +162,7 @@  struct set_gitdir_args {
 	const char *graft_file;
 	const char *index_file;
 	const char *alternate_db;
+	int object_dir_is_temp;
 };
 
 void repo_set_gitdir(struct repository *repo, const char *root,