[v2] bulk-checkin: only support blobs in index_bulk_checkin

Message ID	878r918ps3.fsf@gmail.froward.int.ebiederm.org (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@vger.kernel.org> From: "Eric W. Biederman" <ebiederm@gmail.com> To: Junio C Hamano <gitster@pobox.com> Cc: "brian m. carlson" <sandals@crustytoothpaste.net>, <git@vger.kernel.org> Subject: [PATCH v2] bulk-checkin: only support blobs in index_bulk_checkin Date: Tue, 19 Sep 2023 22:52:28 -0500 Message-ID: <878r918ps3.fsf@gmail.froward.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	[v2] bulk-checkin: only support blobs in index_bulk_checkin \| expand [v2] bulk-checkin: only support blobs in index_bulk_checkin

Message ID

878r918ps3.fsf@gmail.froward.int.ebiederm.org (mailing list archive)

State

Superseded

Headers

From: "Eric W. Biederman" <ebiederm@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>,
        <git@vger.kernel.org>
Subject: [PATCH v2] bulk-checkin: only support blobs in index_bulk_checkin
Date: Tue, 19 Sep 2023 22:52:28 -0500
Message-ID: <878r918ps3.fsf@gmail.froward.int.ebiederm.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
Precedence: bulk

Series

[v2] bulk-checkin: only support blobs in index_bulk_checkin | expand

Commit Message

Eric W. Biederman Sept. 20, 2023, 3:52 a.m. UTC

As the code is written today index_bulk_checkin only accepts blobs.
Remove the enum object_type parameter and rename index_bulk_checkin to
index_blob_bulk_checkin, index_stream to index_blob_stream,
deflate_to_pack to deflate_blob_to_pack, stream_to_pack to
stream_blobk_to_pack, to make this explicit.

Not supporting commits, tags, or trees has no downside as it is not
currently supported now, and commits, tags, and trees being smaller by
design do not have the problem that the problem that index_bulk_checkin
was built to solve.

What is more this is very desiable from the context of the hash function
transition.

For blob objects it is straight forward to compute multiple hash
functions during index_bulk_checkin as the object header and content of
a blob is the same no matter which hash function is being used to
compute the oid of a blob.

For commits, tress, and tags the object header and content that need to
be hashed ard different for different hashes.  Even worse the object
header can not be known until the size of the content that needs to be
hashed is known.  The size of the content that needs to be hashed can
not be known until a complete pass is made through all of the variable
length entries of the original object.

As far as I can tell this extra pass defeats most of the purpose of
streaming, and it is much easier to implement with in memory buffers.

So if it is needed to write commits, trees, and tags directly to pack
files writing a separate function to do the would be needed.

So let's just simplify the code base for now, simplify the development
needed for the hash function transition and only support blobs with the
existing bulk_checkin code.

Inspired-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 bulk-checkin.c | 35 +++++++++++++++++------------------
 bulk-checkin.h |  6 +++---
 object-file.c  | 12 ++++++------
 3 files changed, 26 insertions(+), 27 deletions(-)

This is just a v2 of the description, that addresses Junio's
capitalization concern, and hopefully makes the justification clear to
other people.

I am sending it now mostly because the original version did not
land on the mailing list for some reason.  So I have switched
which email account I am using for now.

Comments

Junio C Hamano Sept. 20, 2023, 6:59 a.m. UTC | #1

"Eric W. Biederman" <ebiederm@gmail.com> writes:

> As the code is written today index_bulk_checkin only accepts blobs.
> Remove the enum object_type parameter and rename index_bulk_checkin to
> index_blob_bulk_checkin, index_stream to index_blob_stream,
> deflate_to_pack to deflate_blob_to_pack, stream_to_pack to
> stream_blobk_to_pack, to make this explicit.

> Not supporting commits, tags, or trees has no downside as it is not
> currently supported now, and commits, tags, and trees being smaller by
> design do not have the problem that the problem that index_bulk_checkin
> was built to solve.

Exactly.  The streaming was primarily to help dealing with huge
blobs that cannot be held in-core.  Of course other parts (like
comparing them) of the system would require to hold them in-core
so some things may not work for them, but at least it is a start
to be able to _hash_ them to store them in the object store and to
give them names.

> What is more this is very desiable from the context of the hash function
> transition.

A bit hard to parse; perhaps want a comma before "this"?

> For blob objects it is straight forward to compute multiple hash
> functions during index_bulk_checkin as the object header and content of
> a blob is the same no matter which hash function is being used to
> compute the oid of a blob.

OK.

> For commits, tress, and tags the object header and content that need to
> be hashed ard different for different hashes.  Even worse the object
> header can not be known until the size of the content that needs to be
> hashed is known.  The size of the content that needs to be hashed can
> not be known until a complete pass is made through all of the variable
> length entries of the original object.

"tress" -> "trees".  Also a comma after "worse".

> As far as I can tell this extra pass defeats most of the purpose of
> streaming, and it is much easier to implement with in memory buffers.

The purpose of streaming being the ability to hash and compute the
object name without having to hold the entirety of the object, I am
not sure the above is a good argument.  You can run multiple passes
by streaming the same data twice if you needed to, and how much
easier the implementation may become if you can assume that you can
hold everything in-core, what you cannot fit in-core would not fit
in-core, so ...

> So if it is needed to write commits, trees, and tags directly to pack
> files writing a separate function to do the would be needed.

But I am OK with this conclusion.  As the way to compute the
fallback hashes for different types of objects are very different,
compared to a single-hash world where as long as you come up with a
serialization you have only a single way to hash and name the
object.  We would end up having separate helper functions per target
type anyway, even if we kept a single entry point function like
index_stream().  The single entry point function will only be used
to just dispatch to type specific ones, so renaming what we have today
and making it clear they are for "blobs" does make sense.

Eric W. Biederman Sept. 20, 2023, 12:24 p.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

> "Eric W. Biederman" <ebiederm@gmail.com> writes:
>
>> As far as I can tell this extra pass defeats most of the purpose of
>> streaming, and it is much easier to implement with in memory buffers.
>
> The purpose of streaming being the ability to hash and compute the
> object name without having to hold the entirety of the object, I am
> not sure the above is a good argument.  You can run multiple passes
> by streaming the same data twice if you needed to, and how much
> easier the implementation may become if you can assume that you can
> hold everything in-core, what you cannot fit in-core would not fit
> in-core, so ...

Yes this wording needs to be clarified.

If streaming to handle objects that don't fit in memory is the purpose,
I agree there are slow multi-pass ways to deal with trees, commits and
tags.

If writing directly to the pack is the purpose, using an in-core
buffer for trees, commits, and tags is better.

I will put on the wording on the back burner and see what I come up
with.

>> So if it is needed to write commits, trees, and tags directly to pack
>> files writing a separate function to do the would be needed.
>
> But I am OK with this conclusion.  As the way to compute the
> fallback hashes for different types of objects are very different,
> compared to a single-hash world where as long as you come up with a
> serialization you have only a single way to hash and name the
> object.  We would end up having separate helper functions per target
> type anyway, even if we kept a single entry point function like
> index_stream().  The single entry point function will only be used
> to just dispatch to type specific ones, so renaming what we have today
> and making it clear they are for "blobs" does make sense.

Good.  I am glad I am able to step back and successfully explain the
whys of things.

Eric

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 73bff3a23d27..223562b4e748 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -155,10 +155,10 @@  static int already_written(struct bulk_checkin_packfile *state, struct object_id
  * status before calling us just in case we ask it to call us again
  * with a new pack.
  */
-static int stream_to_pack(struct bulk_checkin_packfile *state,
-			  git_hash_ctx *ctx, off_t *already_hashed_to,
-			  int fd, size_t size, enum object_type type,
-			  const char *path, unsigned flags)
+static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
+			       git_hash_ctx *ctx, off_t *already_hashed_to,
+			       int fd, size_t size, const char *path,
+			       unsigned flags)
 {
 	git_zstream s;
 	unsigned char ibuf[16384];
@@ -170,7 +170,7 @@  static int stream_to_pack(struct bulk_checkin_packfile *state,
 
 	git_deflate_init(&s, pack_compression_level);
 
-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), type, size);
+	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB, size);
 	s.next_out = obuf + hdrlen;
 	s.avail_out = sizeof(obuf) - hdrlen;
 
@@ -247,11 +247,10 @@  static void prepare_to_stream(struct bulk_checkin_packfile *state,
 		die_errno("unable to write pack header");
 }
 
-static int deflate_to_pack(struct bulk_checkin_packfile *state,
-			   struct object_id *result_oid,
-			   int fd, size_t size,
-			   enum object_type type, const char *path,
-			   unsigned flags)
+static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
+				struct object_id *result_oid,
+				int fd, size_t size,
+				const char *path, unsigned flags)
 {
 	off_t seekback, already_hashed_to;
 	git_hash_ctx ctx;
@@ -265,7 +264,7 @@  static int deflate_to_pack(struct bulk_checkin_packfile *state,
 		return error("cannot find the current offset");
 
 	header_len = format_object_header((char *)obuf, sizeof(obuf),
-					  type, size);
+					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
 
@@ -282,8 +281,8 @@  static int deflate_to_pack(struct bulk_checkin_packfile *state,
 			idx->offset = state->offset;
 			crc32_begin(state->f);
 		}
-		if (!stream_to_pack(state, &ctx, &already_hashed_to,
-				    fd, size, type, path, flags))
+		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
+					 fd, size, path, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
@@ -350,12 +349,12 @@  void fsync_loose_object_bulk_checkin(int fd, const char *filename)
 	}
 }
 
-int index_bulk_checkin(struct object_id *oid,
-		       int fd, size_t size, enum object_type type,
-		       const char *path, unsigned flags)
+int index_blob_bulk_checkin(struct object_id *oid,
+			    int fd, size_t size,
+			    const char *path, unsigned flags)
 {
-	int status = deflate_to_pack(&bulk_checkin_packfile, oid, fd, size, type,
-				     path, flags);
+	int status = deflate_blob_to_pack(&bulk_checkin_packfile, oid, fd, size,
+					  path, flags);
 	if (!odb_transaction_nesting)
 		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
 	return status;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index 48fe9a6e9171..aa7286a7b3e1 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -9,9 +9,9 @@ 
 void prepare_loose_object_bulk_checkin(void);
 void fsync_loose_object_bulk_checkin(int fd, const char *filename);
 
-int index_bulk_checkin(struct object_id *oid,
-		       int fd, size_t size, enum object_type type,
-		       const char *path, unsigned flags);
+int index_blob_bulk_checkin(struct object_id *oid,
+			    int fd, size_t size,
+			    const char *path, unsigned flags);
 
 /*
  * Tell the object database to optimize for adding
diff --git a/object-file.c b/object-file.c
index 7dc0c4bfbba8..7c7afe579364 100644
--- a/object-file.c
+++ b/object-file.c
@@ -2446,11 +2446,11 @@  static int index_core(struct index_state *istate,
  * binary blobs, they generally do not want to get any conversion, and
  * callers should avoid this code path when filters are requested.
  */
-static int index_stream(struct object_id *oid, int fd, size_t size,
-			enum object_type type, const char *path,
-			unsigned flags)
+static int index_blob_stream(struct object_id *oid, int fd, size_t size,
+			     const char *path,
+			     unsigned flags)
 {
-	return index_bulk_checkin(oid, fd, size, type, path, flags);
+	return index_blob_bulk_checkin(oid, fd, size, path, flags);
 }
 
 int index_fd(struct index_state *istate, struct object_id *oid,
@@ -2472,8 +2472,8 @@  int index_fd(struct index_state *istate, struct object_id *oid,
 		ret = index_core(istate, oid, fd, xsize_t(st->st_size),
 				 type, path, flags);
 	else
-		ret = index_stream(oid, fd, xsize_t(st->st_size), type, path,
-				   flags);
+		ret = index_blob_stream(oid, fd, xsize_t(st->st_size), path,
+					flags);
 	close(fd);
 	return ret;
 }

[v2] bulk-checkin: only support blobs in index_bulk_checkin

Commit Message

Comments

Patch