[v2,02/17] chunk-format: create chunk format write API

Message ID	814512f216719d89f41822753d5c71df5e49385d.1611759716.git.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <814512f216719d89f41822753d5c71df5e49385d.1611759716.git.gitgitgadget@gmail.com> In-Reply-To: <pull.848.v2.git.1611759716.gitgitgadget@gmail.com> References: <pull.848.git.1611676886.gitgitgadget@gmail.com> <pull.848.v2.git.1611759716.gitgitgadget@gmail.com> Date: Wed, 27 Jan 2021 15:01:41 +0000 Subject: [PATCH v2 02/17] chunk-format: create chunk format write API Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, gitster@pobox.com, l.s.r@web.de, szeder.dev@gmail.com, Chris Torek <chris.torek@gmail.com>, Derrick Stolee <stolee@gmail.com>, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <dstolee@microsoft.com> Precedence: bulk From: Derrick Stolee <dstolee@microsoft.com>
Series	Refactor chunk-format into an API \| expand [v2,00/17] Refactor chunk-format into an API [v2,01/17] commit-graph: anonymize data in chunk_write_fn [v2,02/17] chunk-format: create chunk format write API [v2,03/17] commit-graph: use chunk-format write API [v2,04/17] midx: rename pack_info to write_midx_context [v2,05/17] midx: use context in write_midx_pack_names() [v2,06/17] midx: add entries to write_midx_context [v2,07/17] midx: add pack_perm to write_midx_context [v2,08/17] midx: add num_large_offsets to write_midx_context [v2,09/17] midx: return success/failure in chunk write methods [v2,10/17] midx: drop chunk progress during write [v2,11/17] midx: use chunk-format API in write_midx_internal() [v2,12/17] chunk-format: create read chunk API [v2,13/17] commit-graph: use chunk-format read API [v2,14/17] midx: use chunk-format read API [v2,15/17] midx: use 64-bit multiplication for chunk sizes [v2,16/17] chunk-format: restore duplicate chunk checks [v2,17/17] chunk-format: add technical docs

Message ID

814512f216719d89f41822753d5c71df5e49385d.1611759716.git.gitgitgadget@gmail.com (mailing list archive)

State

Superseded

Headers

Message-Id: 
 <814512f216719d89f41822753d5c71df5e49385d.1611759716.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.848.v2.git.1611759716.gitgitgadget@gmail.com>
References: <pull.848.git.1611676886.gitgitgadget@gmail.com>
        <pull.848.v2.git.1611759716.gitgitgadget@gmail.com>
Date: Wed, 27 Jan 2021 15:01:41 +0000
Subject: [PATCH v2 02/17] chunk-format: create chunk format write API
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: me@ttaylorr.com, gitster@pobox.com, l.s.r@web.de,
        szeder.dev@gmail.com, Chris Torek <chris.torek@gmail.com>,
        Derrick Stolee <stolee@gmail.com>,
        Derrick Stolee <derrickstolee@github.com>,
        Derrick Stolee <dstolee@microsoft.com>
Precedence: bulk
From: Derrick Stolee <dstolee@microsoft.com>

Series

Refactor chunk-format into an API | expand

Commit Message

Derrick Stolee Jan. 27, 2021, 3:01 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of combining the logic from the commit-graph and
multi-pack-index file formats, create a new chunk-format API. Use a
'struct chunkfile' pointer to keep track of data that has been
registered for writes. This struct is anonymous outside of
chunk-format.c to ensure no user attempts to interfere with the data.

The next change will use this API in commit-graph.c, but the general
approach is:

 1. initialize the chunkfile with init_chunkfile(f).
 2. add chunks in the intended writing order with add_chunk().
 3. write any header information to the hashfile f.
 4. write the chunkfile data using write_chunkfile().
 5. free the chunkfile struct using free_chunkfile().

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |  1 +
 chunk-format.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 20 +++++++++++
 3 files changed, 112 insertions(+)
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h

Comments

Junio C Hamano Feb. 4, 2021, 9:24 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +/*
> + * When writing a chunk-based file format, collect the chunks in
> + * an array of chunk_info structs. The size stores the _expected_
> + * amount of data that will be written by write_fn.
> + */
> +struct chunk_info {
> +	uint32_t id;
> +	uint64_t size;
> +	chunk_write_fn write_fn;
> +};
> +...
> +void add_chunk(struct chunkfile *cf,
> +	       uint64_t id,
> +	       chunk_write_fn fn,
> +	       size_t size)
> +{
> +	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
> +
> +	cf->chunks[cf->chunks_nr].id = id;
> +	cf->chunks[cf->chunks_nr].write_fn = fn;
> +	cf->chunks[cf->chunks_nr].size = size;
> +	cf->chunks_nr++;
> +}

Somebody somewhere between the caller in the higher part of the
callchain (that has to work with platform native types) and the
on-disk format at the bottom of the callchain (that has to work
with fixed size data fields) needs to make sure that the size that
the higher level caller has fits on-disk data structure we define,
and the data we read from disk fits the in-core structure our caller
use on the reading side.

If there is a function at the one level closer to the disk than
"struct chunk_info" that takes a "struct chunk_info" and writes the
id and size to disk (and fills "struct chunk_info" from what is read
from the disk, on the reading side), it would be a good place to do
the size_t to uint64_t conversion.

It is OK to do the conversion-while-checking in add_chunk(), too.

But a silent type casting from size_t to uint64_t done silently by
assignment bothers me.  Also, I think you meant to make the incoming
ID uint32_t; am I missing something, or did nobody notice it in the
review of the previous round?

> +int write_chunkfile(struct chunkfile *cf, void *data)
> +{
> +	int i;
> +	size_t cur_offset = cf->f->offset + cf->f->total;

That ought to be off_t, as it is a seek position inside a file
(struct hashfile.total is already off_t).

Use csum-file.h::hashfile_total() instead, perhaps?  .offset member
is an implementation detail of the hashfile API (i.e. some leftover
bits are kept in in-core buffer, until we accumulate enough to make
it worth flushing to the disk), and by using the helper, this code
does not have to know about it.

> +	/* Add the table of contents to the current offset */
> +	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;

Is that 12 == sizeof(chunk_info.id) + sizeof(chunk_info.size)?
If so, this makes sense.

> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		hashwrite_be32(cf->f, cf->chunks[i].id);
> +		hashwrite_be64(cf->f, cur_offset);
> +
> +		cur_offset += cf->chunks[i].size;
> +	}
> +
> +	/* Trailing entry marks the end of the chunks */
> +	hashwrite_be32(cf->f, 0);
> +	hashwrite_be64(cf->f, cur_offset);

OK.  This helper does not tell us anything about what comes in the
on-disk file before this point, but we write a table of contents
that says "chunk with this ID has this size, chunk with that ID has
that size, ...", concluded by something that looks like another
entry with chunk ID 0 that records the current offset as its size.

> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		uint64_t start_offset = cf->f->total + cf->f->offset;

Likewise about the type and use of hashfile_total().

> +		int result = cf->chunks[i].write_fn(cf->f, data);
> +
> +		if (result)
> +			return result;
> +
> +		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
> +			    cf->chunks[i].size, cf->chunks[i].id,
> +			    cf->f->total + cf->f->offset - start_offset);

I won't complain, as this apparently is sufficient to abstract out
the two existing chunked format files, but it somehow feels a bit
limiting that the one that calls add_chunk() is required to know
what the size of generated data would be, way before .write_fn() is
called to produce the actual data here.

> +	}
> +
> +	return 0;
> +}
> diff --git a/chunk-format.h b/chunk-format.h
> new file mode 100644
> index 00000000000..bfaed672813
> --- /dev/null
> +++ b/chunk-format.h
> @@ -0,0 +1,20 @@
> +#ifndef CHUNK_FORMAT_H
> +#define CHUNK_FORMAT_H
> +
> +#include "git-compat-util.h"
> +
> +struct hashfile;
> +struct chunkfile;
> +
> +struct chunkfile *init_chunkfile(struct hashfile *f);
> +void free_chunkfile(struct chunkfile *cf);
> +int get_num_chunks(struct chunkfile *cf);
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      void *data);

Write this on a single line.

> +void add_chunk(struct chunkfile *cf,
> +	       uint64_t id,
> +	       chunk_write_fn fn,
> +	       size_t size);

Shouldn't this match the order of members in chunk_info struct?

> +int write_chunkfile(struct chunkfile *cf, void *data);
> +
> +#endif

Junio C Hamano Feb. 4, 2021, 10:40 p.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

>> +void add_chunk(struct chunkfile *cf,
>> +	       uint64_t id,
>> +	       chunk_write_fn fn,
>> +	       size_t size);
>
> Shouldn't this match the order of members in chunk_info struct?

Nah.  Unless there are other reasons to touch numerous add_chunk()
calls that have already been written (like we find a need to add a
new parameter to the call), I do not think this matters too much.

Thanks.

Derrick Stolee Feb. 5, 2021, 11:37 a.m. UTC | #3

On 2/4/2021 4:24 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +/*
>> + * When writing a chunk-based file format, collect the chunks in
>> + * an array of chunk_info structs. The size stores the _expected_
>> + * amount of data that will be written by write_fn.
>> + */
>> +struct chunk_info {
>> +	uint32_t id;
>> +	uint64_t size;
>> +	chunk_write_fn write_fn;
>> +};
>> +...
>> +void add_chunk(struct chunkfile *cf,
>> +	       uint64_t id,
>> +	       chunk_write_fn fn,
>> +	       size_t size)
>> +{
>> +	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
>> +
>> +	cf->chunks[cf->chunks_nr].id = id;
>> +	cf->chunks[cf->chunks_nr].write_fn = fn;
>> +	cf->chunks[cf->chunks_nr].size = size;
>> +	cf->chunks_nr++;
>> +}
> 
> Somebody somewhere between the caller in the higher part of the
> callchain (that has to work with platform native types) and the
> on-disk format at the bottom of the callchain (that has to work
> with fixed size data fields) needs to make sure that the size that
> the higher level caller has fits on-disk data structure we define,
> and the data we read from disk fits the in-core structure our caller
> use on the reading side.
> 
> If there is a function at the one level closer to the disk than
> "struct chunk_info" that takes a "struct chunk_info" and writes the
> id and size to disk (and fills "struct chunk_info" from what is read
> from the disk, on the reading side), it would be a good place to do
> the size_t to uint64_t conversion.

I'm fine with keeping the external interface focused on size_t
instead of uint64_t.

> It is OK to do the conversion-while-checking in add_chunk(), too.
> 
> But a silent type casting from size_t to uint64_t done silently by
> assignment bothers me. 

Does this bother you only because its part of the external interface?
If I understand correctly, uint64_t will always be at least as big
as size_t, so this doesn't need any protections for overflow or
anything. Is there something I should be doing before casting?

> Also, I think you meant to make the incoming
> ID uint32_t; am I missing something, or did nobody notice it in the
> review of the previous round?

Yes, this should be 32-bits. Will fix.
 
>> +int write_chunkfile(struct chunkfile *cf, void *data)
>> +{
>> +	int i;
>> +	size_t cur_offset = cf->f->offset + cf->f->total;
> 
> That ought to be off_t, as it is a seek position inside a file
> (struct hashfile.total is already off_t).

I can use off_t for the other offsets in this computation, but
cur_offset will be used in hashwrite_be64(), so maybe it is best
to use uint64_t here?

> Use csum-file.h::hashfile_total() instead, perhaps?  .offset member
> is an implementation detail of the hashfile API (i.e. some leftover
> bits are kept in in-core buffer, until we accumulate enough to make
> it worth flushing to the disk), and by using the helper, this code
> does not have to know about it.

Thanks! This is cleaner.

>> +	/* Add the table of contents to the current offset */
>> +	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
> 
> Is that 12 == sizeof(chunk_info.id) + sizeof(chunk_info.size)?
> If so, this makes sense.

Yes.

>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		hashwrite_be32(cf->f, cf->chunks[i].id);
>> +		hashwrite_be64(cf->f, cur_offset);
>> +
>> +		cur_offset += cf->chunks[i].size;
>> +	}
>> +
>> +	/* Trailing entry marks the end of the chunks */
>> +	hashwrite_be32(cf->f, 0);
>> +	hashwrite_be64(cf->f, cur_offset);
> 
> OK.  This helper does not tell us anything about what comes in the
> on-disk file before this point, but we write a table of contents
> that says "chunk with this ID has this size, chunk with that ID has
> that size, ...", concluded by something that looks like another
> entry with chunk ID 0 that records the current offset as its size.

Right. The table of contents gives us enough information to find
the start _and_ end of each chunk (and hence compute their size).

>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		uint64_t start_offset = cf->f->total + cf->f->offset;
> 
> Likewise about the type and use of hashfile_total().

This one can definitely be off_t.

>> +		int result = cf->chunks[i].write_fn(cf->f, data);
>> +
>> +		if (result)
>> +			return result;
>> +
>> +		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
>> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
>> +			    cf->chunks[i].size, cf->chunks[i].id,
>> +			    cf->f->total + cf->f->offset - start_offset);
> 
> I won't complain, as this apparently is sufficient to abstract out
> the two existing chunked format files, but it somehow feels a bit
> limiting that the one that calls add_chunk() is required to know
> what the size of generated data would be, way before .write_fn() is
> called to produce the actual data here.

This was pointed out earlier, but it _is_ part of the existing users
of the format. The table of contents is written at the start of the
file instead of the end (such as in the .zip format).

The current chunk format API makes the same assumption (ToC comes
first) but could be adjusted later to let this part of the method
dynamically compute the chunk sizes and fill a ToC at the end. The
way to modify this API would be to add a 'flags' parameter.

So far, this has not been necessary, but might be in the future.

>> +	}
>> +
>> +	return 0;
>> +}
>> diff --git a/chunk-format.h b/chunk-format.h
>> new file mode 100644
>> index 00000000000..bfaed672813
>> --- /dev/null
>> +++ b/chunk-format.h
>> @@ -0,0 +1,20 @@
>> +#ifndef CHUNK_FORMAT_H
>> +#define CHUNK_FORMAT_H
>> +
>> +#include "git-compat-util.h"
>> +
>> +struct hashfile;
>> +struct chunkfile;
>> +
>> +struct chunkfile *init_chunkfile(struct hashfile *f);
>> +void free_chunkfile(struct chunkfile *cf);
>> +int get_num_chunks(struct chunkfile *cf);
>> +typedef int (*chunk_write_fn)(struct hashfile *f,
>> +			      void *data);
> 
> Write this on a single line.

Will do.

Thanks,
-Stolee

Junio C Hamano Feb. 5, 2021, 7:25 p.m. UTC | #4

Derrick Stolee <stolee@gmail.com> writes:

>> But a silent type casting from size_t to uint64_t done silently by
>> assignment bothers me. 
>
> Does this bother you only because its part of the external interface?
> If I understand correctly, uint64_t will always be at least as big
> as size_t, so this doesn't need any protections for overflow or
> anything. Is there something I should be doing before casting?

I am OK to use uint64_t on the caller-facing side, as long as we
explicitly declare that uint64_t ought to be large enough for
everybody.  Struct members and variables that are closer to the
on-disk data need to be of sized type to avoid repeating the pain
caused by our early "unsigned long ought to be large enough for
everybody" attitude, but it is nicer to be working with more
abstract types in the layer higher up.  And if there is a risk of
truncation in either direction, we should be defensive.  That's all.

>>> +int write_chunkfile(struct chunkfile *cf, void *data)
>>> +{
>>> +	int i;
>>> +	size_t cur_offset = cf->f->offset + cf->f->total;
>> 
>> That ought to be off_t, as it is a seek position inside a file
>> (struct hashfile.total is already off_t).
>
> I can use off_t for the other offsets in this computation, but
> cur_offset will be used in hashwrite_be64(), so maybe it is best
> to use uint64_t here?

As I discovered in the later parts, I think off_t makes less sense
than size_t in the context of this design, so size_t is fine, as
long as we keep the "users of chunkfile API work on a mmapped region
of contiguous memory" (which I think is OK).  uint64_t is also fine,
as long as this one is an internal implementation detail; iow,
callers of the API do not have to be responsible for casting their
more abstract and platform neutral types down to these fixed-sized
types even if we choose to use uint64_t here.

> The current chunk format API makes the same assumption (ToC comes
> first) but could be adjusted later to let this part of the method
> dynamically compute the chunk sizes and fill a ToC at the end. The
> way to modify this API would be to add a 'flags' parameter.
>
> So far, this has not been necessary, but might be in the future.

Yup, and I am happy with the current design for now.  Thanks for
clarifying the thinking behind it.

diff --git a/Makefile b/Makefile
index 7b64106930a..50a7663841e 100644
--- a/Makefile
+++ b/Makefile
@@ -854,6 +854,7 @@  LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += chdir-notify.o
 LIB_OBJS += checkout.o
+LIB_OBJS += chunk-format.o
 LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
diff --git a/chunk-format.c b/chunk-format.c
new file mode 100644
index 00000000000..ab914c55856
--- /dev/null
+++ b/chunk-format.c
@@ -0,0 +1,91 @@ 
+#include "cache.h"
+#include "chunk-format.h"
+#include "csum-file.h"
+#define CHUNK_LOOKUP_WIDTH 12
+
+/*
+ * When writing a chunk-based file format, collect the chunks in
+ * an array of chunk_info structs. The size stores the _expected_
+ * amount of data that will be written by write_fn.
+ */
+struct chunk_info {
+	uint32_t id;
+	uint64_t size;
+	chunk_write_fn write_fn;
+};
+
+struct chunkfile {
+	struct hashfile *f;
+
+	struct chunk_info *chunks;
+	size_t chunks_nr;
+	size_t chunks_alloc;
+};
+
+struct chunkfile *init_chunkfile(struct hashfile *f)
+{
+	struct chunkfile *cf = xcalloc(1, sizeof(*cf));
+	cf->f = f;
+	return cf;
+}
+
+void free_chunkfile(struct chunkfile *cf)
+{
+	if (!cf)
+		return;
+	free(cf->chunks);
+	free(cf);
+}
+
+int get_num_chunks(struct chunkfile *cf)
+{
+	return cf->chunks_nr;
+}
+
+void add_chunk(struct chunkfile *cf,
+	       uint64_t id,
+	       chunk_write_fn fn,
+	       size_t size)
+{
+	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
+
+	cf->chunks[cf->chunks_nr].id = id;
+	cf->chunks[cf->chunks_nr].write_fn = fn;
+	cf->chunks[cf->chunks_nr].size = size;
+	cf->chunks_nr++;
+}
+
+int write_chunkfile(struct chunkfile *cf, void *data)
+{
+	int i;
+	size_t cur_offset = cf->f->offset + cf->f->total;
+
+	/* Add the table of contents to the current offset */
+	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		hashwrite_be32(cf->f, cf->chunks[i].id);
+		hashwrite_be64(cf->f, cur_offset);
+
+		cur_offset += cf->chunks[i].size;
+	}
+
+	/* Trailing entry marks the end of the chunks */
+	hashwrite_be32(cf->f, 0);
+	hashwrite_be64(cf->f, cur_offset);
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		uint64_t start_offset = cf->f->total + cf->f->offset;
+		int result = cf->chunks[i].write_fn(cf->f, data);
+
+		if (result)
+			return result;
+
+		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    cf->chunks[i].size, cf->chunks[i].id,
+			    cf->f->total + cf->f->offset - start_offset);
+	}
+
+	return 0;
+}
diff --git a/chunk-format.h b/chunk-format.h
new file mode 100644
index 00000000000..bfaed672813
--- /dev/null
+++ b/chunk-format.h
@@ -0,0 +1,20 @@ 
+#ifndef CHUNK_FORMAT_H
+#define CHUNK_FORMAT_H
+
+#include "git-compat-util.h"
+
+struct hashfile;
+struct chunkfile;
+
+struct chunkfile *init_chunkfile(struct hashfile *f);
+void free_chunkfile(struct chunkfile *cf);
+int get_num_chunks(struct chunkfile *cf);
+typedef int (*chunk_write_fn)(struct hashfile *f,
+			      void *data);
+void add_chunk(struct chunkfile *cf,
+	       uint64_t id,
+	       chunk_write_fn fn,
+	       size_t size);
+int write_chunkfile(struct chunkfile *cf, void *data);
+
+#endif

[v2,02/17] chunk-format: create chunk format write API

Commit Message

Comments

Patch