Message ID | 814512f216719d89f41822753d5c71df5e49385d.1611759716.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Refactor chunk-format into an API | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > +/* > + * When writing a chunk-based file format, collect the chunks in > + * an array of chunk_info structs. The size stores the _expected_ > + * amount of data that will be written by write_fn. > + */ > +struct chunk_info { > + uint32_t id; > + uint64_t size; > + chunk_write_fn write_fn; > +}; > +... > +void add_chunk(struct chunkfile *cf, > + uint64_t id, > + chunk_write_fn fn, > + size_t size) > +{ > + ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc); > + > + cf->chunks[cf->chunks_nr].id = id; > + cf->chunks[cf->chunks_nr].write_fn = fn; > + cf->chunks[cf->chunks_nr].size = size; > + cf->chunks_nr++; > +} Somebody somewhere between the caller in the higher part of the callchain (that has to work with platform native types) and the on-disk format at the bottom of the callchain (that has to work with fixed size data fields) needs to make sure that the size that the higher level caller has fits on-disk data structure we define, and the data we read from disk fits the in-core structure our caller use on the reading side. If there is a function at the one level closer to the disk than "struct chunk_info" that takes a "struct chunk_info" and writes the id and size to disk (and fills "struct chunk_info" from what is read from the disk, on the reading side), it would be a good place to do the size_t to uint64_t conversion. It is OK to do the conversion-while-checking in add_chunk(), too. But a silent type casting from size_t to uint64_t done silently by assignment bothers me. Also, I think you meant to make the incoming ID uint32_t; am I missing something, or did nobody notice it in the review of the previous round? > +int write_chunkfile(struct chunkfile *cf, void *data) > +{ > + int i; > + size_t cur_offset = cf->f->offset + cf->f->total; That ought to be off_t, as it is a seek position inside a file (struct hashfile.total is already off_t). Use csum-file.h::hashfile_total() instead, perhaps? .offset member is an implementation detail of the hashfile API (i.e. some leftover bits are kept in in-core buffer, until we accumulate enough to make it worth flushing to the disk), and by using the helper, this code does not have to know about it. > + /* Add the table of contents to the current offset */ > + cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH; Is that 12 == sizeof(chunk_info.id) + sizeof(chunk_info.size)? If so, this makes sense. > + for (i = 0; i < cf->chunks_nr; i++) { > + hashwrite_be32(cf->f, cf->chunks[i].id); > + hashwrite_be64(cf->f, cur_offset); > + > + cur_offset += cf->chunks[i].size; > + } > + > + /* Trailing entry marks the end of the chunks */ > + hashwrite_be32(cf->f, 0); > + hashwrite_be64(cf->f, cur_offset); OK. This helper does not tell us anything about what comes in the on-disk file before this point, but we write a table of contents that says "chunk with this ID has this size, chunk with that ID has that size, ...", concluded by something that looks like another entry with chunk ID 0 that records the current offset as its size. > + for (i = 0; i < cf->chunks_nr; i++) { > + uint64_t start_offset = cf->f->total + cf->f->offset; Likewise about the type and use of hashfile_total(). > + int result = cf->chunks[i].write_fn(cf->f, data); > + > + if (result) > + return result; > + > + if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size) > + BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead", > + cf->chunks[i].size, cf->chunks[i].id, > + cf->f->total + cf->f->offset - start_offset); I won't complain, as this apparently is sufficient to abstract out the two existing chunked format files, but it somehow feels a bit limiting that the one that calls add_chunk() is required to know what the size of generated data would be, way before .write_fn() is called to produce the actual data here. > + } > + > + return 0; > +} > diff --git a/chunk-format.h b/chunk-format.h > new file mode 100644 > index 00000000000..bfaed672813 > --- /dev/null > +++ b/chunk-format.h > @@ -0,0 +1,20 @@ > +#ifndef CHUNK_FORMAT_H > +#define CHUNK_FORMAT_H > + > +#include "git-compat-util.h" > + > +struct hashfile; > +struct chunkfile; > + > +struct chunkfile *init_chunkfile(struct hashfile *f); > +void free_chunkfile(struct chunkfile *cf); > +int get_num_chunks(struct chunkfile *cf); > +typedef int (*chunk_write_fn)(struct hashfile *f, > + void *data); Write this on a single line. > +void add_chunk(struct chunkfile *cf, > + uint64_t id, > + chunk_write_fn fn, > + size_t size); Shouldn't this match the order of members in chunk_info struct? > +int write_chunkfile(struct chunkfile *cf, void *data); > + > +#endif
Junio C Hamano <gitster@pobox.com> writes: >> +void add_chunk(struct chunkfile *cf, >> + uint64_t id, >> + chunk_write_fn fn, >> + size_t size); > > Shouldn't this match the order of members in chunk_info struct? Nah. Unless there are other reasons to touch numerous add_chunk() calls that have already been written (like we find a need to add a new parameter to the call), I do not think this matters too much. Thanks.
On 2/4/2021 4:24 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> +/* >> + * When writing a chunk-based file format, collect the chunks in >> + * an array of chunk_info structs. The size stores the _expected_ >> + * amount of data that will be written by write_fn. >> + */ >> +struct chunk_info { >> + uint32_t id; >> + uint64_t size; >> + chunk_write_fn write_fn; >> +}; >> +... >> +void add_chunk(struct chunkfile *cf, >> + uint64_t id, >> + chunk_write_fn fn, >> + size_t size) >> +{ >> + ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc); >> + >> + cf->chunks[cf->chunks_nr].id = id; >> + cf->chunks[cf->chunks_nr].write_fn = fn; >> + cf->chunks[cf->chunks_nr].size = size; >> + cf->chunks_nr++; >> +} > > Somebody somewhere between the caller in the higher part of the > callchain (that has to work with platform native types) and the > on-disk format at the bottom of the callchain (that has to work > with fixed size data fields) needs to make sure that the size that > the higher level caller has fits on-disk data structure we define, > and the data we read from disk fits the in-core structure our caller > use on the reading side. > > If there is a function at the one level closer to the disk than > "struct chunk_info" that takes a "struct chunk_info" and writes the > id and size to disk (and fills "struct chunk_info" from what is read > from the disk, on the reading side), it would be a good place to do > the size_t to uint64_t conversion. I'm fine with keeping the external interface focused on size_t instead of uint64_t. > It is OK to do the conversion-while-checking in add_chunk(), too. > > But a silent type casting from size_t to uint64_t done silently by > assignment bothers me. Does this bother you only because its part of the external interface? If I understand correctly, uint64_t will always be at least as big as size_t, so this doesn't need any protections for overflow or anything. Is there something I should be doing before casting? > Also, I think you meant to make the incoming > ID uint32_t; am I missing something, or did nobody notice it in the > review of the previous round? Yes, this should be 32-bits. Will fix. >> +int write_chunkfile(struct chunkfile *cf, void *data) >> +{ >> + int i; >> + size_t cur_offset = cf->f->offset + cf->f->total; > > That ought to be off_t, as it is a seek position inside a file > (struct hashfile.total is already off_t). I can use off_t for the other offsets in this computation, but cur_offset will be used in hashwrite_be64(), so maybe it is best to use uint64_t here? > Use csum-file.h::hashfile_total() instead, perhaps? .offset member > is an implementation detail of the hashfile API (i.e. some leftover > bits are kept in in-core buffer, until we accumulate enough to make > it worth flushing to the disk), and by using the helper, this code > does not have to know about it. Thanks! This is cleaner. >> + /* Add the table of contents to the current offset */ >> + cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH; > > Is that 12 == sizeof(chunk_info.id) + sizeof(chunk_info.size)? > If so, this makes sense. Yes. >> + for (i = 0; i < cf->chunks_nr; i++) { >> + hashwrite_be32(cf->f, cf->chunks[i].id); >> + hashwrite_be64(cf->f, cur_offset); >> + >> + cur_offset += cf->chunks[i].size; >> + } >> + >> + /* Trailing entry marks the end of the chunks */ >> + hashwrite_be32(cf->f, 0); >> + hashwrite_be64(cf->f, cur_offset); > > OK. This helper does not tell us anything about what comes in the > on-disk file before this point, but we write a table of contents > that says "chunk with this ID has this size, chunk with that ID has > that size, ...", concluded by something that looks like another > entry with chunk ID 0 that records the current offset as its size. Right. The table of contents gives us enough information to find the start _and_ end of each chunk (and hence compute their size). >> + for (i = 0; i < cf->chunks_nr; i++) { >> + uint64_t start_offset = cf->f->total + cf->f->offset; > > Likewise about the type and use of hashfile_total(). This one can definitely be off_t. >> + int result = cf->chunks[i].write_fn(cf->f, data); >> + >> + if (result) >> + return result; >> + >> + if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size) >> + BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead", >> + cf->chunks[i].size, cf->chunks[i].id, >> + cf->f->total + cf->f->offset - start_offset); > > I won't complain, as this apparently is sufficient to abstract out > the two existing chunked format files, but it somehow feels a bit > limiting that the one that calls add_chunk() is required to know > what the size of generated data would be, way before .write_fn() is > called to produce the actual data here. This was pointed out earlier, but it _is_ part of the existing users of the format. The table of contents is written at the start of the file instead of the end (such as in the .zip format). The current chunk format API makes the same assumption (ToC comes first) but could be adjusted later to let this part of the method dynamically compute the chunk sizes and fill a ToC at the end. The way to modify this API would be to add a 'flags' parameter. So far, this has not been necessary, but might be in the future. >> + } >> + >> + return 0; >> +} >> diff --git a/chunk-format.h b/chunk-format.h >> new file mode 100644 >> index 00000000000..bfaed672813 >> --- /dev/null >> +++ b/chunk-format.h >> @@ -0,0 +1,20 @@ >> +#ifndef CHUNK_FORMAT_H >> +#define CHUNK_FORMAT_H >> + >> +#include "git-compat-util.h" >> + >> +struct hashfile; >> +struct chunkfile; >> + >> +struct chunkfile *init_chunkfile(struct hashfile *f); >> +void free_chunkfile(struct chunkfile *cf); >> +int get_num_chunks(struct chunkfile *cf); >> +typedef int (*chunk_write_fn)(struct hashfile *f, >> + void *data); > > Write this on a single line. Will do. Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: >> But a silent type casting from size_t to uint64_t done silently by >> assignment bothers me. > > Does this bother you only because its part of the external interface? > If I understand correctly, uint64_t will always be at least as big > as size_t, so this doesn't need any protections for overflow or > anything. Is there something I should be doing before casting? I am OK to use uint64_t on the caller-facing side, as long as we explicitly declare that uint64_t ought to be large enough for everybody. Struct members and variables that are closer to the on-disk data need to be of sized type to avoid repeating the pain caused by our early "unsigned long ought to be large enough for everybody" attitude, but it is nicer to be working with more abstract types in the layer higher up. And if there is a risk of truncation in either direction, we should be defensive. That's all. >>> +int write_chunkfile(struct chunkfile *cf, void *data) >>> +{ >>> + int i; >>> + size_t cur_offset = cf->f->offset + cf->f->total; >> >> That ought to be off_t, as it is a seek position inside a file >> (struct hashfile.total is already off_t). > > I can use off_t for the other offsets in this computation, but > cur_offset will be used in hashwrite_be64(), so maybe it is best > to use uint64_t here? As I discovered in the later parts, I think off_t makes less sense than size_t in the context of this design, so size_t is fine, as long as we keep the "users of chunkfile API work on a mmapped region of contiguous memory" (which I think is OK). uint64_t is also fine, as long as this one is an internal implementation detail; iow, callers of the API do not have to be responsible for casting their more abstract and platform neutral types down to these fixed-sized types even if we choose to use uint64_t here. > The current chunk format API makes the same assumption (ToC comes > first) but could be adjusted later to let this part of the method > dynamically compute the chunk sizes and fill a ToC at the end. The > way to modify this API would be to add a 'flags' parameter. > > So far, this has not been necessary, but might be in the future. Yup, and I am happy with the current design for now. Thanks for clarifying the thinking behind it.
diff --git a/Makefile b/Makefile index 7b64106930a..50a7663841e 100644 --- a/Makefile +++ b/Makefile @@ -854,6 +854,7 @@ LIB_OBJS += bundle.o LIB_OBJS += cache-tree.o LIB_OBJS += chdir-notify.o LIB_OBJS += checkout.o +LIB_OBJS += chunk-format.o LIB_OBJS += color.o LIB_OBJS += column.o LIB_OBJS += combine-diff.o diff --git a/chunk-format.c b/chunk-format.c new file mode 100644 index 00000000000..ab914c55856 --- /dev/null +++ b/chunk-format.c @@ -0,0 +1,91 @@ +#include "cache.h" +#include "chunk-format.h" +#include "csum-file.h" +#define CHUNK_LOOKUP_WIDTH 12 + +/* + * When writing a chunk-based file format, collect the chunks in + * an array of chunk_info structs. The size stores the _expected_ + * amount of data that will be written by write_fn. + */ +struct chunk_info { + uint32_t id; + uint64_t size; + chunk_write_fn write_fn; +}; + +struct chunkfile { + struct hashfile *f; + + struct chunk_info *chunks; + size_t chunks_nr; + size_t chunks_alloc; +}; + +struct chunkfile *init_chunkfile(struct hashfile *f) +{ + struct chunkfile *cf = xcalloc(1, sizeof(*cf)); + cf->f = f; + return cf; +} + +void free_chunkfile(struct chunkfile *cf) +{ + if (!cf) + return; + free(cf->chunks); + free(cf); +} + +int get_num_chunks(struct chunkfile *cf) +{ + return cf->chunks_nr; +} + +void add_chunk(struct chunkfile *cf, + uint64_t id, + chunk_write_fn fn, + size_t size) +{ + ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc); + + cf->chunks[cf->chunks_nr].id = id; + cf->chunks[cf->chunks_nr].write_fn = fn; + cf->chunks[cf->chunks_nr].size = size; + cf->chunks_nr++; +} + +int write_chunkfile(struct chunkfile *cf, void *data) +{ + int i; + size_t cur_offset = cf->f->offset + cf->f->total; + + /* Add the table of contents to the current offset */ + cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH; + + for (i = 0; i < cf->chunks_nr; i++) { + hashwrite_be32(cf->f, cf->chunks[i].id); + hashwrite_be64(cf->f, cur_offset); + + cur_offset += cf->chunks[i].size; + } + + /* Trailing entry marks the end of the chunks */ + hashwrite_be32(cf->f, 0); + hashwrite_be64(cf->f, cur_offset); + + for (i = 0; i < cf->chunks_nr; i++) { + uint64_t start_offset = cf->f->total + cf->f->offset; + int result = cf->chunks[i].write_fn(cf->f, data); + + if (result) + return result; + + if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size) + BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead", + cf->chunks[i].size, cf->chunks[i].id, + cf->f->total + cf->f->offset - start_offset); + } + + return 0; +} diff --git a/chunk-format.h b/chunk-format.h new file mode 100644 index 00000000000..bfaed672813 --- /dev/null +++ b/chunk-format.h @@ -0,0 +1,20 @@ +#ifndef CHUNK_FORMAT_H +#define CHUNK_FORMAT_H + +#include "git-compat-util.h" + +struct hashfile; +struct chunkfile; + +struct chunkfile *init_chunkfile(struct hashfile *f); +void free_chunkfile(struct chunkfile *cf); +int get_num_chunks(struct chunkfile *cf); +typedef int (*chunk_write_fn)(struct hashfile *f, + void *data); +void add_chunk(struct chunkfile *cf, + uint64_t id, + chunk_write_fn fn, + size_t size); +int write_chunkfile(struct chunkfile *cf, void *data); + +#endif