Message ID | 83d292532a0fa3f3a0ad343421be4a99a03471d0.1611759716.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Refactor chunk-format into an API | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <dstolee@microsoft.com> > > When calculating the sizes of certain chunks, we should use 64-bit > multiplication always. This allows us to properly predict the chunk > sizes without risk of overflow. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > midx.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) This one I find somewhat questionable for multiple reasons. * the fourth parameter of add_chunk() is of size_t, not uint64_t; shouldn't the multiplication be done in type size_t instead? * these mutiplications were introduced in "midx: use chunk-format API in write_midx_internal()"; that step should use the arithmetic with cast (if necessary) from the start, no? * There is "ctx.entries_nr * MIDX_CHUNKID_OFFSET_WIDTH" passed to add_chunk(), in the post-context of the first hunk. Shouldn't that be covered as well? I didn't grep for all uses of add_chunk(), but I wouldn't be surprised if this patch missed some of the calls that need the same treatment. > diff --git a/midx.c b/midx.c > index e94dcd34b7f..a365dac6bbc 100644 > --- a/midx.c > +++ b/midx.c > @@ -913,7 +913,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index * > add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, > write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE); > add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, > - write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz); > + write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz); > add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, > write_midx_object_offsets, > ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH); > @@ -921,7 +921,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index * > if (ctx.large_offsets_needed) > add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, > write_midx_large_offsets, > - ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH); > + (uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH); > > write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs); > write_chunkfile(cf, &ctx);
On Thu, Feb 4, 2021 at 4:00 PM Junio C Hamano <gitster@pobox.com> wrote: > * the fourth parameter of add_chunk() is of size_t, not uint64_t; > shouldn't the multiplication be done in type size_t instead? There are (still) systems with 32-bit size_t (but 64-bit off_t / file sizes), so ... probably not. Is size_t ever more than 64 bits these days? Chris
On 2/4/2021 7:00 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> From: Derrick Stolee <dstolee@microsoft.com> >> >> When calculating the sizes of certain chunks, we should use 64-bit >> multiplication always. This allows us to properly predict the chunk >> sizes without risk of overflow. >> >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> midx.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) > > This one I find somewhat questionable for multiple reasons. > > * the fourth parameter of add_chunk() is of size_t, not uint64_t; > shouldn't the multiplication be done in type size_t instead? This is probably appropriate because we will truncate to size_t if it is smaller than uint64_t. > * these mutiplications were introduced in "midx: use chunk-format > API in write_midx_internal()"; that step should use the > arithmetic with cast (if necessary) from the start, no? I wanted to isolate these changes specifically so we could be careful about the multiplications and not be distracted by them when converting to the chunk-format API. The multiplications were "moved" by that patch, not "introduced". > * There is "ctx.entries_nr * MIDX_CHUNKID_OFFSET_WIDTH" passed to > add_chunk(), in the post-context of the first hunk. Shouldn't > that be covered as well? I didn't grep for all uses of > add_chunk(), but I wouldn't be surprised if this patch missed > some of the calls that need the same treatment. And here is a great example of why it was good to call out these multiplications in their own patch. I did a full inspection of all multiplications in midx.c and found a few more instances of possible overflow. Two are on the read side, but they require the object lookup chunk to have size 4gb or larger. This is not _that_ far off from possibility! My multi-pack-index for the Windows repository is currently ~1.6 GB (in total, including the other chunks). Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: > On 2/4/2021 7:00 PM, Junio C Hamano wrote: >> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: >> >>> From: Derrick Stolee <dstolee@microsoft.com> >>> >>> When calculating the sizes of certain chunks, we should use 64-bit >>> multiplication always. This allows us to properly predict the chunk >>> sizes without risk of overflow. >>> >>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >>> --- >>> midx.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> This one I find somewhat questionable for multiple reasons. >> >> * the fourth parameter of add_chunk() is of size_t, not uint64_t; >> shouldn't the multiplication be done in type size_t instead? > > This is probably appropriate because we will truncate to size_t if > it is smaller than uint64_t. In other words, if size_t turns out to be too small, doing multiplication in uint64_t would not help at all and add_chunk() API needs its parameter types updated [*]. side note: I really wish that the language and the compiler helped us so that we didn't have to do this---after all, our function prototype says the result will be passed as a certain type, so it would be nice if the arithmetic to compute that result were automatically carried out in a way not to cause truncation. >> * these mutiplications were introduced in "midx: use chunk-format >> API in write_midx_internal()"; that step should use the >> arithmetic with cast (if necessary) from the start, no? > > I wanted to isolate these changes specifically so we could be > careful about the multiplications and not be distracted by them > when converting to the chunk-format API. The multiplications were > "moved" by that patch, not "introduced". Hmph, I somehow had an impression that they did not have truncation issue in the original context, but perhaps I was wrong. OK. > I did a full inspection of all multiplications in midx.c and > found a few more instances of possible overflow. Two are on the > read side, but they require the object lookup chunk to have size > 4gb or larger. This is not _that_ far off from possibility! My > multi-pack-index for the Windows repository is currently ~1.6 GB > (in total, including the other chunks). Thanks.
Chris Torek <chris.torek@gmail.com> writes: > On Thu, Feb 4, 2021 at 4:00 PM Junio C Hamano <gitster@pobox.com> wrote: >> * the fourth parameter of add_chunk() is of size_t, not uint64_t; >> shouldn't the multiplication be done in type size_t instead? > > There are (still) systems with 32-bit size_t (but 64-bit > off_t / file sizes), so ... probably not. Is size_t ever more than > 64 bits these days? Sorry, you lost me. I do not see how it would help to perform the multiplication in uint64_t, when you suspect that size_t is too small, if the final destination of the result of the multiplication is a function argument of type size_t?
On Fri, Feb 5, 2021 at 12:41 PM Junio C Hamano <gitster@pobox.com> wrote: > Chris Torek <chris.torek@gmail.com> writes: > > There are (still) systems with 32-bit size_t (but 64-bit > > off_t / file sizes), so ... probably not. Is size_t ever more than > > 64 bits these days? > > Sorry, you lost me. I do not see how it would help to perform the > multiplication in uint64_t, when you suspect that size_t is too > small, if the final destination of the result of the multiplication > is a function argument of type size_t? No, you and Derrick Stolee are right, I wasn't looking out far enough here (to the actual function). (I was wondering though if there are systems where the valid range for size_t could exceed that for off_t. Are there still systems using 32-bit off_t? Sometimes I think there are too many abstracted types running around here -- how do we know which sizes are big enough? There is always uintmax_t, though, and for unsigned types, ((T)-1) gets you the maximum possible value.) Chris
On Thu, Feb 04, 2021 at 04:00:19PM -0800, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: Derrick Stolee <dstolee@microsoft.com> > > > > When calculating the sizes of certain chunks, we should use 64-bit > > multiplication always. This allows us to properly predict the chunk > > sizes without risk of overflow. > > > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > > --- > > midx.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > This one I find somewhat questionable for multiple reasons. > > * the fourth parameter of add_chunk() is of size_t, not uint64_t; > shouldn't the multiplication be done in type size_t instead? > > * these mutiplications were introduced in "midx: use chunk-format > API in write_midx_internal()"; No, that patch also removes lines like: - chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz; - chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + - ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH; So those potentially problematic multiplications were already there before this series, and in fact trace all the way back to the initial midx patch series (commits 0d5b3a5ef7 (midx: write object ids in a chunk, 2018-07-12) and 662148c435 (midx: write object offsets, 2018-07-12)). > that step should use the > arithmetic with cast (if necessary) from the start, no? As it fixes a long-standing issue, it should rather be a bugfix patch at the beginning of the series.
SZEDER Gábor <szeder.dev@gmail.com> writes: > No, that patch also removes lines like: > > - chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz; > > - chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + > - ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH; OK. In other words, the above was replaced in the same patch with add_chunk(...., U32 * U32); where the called function expects the result of the multiplication as size_t in its function prototype. It is a bit sad that U32*U32 to compute the argument that is to be passed as U64 must be casted as (uint64_t)U32*U32 by the caller X-<. The original that the above replaced, shown in your quote, is: U64 = U64 + U32 * U32; I also wish that the fact that it is added to U64 is sufficient not to require the RHS to be written as U64 + (uint64_t) U32 * U32 (in other words, the original that was removed was OK without cast). Sad.
diff --git a/midx.c b/midx.c index e94dcd34b7f..a365dac6bbc 100644 --- a/midx.c +++ b/midx.c @@ -913,7 +913,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index * add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE); add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, - write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz); + write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz); add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, write_midx_object_offsets, ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH); @@ -921,7 +921,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index * if (ctx.large_offsets_needed) add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, write_midx_large_offsets, - ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH); + (uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH); write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs); write_chunkfile(cf, &ctx);