Message ID | 0e2fcee6894b7b16136ff09a69f199bea9f8c882.1730833507.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | t/helper/test-tool: implement 'sha1-unsafe' helper | expand |
Taylor Blau <me@ttaylorr.com> writes: > -int cmd_hash_impl(int ac, const char **av, int algo) > +int cmd_hash_impl(int ac, const char **av, int algo, int unsafe) > { > git_hash_ctx ctx; > unsigned char hash[GIT_MAX_HEXSZ]; > @@ -27,7 +27,10 @@ int cmd_hash_impl(int ac, const char **av, int algo) > die("OOPS"); > } > > - algop->init_fn(&ctx); > + if (unsafe) > + algop->unsafe_init_fn(&ctx); > + else > + algop->init_fn(&ctx); It may be just me, and it would not matter all that much within the context of the project because this is merely a test helper, but giving a pair of init/unsafe_init methods to algop looks unnerving. It gives an impression that every design of hash algorithm must come with normal and unsafe variant, which is not what you want to say. I would have expected that there are different algorighm instances, and one of them would be "unsafe SHA-1", among "normal SHA-1" and "SHA-256" (as the last one would not even have unsafe_init_fn method), and the callers that want to measure the performance of each algorithm would simply pick one of these instances and go through the usual "init", "update", "final" flow, regardless of the "unsafe" ness of the algorithm. IOW, ... > while (1) { > ssize_t sz, this_sz; > @@ -46,9 +49,15 @@ int cmd_hash_impl(int ac, const char **av, int algo) > } > if (this_sz == 0) > break; > - algop->update_fn(&ctx, buffer, this_sz); > + if (unsafe) > + algop->unsafe_update_fn(&ctx, buffer, this_sz); > + else > + algop->update_fn(&ctx, buffer, this_sz); > } > - algop->final_fn(hash, &ctx); > + if (unsafe) > + algop->unsafe_final_fn(hash, &ctx); > + else > + algop->final_fn(hash, &ctx); ... I didn't expect any of these "if (unsafe) .. else .." switches.
On 2024-11-06 at 01:38:48, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > -int cmd_hash_impl(int ac, const char **av, int algo) > > +int cmd_hash_impl(int ac, const char **av, int algo, int unsafe) > > { > > git_hash_ctx ctx; > > unsigned char hash[GIT_MAX_HEXSZ]; > > @@ -27,7 +27,10 @@ int cmd_hash_impl(int ac, const char **av, int algo) > > die("OOPS"); > > } > > > > - algop->init_fn(&ctx); > > + if (unsafe) > > + algop->unsafe_init_fn(&ctx); > > + else > > + algop->init_fn(&ctx); > > It may be just me, and it would not matter all that much within the > context of the project because this is merely a test helper, but > giving a pair of init/unsafe_init methods to algop looks unnerving. > It gives an impression that every design of hash algorithm must come > with normal and unsafe variant, which is not what you want to say. > > I would have expected that there are different algorighm instances, > and one of them would be "unsafe SHA-1", among "normal SHA-1" and > "SHA-256" (as the last one would not even have unsafe_init_fn > method), and the callers that want to measure the performance of > each algorithm would simply pick one of these instances and go > through the usual "init", "update", "final" flow, regardless of the > "unsafe" ness of the algorithm. > > IOW, ... > > > while (1) { > > ssize_t sz, this_sz; > > @@ -46,9 +49,15 @@ int cmd_hash_impl(int ac, const char **av, int algo) > > } > > if (this_sz == 0) > > break; > > - algop->update_fn(&ctx, buffer, this_sz); > > + if (unsafe) > > + algop->unsafe_update_fn(&ctx, buffer, this_sz); > > + else > > + algop->update_fn(&ctx, buffer, this_sz); > > } > > - algop->final_fn(hash, &ctx); > > + if (unsafe) > > + algop->unsafe_final_fn(hash, &ctx); > > + else > > + algop->final_fn(hash, &ctx); > > ... I didn't expect any of these "if (unsafe) .. else .." switches. I don't think we can remove those, and here's why. Certainly Taylor can correct me if I'm off base, though. In the normal case, our hash struct is a union, and different implementations can have a different layout. A SHA-1 implementation will usually keep track of a 64-bit size, 5 32-bit words, and up to 64 bytes of data that might need to be processed. Maybe SHA-1-DC, our safe implementation, stores the size first, but our unsafe implementation is OpenSSL and it stores the 5 hash words first, or whatever. If we use the same update and final functions, we'll end up with incorrect data because we'll have initialized the union contents with data for one implementation but are trying to update or finalize it with different data, which in the very best case will just produce broken results, and might just cause a segfault (if one of the implementations stores a pointer instead of storing the data in the union). We could certainly adopt a different hash algorithm structure for safe and unsafe code, but we have a lot of places that assume that there's just one structure per algorithm. It would require a substantial amount of refactoring to remove those assumptions and deal with the fact that we now have two SHA-1 implementations. It would also be tricky to deal with the fact that for SHA-1, we might use the safe or unsafe algorithm, but for SHA-256, there's only one algorithm structure and we need to use it for both.
On Thu, Nov 07, 2024 at 12:48:26AM +0000, brian m. carlson wrote: > > I would have expected that there are different algorighm instances, > > and one of them would be "unsafe SHA-1", among "normal SHA-1" and > > "SHA-256" (as the last one would not even have unsafe_init_fn > > method), and the callers that want to measure the performance of > > each algorithm would simply pick one of these instances and go > > through the usual "init", "update", "final" flow, regardless of the > > "unsafe" ness of the algorithm. > [...] > > ... I didn't expect any of these "if (unsafe) .. else .." switches. > > I don't think we can remove those, and here's why. Certainly Taylor can > correct me if I'm off base, though. > > In the normal case, our hash struct is a union, and different > implementations can have a different layout. A SHA-1 implementation > will usually keep track of a 64-bit size, 5 32-bit words, and up to 64 > bytes of data that might need to be processed. Maybe SHA-1-DC, our safe > implementation, stores the size first, but our unsafe implementation > is OpenSSL and it stores the 5 hash words first, or whatever. > > If we use the same update and final functions, we'll end up with > incorrect data because we'll have initialized the union contents with > data for one implementation but are trying to update or finalize it with > different data, which in the very best case will just produce broken > results, and might just cause a segfault (if one of the implementations > stores a pointer instead of storing the data in the union). I don't think Junio is proposing mixing the functions on a single data type. Which, as you note, would be a disaster. I think the idea is for a separate git_hash_algo struct entirely, that can be slotted in as a pointer (since git_hash_algo is already essentially a virtual type). That gets weird as you say here: > We could certainly adopt a different hash algorithm structure for safe > and unsafe code, but we have a lot of places that assume that there's > just one structure per algorithm. It would require a substantial amount > of refactoring to remove those assumptions and deal with the fact that > we now have two SHA-1 implementations. It would also be tricky to deal > with the fact that for SHA-1, we might use the safe or unsafe algorithm, > but for SHA-256, there's only one algorithm structure and we need to use > it for both. both because we have two different algos for "sha1", but also because they are not _really_ independent. If the_hash_algo is sha1, then whichever implementation a given piece of code is using, it must still be one of the sha1 variants. So I think you wouldn't want to allocate an enum or a slot in the hash_algos array, because it is not really an independent algorithm. But I think it _could_ work as a separate struct that the caller derives from the main hash algorithm. For example, imagine that the git_hash_algo struct didn't have unsafe_init_fn, etc, but instead had: struct git_hash_algo *unsafe_implementation; with a matching accessor like: struct git_hash_algo *unsafe_hash_algo(struct git_hash_algo *algo) { /* if we have a faster "unsafe" implementation, use that */ if (algo->unsafe_implementation) return algo->unsafe_implementation; /* otherwise just use the default one */ return algo; } And then csum-file.c, rather than calling: the_hash_algo->unsafe_init_fn(...); ... the_hash_algo->unsafe_final_fn(...); would do this: struct git_hash_algo *algo = unsafe_hash_algo(the_hash_algo); algo->init_fn(...); ... algo->final_fn(...); And likewise this test helper would have a single conditional at the start: if (unsafe) algo = unsafe_hash_algo(algo); and the rest of the code would just use that. All that said, I do not think it buys us anything for "real" code. There the decision between safe/unsafe is in the context of how we are using it, and not based on some conditional. So while the code in this test helper is ugly, I don't think we'd ever write anything like that for real. So it may not be worth the effort to refactor into a more virtual object-oriented way. -Peff
On 2024-11-07 at 01:39:15, Jeff King wrote: > So I think you wouldn't want to allocate an enum or a slot in the > hash_algos array, because it is not really an independent algorithm. > But I think it _could_ work as a separate struct that the caller derives > from the main hash algorithm. For example, imagine that the > git_hash_algo struct didn't have unsafe_init_fn, etc, but instead had: > > struct git_hash_algo *unsafe_implementation; > > with a matching accessor like: > > struct git_hash_algo *unsafe_hash_algo(struct git_hash_algo *algo) > { > /* if we have a faster "unsafe" implementation, use that */ > if (algo->unsafe_implementation) > return algo->unsafe_implementation; > /* otherwise just use the default one */ > return algo; > } > > And then csum-file.c, rather than calling: > > the_hash_algo->unsafe_init_fn(...); > ... > the_hash_algo->unsafe_final_fn(...); > > would do this: > > struct git_hash_algo *algo = unsafe_hash_algo(the_hash_algo); > algo->init_fn(...); > ... > algo->final_fn(...); > > And likewise this test helper would have a single conditional at the > start: > > if (unsafe) > algo = unsafe_hash_algo(algo); > > and the rest of the code would just use that. Ah, yes, I think that approach would be simpler and nicer to work with than a separate entry in the `hash_algos` array. We do, however, have some places that assume that a `struct git_hash_algo *` points into the `hash_algos` array (notably `hash_algo_by_ptr`), so we'd have to adjust for that, move the function pointers out into their own struct which we'd use for `unsafe_hash_algo`, or be careful never to call the relevant functions on our special `git_hash_algo` struct. > All that said, I do not think it buys us anything for "real" code. There > the decision between safe/unsafe is in the context of how we are using > it, and not based on some conditional. So while the code in this test > helper is ugly, I don't think we'd ever write anything like that for > real. So it may not be worth the effort to refactor into a more virtual > object-oriented way. Yeah, I don't have a strong opinion one way or the other. I think, with the limitation I mentioned above, it would probably require a decent amount of refactoring if we took a different approach, and I'm fine with going with Taylor's current approach unless he wants to do that refactoring (in which case, great).
On Thu, Nov 07, 2024 at 01:49:35AM +0000, brian m. carlson wrote: > Ah, yes, I think that approach would be simpler and nicer to work with > than a separate entry in the `hash_algos` array. We do, however, have > some places that assume that a `struct git_hash_algo *` points into the > `hash_algos` array (notably `hash_algo_by_ptr`), so we'd have to adjust > for that, move the function pointers out into their own struct which > we'd use for `unsafe_hash_algo`, or be careful never to call the > relevant functions on our special `git_hash_algo` struct. Yeah, I wondered if some code might be surprised by having a separate hash algo. Another weird thing is that the sub-implementation algo struct will have its own rawsz, hexsz, etc, even though those _must_ be the same its parent. If all of the virtual implementation functions were in a git_hash_impl struct, then each git_hash_algo struct could have one embedded for the main implementation (which in sha1's case would be a collision detecting one), and an optional pointer to another unsafe _impl struct. And then you get more type-safety, because you can never confuse the _impl struct for a parent git_hash_algo. The downside is that every single caller, even if it doesn't care about the unsafe variant, needs to refer to the_hash_algo->impl.init_fn(), etc, due to the extra layer of indirection. Probably not worth it. > Yeah, I don't have a strong opinion one way or the other. I think, with > the limitation I mentioned above, it would probably require a decent > amount of refactoring if we took a different approach, and I'm fine with > going with Taylor's current approach unless he wants to do that > refactoring (in which case, great). Me too. If we were fixing something ugly or error-prone that we expected to come up in real code, it might be worth it. But it's probably not for trying to accommodate a one-off test helper. -Peff
Jeff King <peff@peff.net> writes: > Me too. If we were fixing something ugly or error-prone that we expected > to come up in real code, it might be worth it. But it's probably not for > trying to accommodate a one-off test helper. I 100% agree that it would not matter all that much within the context of the project because this is merely a test helper. It looked odd not to have sha1-unsafe as a first class citizen next to sha1 and sha256, with an entry for it in the list enums and list of hash algos. If there is a good reason why adding the "unsafe" variant as a first class citizen among algos, the approach posted patch took is fine. Thanks.
On Thu, Nov 07, 2024 at 01:49:35AM +0000, brian m. carlson wrote: > On 2024-11-07 at 01:39:15, Jeff King wrote: > > So I think you wouldn't want to allocate an enum or a slot in the > > hash_algos array, because it is not really an independent algorithm. > > But I think it _could_ work as a separate struct that the caller derives > > from the main hash algorithm. For example, imagine that the > > git_hash_algo struct didn't have unsafe_init_fn, etc, but instead had: > > > > struct git_hash_algo *unsafe_implementation; > > > > with a matching accessor like: > > > > struct git_hash_algo *unsafe_hash_algo(struct git_hash_algo *algo) > > { > > /* if we have a faster "unsafe" implementation, use that */ > > if (algo->unsafe_implementation) > > return algo->unsafe_implementation; > > /* otherwise just use the default one */ > > return algo; > > } > > > > And then csum-file.c, rather than calling: > > > > the_hash_algo->unsafe_init_fn(...); > > ... > > the_hash_algo->unsafe_final_fn(...); > > > > would do this: > > > > struct git_hash_algo *algo = unsafe_hash_algo(the_hash_algo); > > algo->init_fn(...); > > ... > > algo->final_fn(...); > > > > And likewise this test helper would have a single conditional at the > > start: > > > > if (unsafe) > > algo = unsafe_hash_algo(algo); > > > > and the rest of the code would just use that. > > Ah, yes, I think that approach would be simpler and nicer to work with > than a separate entry in the `hash_algos` array. We do, however, have > some places that assume that a `struct git_hash_algo *` points into the > `hash_algos` array (notably `hash_algo_by_ptr`), so we'd have to adjust > for that, move the function pointers out into their own struct which > we'd use for `unsafe_hash_algo`, or be careful never to call the > relevant functions on our special `git_hash_algo` struct. I agree that the approach suggested by Peff here is clean and would button up the test-helper code nicely. It may be worth doing in the future, but I agree that it doesn't seem entirely in scope here. Part of the reason that I did not initially add a separate sha1-unsafe struct is that while it avoids this awkwardness, it creates new awkwardness when in a non-SHA-1 repository, where you have to keep track of whether you want to use the_unsafe_hash_algo or the SHA-256 hash algo. In that instance, it's not a matter of computing the same result two different ways, but rather using the wrong hash function entirely. IOW, the hashfile API would have to realize that when in SHA-256 mode, that it should *not* use the separate (hypothetical) unsafe SHA-1 struct. > > All that said, I do not think it buys us anything for "real" code. There > > the decision between safe/unsafe is in the context of how we are using > > it, and not based on some conditional. So while the code in this test > > helper is ugly, I don't think we'd ever write anything like that for > > real. So it may not be worth the effort to refactor into a more virtual > > object-oriented way. > > Yeah, I don't have a strong opinion one way or the other. I think, with > the limitation I mentioned above, it would probably require a decent > amount of refactoring if we took a different approach, and I'm fine with > going with Taylor's current approach unless he wants to do that > refactoring (in which case, great). I think it does buy you something for real code, which is that you don't have to remember to consistently call the unsafe_ variants of all of the various function pointers. For instance, if you do the_hash_algo->unsafe_init_fn(...); early on, and then later on by mistake write: the_hash_algo->update_fn(...); Then your code is broken and will (as brian said) either in the best case produce wrong results, or likely segfault. So in that sense it is worth doing as it avoids the caller from having to keep track of what "mode" they are using the_hash_algo in. But let's take it up later, I think. Thanks, Taylor
On Thu, Nov 07, 2024 at 04:30:37PM -0500, Taylor Blau wrote: > > > All that said, I do not think it buys us anything for "real" code. There > > > the decision between safe/unsafe is in the context of how we are using > > > it, and not based on some conditional. So while the code in this test > > > helper is ugly, I don't think we'd ever write anything like that for > > > real. So it may not be worth the effort to refactor into a more virtual > > > object-oriented way. > > > > Yeah, I don't have a strong opinion one way or the other. I think, with > > the limitation I mentioned above, it would probably require a decent > > amount of refactoring if we took a different approach, and I'm fine with > > going with Taylor's current approach unless he wants to do that > > refactoring (in which case, great). > > I think it does buy you something for real code, which is that you don't > have to remember to consistently call the unsafe_ variants of all of the > various function pointers. > > For instance, if you do > > the_hash_algo->unsafe_init_fn(...); > > early on, and then later on by mistake write: > > the_hash_algo->update_fn(...); > > Then your code is broken and will (as brian said) either in the best > case produce wrong results, or likely segfault. > > So in that sense it is worth doing as it avoids the caller from having > to keep track of what "mode" they are using the_hash_algo in. But let's > take it up later, I think. The idea seemed too fun to play with, so I tried my hand at implementing it and was quite pleased with the result. The WIP-quality patches are available in my tree under the wip/tb/unsafe-hash-algop[1] branch. The result, at least as it applies to the test-tool modified here, is quite pleasing IMHO: --- 8< --- Subject: [PATCH] t/helper/test-hash.c: use unsafe_hash_algo() Signed-off-by: Taylor Blau <me@ttaylorr.com> --- t/helper/test-hash.c | 17 +++++------------ 1 file changed, 5 insertions(+), 12 deletions(-) diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c index d0ee668df95..aa82638c621 100644 --- a/t/helper/test-hash.c +++ b/t/helper/test-hash.c @@ -9,6 +9,8 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe) int binary = 0; char *buffer; const struct git_hash_algo *algop = &hash_algos[algo]; + if (unsafe) + algop = unsafe_hash_algo(algop); if (ac == 2) { if (!strcmp(av[1], "-b")) @@ -27,10 +29,7 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe) die("OOPS"); } - if (unsafe) - algop->unsafe_init_fn(&ctx); - else - algop->init_fn(&ctx); + algop->init_fn(&ctx); while (1) { ssize_t sz, this_sz; @@ -49,15 +48,9 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe) } if (this_sz == 0) break; - if (unsafe) - algop->unsafe_update_fn(&ctx, buffer, this_sz); - else - algop->update_fn(&ctx, buffer, this_sz); + algop->update_fn(&ctx, buffer, this_sz); } - if (unsafe) - algop->unsafe_final_fn(hash, &ctx); - else - algop->final_fn(hash, &ctx); + algop->final_fn(hash, &ctx); if (binary) fwrite(hash, 1, algop->rawsz, stdout); base-commit: d8c1fc78b57e02a140b5c363caaa14c2dc2bb274 prerequisite-patch-id: 5df87a6ba8a596bc7834ce8b5a656a3c4f1a869f prerequisite-patch-id: c3f346e57f98b067eef8adf1e20c70ac23f41c36 prerequisite-patch-id: c5d1ec4f5e9796c5c9232442a3fc3be79379b8c4 prerequisite-patch-id: bdc2d6cbc23c467b24f184a889a5242e5c9fe774 -- 2.47.0.238.g87615aecf12 --- >8 --- To be clear: I think that we should still take this series as-is, and I'll send the additional half dozen or so patches on top as a new series dependent on this one. Thanks, Taylor [1]: https://github.com/ttaylorr/git/compare/tb/sha1-unsafe-helper...ttaylorr:git:wip/tb/unsafe-hash-algop
On Thu, Nov 07, 2024 at 04:30:37PM -0500, Taylor Blau wrote: > > Yeah, I don't have a strong opinion one way or the other. I think, with > > the limitation I mentioned above, it would probably require a decent > > amount of refactoring if we took a different approach, and I'm fine with > > going with Taylor's current approach unless he wants to do that > > refactoring (in which case, great). > > I think it does buy you something for real code, which is that you don't > have to remember to consistently call the unsafe_ variants of all of the > various function pointers. > > For instance, if you do > > the_hash_algo->unsafe_init_fn(...); > > early on, and then later on by mistake write: > > the_hash_algo->update_fn(...); > > Then your code is broken and will (as brian said) either in the best > case produce wrong results, or likely segfault. Yes, true. I sort of assume that all of those calls are happening within one function (or at least a suite of related functions). Just because there's an implicit context of "I am computing the hash for an object" versus "I am computing a checksum". And if we ever do move to splitting those further (to have crc32 or whatever for the checksum), then having a git_hash_algo for that would seem even weirder. -Peff
diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c index 45d829c908f..d0ee668df95 100644 --- a/t/helper/test-hash.c +++ b/t/helper/test-hash.c @@ -1,7 +1,7 @@ #include "test-tool.h" #include "hex.h" -int cmd_hash_impl(int ac, const char **av, int algo) +int cmd_hash_impl(int ac, const char **av, int algo, int unsafe) { git_hash_ctx ctx; unsigned char hash[GIT_MAX_HEXSZ]; @@ -27,7 +27,10 @@ int cmd_hash_impl(int ac, const char **av, int algo) die("OOPS"); } - algop->init_fn(&ctx); + if (unsafe) + algop->unsafe_init_fn(&ctx); + else + algop->init_fn(&ctx); while (1) { ssize_t sz, this_sz; @@ -46,9 +49,15 @@ int cmd_hash_impl(int ac, const char **av, int algo) } if (this_sz == 0) break; - algop->update_fn(&ctx, buffer, this_sz); + if (unsafe) + algop->unsafe_update_fn(&ctx, buffer, this_sz); + else + algop->update_fn(&ctx, buffer, this_sz); } - algop->final_fn(hash, &ctx); + if (unsafe) + algop->unsafe_final_fn(hash, &ctx); + else + algop->final_fn(hash, &ctx); if (binary) fwrite(hash, 1, algop->rawsz, stdout); diff --git a/t/helper/test-sha1.c b/t/helper/test-sha1.c index e60d000c039..1c1272cc1f9 100644 --- a/t/helper/test-sha1.c +++ b/t/helper/test-sha1.c @@ -3,7 +3,7 @@ int cmd__sha1(int ac, const char **av) { - return cmd_hash_impl(ac, av, GIT_HASH_SHA1); + return cmd_hash_impl(ac, av, GIT_HASH_SHA1, 0); } int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED) diff --git a/t/helper/test-sha256.c b/t/helper/test-sha256.c index 2fb20438f3c..7fd0aa1fcd3 100644 --- a/t/helper/test-sha256.c +++ b/t/helper/test-sha256.c @@ -3,5 +3,5 @@ int cmd__sha256(int ac, const char **av) { - return cmd_hash_impl(ac, av, GIT_HASH_SHA256); + return cmd_hash_impl(ac, av, GIT_HASH_SHA256, 0); } diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index 21802ac27da..f3524d9a0f6 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -81,6 +81,6 @@ int cmd__windows_named_pipe(int argc, const char **argv); #endif int cmd__write_cache(int argc, const char **argv); -int cmd_hash_impl(int ac, const char **av, int algo); +int cmd_hash_impl(int ac, const char **av, int algo, int unsafe); #endif
With the new "unsafe" SHA-1 build knob, it would be convenient to have a test-tool that can exercise Git's unsafe SHA-1 wrappers for testing, similar to 't/helper/test-tool sha1'. Prepare for such a helper by altering the implementation of that test-tool (in cmd_hash_impl(), which is generic and parameterized over different hash functions) to conditionally run the unsafe variants of the chosen hash function. The following commit will add a new test-tool which makes use of this new parameter. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- t/helper/test-hash.c | 17 +++++++++++++---- t/helper/test-sha1.c | 2 +- t/helper/test-sha256.c | 2 +- t/helper/test-tool.h | 2 +- 4 files changed, 16 insertions(+), 7 deletions(-)