Message ID | f6ad92ffd01c442dacd3ac6aa448891028636636.1727158127.git.ps@pks.im (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | reftable: handle allocation errors | expand |
Patrick Steinhardt <ps@pks.im> writes: > void *reftable_calloc(size_t nelem, size_t elsize) > { > - size_t sz = st_mult(nelem, elsize); > - void *p = reftable_malloc(sz); > - memset(p, 0, sz); > + void *p; > + > + if (nelem && elsize > SIZE_MAX / nelem) > + return NULL; Now it is open coded, it strikes me that the check is a bit overly conservative. If we are trying to allocate slightly than half of SIZE_MAX by asking elsize==1 and nelem==(SIZE_MAX / 2 + 10), we'd say that (elsize * nelem) would not fit size_t and fail the allocation. For the purpose of this caller, it is not a practical issue, as it is likely that you'd not be able to obtain slightly more than half your address space out of a single allocation anyway. But it illustrates why open coding is not necessarily an excellent idea in the longer term, doesn't it? When unsigned_mult_overflows() is updated to avoid such a false positive, how would we remember that we need to update this copy we? > + p = reftable_malloc(nelem * elsize); > + if (!p) > + return NULL; > + > + memset(p, 0, nelem * elsize); > return p; > }
On Tue, Sep 24, 2024 at 09:59:24AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > void *reftable_calloc(size_t nelem, size_t elsize) > > { > > - size_t sz = st_mult(nelem, elsize); > > - void *p = reftable_malloc(sz); > > - memset(p, 0, sz); > > + void *p; > > + > > + if (nelem && elsize > SIZE_MAX / nelem) > > + return NULL; > > Now it is open coded, it strikes me that the check is a bit overly > conservative. > > If we are trying to allocate slightly than half of SIZE_MAX by > asking elsize==1 and nelem==(SIZE_MAX / 2 + 10), we'd say that > (elsize * nelem) would not fit size_t and fail the allocation. > > For the purpose of this caller, it is not a practical issue, as it > is likely that you'd not be able to obtain slightly more than half > your address space out of a single allocation anyway. > > But it illustrates why open coding is not necessarily an excellent > idea in the longer term, doesn't it? When unsigned_mult_overflows() > is updated to avoid such a false positive, how would we remember > that we need to update this copy we? I agree in general, but with the reftable library I'm stuck between a rock and a hard place. My goal is to make it fully reusable by other projects without first having to do surgery on their side. While having something like `st_mult()` is simple enough to port over, the biggest problem I have is that over time we sneak in more and more code from the Git codebase. The result is death by a thousand cuts. Now if we had a single header that exposes a small set of functions without _anything_ else it could work alright. But I rather doubt that we really want to have a standalone header for `st_mult()` that we can include. But without such a standalone header it is simply too easy to start building on top of Git features by accident. So I'm instead aiming to go a different way and fully cut the reftable code loose from Git. So even if we e.g. eventually want `struct strbuf` return errors on failure, it would only address one part of my problem. A couple months ago I said that I'll try to write something like shims that wrap our types in reftable types such that other projects can provide implementations for such shims themselves. I tried to make that work, but the result was an unholy mess that really didn't make any sense whatsoever. Most of the features that I need from the Git codebase can be provided in a couple of lines of code (`struct strbuf` is roughly 50 lines for example), plus maybe a callback function that can be provided to wire things up on the user side (`register_tempfiles()` for example). So once I saw that those wrappers are harder to use and result in roughly the same lines of code I decided to scrap that approach and instead try to convert it fully. So yeah, overall we shouldn't open-code things like this. But I don't really see another way to do this for the reftable library. Patrick
Patrick Steinhardt <ps@pks.im> writes: >> But it illustrates why open coding is not necessarily an excellent >> idea in the longer term, doesn't it? When unsigned_mult_overflows() >> is updated to avoid such a false positive, how would we remember >> that we need to update this copy we? > > I agree in general, but with the reftable library I'm stuck between a > rock and a hard place. My goal is to make it fully reusable by other > projects without first having to do surgery on their side. While having > something like `st_mult()` is simple enough to port over, the biggest > problem I have is that over time we sneak in more and more code from the > Git codebase. The result is death by a thousand cuts. > Now if we had a single header that exposes a small set of functions > without _anything_ else it could work alright. But I rather doubt that > we really want to have a standalone header for `st_mult()` that we can > include. But without such a standalone header it is simply too easy to > start building on top of Git features by accident. > > So I'm instead aiming to go a different way and fully cut the reftable > code loose from Git. So even if we e.g. eventually want `struct strbuf` > return errors on failure, it would only address one part of my problem. The dependency to "strbuf" (just as an example) was added initially fairly early. Soon after 27f7ed2a (reftable: add LICENSE, 2021-10-07) added the reftable/ hierarchy, e303bf22 (reftable: (de)serialization for the polymorphic record type., 2021-10-07). I somehow had an impression that reftable "library" started without any Git dependency and then use of our helper functions seemed through from the shim layer, but it was pretty much part of the library from day one, it seems. > A couple months ago I said that I'll try to write something like shims > that wrap our types in reftable types such that other projects can > provide implementations for such shims themselves. I tried to make that > work, but the result was an unholy mess that really didn't make any > sense whatsoever. Most of the features that I need from the Git codebase > can be provided in a couple of lines of code (`struct strbuf` is roughly > 50 lines for example), plus maybe a callback function that can be > provided to wire things up on the user side (`register_tempfiles()` for > example). So once I saw that those wrappers are harder to use and result > in roughly the same lines of code I decided to scrap that approach and > instead try to convert it fully. > > So yeah, overall we shouldn't open-code things like this. But I don't > really see another way to do this for the reftable library. But isn't all of the above what Libification ought to be about? I was hoping that the reftable polishing would not have to be done by you alone, and you would recruit those who are into libification of other parts of Git codebase to help cleaning up these fringe (from the point of view of reftable) interfaces. What do the libification folks feel about this (folks involved in libgit-sys CC'ed; of course all others who are interested in the libification topic are welcome to comment)? Thanks.
On Thu, Sep 26, 2024 at 09:13:23AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > >> But it illustrates why open coding is not necessarily an excellent > >> idea in the longer term, doesn't it? When unsigned_mult_overflows() > >> is updated to avoid such a false positive, how would we remember > >> that we need to update this copy we? > > > > I agree in general, but with the reftable library I'm stuck between a > > rock and a hard place. My goal is to make it fully reusable by other > > projects without first having to do surgery on their side. While having > > something like `st_mult()` is simple enough to port over, the biggest > > problem I have is that over time we sneak in more and more code from the > > Git codebase. The result is death by a thousand cuts. > > > Now if we had a single header that exposes a small set of functions > > without _anything_ else it could work alright. But I rather doubt that > > we really want to have a standalone header for `st_mult()` that we can > > include. But without such a standalone header it is simply too easy to > > start building on top of Git features by accident. > > > > So I'm instead aiming to go a different way and fully cut the reftable > > code loose from Git. So even if we e.g. eventually want `struct strbuf` > > return errors on failure, it would only address one part of my problem. > > The dependency to "strbuf" (just as an example) was added initially > fairly early. Soon after 27f7ed2a (reftable: add LICENSE, > 2021-10-07) added the reftable/ hierarchy, e303bf22 (reftable: > (de)serialization for the polymorphic record type., 2021-10-07). I > somehow had an impression that reftable "library" started without > any Git dependency and then use of our helper functions seemed > through from the shim layer, but it was pretty much part of the > library from day one, it seems. I think the history goes a bit different. Initially, the reftable library was developed completely outside of the Git tree in [1]. It did not have any external dependencies and didn't use any of the Git code. During the upstreaming process the discussion rather quickly swayed into the direction of making Git the canonical upstream instead of using a separate repository that hosts the code. See e.g. [2], which was still tracking a VERSION file to keep track of the upstream version that the code was pulled from. In [3] the discussion started about upstream requiring a CLA, which was an (understandable) non-starter. I cannot find the statement, but eventually we decided that canonical upstream should be Git, and the VERSION file was never committed. But with that decision, now that the reftable library was part of Git, the reviews also started to note that it really should use functions provided by Git. The initial version of the patch series didn't yet have any of that. From my point of view that was a big mistake, as the result is that Git is _not_ reusable by other projects in its current state. [1]: https://github.com/google/reftable/tree/master/c [2]: <2106ff286b1135f9428529d9fc392edc127e960c.1580134944.git.gitgitgadget@gmail.com> [3]: <20200207001612.GF6573@camp.crustytoothpaste.net> > > A couple months ago I said that I'll try to write something like shims > > that wrap our types in reftable types such that other projects can > > provide implementations for such shims themselves. I tried to make that > > work, but the result was an unholy mess that really didn't make any > > sense whatsoever. Most of the features that I need from the Git codebase > > can be provided in a couple of lines of code (`struct strbuf` is roughly > > 50 lines for example), plus maybe a callback function that can be > > provided to wire things up on the user side (`register_tempfiles()` for > > example). So once I saw that those wrappers are harder to use and result > > in roughly the same lines of code I decided to scrap that approach and > > instead try to convert it fully. > > > > So yeah, overall we shouldn't open-code things like this. But I don't > > really see another way to do this for the reftable library. > > But isn't all of the above what Libification ought to be about? I > was hoping that the reftable polishing would not have to be done by > you alone, and you would recruit those who are into libification of > other parts of Git codebase to help cleaning up these fringe (from > the point of view of reftable) interfaces. > > What do the libification folks feel about this (folks involved in > libgit-sys CC'ed; of course all others who are interested in the > libification topic are welcome to comment)? I think it's orthogonal to that. The libification effort doesn't really care about exposing low-level details like `st_mult()` as a library, or even the `strbuf` interface. It cares about making Git functionality available, which is on a much higher conceptual level. Furthermore, to the best of my knowledge the intent wasn't ever to provide the ability to take a couple of C files and have them be easy to build as part of another project. The goal is to have proper interfaces to our code, but whether the code itself has internal interdependencies is not really much of a concern (globals are a different story). My goal is different: I want to `cp -r` the "reftable/" directory into libgit2 or any other project that wants to implement a reftable backend without having to care about all the different dependencies that it may pull in or having to carefully massage all of the includes. Now if it was a single standalone file that one would have to copy in addition to that I'd be fine. But "strbuf.c' isn't that at all: it ropes in "gettext.h", "hex-ll.h", "string-list.h", "utf-8.h", "date.h". All of which is not needed at all, but the person who does the porting now has to carefully figure out what can and cannot be removed from both "strbuf.h" and "strbuf.c". So... it gets out of hand really fast. We would have to significantly change the way we develop the Git codebase if we wanted to make this a good experience. But enforcing very strict coding guidelines onto all of our codebase doesn't feel reasonable to me, which is why I am instead trying to reinstate the state where the reftable library was decoupled from the rest of the Git codebase. The difference here is roughly 100 to 200 lines of code, which I don't think is much of a maintenance burden by itself. In fact, we'll even start to share the maintenance burden with libgit2, because they would be able to use the exact same copy as we do and thus contribute back bugfixes and improvements for discoveries that their additional test coverage brings us. I hope that this all clarifies my point of view :) I've also Cc'd Han-Wen in case I've made any mistakes regarding the original intent. Patrick
On Fri, Sep 27, 2024 at 11:31 AM Patrick Steinhardt <ps@pks.im> wrote: > > The dependency to "strbuf" (just as an example) was added initially > > fairly early. Soon after 27f7ed2a (reftable: add LICENSE, > > 2021-10-07) added the reftable/ hierarchy, e303bf22 (reftable: > > (de)serialization for the polymorphic record type., 2021-10-07). I For historical understanding, these commits are meaningless because they were a bulk import. You should look at https://github.com/hanwen/reftable instead. > > somehow had an impression that reftable "library" started without > > any Git dependency and then use of our helper functions seemed > > through from the shim layer, but it was pretty much part of the > > library from day one, it seems. > > I think the history goes a bit different. Initially, the reftable > library was developed completely outside of the Git tree in [1]. It did > not have any external dependencies and didn't use any of the Git code. Correct. strbuf was originally called 'slice', because it was a direct translation of the []byte type in Go. I renamed it to avoid offending the sensibilities of Git reviewers. See here https://github.com/hanwen/reftable/commit/06d9728b4fedb9ed996ac1ae48343e3879114e1b . It was a change that made me a little sad, because the strbuf type only works with explicit initialization (STRBUF_INIT), which means that every struct that transitively includes a strbuf must now have a XXX_INIT macro. It just required 12 strbuf functions, but I suppose more have crept in over time. > The difference here is roughly 100 to 200 lines of code, which I don't > think is much of a maintenance burden by itself. In fact, we'll even I think Patrick's plan is sound. The amount of code to make reftable work both standalone and in Git is small. It just needs some discipline on both the Git and libgit2 side to not add additional dependencies.
Han-Wen Nienhuys <hanwenn@gmail.com> writes: > I think Patrick's plan is sound. The amount of code to make reftable > work both standalone and in Git is small. It just needs some > discipline on both the Git and libgit2 side to not add additional > dependencies. Thanks for a sanity check. Very much appreciated.
diff --git a/reftable/basics.c b/reftable/basics.c index 4adc98cf5de..3350bbffa23 100644 --- a/reftable/basics.c +++ b/reftable/basics.c @@ -37,9 +37,16 @@ void reftable_free(void *p) void *reftable_calloc(size_t nelem, size_t elsize) { - size_t sz = st_mult(nelem, elsize); - void *p = reftable_malloc(sz); - memset(p, 0, sz); + void *p; + + if (nelem && elsize > SIZE_MAX / nelem) + return NULL; + + p = reftable_malloc(nelem * elsize); + if (!p) + return NULL; + + memset(p, 0, nelem * elsize); return p; }
Handle allocation failures in `reftable_calloc()`. While at it, remove our use of `st_mult()` that would cause us to die on an overflow. From the caller's point of view there is not much of a difference between arguments that are too large to be multiplied and a request that is too big to handle by the allocator: in both cases the allocation cannot be fulfilled. And in neither of these cases do we want the reftable library to die. While we could use `unsigned_mult_overflows()` to handle the overflow gracefully, we instead open-code it to further our goal of converting the reftable codebase to become a standalone library that can be reused by external projects. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- reftable/basics.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)