diff mbox series

[v2,04/22] reftable/basics: handle allocation failures in `reftable_calloc()`

Message ID f6ad92ffd01c442dacd3ac6aa448891028636636.1727158127.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series reftable: handle allocation errors | expand

Commit Message

Patrick Steinhardt Sept. 24, 2024, 6:32 a.m. UTC
Handle allocation failures in `reftable_calloc()`.

While at it, remove our use of `st_mult()` that would cause us to die on
an overflow. From the caller's point of view there is not much of a
difference between arguments that are too large to be multiplied and a
request that is too big to handle by the allocator: in both cases the
allocation cannot be fulfilled. And in neither of these cases do we want
the reftable library to die.

While we could use `unsigned_mult_overflows()` to handle the overflow
gracefully, we instead open-code it to further our goal of converting
the reftable codebase to become a standalone library that can be reused
by external projects.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Comments

Junio C Hamano Sept. 24, 2024, 4:59 p.m. UTC | #1
Patrick Steinhardt <ps@pks.im> writes:

>  void *reftable_calloc(size_t nelem, size_t elsize)
>  {
> -	size_t sz = st_mult(nelem, elsize);
> -	void *p = reftable_malloc(sz);
> -	memset(p, 0, sz);
> +	void *p;
> +
> +	if (nelem && elsize > SIZE_MAX / nelem)
> +		return NULL;

Now it is open coded, it strikes me that the check is a bit overly
conservative.

If we are trying to allocate slightly than half of SIZE_MAX by
asking elsize==1 and nelem==(SIZE_MAX / 2 + 10), we'd say that
(elsize * nelem) would not fit size_t and fail the allocation.

For the purpose of this caller, it is not a practical issue, as it
is likely that you'd not be able to obtain slightly more than half
your address space out of a single allocation anyway.

But it illustrates why open coding is not necessarily an excellent
idea in the longer term, doesn't it?  When unsigned_mult_overflows()
is updated to avoid such a false positive, how would we remember
that we need to update this copy we?

> +	p = reftable_malloc(nelem * elsize);
> +	if (!p)
> +		return NULL;
> +
> +	memset(p, 0, nelem * elsize);
>  	return p;
>  }
Patrick Steinhardt Sept. 26, 2024, 12:11 p.m. UTC | #2
On Tue, Sep 24, 2024 at 09:59:24AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> >  void *reftable_calloc(size_t nelem, size_t elsize)
> >  {
> > -	size_t sz = st_mult(nelem, elsize);
> > -	void *p = reftable_malloc(sz);
> > -	memset(p, 0, sz);
> > +	void *p;
> > +
> > +	if (nelem && elsize > SIZE_MAX / nelem)
> > +		return NULL;
> 
> Now it is open coded, it strikes me that the check is a bit overly
> conservative.
> 
> If we are trying to allocate slightly than half of SIZE_MAX by
> asking elsize==1 and nelem==(SIZE_MAX / 2 + 10), we'd say that
> (elsize * nelem) would not fit size_t and fail the allocation.
> 
> For the purpose of this caller, it is not a practical issue, as it
> is likely that you'd not be able to obtain slightly more than half
> your address space out of a single allocation anyway.
> 
> But it illustrates why open coding is not necessarily an excellent
> idea in the longer term, doesn't it?  When unsigned_mult_overflows()
> is updated to avoid such a false positive, how would we remember
> that we need to update this copy we?

I agree in general, but with the reftable library I'm stuck between a
rock and a hard place. My goal is to make it fully reusable by other
projects without first having to do surgery on their side. While having
something like `st_mult()` is simple enough to port over, the biggest
problem I have is that over time we sneak in more and more code from the
Git codebase. The result is death by a thousand cuts.

Now if we had a single header that exposes a small set of functions
without _anything_ else it could work alright. But I rather doubt that
we really want to have a standalone header for `st_mult()` that we can
include. But without such a standalone header it is simply too easy to
start building on top of Git features by accident.

So I'm instead aiming to go a different way and fully cut the reftable
code loose from Git. So even if we e.g. eventually want `struct strbuf`
return errors on failure, it would only address one part of my problem.

A couple months ago I said that I'll try to write something like shims
that wrap our types in reftable types such that other projects can
provide implementations for such shims themselves. I tried to make that
work, but the result was an unholy mess that really didn't make any
sense whatsoever. Most of the features that I need from the Git codebase
can be provided in a couple of lines of code (`struct strbuf` is roughly
50 lines for example), plus maybe a callback function that can be
provided to wire things up on the user side (`register_tempfiles()` for
example). So once I saw that those wrappers are harder to use and result
in roughly the same lines of code I decided to scrap that approach and
instead try to convert it fully.

So yeah, overall we shouldn't open-code things like this. But I don't
really see another way to do this for the reftable library.

Patrick
Junio C Hamano Sept. 26, 2024, 4:13 p.m. UTC | #3
Patrick Steinhardt <ps@pks.im> writes:

>> But it illustrates why open coding is not necessarily an excellent
>> idea in the longer term, doesn't it?  When unsigned_mult_overflows()
>> is updated to avoid such a false positive, how would we remember
>> that we need to update this copy we?
>
> I agree in general, but with the reftable library I'm stuck between a
> rock and a hard place. My goal is to make it fully reusable by other
> projects without first having to do surgery on their side. While having
> something like `st_mult()` is simple enough to port over, the biggest
> problem I have is that over time we sneak in more and more code from the
> Git codebase. The result is death by a thousand cuts.

> Now if we had a single header that exposes a small set of functions
> without _anything_ else it could work alright. But I rather doubt that
> we really want to have a standalone header for `st_mult()` that we can
> include. But without such a standalone header it is simply too easy to
> start building on top of Git features by accident.
>
> So I'm instead aiming to go a different way and fully cut the reftable
> code loose from Git. So even if we e.g. eventually want `struct strbuf`
> return errors on failure, it would only address one part of my problem.

The dependency to "strbuf" (just as an example) was added initially
fairly early.  Soon after 27f7ed2a (reftable: add LICENSE,
2021-10-07) added the reftable/ hierarchy, e303bf22 (reftable:
(de)serialization for the polymorphic record type., 2021-10-07).  I
somehow had an impression that reftable "library" started without
any Git dependency and then use of our helper functions seemed
through from the shim layer, but it was pretty much part of the
library from day one, it seems.

> A couple months ago I said that I'll try to write something like shims
> that wrap our types in reftable types such that other projects can
> provide implementations for such shims themselves. I tried to make that
> work, but the result was an unholy mess that really didn't make any
> sense whatsoever. Most of the features that I need from the Git codebase
> can be provided in a couple of lines of code (`struct strbuf` is roughly
> 50 lines for example), plus maybe a callback function that can be
> provided to wire things up on the user side (`register_tempfiles()` for
> example). So once I saw that those wrappers are harder to use and result
> in roughly the same lines of code I decided to scrap that approach and
> instead try to convert it fully.
>
> So yeah, overall we shouldn't open-code things like this. But I don't
> really see another way to do this for the reftable library.

But isn't all of the above what Libification ought to be about?  I
was hoping that the reftable polishing would not have to be done by
you alone, and you would recruit those who are into libification of
other parts of Git codebase to help cleaning up these fringe (from
the point of view of reftable) interfaces.

What do the libification folks feel about this (folks involved in
libgit-sys CC'ed; of course all others who are interested in the
libification topic are welcome to comment)?

Thanks.
Patrick Steinhardt Sept. 27, 2024, 5:28 a.m. UTC | #4
On Thu, Sep 26, 2024 at 09:13:23AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> >> But it illustrates why open coding is not necessarily an excellent
> >> idea in the longer term, doesn't it?  When unsigned_mult_overflows()
> >> is updated to avoid such a false positive, how would we remember
> >> that we need to update this copy we?
> >
> > I agree in general, but with the reftable library I'm stuck between a
> > rock and a hard place. My goal is to make it fully reusable by other
> > projects without first having to do surgery on their side. While having
> > something like `st_mult()` is simple enough to port over, the biggest
> > problem I have is that over time we sneak in more and more code from the
> > Git codebase. The result is death by a thousand cuts.
> 
> > Now if we had a single header that exposes a small set of functions
> > without _anything_ else it could work alright. But I rather doubt that
> > we really want to have a standalone header for `st_mult()` that we can
> > include. But without such a standalone header it is simply too easy to
> > start building on top of Git features by accident.
> >
> > So I'm instead aiming to go a different way and fully cut the reftable
> > code loose from Git. So even if we e.g. eventually want `struct strbuf`
> > return errors on failure, it would only address one part of my problem.
> 
> The dependency to "strbuf" (just as an example) was added initially
> fairly early.  Soon after 27f7ed2a (reftable: add LICENSE,
> 2021-10-07) added the reftable/ hierarchy, e303bf22 (reftable:
> (de)serialization for the polymorphic record type., 2021-10-07).  I
> somehow had an impression that reftable "library" started without
> any Git dependency and then use of our helper functions seemed
> through from the shim layer, but it was pretty much part of the
> library from day one, it seems.

I think the history goes a bit different. Initially, the reftable
library was developed completely outside of the Git tree in [1]. It did
not have any external dependencies and didn't use any of the Git code.

During the upstreaming process the discussion rather quickly swayed into
the direction of making Git the canonical upstream instead of using a
separate repository that hosts the code. See e.g. [2], which was still
tracking a VERSION file to keep track of the upstream version that the
code was pulled from. In [3] the discussion started about upstream
requiring a CLA, which was an (understandable) non-starter. I cannot
find the statement, but eventually we decided that canonical upstream
should be Git, and the VERSION file was never committed.

But with that decision, now that the reftable library was part of Git,
the reviews also started to note that it really should use functions
provided by Git. The initial version of the patch series didn't yet have
any of that. From my point of view that was a big mistake, as the result
is that Git is _not_ reusable by other projects in its current state.

[1]: https://github.com/google/reftable/tree/master/c
[2]: <2106ff286b1135f9428529d9fc392edc127e960c.1580134944.git.gitgitgadget@gmail.com>
[3]: <20200207001612.GF6573@camp.crustytoothpaste.net>

> > A couple months ago I said that I'll try to write something like shims
> > that wrap our types in reftable types such that other projects can
> > provide implementations for such shims themselves. I tried to make that
> > work, but the result was an unholy mess that really didn't make any
> > sense whatsoever. Most of the features that I need from the Git codebase
> > can be provided in a couple of lines of code (`struct strbuf` is roughly
> > 50 lines for example), plus maybe a callback function that can be
> > provided to wire things up on the user side (`register_tempfiles()` for
> > example). So once I saw that those wrappers are harder to use and result
> > in roughly the same lines of code I decided to scrap that approach and
> > instead try to convert it fully.
> >
> > So yeah, overall we shouldn't open-code things like this. But I don't
> > really see another way to do this for the reftable library.
> 
> But isn't all of the above what Libification ought to be about?  I
> was hoping that the reftable polishing would not have to be done by
> you alone, and you would recruit those who are into libification of
> other parts of Git codebase to help cleaning up these fringe (from
> the point of view of reftable) interfaces.
> 
> What do the libification folks feel about this (folks involved in
> libgit-sys CC'ed; of course all others who are interested in the
> libification topic are welcome to comment)?

I think it's orthogonal to that. The libification effort doesn't really
care about exposing low-level details like `st_mult()` as a library, or
even the `strbuf` interface. It cares about making Git functionality
available, which is on a much higher conceptual level.

Furthermore, to the best of my knowledge the intent wasn't ever to
provide the ability to take a couple of C files and have them be easy to
build as part of another project. The goal is to have proper interfaces
to our code, but whether the code itself has internal interdependencies
is not really much of a concern (globals are a different story).

My goal is different: I want to `cp -r` the "reftable/" directory into
libgit2 or any other project that wants to implement a reftable backend
without having to care about all the different dependencies that it may
pull in or having to carefully massage all of the includes.

Now if it was a single standalone file that one would have to copy in
addition to that I'd be fine. But "strbuf.c' isn't that at all: it ropes
in "gettext.h", "hex-ll.h", "string-list.h", "utf-8.h", "date.h". All of
which is not needed at all, but the person who does the porting now has
to carefully figure out what can and cannot be removed from both
"strbuf.h" and "strbuf.c".

So... it gets out of hand really fast. We would have to significantly
change the way we develop the Git codebase if we wanted to make this a
good experience. But enforcing very strict coding guidelines onto all of
our codebase doesn't feel reasonable to me, which is why I am instead
trying to reinstate the state where the reftable library was decoupled
from the rest of the Git codebase.

The difference here is roughly 100 to 200 lines of code, which I don't
think is much of a maintenance burden by itself. In fact, we'll even
start to share the maintenance burden with libgit2, because they would
be able to use the exact same copy as we do and thus contribute back
bugfixes and improvements for discoveries that their additional test
coverage brings us.

I hope that this all clarifies my point of view :) I've also Cc'd
Han-Wen in case I've made any mistakes regarding the original intent.

Patrick
Han-Wen Nienhuys Sept. 27, 2024, 12:21 p.m. UTC | #5
On Fri, Sep 27, 2024 at 11:31 AM Patrick Steinhardt <ps@pks.im> wrote:
> > The dependency to "strbuf" (just as an example) was added initially
> > fairly early.  Soon after 27f7ed2a (reftable: add LICENSE,
> > 2021-10-07) added the reftable/ hierarchy, e303bf22 (reftable:
> > (de)serialization for the polymorphic record type., 2021-10-07).  I

For historical understanding, these commits are meaningless because
they were a bulk import. You should look at
https://github.com/hanwen/reftable instead.

> > somehow had an impression that reftable "library" started without
> > any Git dependency and then use of our helper functions seemed
> > through from the shim layer, but it was pretty much part of the
> > library from day one, it seems.
>
> I think the history goes a bit different. Initially, the reftable
> library was developed completely outside of the Git tree in [1]. It did
> not have any external dependencies and didn't use any of the Git code.

Correct. strbuf was originally called 'slice', because it was a direct
translation of the []byte type in Go. I renamed it to avoid offending
the sensibilities of Git reviewers. See here
https://github.com/hanwen/reftable/commit/06d9728b4fedb9ed996ac1ae48343e3879114e1b
. It was a change that made me a little sad, because the strbuf type
only works with explicit initialization (STRBUF_INIT),
which means that every struct that transitively includes a strbuf must
now have a XXX_INIT macro. It just required 12 strbuf functions, but I
suppose more have crept in over time.

> The difference here is roughly 100 to 200 lines of code, which I don't
> think is much of a maintenance burden by itself. In fact, we'll even

I think Patrick's plan is sound. The amount of code to make reftable
work both standalone and in Git is small. It just needs some
discipline on both the Git and libgit2 side to not add additional
dependencies.
Junio C Hamano Sept. 27, 2024, 3:21 p.m. UTC | #6
Han-Wen Nienhuys <hanwenn@gmail.com> writes:

> I think Patrick's plan is sound. The amount of code to make reftable
> work both standalone and in Git is small. It just needs some
> discipline on both the Git and libgit2 side to not add additional
> dependencies.

Thanks for a sanity check.  Very much appreciated.
diff mbox series

Patch

diff --git a/reftable/basics.c b/reftable/basics.c
index 4adc98cf5de..3350bbffa23 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -37,9 +37,16 @@  void reftable_free(void *p)
 
 void *reftable_calloc(size_t nelem, size_t elsize)
 {
-	size_t sz = st_mult(nelem, elsize);
-	void *p = reftable_malloc(sz);
-	memset(p, 0, sz);
+	void *p;
+
+	if (nelem && elsize > SIZE_MAX / nelem)
+		return NULL;
+
+	p = reftable_malloc(nelem * elsize);
+	if (!p)
+		return NULL;
+
+	memset(p, 0, nelem * elsize);
 	return p;
 }