git-compat-util.h: introduce CALLOC(x)

Message ID	6694c52b38674859eb0390c7f62da1209a8d8ec3.1670266373.git.me@ttaylorr.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Mon, 5 Dec 2022 13:54:38 -0500 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org, git@vger.kernel.org Cc: Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com> Subject: [PATCH] git-compat-util.h: introduce CALLOC(x) Message-ID: <6694c52b38674859eb0390c7f62da1209a8d8ec3.1670266373.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Precedence: bulk
Series	git-compat-util.h: introduce CALLOC(x) \| expand git-compat-util.h: introduce CALLOC(x)

Taylor Blau Dec. 5, 2022, 6:54 p.m. UTC

When zero-initializing a struct without the use of static initializers
like "{ 0 }", it is common to write one of:

    T *ptr = xcalloc(1, sizeof(T));
    T *ptr = xcalloc(1, sizeof(*ptr));

These correctly initialize "*ptr" to the all-zeros value, but have a
couple of drawbacks. Notably, both initializations are verbose, but the
former is a foot-gun. If "*ptr"'s type changes to something other than
"T", the programmer has to remember to update not only the type of the
variable, but the right-hand side of its initialization, too.

In git.git, it is sometimes common to write something like:

    T *ptr;
    CALLOC_ARRAY(ptr, 1);

...but that is confusing, since we're not initializing an array. Rather,
we're using the CALLOC_ARRAY() macro to pretend like we are. But since
the array length is 1, the effect is zero initializing a single element.

Introduce a new variant, CALLOC(x), which initializes "x" to the
all-zeros value, without exposing the confusing use of CALLOC_ARRAY(),
or the foot-guns available when using xcalloc() directly.

Write a Coccinelle patch which codifies these rules, but mark it as
"pending" since the resulting diff is too large to include in a single
patch:

    $ git apply .build/contrib/coccinelle/xcalloc.pending.cocci.patch
    $ git diff --shortstat
     89 files changed, 221 insertions(+), 178 deletions(-)

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
This is a follow-up on [1], where introducing CALLOC(x) as the preferred
alternative to CALLOC_ARRAY(x, 1) was first suggested.

The patch is straightforward, and I am pretty sure that the Coccinelle
rules are right, too ;-). But as a first-time Coccinelle user, any extra
scrutiny on those bits would be appreciated.

The main point of discussion I think is whether we should consider
adopting this rule. I am biased, of course, but I think that we should.

In any case, we should focus our efforts on polishing v2.39.0, but I
wanted to send this out to the list before I forgot about it.

[1]: https://lore.kernel.org/git/Y1MrXoobkVKngYL1@coredump.intra.peff.net/

 contrib/coccinelle/xcalloc.pending.cocci | 21 +++++++++++++++++++++
 git-compat-util.h                        |  8 ++++++++
 2 files changed, 29 insertions(+)
 create mode 100644 contrib/coccinelle/xcalloc.pending.cocci

--
2.38.0.16.g393fd4c6db

René Scharfe Dec. 5, 2022, 9:01 p.m. UTC | #1

Am 05.12.22 um 19:54 schrieb Taylor Blau:
> When zero-initializing a struct without the use of static initializers
> like "{ 0 }", it is common to write one of:
>
>     T *ptr = xcalloc(1, sizeof(T));
>     T *ptr = xcalloc(1, sizeof(*ptr));
>
> These correctly initialize "*ptr" to the all-zeros value, but have a
> couple of drawbacks. Notably, both initializations are verbose, but the
> former is a foot-gun. If "*ptr"'s type changes to something other than
> "T", the programmer has to remember to update not only the type of the
> variable, but the right-hand side of its initialization, too.
>
> In git.git, it is sometimes common to write something like:
>
>     T *ptr;
>     CALLOC_ARRAY(ptr, 1);
>
> ...but that is confusing, since we're not initializing an array. Rather,
> we're using the CALLOC_ARRAY() macro to pretend like we are. But since
> the array length is 1, the effect is zero initializing a single element.

An object and a single-element array of objects allocated on the heap are
indistinguishable.  In that sense there is no confusion -- we are indeed
allocating a single-element array.  But if the intent is to only get one
thing then having to fill in 1 in the bulk order form is a bit tedious,
especially since this is the most common kind of request.  A shortcut for
the frequent case makes sense.

> Introduce a new variant, CALLOC(x), which initializes "x" to the
> all-zeros value, without exposing the confusing use of CALLOC_ARRAY(),
> or the foot-guns available when using xcalloc() directly.

AFAIK the first "c" in "calloc" is for "continuous", not "zeroed".  A
single object is always contiguous, so the "C" in "CALLOC" is
tautologic if read in that way.  It fits our naming scheme for
ALLOC_ARRAY and CALLOC_ARRAY, though, so that might not be much of a
problem.

And there are lots of occurrences of xmalloc(sizeof(T)) and
xmalloc(sizeof(*ptr)) that would benefit from the automatic size
inference of ALLOC_ARRAY -- an ALLOC macro would complement
CALLOC nicely.

> Write a Coccinelle patch which codifies these rules, but mark it as
> "pending" since the resulting diff is too large to include in a single
> patch:
>
>     $ git apply .build/contrib/coccinelle/xcalloc.pending.cocci.patch
>     $ git diff --shortstat
>      89 files changed, 221 insertions(+), 178 deletions(-)
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> This is a follow-up on [1], where introducing CALLOC(x) as the preferred
> alternative to CALLOC_ARRAY(x, 1) was first suggested.
>
> The patch is straightforward, and I am pretty sure that the Coccinelle
> rules are right, too ;-). But as a first-time Coccinelle user, any extra
> scrutiny on those bits would be appreciated.
>
> The main point of discussion I think is whether we should consider
> adopting this rule. I am biased, of course, but I think that we should.
>
> In any case, we should focus our efforts on polishing v2.39.0, but I
> wanted to send this out to the list before I forgot about it.
>
> [1]: https://lore.kernel.org/git/Y1MrXoobkVKngYL1@coredump.intra.peff.net/
>
>  contrib/coccinelle/xcalloc.pending.cocci | 21 +++++++++++++++++++++
>  git-compat-util.h                        |  8 ++++++++
>  2 files changed, 29 insertions(+)
>  create mode 100644 contrib/coccinelle/xcalloc.pending.cocci
>
> diff --git a/contrib/coccinelle/xcalloc.pending.cocci b/contrib/coccinelle/xcalloc.pending.cocci
> new file mode 100644
> index 0000000000..83e4ca1a68
> --- /dev/null
> +++ b/contrib/coccinelle/xcalloc.pending.cocci
> @@ -0,0 +1,21 @@
> +@@
> +type T;
> +T *ptr;
> +@@
> +- ptr = xcalloc(1, \( sizeof(T) \| sizeof(*ptr) \) )
> ++ CALLOC(ptr)
> +
> +@@
> +type T;
> +identifier ptr;
> +@@
> +- T ptr = xcalloc(1, \( sizeof(T) \| sizeof(*ptr) \) );
> ++ T ptr;
> ++ CALLOC(ptr);

This rule would turn this code:

	struct foo *bar = xcalloc(1, sizeof(*bar));
	int i;

... into:

	struct foo *bar;
	CALLOC(bar);
	int i;

... which violates the coding guideline to not mix declarations and
statements (-Wdeclaration-after-statement).

> +
> +@@
> +type T;
> +T *ptr;
> +@@
> +- CALLOC_ARRAY(ptr, 1)
> ++ CALLOC(ptr)
> diff --git a/git-compat-util.h b/git-compat-util.h
> index a76d0526f7..827e5be89c 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -1107,6 +1107,14 @@ static inline void move_array(void *dst, const void *src, size_t n, size_t size)
>  		memmove(dst, src, st_mult(size, n));
>  }
>
> +/*
> + * Like CALLOC_ARRAY, but the argument is treated as a pointer to a
> + * single struct.
> + *
> + * Preferable over xcalloc(1, sizeof(...)), or CALLOC_ARRAY(..., 1).
> + */
> +#define CALLOC(x) (x) = CALLOC_ARRAY((x), 1)
> +
>  /*
>   * These functions help you allocate structs with flex arrays, and copy
>   * the data directly into the array. For example, if you had:
> --
> 2.38.0.16.g393fd4c6db

Taylor Blau Dec. 5, 2022, 10:36 p.m. UTC | #2

On Mon, Dec 05, 2022 at 10:01:11PM +0100, René Scharfe wrote:
> This rule would turn this code:
>
> 	struct foo *bar = xcalloc(1, sizeof(*bar));
> 	int i;
>
> ... into:
>
> 	struct foo *bar;
> 	CALLOC(bar);
> 	int i;
>
> ... which violates the coding guideline to not mix declarations and
> statements (-Wdeclaration-after-statement).

Yeah, I was wondering about this myself when I wrote this part of the
Coccinelle patch.

Is there an intelligent way to tell it to put the first statement after
all declarations? I couldn't find anything after a quick scan of the
documentation nor our own patches.

Thanks,
Taylor

Ævar Arnfjörð Bjarmason Dec. 5, 2022, 11:12 p.m. UTC | #3

On Mon, Dec 05 2022, Taylor Blau wrote:

> On Mon, Dec 05, 2022 at 10:01:11PM +0100, René Scharfe wrote:
>> This rule would turn this code:
>>
>> 	struct foo *bar = xcalloc(1, sizeof(*bar));
>> 	int i;
>>
>> ... into:
>>
>> 	struct foo *bar;
>> 	CALLOC(bar);
>> 	int i;
>>
>> ... which violates the coding guideline to not mix declarations and
>> statements (-Wdeclaration-after-statement).
>
> Yeah, I was wondering about this myself when I wrote this part of the
> Coccinelle patch.
>
> Is there an intelligent way to tell it to put the first statement after
> all declarations? I couldn't find anything after a quick scan of the
> documentation nor our own patches.

We can get that working, but I think it's good to establish some ground
rules first:

First, when you add *.pending.cocci rules they shouldn't be pseudocode,
but things that are too large to apply right now. I think my recent
041df69edd3 (Merge branch 'ab/fewer-the-index-macros', 2022-11-28) is a
good example.

In this case if I apply the proposed patch and do:

	make contrib/coccinelle/xcalloc.pending.cocci.patch &&
	patch -p1 <contrib/coccinelle/xcalloc.pending.cocci.patch &&
	make

We have a deluge of syntax errors, not just the warnings René
notes. This needs to be fixed, the *.pending.cocci must actually work on
our code.

(We should probably modify the "static-analysis" to actually apply the
result of these *.pending patches and compile with those, but until then
we can manually test them).

Second, we have test support for rules now, see
contrib/coccinelle/tests/. You just need to create a
contrib/coccinelle/tests/xcalloc.pending.c, and have the expected
post-image in contrib/coccinelle/tests/xcalloc.pending.res. Please add
one of those. We don't have them for existing rules, but it really helps
to assert & review new rules.

The various edge cases that your current *.cocci doesn't compile on
etc. should go into that test file.

Third, the resulting patch is currently ~2k lines. Can we really not
make it non-pending with some whitelisting/gblacklisting. E.g. see this
out-of-tree patch for an example of opting out certain functions:
https://github.com/avar/git/commit/bedd0323e9b

I think that's preferable to having some *.pending.cocci we may or may
not get to. I think you can also probably write a working rule for this
to incrementally target certain subsets of these, and we could just
mass-convert some stuff already (if it doesn't conflict with in-flight
etc.).

Fourth, I must say I'm kind of negative on the proposed change. I.e. the
foot-gun is definitely worth solving, but why not just create a
coccinelle rule to narrowly address that? In that case we can presumably
start with it non-pending, as we don't have that many of them.

On the notion that we need to special-case:

	- CALLOC_ARRAY(ptr, 1)
	+ CALLOC(ptr)

Because an array of one is "not an array" I don't really buy it. The
calloc() interface itself exposes that, so I don't see why we'd consider
those separate on the API level for these wrapper macros.

I.e. I think the point of the macro should be to provide the appropiate
sizeof() for you, not to be opinionated that nmemb=1 should be
special-cased.I think it's probably not worth the churn to make that
transformation.

But if we *are* doing that then surely we should provide the full set of
functions. I.e. ALLOC() and ALLOC_ARRAY(), CALLOC() and CALLOC_ARRAY(),
and REALLOC() and REALLOC_ARRAY()?

Junio C Hamano Dec. 5, 2022, 11:57 p.m. UTC | #4

Taylor Blau <me@ttaylorr.com> writes:

> In git.git, it is sometimes common to write something like:
>
>     T *ptr;
>     CALLOC_ARRAY(ptr, 1);
>
> ...but that is confusing, since we're not initializing an array.

Given that "man calloc" tells us that calloc takes two parameters,

    void *calloc(size_t nmemb, size_t size);

I personally find CALLOC() that takes only a single parameter and is
capable only to allocate a single element array very much confusing.

It _might_ be arguable that the order of the parameters CALLOC_ARRAY
takes should have been reversed in that the number of elements in
the array should come first just like in calloc(), while the pointer
that is used to infer the size of an array element should come next,
but that is water under the bridge.

Taylor Blau Dec. 6, 2022, 12:29 a.m. UTC | #5

On Tue, Dec 06, 2022 at 08:57:21AM +0900, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > In git.git, it is sometimes common to write something like:
> >
> >     T *ptr;
> >     CALLOC_ARRAY(ptr, 1);
> >
> > ...but that is confusing, since we're not initializing an array.
>
> Given that "man calloc" tells us that calloc takes two parameters,
>
>     void *calloc(size_t nmemb, size_t size);
>
> I personally find CALLOC() that takes only a single parameter and is
> capable only to allocate a single element array very much confusing.

Hmm. I have always considered "calloc" a mental shorthand for "zero
initialize some bytes on the heap". It seemed like you were in favor of
such a change in:

    https://lore.kernel.org/git/xmqq8rl8ivlb.fsf@gitster.g/

...but it's entirely possible that I misread your message, in which case
I would not be sad if you dropped this patch on the floor since I don't
feel that strongly about it.

I'd be fine to call it CALLOC_ONE() or something, but I'm not sure at
that point if it's significantly better to write "CALLOC_ONE(x)" versus
"CALLOC_ARRAY(foo, 1)"

> It _might_ be arguable that the order of the parameters CALLOC_ARRAY
> takes should have been reversed in that the number of elements in
> the array should come first just like in calloc(), while the pointer
> that is used to infer the size of an array element should come next,
> but that is water under the bridge.

Yes, I agree that that would be better. But it would be frustrating to
make a tree-wide change of that magnitude at this point. So I agree it's
water under the bridge ;-).

Thanks,
Taylor

Jeff King Dec. 6, 2022, 1:21 a.m. UTC | #6

On Mon, Dec 05, 2022 at 07:29:22PM -0500, Taylor Blau wrote:

> On Tue, Dec 06, 2022 at 08:57:21AM +0900, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > In git.git, it is sometimes common to write something like:
> > >
> > >     T *ptr;
> > >     CALLOC_ARRAY(ptr, 1);
> > >
> > > ...but that is confusing, since we're not initializing an array.
> >
> > Given that "man calloc" tells us that calloc takes two parameters,
> >
> >     void *calloc(size_t nmemb, size_t size);
> >
> > I personally find CALLOC() that takes only a single parameter and is
> > capable only to allocate a single element array very much confusing.
> 
> Hmm. I have always considered "calloc" a mental shorthand for "zero
> initialize some bytes on the heap". It seemed like you were in favor of
> such a change in:
> 
>     https://lore.kernel.org/git/xmqq8rl8ivlb.fsf@gitster.g/
> 
> ...but it's entirely possible that I misread your message, in which case
> I would not be sad if you dropped this patch on the floor since I don't
> feel that strongly about it.

I think the weird thing about calloc is that it does _two_ things
compared to malloc: zero-ing, and doing an implicit nmemb*size
multiplication.

So you can think of "allocate this one element and zero it" as "calloc,
but don't multiply" or as "malloc, but zero". Naming it CALLOC() is
thinking of it as the former. If we think of it as the latter it could
perhaps be MALLOCZ() or something. I don't know if that name is too
subtle or not. We have xmemdupz(), which is basically the same thing;
it's only a zero-terminator, but that is because we are writing non-zero
bytes in the rest of it. Mostly I'd worry that it is easy to glance past
the "Z".

-Peff

Junio C Hamano Dec. 6, 2022, 1:35 a.m. UTC | #7

Jeff King <peff@peff.net> writes:

> So you can think of "allocate this one element and zero it" as "calloc,
> but don't multiply" or as "malloc, but zero". Naming it CALLOC() is
> thinking of it as the former. If we think of it as the latter it could
> perhaps be MALLOCZ() or something. I don't know if that name is too
> subtle or not. We have xmemdupz(), which is basically the same thing;
> it's only a zero-terminator, but that is because we are writing non-zero
> bytes in the rest of it. Mostly I'd worry that it is easy to glance past
> the "Z".

I think the name for the former would be CALLOC_ONE(), as I would
rephrase it as "calloc, but just one element".  I agree MALLOCZ()
would be fine for the other interpretation, and I do not have much
problem as much problem with the name as calling it CALLOC().

Thanks.

Jeff King Dec. 6, 2022, 1:43 a.m. UTC | #8

On Mon, Dec 05, 2022 at 05:36:25PM -0500, Taylor Blau wrote:

> On Mon, Dec 05, 2022 at 10:01:11PM +0100, René Scharfe wrote:
> > This rule would turn this code:
> >
> > 	struct foo *bar = xcalloc(1, sizeof(*bar));
> > 	int i;
> >
> > ... into:
> >
> > 	struct foo *bar;
> > 	CALLOC(bar);
> > 	int i;
> >
> > ... which violates the coding guideline to not mix declarations and
> > statements (-Wdeclaration-after-statement).
> 
> Yeah, I was wondering about this myself when I wrote this part of the
> Coccinelle patch.
> 
> Is there an intelligent way to tell it to put the first statement after
> all declarations? I couldn't find anything after a quick scan of the
> documentation nor our own patches.

It feels like generating the code as above is not the end of the world.
The most valuable thing that coccinelle is doing here is _finding_ the
location, and telling you "it's supposed to be like this". It is great
when the "this" post-image is perfect and doesn't need further tweaking.

But if the compiler then reminds you "hey, you need to go a bit further
manually", that doesn't seem so bad. In other words, I would be happy to
follow that work flow if I introduced a bare xcalloc(). My only worry is
that somebody less experienced with the project (or with C) would get
confused.

-Peff

Jeff King Dec. 6, 2022, 1:47 a.m. UTC | #9

On Tue, Dec 06, 2022 at 12:12:32AM +0100, Ævar Arnfjörð Bjarmason wrote:

> But if we *are* doing that then surely we should provide the full set of
> functions. I.e. ALLOC() and ALLOC_ARRAY(), CALLOC() and CALLOC_ARRAY(),
> and REALLOC() and REALLOC_ARRAY()?

FWIW, I would be happy to see all of those (minus REALLOC(), as there is
not really any point in growing or shrinking something with a fixed
size).

The biggest argument against them that I can see is that:

  struct foo *x = malloc(sizeof(*x));

is idiomatic C that newcomers to the project will easily understand,
and:

  struct foo *x;
  ALLOC(x);

is not. But it feels like we already crossed that bridge with
ALLOC_ARRAY(), etc.

-Peff

Ævar Arnfjörð Bjarmason Dec. 6, 2022, 1:58 a.m. UTC | #10

On Mon, Dec 05 2022, Jeff King wrote:

> On Tue, Dec 06, 2022 at 12:12:32AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> But if we *are* doing that then surely we should provide the full set of
>> functions. I.e. ALLOC() and ALLOC_ARRAY(), CALLOC() and CALLOC_ARRAY(),
>> and REALLOC() and REALLOC_ARRAY()?
>
> FWIW, I would be happy to see all of those (minus REALLOC(), as there is
> not really any point in growing or shrinking something with a fixed
> size).
>
> The biggest argument against them that I can see is that:
>
>   struct foo *x = malloc(sizeof(*x));
>
> is idiomatic C that newcomers to the project will easily understand,
> and:
>
>   struct foo *x;
>   ALLOC(x);
>
> is not. But it feels like we already crossed that bridge with
> ALLOC_ARRAY(), etc.

This is probably too ugly to exist, but if we are going to have more
variants maybe one that would allow use within declarations would be
better, e.g.:

	
	diff --git a/attr.c b/attr.c
	index 42ad6de8c7c..43ade426e57 100644
	--- a/attr.c
	+++ b/attr.c
	@@ -666,11 +666,10 @@ static void handle_attr_line(struct attr_stack *res,
	 
	 static struct attr_stack *read_attr_from_array(const char **list)
	 {
	-	struct attr_stack *res;
	+	struct attr_stack *INIT_CALLOC_ARRAY(res, 1);
	 	const char *line;
	 	int lineno = 0;
	 
	-	CALLOC_ARRAY(res, 1);
	 	while ((line = *(list++)) != NULL)
	 		handle_attr_line(res, line, "[builtin]", ++lineno,
	 				 READ_ATTR_MACRO_OK);
	diff --git a/git-compat-util.h b/git-compat-util.h
	index a76d0526f79..932d0907f3f 100644
	--- a/git-compat-util.h
	+++ b/git-compat-util.h
	@@ -1089,6 +1089,7 @@ int xstrncmpz(const char *s, const char *t, size_t len);
	 
	 #define ALLOC_ARRAY(x, alloc) (x) = xmalloc(st_mult(sizeof(*(x)), (alloc)))
	 #define CALLOC_ARRAY(x, alloc) (x) = xcalloc((alloc), sizeof(*(x)))
	+#define INIT_CALLOC_ARRAY(x, alloc) x = xcalloc((alloc), sizeof(*(x)))
	 #define REALLOC_ARRAY(x, alloc) (x) = xrealloc((x), st_mult(sizeof(*(x)), (alloc)))
	 
	 #define COPY_ARRAY(dst, src, n) copy_array((dst), (src), (n), sizeof(*(dst)) + \

I suspect (but haven't checked) that it might get us into the same
parsing trouble as your UNUSED(name) macro.

Or maybe:
	
	diff --git a/attr.c b/attr.c
	index 42ad6de8c7c..c3cb5c98bbf 100644
	--- a/attr.c
	+++ b/attr.c
	@@ -669 +669 @@ static struct attr_stack *read_attr_from_array(const char **list)
	-	struct attr_stack *res;
	+	INIT_CALLOC_ARRAY(struct attr_stack *, res, 1);
	@@ -673 +672,0 @@ static struct attr_stack *read_attr_from_array(const char **list)
	-	CALLOC_ARRAY(res, 1);
	diff --git a/git-compat-util.h b/git-compat-util.h
	index a76d0526f79..fd9af571dc9 100644
	--- a/git-compat-util.h
	+++ b/git-compat-util.h
	@@ -1091,0 +1092 @@ int xstrncmpz(const char *s, const char *t, size_t len);
	+#define INIT_CALLOC_ARRAY(d, x, alloc) d x = xcalloc((alloc), sizeof(*(x)))
	
I think it's also worth considering just having the "normal C' versions,
but we could transform anything that doesn't look lik ea narrow
whitelist of patterns into an error with coccinelle.

I.e. if we're paranoid about "v" v.s. "*var" in the "sizeof" we could
also check & tranform that with coccinelle...

Ævar Arnfjörð Bjarmason Dec. 6, 2022, 2:12 a.m. UTC | #11

On Mon, Dec 05 2022, Taylor Blau wrote:

> On Tue, Dec 06, 2022 at 08:57:21AM +0900, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>>
>> > In git.git, it is sometimes common to write something like:
>> >
>> >     T *ptr;
>> >     CALLOC_ARRAY(ptr, 1);
>> >
>> > ...but that is confusing, since we're not initializing an array.
>>
>> Given that "man calloc" tells us that calloc takes two parameters,
>>
>>     void *calloc(size_t nmemb, size_t size);
>>
>> I personally find CALLOC() that takes only a single parameter and is
>> capable only to allocate a single element array very much confusing.
>
> Hmm. I have always considered "calloc" a mental shorthand for "zero
> initialize some bytes on the heap". It seemed like you were in favor of
> such a change in:
>
>     https://lore.kernel.org/git/xmqq8rl8ivlb.fsf@gitster.g/
>
> ...but it's entirely possible that I misread your message, in which case
> I would not be sad if you dropped this patch on the floor since I don't
> feel that strongly about it.
>
> I'd be fine to call it CALLOC_ONE() or something, but I'm not sure at
> that point if it's significantly better to write "CALLOC_ONE(x)" versus
> "CALLOC_ARRAY(foo, 1)"
>
>> It _might_ be arguable that the order of the parameters CALLOC_ARRAY
>> takes should have been reversed in that the number of elements in
>> the array should come first just like in calloc(), while the pointer
>> that is used to infer the size of an array element should come next,
>> but that is water under the bridge.
>
> Yes, I agree that that would be better. But it would be frustrating to
> make a tree-wide change of that magnitude at this point. So I agree it's
> water under the bridge ;-).

I'm not saying you *should*, but now that we use C99 macros we *could*
also make the "1" an optional argument. I.e. these would be the same
thing:

	CALLOC(x)
	CALLOC(x, 1)

And you could also do:

	CALLOC(x, 123)

Whether that makes the interface even nastier is another matter.

That can be done by dispatching to an underlying function, and defining
the macro as:

	#define CALLOC(...) mycallocfn_1(__VA_ARGS__, NULL)

I.e. for the above you'd get either "x, NULL, NULL", "x, 1, NULL", or
"x, 123, NULL". The function could then manually check the arity.

Maybe I've just been corrupted by reading the P99 library :)

Taylor Blau Dec. 7, 2022, 2:29 a.m. UTC | #12

On Mon, Dec 05, 2022 at 08:43:50PM -0500, Jeff King wrote:
> On Mon, Dec 05, 2022 at 05:36:25PM -0500, Taylor Blau wrote:
>
> > On Mon, Dec 05, 2022 at 10:01:11PM +0100, René Scharfe wrote:
> > > This rule would turn this code:
> > >
> > > 	struct foo *bar = xcalloc(1, sizeof(*bar));
> > > 	int i;
> > >
> > > ... into:
> > >
> > > 	struct foo *bar;
> > > 	CALLOC(bar);
> > > 	int i;
> > >
> > > ... which violates the coding guideline to not mix declarations and
> > > statements (-Wdeclaration-after-statement).
> >
> > Yeah, I was wondering about this myself when I wrote this part of the
> > Coccinelle patch.
> >
> > Is there an intelligent way to tell it to put the first statement after
> > all declarations? I couldn't find anything after a quick scan of the
> > documentation nor our own patches.
>
> It feels like generating the code as above is not the end of the world.
> The most valuable thing that coccinelle is doing here is _finding_ the
> location, and telling you "it's supposed to be like this". It is great
> when the "this" post-image is perfect and doesn't need further tweaking.

I have to agree. If Coccinelle can generate the right output; great. But
if it can't, the amount of additional work to reorganize an already
generated and mostly correct *.patch from the tool seems minimal by
comparison.

> But if the compiler then reminds you "hey, you need to go a bit further
> manually", that doesn't seem so bad. In other words, I would be happy to
> follow that work flow if I introduced a bare xcalloc(). My only worry is
> that somebody less experienced with the project (or with C) would get
> confused.

Agreed.

Thanks,
Taylor

Taylor Blau Dec. 7, 2022, 2:34 a.m. UTC | #13

On Tue, Dec 06, 2022 at 12:12:32AM +0100, Ævar Arnfjörð Bjarmason wrote:
> First, when you add *.pending.cocci rules they shouldn't be pseudocode,
> but things that are too large to apply right now. I think my recent
> 041df69edd3 (Merge branch 'ab/fewer-the-index-macros', 2022-11-28) is a
> good example.

Agreed, but I do tend to consider this patch too-large to apply right
now.

Whether or not 2k lines of diff is review-able or not (and FWIW, I think
that your recent Coccinelle-generated patches were quite easy to review
pretty quickly), I think there's a bigger question of whether or not we
want to make such a sweeping, tree-wide change. And assuming the answer
to that is "yes", there's is also a question of timing. Proposing it
towards the middle or end of a release cycle seems in bad taste.

But I think splitting this discussion up into, "should we introduce
something like CALLOC() into our style conventions?" and "do we want to
apply this everywhere?" is worth doing. The former doesn't seem to take
a ton of time away from polishing the release candidates, and the latter
can be done only after the former has settled.

So the decision to make this a *.pending.cocci rule was definitely
intentional in this case.

> Second, we have test support for rules now, see
> contrib/coccinelle/tests/. You just need to create a
> contrib/coccinelle/tests/xcalloc.pending.c, and have the expected
> post-image in contrib/coccinelle/tests/xcalloc.pending.res. Please add
> one of those. We don't have them for existing rules, but it really helps
> to assert & review new rules.
>
> The various edge cases that your current *.cocci doesn't compile on
> etc. should go into that test file.

Thanks for mentioning.

> Third, the resulting patch is currently ~2k lines. Can we really not
> make it non-pending with some whitelisting/gblacklisting. E.g. see this
> out-of-tree patch for an example of opting out certain functions:
> https://github.com/avar/git/commit/bedd0323e9b

See above.

> Fourth, I must say I'm kind of negative on the proposed change. I.e. the
> foot-gun is definitely worth solving, but why not just create a
> coccinelle rule to narrowly address that? In that case we can presumably
> start with it non-pending, as we don't have that many of them.
>
> On the notion that we need to special-case:
>
> 	- CALLOC_ARRAY(ptr, 1)
> 	+ CALLOC(ptr)
>
> Because an array of one is "not an array" I don't really buy it. The
> calloc() interface itself exposes that, so I don't see why we'd consider
> those separate on the API level for these wrapper macros.

I think the point is that it's just weird. Yes, an array of just a
single element on the heap is indistinguishable from asking malloc() to
give me the same bytes and then memset() them myself, just as it's
indistinguishable from calloc()'ing the right number of bytes for a
single structure to begin with.

But whether or not the two are indistinguishable doesn't mean that it
makes intuitive sense, and that is the goal of CALLOC().

Thanks,
Taylor

Taylor Blau Dec. 7, 2022, 2:36 a.m. UTC | #14

On Mon, Dec 05, 2022 at 08:47:10PM -0500, Jeff King wrote:
> On Tue, Dec 06, 2022 at 12:12:32AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
> > But if we *are* doing that then surely we should provide the full set of
> > functions. I.e. ALLOC() and ALLOC_ARRAY(), CALLOC() and CALLOC_ARRAY(),
> > and REALLOC() and REALLOC_ARRAY()?
>
> FWIW, I would be happy to see all of those (minus REALLOC(), as there is
> not really any point in growing or shrinking something with a fixed
> size).

Same.

> The biggest argument against them that I can see is that:
>
>   struct foo *x = malloc(sizeof(*x));
>
> is idiomatic C that newcomers to the project will easily understand,
> and:
>
>   struct foo *x;
>   ALLOC(x);
>
> is not. But it feels like we already crossed that bridge with
> ALLOC_ARRAY(), etc.

Yeah, I agree that it might be off-putting to have that convention
pervade throughout the tree. On the other hand, though, we can clearly
document that in CodingGuidelines, and make Coccinelle rules that
indicate our preferences.

And indeed, ALLOC_ARRAY() and friends is a useful test-case for gauging
how something like this might go. I don't think we necessarily need to
do all of that in the same set of patches (though I'm not opposed to
doing so, either), but I wouldn't mind getting there in the end, one way
or another.

Thanks,
Taylor

Taylor Blau Dec. 7, 2022, 2:38 a.m. UTC | #15

On Tue, Dec 06, 2022 at 10:35:54AM +0900, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > So you can think of "allocate this one element and zero it" as "calloc,
> > but don't multiply" or as "malloc, but zero". Naming it CALLOC() is
> > thinking of it as the former. If we think of it as the latter it could
> > perhaps be MALLOCZ() or something. I don't know if that name is too
> > subtle or not. We have xmemdupz(), which is basically the same thing;
> > it's only a zero-terminator, but that is because we are writing non-zero
> > bytes in the rest of it. Mostly I'd worry that it is easy to glance past
> > the "Z".
>
> I think the name for the former would be CALLOC_ONE(), as I would
> rephrase it as "calloc, but just one element".  I agree MALLOCZ()
> would be fine for the other interpretation, and I do not have much
> problem as much problem with the name as calling it CALLOC().

Between CALLOC_ONE() and MALLOCZ(), I prefer the latter for brevity. But
between that and CALLOC(), I prefer the latter, since "CALLOC()" reminds
me of the zero-initialization of calloc()-proper, and the "Z" in
"MALLOCZ()" feels easy-ish to miss.

Thanks,
Taylor

Ævar Arnfjörð Bjarmason Dec. 7, 2022, 3:17 a.m. UTC | #16

On Tue, Dec 06 2022, Taylor Blau wrote:

> On Tue, Dec 06, 2022 at 12:12:32AM +0100, Ævar Arnfjörð Bjarmason wrote:
>> First, when you add *.pending.cocci rules they shouldn't be pseudocode,
>> but things that are too large to apply right now. I think my recent
>> 041df69edd3 (Merge branch 'ab/fewer-the-index-macros', 2022-11-28) is a
>> good example.
>
> Agreed, but I do tend to consider this patch too-large to apply right
> now.

I think it's OK if a *pending* patch is 10k lines or whatever, as long
as the result of applying it compiles. I.e. if it doesn't compile the
semantic patch is broken, and doesn't really help to clarify what the
proposed change is.

Whether the change itself is too large to be worth it or whatever is
another matter, but we can always consider it advisory.

E.g. consider a change to remove braces from if/else per our coding
style, we don't have such a *.cocci now, but if we did it would be a
very large patch. It might still be useful in *.pending.cocci, e.g. to
run new code through it.

> Whether or not 2k lines of diff is review-able or not (and FWIW, I think
> that your recent Coccinelle-generated patches were quite easy to review
> pretty quickly), I think there's a bigger question of whether or not we
> want to make such a sweeping, tree-wide change. And assuming the answer
> to that is "yes", there's is also a question of timing. Proposing it
> towards the middle or end of a release cycle seems in bad taste.
>
> But I think splitting this discussion up into, "should we introduce
> something like CALLOC() into our style conventions?" and "do we want to
> apply this everywhere?" is worth doing. The former doesn't seem to take
> a ton of time away from polishing the release candidates, and the latter
> can be done only after the former has settled.
>
> So the decision to make this a *.pending.cocci rule was definitely
> intentional in this case.

Yes. A sweeping change like this should definitely start there (although
see below).

> [...]
>> Third, the resulting patch is currently ~2k lines. Can we really not
>> make it non-pending with some whitelisting/gblacklisting. E.g. see this
>> out-of-tree patch for an example of opting out certain functions:
>> https://github.com/avar/git/commit/bedd0323e9b
>
> See above.

What I'm demonstrating with that commit is that you can make a
non-pending patch by blacklisting the functions, filenames etc. that are
known to be outstanding.

The advantage of doing that being that you'd apply it for any new code
outside those scopes.

So re the "see above" it's an explicit end-run around the "do we want to
apply this everywhere?".

>> Fourth, I must say I'm kind of negative on the proposed change. I.e. the
>> foot-gun is definitely worth solving, but why not just create a
>> coccinelle rule to narrowly address that? In that case we can presumably
>> start with it non-pending, as we don't have that many of them.
>>
>> On the notion that we need to special-case:
>>
>> 	- CALLOC_ARRAY(ptr, 1)
>> 	+ CALLOC(ptr)
>>
>> Because an array of one is "not an array" I don't really buy it. The
>> calloc() interface itself exposes that, so I don't see why we'd consider
>> those separate on the API level for these wrapper macros.
>
> I think the point is that it's just weird. Yes, an array of just a
> single element on the heap is indistinguishable from asking malloc() to
> give me the same bytes and then memset() them myself, just as it's
> indistinguishable from calloc()'ing the right number of bytes for a
> single structure to begin with.
>
> But whether or not the two are indistinguishable doesn't mean that it
> makes intuitive sense, and that is the goal of CALLOC().

I understand that's the goal, I just don't get how it's intuitive.

First we had REALLOC_ARRAY(), which is the same as a reallocarray(3)
where the "size" is computed for you based on the "ptr", which we know
to be a variable name in our case. Later we had ALLOC_ARRAY() and
CALLOC_ARRAY(), which aim to provide the same sort of interface for
malloc() and calloc().

The reallocarray() function comes from OpenBSD, where it's always
supported nmemb = 1:
https://man.openbsd.org/OpenBSD-5.9/malloc.3#reallocarray~2

It has always been redundant to use it for nmemb=1, but as arrays in C
don't have containers (unlike some higher-level lanugages) it would be
odd to disallow it, and force the user to do:

	mem = n == 1 ? realloc(mem, sz) : reallocarray(mem, n, sz);

Instead of:

	mem = reallocarray(mem, n, sz);

I.e. you don't need the overlfow check for n * sz if n == 1, but
treating it as a special-case is a hassle.

Ævar Arnfjörð Bjarmason Dec. 7, 2022, 3:51 a.m. UTC | #17

On Tue, Dec 06 2022, Taylor Blau wrote:

> On Mon, Dec 05, 2022 at 08:43:50PM -0500, Jeff King wrote:
>> On Mon, Dec 05, 2022 at 05:36:25PM -0500, Taylor Blau wrote:
>>
>> > On Mon, Dec 05, 2022 at 10:01:11PM +0100, René Scharfe wrote:
>> > > This rule would turn this code:
>> > >
>> > > 	struct foo *bar = xcalloc(1, sizeof(*bar));
>> > > 	int i;
>> > >
>> > > ... into:
>> > >
>> > > 	struct foo *bar;
>> > > 	CALLOC(bar);
>> > > 	int i;
>> > >
>> > > ... which violates the coding guideline to not mix declarations and
>> > > statements (-Wdeclaration-after-statement).
>> >
>> > Yeah, I was wondering about this myself when I wrote this part of the
>> > Coccinelle patch.
>> >
>> > Is there an intelligent way to tell it to put the first statement after
>> > all declarations? I couldn't find anything after a quick scan of the
>> > documentation nor our own patches.
>>
>> It feels like generating the code as above is not the end of the world.
>> The most valuable thing that coccinelle is doing here is _finding_ the
>> location, and telling you "it's supposed to be like this". It is great
>> when the "this" post-image is perfect and doesn't need further tweaking.
>
> I have to agree. If Coccinelle can generate the right output; great. But
> if it can't, the amount of additional work to reorganize an already
> generated and mostly correct *.patch from the tool seems minimal by
> comparison.

It can, but you need to write your semantic patch to match your
intent. If you write e.g.:

	- int x;
        + int y;
        + foo();

That means "add the int y and foo() line right after that "int x" line
you removed.

Whereas what you want in this case is closer to:

 - match the "int x" line
 - remove or amend it
 - skip past all subsequent declarations
 - skip past all code that isn't referring to the "x variable?
 - insert the "CALLOC_ARRAY" (or whatever) before that first "x" use.

I don't know offhand how to match this, but presumably it's some mixture
of the wildcard syntax ("...", "<... ...>" etc.) and matching a
"statement", or maybe marking the "int x" with the "@pos" syntax, and
referring to that position again.

I usually just browse through the coccinelle.git for *.cocci examples
and/or read the PDF (*not* the manual page, which discusses almost none
of the syntax) documentation.

See:

	git grep -F '...' -- contrib/coccinelle 

For some in-tree use of this, the unused.cocci I added recently is
probably the closest equivalent to what you'll want.

Jeff King Dec. 7, 2022, 6:02 a.m. UTC | #18

On Tue, Dec 06, 2022 at 02:58:53AM +0100, Ævar Arnfjörð Bjarmason wrote:

> This is probably too ugly to exist, but if we are going to have more
> variants maybe one that would allow use within declarations would be
> better, e.g.:
> 
> 	
> 	diff --git a/attr.c b/attr.c
> 	index 42ad6de8c7c..43ade426e57 100644
> 	--- a/attr.c
> 	+++ b/attr.c
> 	@@ -666,11 +666,10 @@ static void handle_attr_line(struct attr_stack *res,
> 	 
> 	 static struct attr_stack *read_attr_from_array(const char **list)
> 	 {
> 	-	struct attr_stack *res;
> 	+	struct attr_stack *INIT_CALLOC_ARRAY(res, 1);

Yeah. While clever, I would agree that is probably too ugly to exist.

-Peff

Jeff King Dec. 7, 2022, 6:06 a.m. UTC | #19

On Tue, Dec 06, 2022 at 03:12:08AM +0100, Ævar Arnfjörð Bjarmason wrote:

> I'm not saying you *should*, but now that we use C99 macros we *could*
> also make the "1" an optional argument. I.e. these would be the same
> thing:
> 
> 	CALLOC(x)
> 	CALLOC(x, 1)
> 
> And you could also do:
> 
> 	CALLOC(x, 123)
> 
> Whether that makes the interface even nastier is another matter.

Like the INIT proposal you mentioned, I tend to think this is clever but
not really a good idea in the long run. I don't mind magic if it's
giving us a big benefit, but I don't think that's the case here. :)

-Peff

Jeff King Dec. 7, 2022, 6:08 a.m. UTC | #20

On Tue, Dec 06, 2022 at 09:38:02PM -0500, Taylor Blau wrote:

> On Tue, Dec 06, 2022 at 10:35:54AM +0900, Junio C Hamano wrote:
> > Jeff King <peff@peff.net> writes:
> >
> > > So you can think of "allocate this one element and zero it" as "calloc,
> > > but don't multiply" or as "malloc, but zero". Naming it CALLOC() is
> > > thinking of it as the former. If we think of it as the latter it could
> > > perhaps be MALLOCZ() or something. I don't know if that name is too
> > > subtle or not. We have xmemdupz(), which is basically the same thing;
> > > it's only a zero-terminator, but that is because we are writing non-zero
> > > bytes in the rest of it. Mostly I'd worry that it is easy to glance past
> > > the "Z".
> >
> > I think the name for the former would be CALLOC_ONE(), as I would
> > rephrase it as "calloc, but just one element".  I agree MALLOCZ()
> > would be fine for the other interpretation, and I do not have much
> > problem as much problem with the name as calling it CALLOC().
> 
> Between CALLOC_ONE() and MALLOCZ(), I prefer the latter for brevity. But
> between that and CALLOC(), I prefer the latter, since "CALLOC()" reminds
> me of the zero-initialization of calloc()-proper, and the "Z" in
> "MALLOCZ()" feels easy-ish to miss.

FWIW, I agree with your preferences here. But if people think CALLOC()
is misleading or badly named, I'm OK with MALLOCZ(). I guess maybe it is
ALLOCZ() to go with ALLOC_ARRAY(), and a potential ALLOC() if we choose
to have one.

-Peff

git-compat-util.h: introduce CALLOC(x)

Commit Message

Comments

Patch