[1/4] add-patch: Fix type missmatch rom msvc

Message ID	20241223110407.3308-2-soekkle@freenet.de (mailing list archive)
State	Superseded
Headers	show Received: from mout0.freenet.de (mout0.freenet.de [195.4.92.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A11A419E98B for <git@vger.kernel.org>; Mon, 23 Dec 2024 11:09:46 +0000 (UTC) From: =?utf-8?q?S=C3=B6ren_Krecker?= <soekkle@freenet.de> To: git@vger.kernel.org Cc: =?utf-8?q?S=C3=B6ren_Krecker?= <soekkle@freenet.de> Subject: [PATCH 1/4] add-patch: Fix type missmatch rom msvc Date: Mon, 23 Dec 2024 12:04:04 +0100 Message-Id: <20241223110407.3308-2-soekkle@freenet.de> In-Reply-To: <20241223110407.3308-1-soekkle@freenet.de> References: <20241223110407.3308-1-soekkle@freenet.de> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	Fixes typemissmatch warinigs from msvc \| expand [0/4] Fixes typemissmatch warinigs from msvc [1/4] add-patch: Fix type missmatch rom msvc [2/4] date.c: Fix type missmatch warings from msvc [3/4] apply.c : Fix type missmatch warings from msvc [4/4] commit.c: Fix type missmatch warings from msvc

Sören Krecker Dec. 23, 2024, 11:04 a.m. UTC

Fix some compiler warings from msvw in add-patch.c for value truncation
form 64 bit to 32 bit integers.Change unsigned long to size_t for
correct variable size on linux and windows

Signed-off-by: Sören Krecker <soekkle@freenet.de>
---
 add-patch.c | 44 +++++++++++++++++++++++++-------------------
 gettext.h   |  2 +-
 2 files changed, 26 insertions(+), 20 deletions(-)

Junio C Hamano Dec. 26, 2024, 9:33 p.m. UTC | #1

Sören Krecker <soekkle@freenet.de> writes:

> Fix some compiler warings from msvw in add-patch.c for value truncation
> form 64 bit to 32 bit integers.Change unsigned long to size_t for
> correct variable size on linux and windows
>
> Signed-off-by: Sören Krecker <soekkle@freenet.de>
> ---
>  add-patch.c | 44 +++++++++++++++++++++++++-------------------
>  gettext.h   |  2 +-
>  2 files changed, 26 insertions(+), 20 deletions(-)

>  struct hunk_header {
> -	unsigned long old_offset, old_count, new_offset, new_count;
> +	size_t old_offset, old_count, new_offset, new_count;

These are not "size"s in the traditional sense of what size_t is
(i.e. the number of bytes in a region of memory), but are more or
less proportional to that in that they count in number of lines.

If ulong is sufficient to count number of lines in an incoming
patch, then turning size_t may be excessive---are we sure that we
are not unnecessarily using wider-than-necessary size_t in some
places to hold these values for which ulong is sufficient, causing
compilers to emit unnecessary warning?

IOW, if we have variables of unsigned integer of various sizes, we
_could_ rewrite all of them to use uintmax_t and there won't be
truncation-upon-assignment warnings from a compiler, but such a
rewrite can be pointless.  We'd need to find where we stuff these
values, which are originally ulong, to size_t, and see if the use of
size_t is really sensible.

> @@ -321,7 +321,7 @@ static void setup_child_process(struct add_p_state *s,
>  }
>  
>  static int parse_range(const char **p,
> -		       unsigned long *offset, unsigned long *count)
> +		       size_t *offset, size_t *count)
>  {
>  	char *pend;
>  
> @@ -672,8 +672,8 @@ static void render_hunk(struct add_p_state *s, struct hunk *hunk,
>  		 */
>  		const char *p;
>  		size_t len;
> -		unsigned long old_offset = header->old_offset;
> -		unsigned long new_offset = header->new_offset;
> +		size_t old_offset = header->old_offset;
> +		size_t new_offset = header->new_offset;
>  
>  		if (!colored) {
>  			p = s->plain.buf + header->extra_start;
> @@ -699,12 +699,14 @@ static void render_hunk(struct add_p_state *s, struct hunk *hunk,
>  		else
>  			new_offset += delta;
>  
> -		strbuf_addf(out, "@@ -%lu", old_offset);
> +		strbuf_addf(out, "@@ -%" PRIuMAX, (uintmax_t)old_offset);
>  		if (header->old_count != 1)
> -			strbuf_addf(out, ",%lu", header->old_count);
> -		strbuf_addf(out, " +%lu", new_offset);
> +			strbuf_addf(out, ",%" PRIuMAX,
> +				    (uintmax_t)header->old_count);
> +		strbuf_addf(out, " +%" PRIuMAX, (uintmax_t)new_offset);
>  		if (header->new_count != 1)
> -			strbuf_addf(out, ",%lu", header->new_count);
> +			strbuf_addf(out, ",%" PRIuMAX,
> +				    (uintmax_t)header->new_count);
>  		strbuf_addstr(out, " @@");
>  
>  		if (len)

So far if we left the types of *_offset and count all ulong, I do
not see anything that causes trunation-upon-assignment.

> @@ -1065,11 +1067,13 @@ static int split_hunk(struct add_p_state *s, struct file_diff *file_diff,
>  
>  	/* last hunk simply gets the rest */
>  	if (header->old_offset != remaining.old_offset)
> -		BUG("miscounted old_offset: %lu != %lu",
> -		    header->old_offset, remaining.old_offset);
> +		BUG("miscounted old_offset: %"PRIuMAX" != %"PRIuMAX,
> +		    (uintmax_t)header->old_offset,
> +		    (uintmax_t)remaining.old_offset);
>  	if (header->new_offset != remaining.new_offset)
> -		BUG("miscounted new_offset: %lu != %lu",
> -		    header->new_offset, remaining.new_offset);
> +		BUG("miscounted new_offset: %"PRIuMAX" != %"PRIuMAX,
> +		    (uintmax_t)header->new_offset,
> +		    (uintmax_t)remaining.new_offset);
>  	header->old_count = remaining.old_count;
>  	header->new_count = remaining.new_count;
>  	hunk->end = end;

This hunk unfortunately may be needed because this function somehow
decided to use size_t for things like "hunk_index" and "end" when it
was added, even though the surrounding functions it interacts with
all used ulong.

> @@ -1353,9 +1357,10 @@ static void summarize_hunk(struct add_p_state *s, struct hunk *hunk,
>  	struct strbuf *plain = &s->plain;
>  	size_t len = out->len, i;
>  
> -	strbuf_addf(out, " -%lu,%lu +%lu,%lu ",
> -		    header->old_offset, header->old_count,
> -		    header->new_offset, header->new_count);
> +	strbuf_addf(out,
> +		    " -%"PRIuMAX",%"PRIuMAX" +%"PRIuMAX",%"PRIuMAX" ",
> +		    (uintmax_t)header->old_offset, (uintmax_t)header->old_count,
> +		    (uintmax_t)header->new_offset, (uintmax_t)header->new_count);
>  	if (out->len - len < SUMMARY_HEADER_WIDTH)
>  		strbuf_addchars(out, ' ',
>  				SUMMARY_HEADER_WIDTH + len - out->len);

Again, if we left the types of *_offset and count all ulong, I do
not see anything that causes trunation-upon-assignment.

> @@ -1624,10 +1629,11 @@ static int patch_update_file(struct add_p_state *s,
>  			else if (0 < response && response <= file_diff->hunk_nr)
>  				hunk_index = response - 1;
>  			else
> -				err(s, Q_("Sorry, only %d hunk available.",
> -					  "Sorry, only %d hunks available.",
> -					  file_diff->hunk_nr),
> -				    (int)file_diff->hunk_nr);
> +				err(s,
> +				    Q_("Sorry, only %"PRIuMAX" hunk available.",
> +				       "Sorry, only %"PRIuMAX" hunks available.",
> +				       (uintmax_t)file_diff->hunk_nr),
> +				    (uintmax_t)file_diff->hunk_nr);

Again, this hunk may be needed, as the "hunk_nr" uses size_t, which
probably is overly wide.

So, I do not mind too much to adjust the code around hunk_nr,
hunk_alloc and other things that are already size_t (but before
doing so, we probably should see if it makes more sense to use ulong
for these members instead of size_t), hbut I am not sure if it is a
sensible move to change old_offset, count, etc. that count in number
of lines and use ulong (not bytes) to use size_t instead.

size_t _might_ be wider than some other forms of unsigned integers,
but it is not necessarily the widest, so "because on msvc ulong is
merely 32-bit and I want wider integer like everybody else!" is not
a good excuse for such a change (if it were, we'd all coding with
nothing but uintmax_t).

So I dunno.

> diff --git a/gettext.h b/gettext.h
> index 484cafa562..d36f5a7ade 100644
> --- a/gettext.h
> +++ b/gettext.h
> @@ -53,7 +53,7 @@ static inline FORMAT_PRESERVING(1) const char *_(const char *msgid)
>  }
>  
>  static inline FORMAT_PRESERVING(1) FORMAT_PRESERVING(2)
> -const char *Q_(const char *msgid, const char *plu, unsigned long n)
> +const char *Q_(const char *msgid, const char *plu, size_t n)
>  {
>  	if (!git_gettext_enabled)
>  		return n == 1 ? msgid : plu;

Patrick Steinhardt Dec. 27, 2024, 10:16 a.m. UTC | #2

On Thu, Dec 26, 2024 at 01:33:12PM -0800, Junio C Hamano wrote:
> Sören Krecker <soekkle@freenet.de> writes:
> > @@ -1624,10 +1629,11 @@ static int patch_update_file(struct add_p_state *s,
> >  			else if (0 < response && response <= file_diff->hunk_nr)
> >  				hunk_index = response - 1;
> >  			else
> > -				err(s, Q_("Sorry, only %d hunk available.",
> > -					  "Sorry, only %d hunks available.",
> > -					  file_diff->hunk_nr),
> > -				    (int)file_diff->hunk_nr);
> > +				err(s,
> > +				    Q_("Sorry, only %"PRIuMAX" hunk available.",
> > +				       "Sorry, only %"PRIuMAX" hunks available.",
> > +				       (uintmax_t)file_diff->hunk_nr),
> > +				    (uintmax_t)file_diff->hunk_nr);
> 
> Again, this hunk may be needed, as the "hunk_nr" uses size_t, which
> probably is overly wide.
> 
> So, I do not mind too much to adjust the code around hunk_nr,
> hunk_alloc and other things that are already size_t (but before
> doing so, we probably should see if it makes more sense to use ulong
> for these members instead of size_t), hbut I am not sure if it is a
> sensible move to change old_offset, count, etc. that count in number
> of lines and use ulong (not bytes) to use size_t instead.
> 
> size_t _might_ be wider than some other forms of unsigned integers,
> but it is not necessarily the widest, so "because on msvc ulong is
> merely 32-bit and I want wider integer like everybody else!" is not
> a good excuse for such a change (if it were, we'd all coding with
> nothing but uintmax_t).
> 
> So I dunno.

In practice, the number of bytes and number of lines _can_ be the same
when the file consists of newlines, only. That is of course unlikely to
be the case in any sane input, but may be the case when processing input
that was specifically crafted to trigger such an edge case. So using a
type that can store the maximum size of a theoretically possible object
feels like a sensible safeguard to me and requires us to worry less
about such weird edge cases.

I also doubt that widening the type to `size_t` would have a meaningful
impact on performance, so I don't see a strong reason not to go there.
I tend to think that using `size_t` in such size-like-fields should be
our default unless there is a good reason not to pick it.

Patrick

Phillip Wood Dec. 27, 2024, 10:38 a.m. UTC | #3

On 26/12/2024 21:33, Junio C Hamano wrote:
> Sören Krecker <soekkle@freenet.de> writes:
> 
>> Fix some compiler warings from msvw in add-patch.c for value truncation
>> form 64 bit to 32 bit integers.Change unsigned long to size_t for
>> correct variable size on linux and windows
>>
>> Signed-off-by: Sören Krecker <soekkle@freenet.de>
>> ---
>>   add-patch.c | 44 +++++++++++++++++++++++++-------------------
>>   gettext.h   |  2 +-
>>   2 files changed, 26 insertions(+), 20 deletions(-)
> 
> 
> 
>>   struct hunk_header {
>> -	unsigned long old_offset, old_count, new_offset, new_count;
>> +	size_t old_offset, old_count, new_offset, new_count;
> 
> These are not "size"s in the traditional sense of what size_t is
> (i.e. the number of bytes in a region of memory), but are more or
> less proportional to that in that they count in number of lines.
> 
> If ulong is sufficient to count number of lines in an incoming
> patch, then turning size_t may be excessive---are we sure that we
> are not unnecessarily using wider-than-necessary size_t in some
> places to hold these values for which ulong is sufficient, causing
> compilers to emit unnecessary warning?

That's my thought too - I think something like the diff below should
fix the warnings by using more appropriate types in expressions
involving the hunk header offset and count. Our internal diff
implementation will not generate diffs for blobs greater than ~1GB
and I don't think "git apply" can handle diff headers that contain
numbers greater that ULONG_MAX so switching to size_t here seems
unnecessary.

Best Wishes

Phillip

---- >8 ----

diff --git a/add-patch.c b/add-patch.c
index 557903310de..2c439b83665 100644
--- a/add-patch.c
+++ b/add-patch.c
@@ -253,7 +253,7 @@ struct hunk_header {
  
  struct hunk {
  	size_t start, end, colored_start, colored_end, splittable_into;
-	ssize_t delta;
+	long delta;
  	enum { UNDECIDED_HUNK = 0, SKIP_HUNK, USE_HUNK } use;
  	struct hunk_header header;
  };
@@ -760,7 +760,8 @@ static void render_diff_header(struct add_p_state *s,
  static int merge_hunks(struct add_p_state *s, struct file_diff *file_diff,
  		       size_t *hunk_index, int use_all, struct hunk *merged)
  {
-	size_t i = *hunk_index, delta;
+	size_t i = *hunk_index;
+	long delta;
  	struct hunk *hunk = file_diff->hunk + i;
  	/* `header` corresponds to the merged hunk */
  	struct hunk_header *header = &merged->header, *next;
@@ -890,7 +891,7 @@ static void reassemble_patch(struct add_p_state *s,
  {
  	struct hunk *hunk;
  	size_t save_len = s->plain.len, i;
-	ssize_t delta = 0;
+	long delta = 0;
  
  	render_diff_header(s, file_diff, 0, out);
  
@@ -926,7 +927,8 @@ static int split_hunk(struct add_p_state *s, struct file_diff *file_diff,
  	int colored = !!s->colored.len, first = 1;
  	struct hunk *hunk = file_diff->hunk + hunk_index;
  	size_t splittable_into;
-	size_t end, colored_end, current, colored_current = 0, context_line_count;
+	size_t end, colored_end, current, colored_current = 0;
+	unsigned long context_line_count;
  	struct hunk_header remaining, *header;
  	char marker, ch;
  
@@ -1175,8 +1177,8 @@ static int edit_hunk_manually(struct add_p_state *s, struct hunk *hunk)
  	return 1;
  }
  
-static ssize_t recount_edited_hunk(struct add_p_state *s, struct hunk *hunk,
-				   size_t orig_old_count, size_t orig_new_count)
+static long recount_edited_hunk(struct add_p_state *s, struct hunk *hunk,
+				 unsigned long orig_old_count, unsigned long orig_new_count)
  {
  	struct hunk_header *header = &hunk->header;
  	size_t i;
@@ -1626,7 +1628,7 @@ static int patch_update_file(struct add_p_state *s,
  			else
  				err(s, Q_("Sorry, only %d hunk available.",
  					  "Sorry, only %d hunks available.",
-					  file_diff->hunk_nr),
+					  (int)file_diff->hunk_nr),
  				    (int)file_diff->hunk_nr);
  		} else if (s->answer.buf[0] == '/') {
  			regex_t regex;

Junio C Hamano Dec. 27, 2024, 2:31 p.m. UTC | #4

Phillip Wood <phillip.wood123@gmail.com> writes:

>>>   struct hunk_header {
>>> -	unsigned long old_offset, old_count, new_offset, new_count;
>>> +	size_t old_offset, old_count, new_offset, new_count;
>> These are not "size"s in the traditional sense of what size_t is
>> (i.e. the number of bytes in a region of memory), but are more or
>> less proportional to that in that they count in number of lines.
>> If ulong is sufficient to count number of lines in an incoming
>> patch, then turning size_t may be excessive---are we sure that we
>> are not unnecessarily using wider-than-necessary size_t in some
>> places to hold these values for which ulong is sufficient, causing
>> compilers to emit unnecessary warning?
>
> That's my thought too - I think something like the diff below should
> fix the warnings by using more appropriate types in expressions
> involving the hunk header offset and count. Our internal diff
> implementation will not generate diffs for blobs greater than ~1GB
> and I don't think "git apply" can handle diff headers that contain
> numbers greater that ULONG_MAX so switching to size_t here seems
> unnecessary.

Yes, exactly.

Of course, when filling old_offset and friends by parsing an input
line like this:

    @@ -253,7 +253,7 @@ struct hunk_header {

it would be a bug if we did not check if "253" overflows the type of
old_offset, etc.  And I would very much welcome patches to fix such
a careless input validation routine.  But replacing ulong with size_t
would not make such a problem go away.

Now, I would be a bit more sympathetic if the patch were to use
integers of exact sizes, in the name of "let's make sure that
regardless of the platforms we handle patches up to the same limit".
But size_t is not a type that is appropriate for that (and of course
ulong is not, either---but the original did not aim for such a uniform
limit to begin with).

> @@ -1626,7 +1628,7 @@ static int patch_update_file(struct add_p_state *s,
>  			else
>  				err(s, Q_("Sorry, only %d hunk available.",
>  					  "Sorry, only %d hunks available.",
> -					  file_diff->hunk_nr),
> +					  (int)file_diff->hunk_nr),
>  				    (int)file_diff->hunk_nr);
>  		} else if (s->answer.buf[0] == '/') {
>  			regex_t regex;

I skimmed your "how about going this way" illustration patch and
found all the hunks reasonable, but this one I am not sure.  Is
there a reason why hunk_nr has to be of type size_t?  

When queuing a hunk (and performing an operation that changes the
number of hunks, like splitting an existing one), the code should be
careful not to make too many hunks to overflow "int" (if that is the
more natural type to count them---and "int" being the most natural
integer type for the platform, I tend to think it should be fine),
again, that applies equally if the type of hunk_nr is "size_t".

Thanks.

Sören Krecker Dec. 27, 2024, 4:35 p.m. UTC | #5

Hi everyone,

If I understand your comments correctly, it would be preferably to 
switch to a data type like uint32_t or uint64_t so that the behavior is 
consisted on all platforms?
Also add a test if the input overflows the data type.

Best regards,

Sören Krecker

Junio C Hamano writes:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>>>>    struct hunk_header {
>>>> -	unsigned long old_offset, old_count, new_offset, new_count;
>>>> +	size_t old_offset, old_count, new_offset, new_count;
>>> These are not "size"s in the traditional sense of what size_t is
>>> (i.e. the number of bytes in a region of memory), but are more or
>>> less proportional to that in that they count in number of lines.
>>> If ulong is sufficient to count number of lines in an incoming
>>> patch, then turning size_t may be excessive---are we sure that we
>>> are not unnecessarily using wider-than-necessary size_t in some
>>> places to hold these values for which ulong is sufficient, causing
>>> compilers to emit unnecessary warning?
>>
>> That's my thought too - I think something like the diff below should
>> fix the warnings by using more appropriate types in expressions
>> involving the hunk header offset and count. Our internal diff
>> implementation will not generate diffs for blobs greater than ~1GB
>> and I don't think "git apply" can handle diff headers that contain
>> numbers greater that ULONG_MAX so switching to size_t here seems
>> unnecessary.
> 
> Yes, exactly.
> 
> Of course, when filling old_offset and friends by parsing an input
> line like this:
> 
>      @@ -253,7 +253,7 @@ struct hunk_header {
> 
> it would be a bug if we did not check if "253" overflows the type of
> old_offset, etc.  And I would very much welcome patches to fix such
> a careless input validation routine.  But replacing ulong with size_t
> would not make such a problem go away.
> 
> Now, I would be a bit more sympathetic if the patch were to use
> integers of exact sizes, in the name of "let's make sure that
> regardless of the platforms we handle patches up to the same limit".
> But size_t is not a type that is appropriate for that (and of course
> ulong is not, either---but the original did not aim for such a uniform
> limit to begin with).
> 
>> @@ -1626,7 +1628,7 @@ static int patch_update_file(struct add_p_state *s,
>>   			else
>>   				err(s, Q_("Sorry, only %d hunk available.",
>>   					  "Sorry, only %d hunks available.",
>> -					  file_diff->hunk_nr),
>> +					  (int)file_diff->hunk_nr),
>>   				    (int)file_diff->hunk_nr);
>>   		} else if (s->answer.buf[0] == '/') {
>>   			regex_t regex;
> 
> I skimmed your "how about going this way" illustration patch and
> found all the hunks reasonable, but this one I am not sure.  Is
> there a reason why hunk_nr has to be of type size_t?
> 
> When queuing a hunk (and performing an operation that changes the
> number of hunks, like splitting an existing one), the code should be
> careful not to make too many hunks to overflow "int" (if that is the
> more natural type to count them---and "int" being the most natural
> integer type for the platform, I tend to think it should be fine),
> again, that applies equally if the type of hunk_nr is "size_t".
> 
> Thanks.

Junio C Hamano Dec. 27, 2024, 4:42 p.m. UTC | #6

Sören Krecker <soekkle@freenet.de> writes:

> If I understand your comments correctly, it would be preferably to
> switch to a data type like uint32_t or uint64_t so that the behavior
> is consisted on all platforms?

I personally wouldn't prefer that.

I'd rather stick to some "natural" platform type like ulong.  I see
no strong need to say "we must behave identically on all platforms"
in this area.  It is preferrable to have every platform use the most
natural type on it, and make sure that we validate input that is too
large to fit on each platform correctly (i.e. it is OK to diagnose
"too big a line number" and die on 32-bit platform with much smaller
line number than on 64-bit platform).

Thanks.

Phillip Wood Dec. 28, 2024, 4:04 p.m. UTC | #7

On 27/12/2024 14:31, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> @@ -1626,7 +1628,7 @@ static int patch_update_file(struct add_p_state *s,
>>   			else
>>   				err(s, Q_("Sorry, only %d hunk available.",
>>   					  "Sorry, only %d hunks available.",
>> -					  file_diff->hunk_nr),
>> +					  (int)file_diff->hunk_nr),
>>   				    (int)file_diff->hunk_nr);
>>   		} else if (s->answer.buf[0] == '/') {
>>   			regex_t regex;
> 
> I skimmed your "how about going this way" illustration patch and
> found all the hunks reasonable, but this one I am not sure.  Is
> there a reason why hunk_nr has to be of type size_t?

We certainly don't need to be able to hold that many hunks but changing 
it to a narrower type generates a truncation warning in ALLOC_GROW_BY() 
that macro declares a local size_t variable to hold the new element 
count and then assigns that to hunk_nr

> When queuing a hunk (and performing an operation that changes the
> number of hunks, like splitting an existing one), the code should be
> careful not to make too many hunks to overflow "int" (if that is the
> more natural type to count them---and "int" being the most natural
> integer type for the platform, I tend to think it should be fine),

Yes it's hard to see anyone wanting to use "git add -p" on INT_MAX hunks

> again, that applies equally if the type of hunk_nr is "size_t".

If we cast to "unsigned long" rather than "int" here then we'd be sure 
that there was no overflow as we only support files with up to ULONG_MAX 
lines so there cannot be more than that number of hunks. "unsigned long" 
would also match the prototype of ngettext().

Best Wishes

Phillip

[1/4] add-patch: Fix type missmatch rom msvc

Commit Message

Comments

Patch