diff mbox series

[v2,06/13] dir.c: always copy input to add_pattern()

Message ID 20240604101322.GF1304593@coredump.intra.peff.net (mailing list archive)
State New, archived
Headers show
Series leak fixes for sparse-checkout code | expand

Commit Message

Jeff King June 4, 2024, 10:13 a.m. UTC
The add_pattern() function has a subtle and undocumented gotcha: the
pattern string you pass in must remain valid as long as the pattern_list
is in use (and nor do we take ownership of it). This is easy to get
wrong, causing either subtle bugs (because you free or reuse the string
buffer) or leaks (because you copy the string, but don't track ownership
separately).

All of this "pattern" code was originally the "exclude" mechanism. So
this _usually_ works OK because you add entries in one of two ways:

  1. From the command-line (e.g., "--exclude"), in which case we're
     pointing to an argv entry which remains valid for the lifetime of
     the program.

  2. From a file (e.g., ".gitignore"), in which case we read the whole
     file into a buffer, attach it to the pattern_list's "filebuf"
     entry, then parse the buffer in-place (adding NULs). The strings
     point into the filebuf, which is cleaned up when the whole
     pattern_list goes away.

But other code, like sparse-checkout, reads individual lines from stdin
and passes them one by one to add_pattern(), leaking each. We could fix
this by refactoring it to take in the whole buffer at once, like (2)
above, and stuff it in "filebuf". But given how subtle the interface is,
let's just fix it to always copy the string.

That seems at first like we'd be wasting extra memory, but we can
mitigate that:

  a. The path_pattern struct already uses a FLEXPTR, since we sometimes
     make a copy (when we see "foo/", we strip off the trailing slash,
     requiring a modifiable copy of the string).

     Since we'll now always embed the string inside the struct, we can
     switch to the regular FLEX_ARRAY pattern, saving us 8 bytes of
     pointer. So patterns with a trailing slash and ones under 8 bytes
     actually get smaller.

  b. Now that we don't need the original string to hang around, we can
     get rid of the "filebuf" mechanism entirely, and just free the file
     contents after parsing. Since files are the sources we'd expect to
     have the largest pattern sets, we should mostly break even on
     stuffing the same data into the individual structs.

This patch just adjusts the add_pattern() interface; it doesn't fix any
leaky callers yet.

Signed-off-by: Jeff King <peff@peff.net>
---
 dir.c | 15 +++++----------
 dir.h |  3 ++-
 2 files changed, 7 insertions(+), 11 deletions(-)

Comments

René Scharfe June 5, 2024, 8:53 a.m. UTC | #1
Am 04.06.24 um 12:13 schrieb Jeff King:
>   b. Now that we don't need the original string to hang around, we can
>      get rid of the "filebuf" mechanism entirely

Right.  This patch does remove its use from dir.c, but leaves it in dir.h:

> diff --git a/dir.h b/dir.h
> index b9e8e96128..c8ff308fae 100644
> --- a/dir.h
> +++ b/dir.h
> @@ -62,7 +62,6 @@ struct path_pattern {
>  	 */
>  	struct pattern_list *pl;
>
> -	const char *pattern;
>  	int patternlen;
>  	int nowildcardlen;
>  	const char *base;
> @@ -74,6 +73,8 @@ struct path_pattern {
>  	 * and from -1 decrementing for patterns from CLI args.
>  	 */
>  	int srcpos;
> +
> +	char pattern[FLEX_ARRAY];
>  };
>
>  /* used for hashmaps for cone patterns */

It can be dropped from struct pattern_list now, for completeness' sake.

René
Jeff King June 5, 2024, 9:18 a.m. UTC | #2
On Wed, Jun 05, 2024 at 10:53:53AM +0200, René Scharfe wrote:

> Am 04.06.24 um 12:13 schrieb Jeff King:
> >   b. Now that we don't need the original string to hang around, we can
> >      get rid of the "filebuf" mechanism entirely
> 
> Right.  This patch does remove its use from dir.c, but leaves it in dir.h:

Hmph. I'm not sure how I managed that. I definitely removed it from the
struct definition in order to find all of the sites that mention it, but
somehow that didn't end up in the final patch. Thanks for noticing.

It looks like the topic hasn't hit next yet. Rather than send a v3 with
this tiny change, Junio, do you mind squashing this into patch 6 (it's
d465adca6d in your tree)?

---
diff --git a/dir.h b/dir.h
index c8ff308fae..1398a53fb4 100644
--- a/dir.h
+++ b/dir.h
@@ -95,9 +95,6 @@ struct pattern_list {
 	int nr;
 	int alloc;
 
-	/* remember pointer to exclude file contents so we can free() */
-	char *filebuf;
-
 	/* origin of list, e.g. path to filename, or descriptive string */
 	const char *src;
 

-Peff
Junio C Hamano June 5, 2024, 4:52 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> It looks like the topic hasn't hit next yet. Rather than send a v3 with
> this tiny change, Junio, do you mind squashing this into patch 6 (it's
> d465adca6d in your tree)?

Done.  Thanks, both, for being careful.
diff mbox series

Patch

diff --git a/dir.c b/dir.c
index d812d521b0..8308d167c8 100644
--- a/dir.c
+++ b/dir.c
@@ -925,12 +925,7 @@  void add_pattern(const char *string, const char *base,
 	int nowildcardlen;
 
 	parse_path_pattern(&string, &patternlen, &flags, &nowildcardlen);
-	if (flags & PATTERN_FLAG_MUSTBEDIR) {
-		FLEXPTR_ALLOC_MEM(pattern, pattern, string, patternlen);
-	} else {
-		pattern = xmalloc(sizeof(*pattern));
-		pattern->pattern = string;
-	}
+	FLEX_ALLOC_MEM(pattern, pattern, string, patternlen);
 	pattern->patternlen = patternlen;
 	pattern->nowildcardlen = nowildcardlen;
 	pattern->base = base;
@@ -972,7 +967,6 @@  void clear_pattern_list(struct pattern_list *pl)
 	for (i = 0; i < pl->nr; i++)
 		free(pl->patterns[i]);
 	free(pl->patterns);
-	free(pl->filebuf);
 	clear_pattern_entry_hashmap(&pl->recursive_hashmap);
 	clear_pattern_entry_hashmap(&pl->parent_hashmap);
 
@@ -1166,23 +1160,23 @@  static int add_patterns(const char *fname, const char *base, int baselen,
 	}
 
 	add_patterns_from_buffer(buf, size, base, baselen, pl);
+	free(buf);
 	return 0;
 }
 
 static int add_patterns_from_buffer(char *buf, size_t size,
 				    const char *base, int baselen,
 				    struct pattern_list *pl)
 {
+	char *orig = buf;
 	int i, lineno = 1;
 	char *entry;
 
 	hashmap_init(&pl->recursive_hashmap, pl_hashmap_cmp, NULL, 0);
 	hashmap_init(&pl->parent_hashmap, pl_hashmap_cmp, NULL, 0);
 
-	pl->filebuf = buf;
-
 	if (skip_utf8_bom(&buf, size))
-		size -= buf - pl->filebuf;
+		size -= buf - orig;
 
 	entry = buf;
 
@@ -1222,6 +1216,7 @@  int add_patterns_from_blob_to_list(
 		return r;
 
 	add_patterns_from_buffer(buf, size, base, baselen, pl);
+	free(buf);
 	return 0;
 }
 
diff --git a/dir.h b/dir.h
index b9e8e96128..c8ff308fae 100644
--- a/dir.h
+++ b/dir.h
@@ -62,7 +62,6 @@  struct path_pattern {
 	 */
 	struct pattern_list *pl;
 
-	const char *pattern;
 	int patternlen;
 	int nowildcardlen;
 	const char *base;
@@ -74,6 +73,8 @@  struct path_pattern {
 	 * and from -1 decrementing for patterns from CLI args.
 	 */
 	int srcpos;
+
+	char pattern[FLEX_ARRAY];
 };
 
 /* used for hashmaps for cone patterns */