diff mbox series

[v2,05/10] strmap: new utility functions

Message ID 5c7507f55b09e24c0bfe87cc3df06213cfd1235b.1602549650.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Add struct strmap and associated utility functions | expand

Commit Message

Elijah Newren Oct. 13, 2020, 12:40 a.m. UTC
From: Elijah Newren <newren@gmail.com>

Add strmap as a new struct and associated utility functions,
specifically for hashmaps that map strings to some value.  The API is
taken directly from Peff's proposal at
https://lore.kernel.org/git/20180906191203.GA26184@sigill.intra.peff.net/

A couple of items of note:

  * Similar to string-list, I have a strdup_strings setting.  However,
    unlike string-list, strmap_init() does not take a parameter for this
    setting and instead automatically sets it to 1; callers who want to
    control this detail need to instead call strmap_ocd_init().

  * I do not have a STRMAP_INIT macro.  I could possibly add one, but
      #define STRMAP_INIT { { NULL, cmp_str_entry, NULL, 0, 0, 0, 0, 0 }, 1 }
    feels a bit unwieldy and possibly error-prone in terms of future
    expansion of the hashmap struct.  The fact that cmp_str_entry needs to
    be in there prevents us from passing all zeros for the hashmap, and makes
    me worry that STRMAP_INIT would just be more trouble than it is worth.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Makefile |   1 +
 strmap.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 strmap.h |  57 +++++++++++++++++++++++++++++++
 3 files changed, 160 insertions(+)
 create mode 100644 strmap.c
 create mode 100644 strmap.h

Comments

Jeff King Oct. 30, 2020, 2:12 p.m. UTC | #1
On Tue, Oct 13, 2020 at 12:40:45AM +0000, Elijah Newren via GitGitGadget wrote:

> Add strmap as a new struct and associated utility functions,
> specifically for hashmaps that map strings to some value.  The API is
> taken directly from Peff's proposal at
> https://lore.kernel.org/git/20180906191203.GA26184@sigill.intra.peff.net/

This looks overall sane to me. I mentioned elsewhere that we could be
using FLEXPTR_ALLOC to save an extra allocation. I think it's easy and
worth doing here, as the logic would be completely contained within
strmap_put():

  if (strdup_strings)
	FLEXPTR_ALLOC_STR(entry, key, str);
  else {
	entry = xmalloc(sizeof(*entry));
	entry->key = str;
  }

And free_entries() then doesn't even have to care about strdup_strings.

> A couple of items of note:
> 
>   * Similar to string-list, I have a strdup_strings setting.  However,
>     unlike string-list, strmap_init() does not take a parameter for this
>     setting and instead automatically sets it to 1; callers who want to
>     control this detail need to instead call strmap_ocd_init().

That seems reasonable. It could just be a parameter, but I like that you
push people in the direction of doing the simple and safe thing, rather
than having them wonder whether they ought to set strdup_strings or not.

>   * I do not have a STRMAP_INIT macro.  I could possibly add one, but
>       #define STRMAP_INIT { { NULL, cmp_str_entry, NULL, 0, 0, 0, 0, 0 }, 1 }
>     feels a bit unwieldy and possibly error-prone in terms of future
>     expansion of the hashmap struct.  The fact that cmp_str_entry needs to
>     be in there prevents us from passing all zeros for the hashmap, and makes
>     me worry that STRMAP_INIT would just be more trouble than it is worth.

You can actually omit everything after cmp_str_entry, and those fields
would all get zero-initialized. But we also allow C99 designed
initializers these days. Coupled with the HASHMAP_INIT() I mentioned in
the earlier email, you'd have:

  #define STRMAP_INIT { \
		.map = HASHMAP_INIT(cmp_strmap_entry, NULL), \
		.strdup_strings = 1, \
	  }

which seems pretty maintainable.

> +static int cmp_strmap_entry(const void *hashmap_cmp_fn_data,
> +			    const struct hashmap_entry *entry1,
> +			    const struct hashmap_entry *entry2,
> +			    const void *keydata)
> +{
> +	const struct strmap_entry *e1, *e2;
> +
> +	e1 = container_of(entry1, const struct strmap_entry, ent);
> +	e2 = container_of(entry2, const struct strmap_entry, ent);
> +	return strcmp(e1->key, e2->key);
> +}

I expected to use keydata here, but it's pretty easy to make a fake
strmap_entry because of the use of the "key" pointer. So that makes
sense.

> +static void strmap_free_entries_(struct strmap *map, int free_util)

You use the term "value" for the mapped-to value in this iteration. So
perhaps free_values here (and in other functions) would be a better
name?

> +	/*
> +	 * We need to iterate over the hashmap entries and free
> +	 * e->key and e->value ourselves; hashmap has no API to
> +	 * take care of that for us.  Since we're already iterating over
> +	 * the hashmap, though, might as well free e too and avoid the need
> +	 * to make some call into the hashmap API to do that.
> +	 */
> +	hashmap_for_each_entry(&map->map, &iter, e, ent) {
> +		if (free_util)
> +			free(e->value);
> +		if (map->strdup_strings)
> +			free((char*)e->key);
> +		free(e);
> +	}
> +}

Yep, makes sense.

> +void strmap_clear(struct strmap *map, int free_util)
> +{
> +	strmap_free_entries_(map, free_util);
> +	hashmap_free(&map->map);
> +}

This made me wonder about a partial_clear(), but it looks like that
comes later.

> +void *strmap_put(struct strmap *map, const char *str, void *data)
> +{
> +	struct strmap_entry *entry = find_strmap_entry(map, str);
> +	void *old = NULL;
> +
> +	if (entry) {
> +		old = entry->value;
> +		entry->value = data;

Here's a weird hypothetical. If strdup_strings is not set and I do:

  const char *one = xstrdup("foo");
  const char *two = xstrdup("foo");

  hashmap_put(map, one, x);
  hashmap_put(map, two, y);

it's clear that the value should be pointing to "y" afterwards (and you
return "x" so the caller can free it or whatever, good).

But which key should the entry be pointing to? The old one or the new
one? I'm trying and failing to think of a case where it would matter.
Certainly I could add a free() to the toy above where it would, but it
feels like a real caller would have to have pretty convoluted memory
lifetime semantics for it to make a difference.

So I'm not really asking for a particular behavior, but just bringing it
up in case you can think of something relevant.

> +	} else {
> +		/*
> +		 * We won't modify entry->key so it really should be const.
> +		 */
> +		const char *key = str;

The "should be" here confused me. It _is_ const. I'd probably just
delete the comment entirely, but perhaps:

  /*
   * We'll store a const pointer. For non-duplicated strings, they belong
   * to the caller and we received them as const in the first place. For
   * our duplicated ones, they do point to memory we own, but they're
   * still conceptually constant within the lifetime of an entry.
   */

Though it might make more sense in the struct definition, not here.

> +void *strmap_get(struct strmap *map, const char *str)
> +{
> +	struct strmap_entry *entry = find_strmap_entry(map, str);
> +	return entry ? entry->value : NULL;
> +}

Just noting that the caller can't tell the difference between "no such
entry" and "the entry is storing NULL". I think the simplicity offered
by this interface makes it worth having (and being the primary one). If
some caller really needs to tell the difference between the two, we can
add another function later.

Obviously they could use strmap_contains(), but that would mean two hash
lookups.

> +/*
> + * Same as strmap_init, but for those who want to control the memory management
> + * carefully instead of using the default of strdup_strings=1.
> + * (OCD = Obsessive Compulsive Disorder, a joke that those who use this function
> + * are obsessing over minor details.)
> + */
> +void strmap_ocd_init(struct strmap *map,
> +		     int strdup_strings);

I'm not personally bothered by this name, but I wonder if some people
may be (because they have or know somebody who actually has OCD).

Perhaps strmap_init_with_options() would be a better name? It likewise
would extend well if we want to add other non-default options later.

-Peff
Elijah Newren Oct. 30, 2020, 4:26 p.m. UTC | #2
On Fri, Oct 30, 2020 at 7:12 AM Jeff King <peff@peff.net> wrote:
>
> On Tue, Oct 13, 2020 at 12:40:45AM +0000, Elijah Newren via GitGitGadget wrote:
>
> > Add strmap as a new struct and associated utility functions,
> > specifically for hashmaps that map strings to some value.  The API is
> > taken directly from Peff's proposal at
> > https://lore.kernel.org/git/20180906191203.GA26184@sigill.intra.peff.net/
>
> This looks overall sane to me. I mentioned elsewhere that we could be
> using FLEXPTR_ALLOC to save an extra allocation. I think it's easy and
> worth doing here, as the logic would be completely contained within
> strmap_put():
>
>   if (strdup_strings)
>         FLEXPTR_ALLOC_STR(entry, key, str);
>   else {
>         entry = xmalloc(sizeof(*entry));
>         entry->key = str;
>   }
>
> And free_entries() then doesn't even have to care about strdup_strings.

Yeah, as you noted in your review of 10/10 this idea wouldn't play
well with the later mem_pool changes.

> > A couple of items of note:
> >
> >   * Similar to string-list, I have a strdup_strings setting.  However,
> >     unlike string-list, strmap_init() does not take a parameter for this
> >     setting and instead automatically sets it to 1; callers who want to
> >     control this detail need to instead call strmap_ocd_init().
>
> That seems reasonable. It could just be a parameter, but I like that you
> push people in the direction of doing the simple and safe thing, rather
> than having them wonder whether they ought to set strdup_strings or not.

Well, in my first round of the series where I did make it a parameter
you balked pretty loudly.  ;-)

> >   * I do not have a STRMAP_INIT macro.  I could possibly add one, but
> >       #define STRMAP_INIT { { NULL, cmp_str_entry, NULL, 0, 0, 0, 0, 0 }, 1 }
> >     feels a bit unwieldy and possibly error-prone in terms of future
> >     expansion of the hashmap struct.  The fact that cmp_str_entry needs to
> >     be in there prevents us from passing all zeros for the hashmap, and makes
> >     me worry that STRMAP_INIT would just be more trouble than it is worth.
>
> You can actually omit everything after cmp_str_entry, and those fields
> would all get zero-initialized. But we also allow C99 designed
> initializers these days. Coupled with the HASHMAP_INIT() I mentioned in
> the earlier email, you'd have:
>
>   #define STRMAP_INIT { \
>                 .map = HASHMAP_INIT(cmp_strmap_entry, NULL), \
>                 .strdup_strings = 1, \
>           }
>
> which seems pretty maintainable.

Makes sense; will add.

> > +static int cmp_strmap_entry(const void *hashmap_cmp_fn_data,
> > +                         const struct hashmap_entry *entry1,
> > +                         const struct hashmap_entry *entry2,
> > +                         const void *keydata)
> > +{
> > +     const struct strmap_entry *e1, *e2;
> > +
> > +     e1 = container_of(entry1, const struct strmap_entry, ent);
> > +     e2 = container_of(entry2, const struct strmap_entry, ent);
> > +     return strcmp(e1->key, e2->key);
> > +}
>
> I expected to use keydata here, but it's pretty easy to make a fake
> strmap_entry because of the use of the "key" pointer. So that makes
> sense.
>
> > +static void strmap_free_entries_(struct strmap *map, int free_util)
>
> You use the term "value" for the mapped-to value in this iteration. So
> perhaps free_values here (and in other functions) would be a better
> name?

Oops, yes, definitely.

> > +     /*
> > +      * We need to iterate over the hashmap entries and free
> > +      * e->key and e->value ourselves; hashmap has no API to
> > +      * take care of that for us.  Since we're already iterating over
> > +      * the hashmap, though, might as well free e too and avoid the need
> > +      * to make some call into the hashmap API to do that.
> > +      */
> > +     hashmap_for_each_entry(&map->map, &iter, e, ent) {
> > +             if (free_util)
> > +                     free(e->value);
> > +             if (map->strdup_strings)
> > +                     free((char*)e->key);
> > +             free(e);
> > +     }
> > +}
>
> Yep, makes sense.
>
> > +void strmap_clear(struct strmap *map, int free_util)
> > +{
> > +     strmap_free_entries_(map, free_util);
> > +     hashmap_free(&map->map);
> > +}
>
> This made me wonder about a partial_clear(), but it looks like that
> comes later.
>
> > +void *strmap_put(struct strmap *map, const char *str, void *data)
> > +{
> > +     struct strmap_entry *entry = find_strmap_entry(map, str);
> > +     void *old = NULL;
> > +
> > +     if (entry) {
> > +             old = entry->value;
> > +             entry->value = data;
>
> Here's a weird hypothetical. If strdup_strings is not set and I do:
>
>   const char *one = xstrdup("foo");
>   const char *two = xstrdup("foo");
>
>   hashmap_put(map, one, x);
>   hashmap_put(map, two, y);
>
> it's clear that the value should be pointing to "y" afterwards (and you
> return "x" so the caller can free it or whatever, good).
>
> But which key should the entry be pointing to? The old one or the new
> one? I'm trying and failing to think of a case where it would matter.
> Certainly I could add a free() to the toy above where it would, but it
> feels like a real caller would have to have pretty convoluted memory
> lifetime semantics for it to make a difference.
>
> So I'm not really asking for a particular behavior, but just bringing it
> up in case you can think of something relevant.

I'll keep mulling it over, but I likewise can't currently think of a
case where it'd matter.

>
> > +     } else {
> > +             /*
> > +              * We won't modify entry->key so it really should be const.
> > +              */
> > +             const char *key = str;
>
> The "should be" here confused me. It _is_ const. I'd probably just
> delete the comment entirely, but perhaps:
>
>   /*
>    * We'll store a const pointer. For non-duplicated strings, they belong
>    * to the caller and we received them as const in the first place. For
>    * our duplicated ones, they do point to memory we own, but they're
>    * still conceptually constant within the lifetime of an entry.
>    */
>
> Though it might make more sense in the struct definition, not here.

Either I was (mistakenly) worried about "I'm going to allocate and
copy, but during the copy it isn't actually const", or this is a
leftover artifact from some of the other iterations I tried.  Anyway,
I think this comment isn't useful; I'll just strike it.

> > +void *strmap_get(struct strmap *map, const char *str)
> > +{
> > +     struct strmap_entry *entry = find_strmap_entry(map, str);
> > +     return entry ? entry->value : NULL;
> > +}
>
> Just noting that the caller can't tell the difference between "no such
> entry" and "the entry is storing NULL". I think the simplicity offered
> by this interface makes it worth having (and being the primary one). If
> some caller really needs to tell the difference between the two, we can
> add another function later.
>
> Obviously they could use strmap_contains(), but that would mean two hash
> lookups.

Yep, addressed later by strmap_get_entry() in another patch, as you
noticed in your later review.

> > +/*
> > + * Same as strmap_init, but for those who want to control the memory management
> > + * carefully instead of using the default of strdup_strings=1.
> > + * (OCD = Obsessive Compulsive Disorder, a joke that those who use this function
> > + * are obsessing over minor details.)
> > + */
> > +void strmap_ocd_init(struct strmap *map,
> > +                  int strdup_strings);
>
> I'm not personally bothered by this name, but I wonder if some people
> may be (because they have or know somebody who actually has OCD).
>
> Perhaps strmap_init_with_options() would be a better name? It likewise
> would extend well if we want to add other non-default options later.

Doh!  That's going to push a bunch of lines past 80 characters.  Sigh...

It's probably a better name though; I'll change it.
diff mbox series

Patch

diff --git a/Makefile b/Makefile
index 95571ee3fc..777a34c01c 100644
--- a/Makefile
+++ b/Makefile
@@ -1000,6 +1000,7 @@  LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
 LIB_OBJS += streaming.o
 LIB_OBJS += string-list.o
+LIB_OBJS += strmap.o
 LIB_OBJS += strvec.o
 LIB_OBJS += sub-process.o
 LIB_OBJS += submodule-config.o
diff --git a/strmap.c b/strmap.c
new file mode 100644
index 0000000000..4b48d64274
--- /dev/null
+++ b/strmap.c
@@ -0,0 +1,102 @@ 
+#include "git-compat-util.h"
+#include "strmap.h"
+
+static int cmp_strmap_entry(const void *hashmap_cmp_fn_data,
+			    const struct hashmap_entry *entry1,
+			    const struct hashmap_entry *entry2,
+			    const void *keydata)
+{
+	const struct strmap_entry *e1, *e2;
+
+	e1 = container_of(entry1, const struct strmap_entry, ent);
+	e2 = container_of(entry2, const struct strmap_entry, ent);
+	return strcmp(e1->key, e2->key);
+}
+
+static struct strmap_entry *find_strmap_entry(struct strmap *map,
+					      const char *str)
+{
+	struct strmap_entry entry;
+	hashmap_entry_init(&entry.ent, strhash(str));
+	entry.key = str;
+	return hashmap_get_entry(&map->map, &entry, ent, NULL);
+}
+
+void strmap_init(struct strmap *map)
+{
+	strmap_ocd_init(map, 1);
+}
+
+void strmap_ocd_init(struct strmap *map,
+		     int strdup_strings)
+{
+	hashmap_init(&map->map, cmp_strmap_entry, NULL, 0);
+	map->strdup_strings = strdup_strings;
+}
+
+static void strmap_free_entries_(struct strmap *map, int free_util)
+{
+	struct hashmap_iter iter;
+	struct strmap_entry *e;
+
+	if (!map)
+		return;
+
+	/*
+	 * We need to iterate over the hashmap entries and free
+	 * e->key and e->value ourselves; hashmap has no API to
+	 * take care of that for us.  Since we're already iterating over
+	 * the hashmap, though, might as well free e too and avoid the need
+	 * to make some call into the hashmap API to do that.
+	 */
+	hashmap_for_each_entry(&map->map, &iter, e, ent) {
+		if (free_util)
+			free(e->value);
+		if (map->strdup_strings)
+			free((char*)e->key);
+		free(e);
+	}
+}
+
+void strmap_clear(struct strmap *map, int free_util)
+{
+	strmap_free_entries_(map, free_util);
+	hashmap_free(&map->map);
+}
+
+void *strmap_put(struct strmap *map, const char *str, void *data)
+{
+	struct strmap_entry *entry = find_strmap_entry(map, str);
+	void *old = NULL;
+
+	if (entry) {
+		old = entry->value;
+		entry->value = data;
+	} else {
+		/*
+		 * We won't modify entry->key so it really should be const.
+		 */
+		const char *key = str;
+
+		entry = xmalloc(sizeof(*entry));
+		hashmap_entry_init(&entry->ent, strhash(str));
+
+		if (map->strdup_strings)
+			key = xstrdup(str);
+		entry->key = key;
+		entry->value = data;
+		hashmap_add(&map->map, &entry->ent);
+	}
+	return old;
+}
+
+void *strmap_get(struct strmap *map, const char *str)
+{
+	struct strmap_entry *entry = find_strmap_entry(map, str);
+	return entry ? entry->value : NULL;
+}
+
+int strmap_contains(struct strmap *map, const char *str)
+{
+	return find_strmap_entry(map, str) != NULL;
+}
diff --git a/strmap.h b/strmap.h
new file mode 100644
index 0000000000..493d19cbc0
--- /dev/null
+++ b/strmap.h
@@ -0,0 +1,57 @@ 
+#ifndef STRMAP_H
+#define STRMAP_H
+
+#include "hashmap.h"
+
+struct strmap {
+	struct hashmap map;
+	unsigned int strdup_strings:1;
+};
+
+struct strmap_entry {
+	struct hashmap_entry ent;
+	const char *key;
+	void *value;
+};
+
+/*
+ * Initialize the members of the strmap.  Any keys added to the strmap will
+ * be strdup'ed with their memory managed by the strmap.
+ */
+void strmap_init(struct strmap *map);
+
+/*
+ * Same as strmap_init, but for those who want to control the memory management
+ * carefully instead of using the default of strdup_strings=1.
+ * (OCD = Obsessive Compulsive Disorder, a joke that those who use this function
+ * are obsessing over minor details.)
+ */
+void strmap_ocd_init(struct strmap *map,
+		     int strdup_strings);
+
+/*
+ * Remove all entries from the map, releasing any allocated resources.
+ */
+void strmap_clear(struct strmap *map, int free_values);
+
+/*
+ * Insert "str" into the map, pointing to "data".
+ *
+ * If an entry for "str" already exists, its data pointer is overwritten, and
+ * the original data pointer returned. Otherwise, returns NULL.
+ */
+void *strmap_put(struct strmap *map, const char *str, void *data);
+
+/*
+ * Return the data pointer mapped by "str", or NULL if the entry does not
+ * exist.
+ */
+void *strmap_get(struct strmap *map, const char *str);
+
+/*
+ * Return non-zero iff "str" is present in the map. This differs from
+ * strmap_get() in that it can distinguish entries with a NULL data pointer.
+ */
+int strmap_contains(struct strmap *map, const char *str);
+
+#endif /* STRMAP_H */