diff mbox series

[2/5] strmap: new utility functions

Message ID a86fd5fdcc47baf85046eb07257f4db9f9498084.1598035949.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Add struct strmap and associated utility functions | expand

Commit Message

Philippe Blain via GitGitGadget Aug. 21, 2020, 6:52 p.m. UTC
From: Elijah Newren <newren@gmail.com>

Add strmap as a new struct and associated utility functions,
specifically for hashmaps that map strings to some value.  The API is
taken directly from Peff's proposal at
https://lore.kernel.org/git/20180906191203.GA26184@sigill.intra.peff.net/

Peff only included the header, not the implementation, so it isn't clear what
the structure was he was going to use for the hash entries.  Instead of having
my str_entry struct have three subfields (the hashmap_entry, the string, and
the void* value), I made it only have two -- the hashmap_entry and a
string_list_item, for two reasons:

  1) a strmap is often the data structure we want where string_list has
     been used in the past.  Using the same building block for
     individual entries in both makes it easier to adopt and reuse
     parts of the string_list API in strmap.

  2) In some cases, after doing lots of other work, I want to be able
     to iterate over the items in my strmap in sorted order.  hashmap
     obviously doesn't support that, but I wanted to be able to export
     the strmap to a string_list easily and then use its functions.
     (Note: I do not need the data structure to both be sorted and have
     efficient lookup at all times.  If I did, I might use a B-tree
     instead, as suggested by brian in response to Peff in the thread
     noted above.  In my case, most strmaps will never need sorting, but
     in one special case at the very end of a bunch of other work I want
     to iterate over the items in sorted order without doing any more
     lookups afterward.)

Also, I removed the STRMAP_INIT macro, since it cannot be used to
correctly initialize a strmap; the underlying hashmap needs a call to
hashmap_init() to allocate the hash table first.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Makefile |  1 +
 strmap.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 strmap.h | 47 ++++++++++++++++++++++++++++++++
 3 files changed, 129 insertions(+)
 create mode 100644 strmap.c
 create mode 100644 strmap.h

Comments

Jeff King Aug. 21, 2020, 7:48 p.m. UTC | #1
On Fri, Aug 21, 2020 at 06:52:26PM +0000, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>
> 
> Add strmap as a new struct and associated utility functions,
> specifically for hashmaps that map strings to some value.  The API is
> taken directly from Peff's proposal at
> https://lore.kernel.org/git/20180906191203.GA26184@sigill.intra.peff.net/

Uh oh. You are encouraging me in the belief that I can send half-baked
ideas to the list and somebody will come along and implement them for
me. ;)

> Peff only included the header, not the implementation, so it isn't clear what
> the structure was he was going to use for the hash entries.  Instead of having
> my str_entry struct have three subfields (the hashmap_entry, the string, and
> the void* value), I made it only have two -- the hashmap_entry and a
> string_list_item, for two reasons:

I'd probably have done:

  struct strmap_entry {
	struct hashmap_entry ent;
	void *value;
	char key[FLEX_ALLOC];
  };

That saves 8 bytes (plus malloc overhead)per item, plus avoids an extra
pointer-chase for each item we consider when looking up.

>   1) a strmap is often the data structure we want where string_list has
>      been used in the past.  Using the same building block for
>      individual entries in both makes it easier to adopt and reuse
>      parts of the string_list API in strmap.

I can see that there might be some value in being able to interchange
the items for code that's expecting a string_list_item. But I have to
wonder if the potential for confusion is worth it. I.e., should that
code really be expecting a raw string pointer (possibly with a separate
void pointer, or even better an actual typed pointer).

I'll keep an eye out as I read the rest of the series for code which
uses this.

>   2) In some cases, after doing lots of other work, I want to be able
>      to iterate over the items in my strmap in sorted order.  hashmap
>      obviously doesn't support that, but I wanted to be able to export
>      the strmap to a string_list easily and then use its functions.
>      (Note: I do not need the data structure to both be sorted and have
>      efficient lookup at all times.  If I did, I might use a B-tree
>      instead, as suggested by brian in response to Peff in the thread
>      noted above.  In my case, most strmaps will never need sorting, but
>      in one special case at the very end of a bunch of other work I want
>      to iterate over the items in sorted order without doing any more
>      lookups afterward.)

Hmm. Likewise, I'll keep an eye open for how this works in practice. I
do suspect that a B-tree might be a better solution here, but
implementing it is non-trivial, and most callers don't care about this
property.

If the interim solution is to just dump it to a string_list and sort
that, that's really not that bad, assuming it just happens once after
we've added everything. I'm not sure there's that big a benefit to using
string_list_item internally, since presumably that conversion needs to
write a whole new array of string_list_items anyway.

> Also, I removed the STRMAP_INIT macro, since it cannot be used to
> correctly initialize a strmap; the underlying hashmap needs a call to
> hashmap_init() to allocate the hash table first.

Since access to the underlying hashmap happens through strmap functions,
they could lazily initialize it. That's how oidmap works.

> diff --git a/strmap.c b/strmap.c
> new file mode 100644
> index 0000000000..1c9fdb3b1e
> --- /dev/null
> +++ b/strmap.c
> @@ -0,0 +1,81 @@
> +#include "git-compat-util.h"
> +#include "strmap.h"
> +
> +static int cmp_str_entry(const void *hashmap_cmp_fn_data,
> +			 const struct hashmap_entry *entry1,
> +			 const struct hashmap_entry *entry2,
> +			 const void *keydata)
> +{
> +	const struct str_entry *e1, *e2;
> +
> +	e1 = container_of(entry1, const struct str_entry, ent);
> +	e2 = container_of(entry2, const struct str_entry, ent);
> +	return strcmp(e1->item.string, e2->item.string);
> +}

If you do go the FLEX_ALLOC route, obviously lookups won't want to
allocate a str_entry struct for the lookup key. You'd use keydata there
(and prefer it over looking at entry2 at all). See remotes_hash_cmp()
for an example.

> +static struct str_entry *find_str_entry(struct strmap *map,
> +					const char *str)
> +{
> +	struct str_entry entry;
> +	hashmap_entry_init(&entry.ent, strhash(str));
> +	entry.item.string = (char *)str;
> +	return hashmap_get_entry(&map->map, &entry, ent, NULL);
> +}

Casting away constness here is awkward. It could likewise benefit from
using keydata, so you wouldn't need to create a temporary
string_list_item (which is where the non-constness comes from).

> +void strmap_clear(struct strmap *map, int free_util)
> +{
> +	struct hashmap_iter iter;
> +	struct str_entry *e;
> +
> +	if (!map)
> +		return;

In a lazy-init world, this becomes:

  if (!map || !map->map.table)

Of course it would be better still if the hashmap code learned to do the
lazy-init stuff itself.

> +	hashmap_for_each_entry(&map->map, &iter, e, ent /* member name */) {
> +		free(e->item.string);
> +		if (free_util)
> +			free(e->item.util);
> +	}
> +	hashmap_free_entries(&map->map, struct str_entry, ent);

With a flex-alloc struct, you can avoid the extra string free. But I
guess you still wouldn't avoid the loop if you want to support
free_entries().

I wonder if it would make the API simpler if the struct knew whether it
owned the void pointer values or not. Then you'd do:

  struct strmap foo = { .free_values = 1 };
  ...
  strmap_put(&foo, "key", value);
  ...
  strmap_clear(&foo);

and wouldn't have to remember to do the right thing at clear-time. It is
a little less flexible (e.g., if you transfer ownership after a certain
point in the code), but I wonder if any callers actually need that (and
they could always set the free_values flag then).

> +/*
> + * Insert "str" into the map, pointing to "data". A copy of "str" is made, so
> + * it does not need to persist after the this function is called.
> + *
> + * If an entry for "str" already exists, its data pointer is overwritten, and
> + * the original data pointer returned. Otherwise, returns NULL.
> + */
> +void *strmap_put(struct strmap *map, const char *str, void *data)

Minor, but IMHO we should avoid copying the docstrings to the
implementation, since it gives two places that people have to remember
to update if the API changes.

> +void *strmap_put(struct strmap *map, const char *str, void *data)
> +{
> +	struct str_entry *entry = find_str_entry(map, str);
> +	void *old = NULL;

In a lazy-init world, this is:

  if (!map->map.table) {
	strmap_init(map);
	entry = NULL;
  } else {
        entry = find_str_entry(map, str);
  }

(or just call find_str_entry() in both cases and let it realize there's
nothing to find).

> +	if (entry) {
> +		old = entry->item.util;
> +		entry->item.util = data;
> +	} else {
> +		entry = xmalloc(sizeof(*entry));
> +		hashmap_entry_init(&entry->ent, strhash(str));
> +		entry->item.string = strdup(str);
> +		entry->item.util = data;
> +		hashmap_add(&map->map, &entry->ent);
> +	}

And in a flex-alloc world, this second block is:

  FLEX_ALLOC_STR(entry, key, str);
  hashmap_entry_init(&entry->ent, strhash(str));
  entry->value = data;
  hashmap_add(&map->map, &entry->ent);

> +void *strmap_get(struct strmap *map, const char *str)
> +{
> +	struct str_entry *entry = find_str_entry(map, str);
> +	return entry ? entry->item.util : NULL;
> +}

In a lazy world, this is:

  if (!map->map.table)
          return NULL;

> +int strmap_contains(struct strmap *map, const char *str)
> +{
> +	return find_str_entry(map, str) != NULL;
> +}

And likewise:

  if (!map->map.table)
          return NULL;

It might actually be easier to stick that in find_str_entry().

The rest of it all looked good to me.

-Peff
diff mbox series

Patch

diff --git a/Makefile b/Makefile
index 65f8cfb236..0da15a9ee5 100644
--- a/Makefile
+++ b/Makefile
@@ -988,6 +988,7 @@  LIB_OBJS += strbuf.o
 LIB_OBJS += strvec.o
 LIB_OBJS += streaming.o
 LIB_OBJS += string-list.o
+LIB_OBJS += strmap.o
 LIB_OBJS += sub-process.o
 LIB_OBJS += submodule-config.o
 LIB_OBJS += submodule.o
diff --git a/strmap.c b/strmap.c
new file mode 100644
index 0000000000..1c9fdb3b1e
--- /dev/null
+++ b/strmap.c
@@ -0,0 +1,81 @@ 
+#include "git-compat-util.h"
+#include "strmap.h"
+
+static int cmp_str_entry(const void *hashmap_cmp_fn_data,
+			 const struct hashmap_entry *entry1,
+			 const struct hashmap_entry *entry2,
+			 const void *keydata)
+{
+	const struct str_entry *e1, *e2;
+
+	e1 = container_of(entry1, const struct str_entry, ent);
+	e2 = container_of(entry2, const struct str_entry, ent);
+	return strcmp(e1->item.string, e2->item.string);
+}
+
+static struct str_entry *find_str_entry(struct strmap *map,
+					const char *str)
+{
+	struct str_entry entry;
+	hashmap_entry_init(&entry.ent, strhash(str));
+	entry.item.string = (char *)str;
+	return hashmap_get_entry(&map->map, &entry, ent, NULL);
+}
+
+void strmap_init(struct strmap *map)
+{
+	hashmap_init(&map->map, cmp_str_entry, NULL, 0);
+}
+
+void strmap_clear(struct strmap *map, int free_util)
+{
+	struct hashmap_iter iter;
+	struct str_entry *e;
+
+	if (!map)
+		return;
+
+	hashmap_for_each_entry(&map->map, &iter, e, ent /* member name */) {
+		free(e->item.string);
+		if (free_util)
+			free(e->item.util);
+	}
+	hashmap_free_entries(&map->map, struct str_entry, ent);
+	strmap_init(map);
+}
+
+/*
+ * Insert "str" into the map, pointing to "data". A copy of "str" is made, so
+ * it does not need to persist after the this function is called.
+ *
+ * If an entry for "str" already exists, its data pointer is overwritten, and
+ * the original data pointer returned. Otherwise, returns NULL.
+ */
+void *strmap_put(struct strmap *map, const char *str, void *data)
+{
+	struct str_entry *entry = find_str_entry(map, str);
+	void *old = NULL;
+
+	if (entry) {
+		old = entry->item.util;
+		entry->item.util = data;
+	} else {
+		entry = xmalloc(sizeof(*entry));
+		hashmap_entry_init(&entry->ent, strhash(str));
+		entry->item.string = strdup(str);
+		entry->item.util = data;
+		hashmap_add(&map->map, &entry->ent);
+	}
+	return old;
+}
+
+void *strmap_get(struct strmap *map, const char *str)
+{
+	struct str_entry *entry = find_str_entry(map, str);
+	return entry ? entry->item.util : NULL;
+}
+
+int strmap_contains(struct strmap *map, const char *str)
+{
+	return find_str_entry(map, str) != NULL;
+}
diff --git a/strmap.h b/strmap.h
new file mode 100644
index 0000000000..eb5807f6fa
--- /dev/null
+++ b/strmap.h
@@ -0,0 +1,47 @@ 
+#ifndef STRMAP_H
+#define STRMAP_H
+
+#include "hashmap.h"
+#include "string-list.h"
+
+struct strmap {
+	struct hashmap map;
+};
+
+struct str_entry {
+	struct hashmap_entry ent;
+	struct string_list_item item;
+};
+
+/*
+ * Initialize an empty strmap
+ */
+void strmap_init(struct strmap *map);
+
+/*
+ * Remove all entries from the map, releasing any allocated resources.
+ */
+void strmap_clear(struct strmap *map, int free_values);
+
+/*
+ * Insert "str" into the map, pointing to "data". A copy of "str" is made, so
+ * it does not need to persist after the this function is called.
+ *
+ * If an entry for "str" already exists, its data pointer is overwritten, and
+ * the original data pointer returned. Otherwise, returns NULL.
+ */
+void *strmap_put(struct strmap *map, const char *str, void *data);
+
+/*
+ * Return the data pointer mapped by "str", or NULL if the entry does not
+ * exist.
+ */
+void *strmap_get(struct strmap *map, const char *str);
+
+/*
+ * Return non-zero iff "str" is present in the map. This differs from
+ * strmap_get() in that it can distinguish entries with a NULL data pointer.
+ */
+int strmap_contains(struct strmap *map, const char *str);
+
+#endif /* STRMAP_H */