diff mbox series

[RFC,2/3] Make ce_compare_gitlink thread-safe

Message ID 20240305012112.1598053-4-atneya@google.com (mailing list archive)
State New
Headers show
Series Parallel submodule status | expand

Commit Message

Atneya Nair March 5, 2024, 1:21 a.m. UTC
To enable parallel update of the read cache for submodules,
ce_compare_gitlink must be thread safe (for different objects).

Remove string interning in do_config_from (called from
repo_submodule_init) and add locking around accessing the ref_store_map.

Signed-off-by: Atneya Nair <atneya@google.com>
---

Notes:
    Chasing down thread unsafe code was done using tsan.
    
    Open questions:
     - Is there any additional thread-unsafety that was missed?
     - Is it safe to strdup in do_config_from (do the filenames need static
       lifetime)? What is the performance implication (it seems small)?
     - Is the locking around ref_store_map appropriate and/or can it be made
       more granular?

 config.c | 3 ++-
 config.h | 2 +-
 refs.c   | 9 +++++++++
 3 files changed, 12 insertions(+), 2 deletions(-)

Comments

Junio C Hamano March 5, 2024, 5:08 p.m. UTC | #1
Atneya Nair <atneya@google.com> writes:

> To enable parallel update of the read cache for submodules,
> ce_compare_gitlink must be thread safe (for different objects).
>
> Remove string interning in do_config_from (called from
> repo_submodule_init) and add locking around accessing the ref_store_map.

This step does two independent things, even though they may have
dependencies, i.e., for one to be a solution for the problem it is
tackling, the other may have to be there already.  E.g., even after
calls to ce_compare_gitlink() get serialized via a mutex, it may for
some reason not work without giving each kvi.filename its own copy
[*], and if that is the case, you may need to have the "stop
interning" step in a single patch with its own justification, and
then "have mutex around ref_store calls" patch has to come after it.

    Side note: I do not know if that is the case myself.  I didn't
    write this commit, you did.  The above is just a sample to
    illustrate the expected level of depth to explain your thinking
    in the log message.

Or if these two things must happen at the same time, please explain
in the proposed log message why they have to happen in the same
commit.  The two paragraphs you wrote there don't explain that, so I
am assuming that it is not the case.

The use of strintern() comes originally from 3df8fd62 (add line
number and file name info to `config_set`, 2014-08-07) by Tanay
Abhra <tanayabh@gmail.com>, and survived a handful of changes

    809d8680 (config: pass ctx with config files, 2023-06-28)
    a798a56c (config.c: plumb the_reader through callbacks, 2023-03-28)
    c97f3ed2 (config.c: plumb config_source through static fns, 2023-03-28)

all of which were done by Glen Choo <glencbz@gmail.com>, so they may
know how safe the change on the config side would be (I still do
not understand why you'd want to do this in the first place, though,
especially if you are protecting the callsites with mutex).

I also think Emily's (who you already have on the "CC:" line) group
wants to libify the config machinery and suspect they may still be
making changes to the code, so you may want to coordinate with them
to avoid duplicated work and overlapping changes.

> Signed-off-by: Atneya Nair <atneya@google.com>
> ---
>
> Notes:
>     Chasing down thread unsafe code was done using tsan.

Very nice to know.

>  config.c | 3 ++-
>  config.h | 2 +-
>  refs.c   | 9 +++++++++
>  3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/config.c b/config.c
> index 3cfeb3d8bd..d7f73d8745 100644
> --- a/config.c
> +++ b/config.c
> @@ -1017,7 +1017,7 @@ static void kvi_from_source(struct config_source *cs,
>  			    enum config_scope scope,
>  			    struct key_value_info *out)
>  {
> -	out->filename = strintern(cs->name);
> +	out->filename = strdup(cs->name);
>  	out->origin_type = cs->origin_type;
>  	out->linenr = cs->linenr;
>  	out->scope = scope;
> @@ -1857,6 +1857,7 @@ static int do_config_from(struct config_source *top, config_fn_t fn,
>  
>  	strbuf_release(&top->value);
>  	strbuf_release(&top->var);
> +	free(kvi.filename);
>  
>  	return ret;
>  }
> diff --git a/config.h b/config.h
> index 5dba984f77..b78f1b6667 100644
> --- a/config.h
> +++ b/config.h
> @@ -118,7 +118,7 @@ struct config_options {
>  
>  /* Config source metadata for a given config key-value pair */
>  struct key_value_info {
> -	const char *filename;
> +	char *filename;
>  	int linenr;
>  	enum config_origin_type origin_type;
>  	enum config_scope scope;
> diff --git a/refs.c b/refs.c
> index c633abf284..cce8a31b22 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -2126,6 +2126,9 @@ struct ref_store *get_submodule_ref_store(const char *submodule)
>  	size_t len;
>  	struct repository *subrepo;
>  
> +	// TODO is this locking tolerable, and/or can we get any finer
> +	static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
> +
>  	if (!submodule)
>  		return NULL;
>  
> @@ -2139,7 +2142,9 @@ struct ref_store *get_submodule_ref_store(const char *submodule)
>  		/* We need to strip off one or more trailing slashes */
>  		submodule = to_free = xmemdupz(submodule, len);
>  
> +	pthread_mutex_lock(&lock);
>  	refs = lookup_ref_store_map(&submodule_ref_stores, submodule);
> +	pthread_mutex_unlock(&lock);
>  	if (refs)
>  		goto done;
>  
> @@ -2162,10 +2167,14 @@ struct ref_store *get_submodule_ref_store(const char *submodule)
>  		free(subrepo);
>  		goto done;
>  	}
> +
> +	pthread_mutex_lock(&lock);
> +	// TODO maybe lock this separately
>  	refs = ref_store_init(subrepo, submodule_sb.buf,
>  			      REF_STORE_READ | REF_STORE_ODB);
>  	register_ref_store_map(&submodule_ref_stores, "submodule",
>  			       refs, submodule);
> +	pthread_mutex_unlock(&lock);
>  
>  done:
>  	strbuf_release(&submodule_sb);
Junio C Hamano March 5, 2024, 6:53 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> The use of strintern() comes originally from 3df8fd62 ...
> ..., so they may
> know how safe the change on the config side would be (I still do
> not understand why you'd want to do this in the first place, though,
> especially if you are protecting the callsites with mutex).

The risks of turning code that uses strintern() to use strdup() are

 * you will leak the allocated string unless you explicitly free the
   string you now own.

 * you may consume too much memory if you are creating too many
   copies of the same string (e.g. if you need filename for each
   line in a file in an application, the memory consumption can
   become 1000-fold).

 * the code may be taking advantage of the fact that two such
   strings can be compared for (in)equality simply by comparing
   their addresses, which you would need to adjust to use !strcmp()
   and the like.

I just checked to make sure that the last one is not the case for
our codebase, and you said you didn't see the second one is the case,
so the change may be a safe one to make.

One more thing.  Do not use strdup() without checking its return
value for failure.  It would be an easy fix to use xstrdup() instead.

Thanks.

>> diff --git a/config.c b/config.c
>> index 3cfeb3d8bd..d7f73d8745 100644
>> --- a/config.c
>> +++ b/config.c
>> @@ -1017,7 +1017,7 @@ static void kvi_from_source(struct config_source *cs,
>>  			    enum config_scope scope,
>>  			    struct key_value_info *out)
>>  {
>> -	out->filename = strintern(cs->name);
>> +	out->filename = strdup(cs->name);
>>  	out->origin_type = cs->origin_type;
>>  	out->linenr = cs->linenr;
>>  	out->scope = scope;
>> @@ -1857,6 +1857,7 @@ static int do_config_from(struct config_source *top, config_fn_t fn,
>>  
>>  	strbuf_release(&top->value);
>>  	strbuf_release(&top->var);
>> +	free(kvi.filename);
>>  
>>  	return ret;
>>  }
Jeff King March 6, 2024, 1:23 a.m. UTC | #3
On Tue, Mar 05, 2024 at 10:53:02AM -0800, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> > The use of strintern() comes originally from 3df8fd62 ...
> > ..., so they may
> > know how safe the change on the config side would be (I still do
> > not understand why you'd want to do this in the first place, though,
> > especially if you are protecting the callsites with mutex).
> 
> The risks of turning code that uses strintern() to use strdup() are
> 
>  * you will leak the allocated string unless you explicitly free the
>    string you now own.
> 
>  * you may consume too much memory if you are creating too many
>    copies of the same string (e.g. if you need filename for each
>    line in a file in an application, the memory consumption can
>    become 1000-fold).
> 
>  * the code may be taking advantage of the fact that two such
>    strings can be compared for (in)equality simply by comparing
>    their addresses, which you would need to adjust to use !strcmp()
>    and the like.

There is one more, I think: if you _do_ free the allocated string to
avoid the leak you mention, then some other code which was relying on
the lifetime of that string to be effectively infinite will now have a
user-after-free.

And I think that may be the case here. The "kvi" struct itself is local
to do_config_from(). But when we load the caching configset, we do so in
configset_add_value() which makes a shallow copy of the kvi struct. And
then that shallow copy may live on and be accessed with further calls to
git_config().

So doing just this:

diff --git a/config.c b/config.c
index 3cfeb3d8bd..2f6c83ffe7 100644
--- a/config.c
+++ b/config.c
@@ -1017,7 +1017,7 @@ static void kvi_from_source(struct config_source *cs,
 			    enum config_scope scope,
 			    struct key_value_info *out)
 {
-	out->filename = strintern(cs->name);
+	out->filename = xstrdup(cs->name);
 	out->origin_type = cs->origin_type;
 	out->linenr = cs->linenr;
 	out->scope = scope;
@@ -1855,6 +1855,7 @@ static int do_config_from(struct config_source *top, config_fn_t fn,
 
 	ret = git_parse_source(top, fn, &kvi, data, opts);
 
+	free((char *)kvi.filename);
 	strbuf_release(&top->value);
 	strbuf_release(&top->var);
 

will cause t4013.199 (among others) to fail when built with
SANITIZE=address, as it detects the user-after-free. I think you'd need
this on top:

diff --git a/config.c b/config.c
index 2f6c83ffe7..9854ca002d 100644
--- a/config.c
+++ b/config.c
@@ -2262,6 +2262,7 @@ static int configset_add_value(const struct key_value_info *kvi_p,
 	l_item->value_index = e->value_list.nr - 1;
 
 	*kv_info = *kvi_p;
+	kv_info->filename = xstrdup_or_null(kvi_p->filename); /* deep copy! */
 	si->util = kv_info;
 
 	return 0;

though probably an actual kvi_copy() function would be less horrible.

A few other things to note, looking at this code:

  - isn't kvi->path in the same boat? We do not duplicate it at all, so
    it seems like the shallow copy made in the configset could cause a
    user-after-free.

  - the "fix" I showed above hits your point 2: now we are making a lot
    more copies of that string. I will note that we're already making a
    lot of copies of the kvi struct in the first place, so unless you
    have really long pathnames, it probably isn't a big difference.

    But it possibly could make sense to have the configset own a single
    duplicate string, and then let the kvi structs it holds point to
    that string. But IMHO all of this should be details of the configset
    code, and the main config-iteration code should not have to worry
    about this at all. I.e., I think kvi_from_source() should not be
    duplicating anything in the first place.

-Peff
Junio C Hamano March 6, 2024, 1:58 a.m. UTC | #4
Jeff King <peff@peff.net> writes:

> There is one more, I think: if you _do_ free the allocated string to
> avoid the leak you mention, then some other code which was relying on
> the lifetime of that string to be effectively infinite will now have a
> user-after-free.

Ah, yes, you're right.  I completely forgot about that shallow copy.

> A few other things to note, looking at this code:
>
>   - isn't kvi->path in the same boat? We do not duplicate it at all, so
>     it seems like the shallow copy made in the configset could cause a
>     user-after-free.
>
>   - the "fix" I showed above hits your point 2: now we are making a lot
>     more copies of that string. I will note that we're already making a
>     lot of copies of the kvi struct in the first place, so unless you
>     have really long pathnames, it probably isn't a big difference.
>
>     But it possibly could make sense to have the configset own a single
>     duplicate string, and then let the kvi structs it holds point to
>     that string. But IMHO all of this should be details of the configset
>     code, and the main config-iteration code should not have to worry
>     about this at all. I.e., I think kvi_from_source() should not be
>     duplicating anything in the first place.

Thanks for a detailed write-up.
Atneya Nair March 12, 2024, 10:13 p.m. UTC | #5
On Tue, Mar 5, 2024 at 5:23 PM Jeff King <peff@peff.net> wrote:

> And I think that may be the case here. The "kvi" struct itself is local
> to do_config_from(). But when we load the caching configset, we do so in
> configset_add_value() which makes a shallow copy of the kvi struct. And
> then that shallow copy may live on and be accessed with further calls to
> git_config().

I missed the shallow copy of kvi in do_config_from. Unfortunately, we also
pass this back into a callback function, and although it is often not used,
changing the code to be more precise about string ownership here seems
non-trivial.

>
> though probably an actual kvi_copy() function would be less horrible.
>
> A few other things to note, looking at this code:
>
>   - isn't kvi->path in the same boat? We do not duplicate it at all, so
>     it seems like the shallow copy made in the configset could cause a
>     user-after-free.

The situation with path seems to indicate that indeed ownership is correctly
handled up the call-stack (i.e. the config_source owns the file/path strings),
and these strings are valid for the duration of the shallow copies through the
call-graph. But, the fact that filename in particular is interned may indicate
that the behavior of leaking the string to static lifetime is used by
a caller at
some point.

>     But it possibly could make sense to have the configset own a single
>     duplicate string, and then let the kvi structs it holds point to
>     that string. But IMHO all of this should be details of the configset
>     code, and the main config-iteration code should not have to worry
>     about this at all. I.e., I think kvi_from_source() should not be
>     duplicating anything in the first place.
>
> -Peff

Based on this discussion, I think the cleanest solution is a lock on
the strintern hashmap. I originally wanted to avoid adding locking, since most
call-paths are single threaded, and I thought we were only using the
string locally.
However, copying the string is more expensive and potentially intrusive to many
call-sites, and it is much more difficult to reason about the correctness of the
change, so I think adding a lock is preferable. I'll switch to that in
v2 along with the
more descriptive commit message.
Atneya Nair March 12, 2024, 10:15 p.m. UTC | #6
On Tue, Mar 5, 2024 at 9:08 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Atneya Nair <atneya@google.com> writes:
>
> > To enable parallel update of the read cache for submodules,
> > ce_compare_gitlink must be thread safe (for different objects).
> >
> > Remove string interning in do_config_from (called from
> > repo_submodule_init) and add locking around accessing the ref_store_map.
>
> This step does two independent things, even though they may have
> dependencies, i.e., for one to be a solution for the problem it is
> tackling, the other may have to be there already.  E.g., even after
> calls to ce_compare_gitlink() get serialized via a mutex, it may for
> some reason not work without giving each kvi.filename its own copy
> [*], and if that is the case, you may need to have the "stop
> interning" step in a single patch with its own justification, and
> then "have mutex around ref_store calls" patch has to come after it.
>

What is the appropriate way to split an intermediate patch into two? Simply
adding the additional commit as 3 of 4 would cause a mismatch between patch-set
versions? Would that cause any issues?
Junio C Hamano March 13, 2024, 5:51 p.m. UTC | #7
Atneya Nair <atneya@google.com> writes:

> What is the appropriate way to split an intermediate patch into two? Simply
> adding the additional commit as 3 of 4 would cause a mismatch between patch-set
> versions? Would that cause any issues?

If you have three patch series originally and the second one is too
big, then you'd end up with four patch series whose second and third
corresponds to the second patch of the original series, and the last
patch of the updated series that is numbered 4 will correspond to
the third one from the original.  That is perfectly normal and you
can see the correspondence with help from tools like range-diff.
diff mbox series

Patch

diff --git a/config.c b/config.c
index 3cfeb3d8bd..d7f73d8745 100644
--- a/config.c
+++ b/config.c
@@ -1017,7 +1017,7 @@  static void kvi_from_source(struct config_source *cs,
 			    enum config_scope scope,
 			    struct key_value_info *out)
 {
-	out->filename = strintern(cs->name);
+	out->filename = strdup(cs->name);
 	out->origin_type = cs->origin_type;
 	out->linenr = cs->linenr;
 	out->scope = scope;
@@ -1857,6 +1857,7 @@  static int do_config_from(struct config_source *top, config_fn_t fn,
 
 	strbuf_release(&top->value);
 	strbuf_release(&top->var);
+	free(kvi.filename);
 
 	return ret;
 }
diff --git a/config.h b/config.h
index 5dba984f77..b78f1b6667 100644
--- a/config.h
+++ b/config.h
@@ -118,7 +118,7 @@  struct config_options {
 
 /* Config source metadata for a given config key-value pair */
 struct key_value_info {
-	const char *filename;
+	char *filename;
 	int linenr;
 	enum config_origin_type origin_type;
 	enum config_scope scope;
diff --git a/refs.c b/refs.c
index c633abf284..cce8a31b22 100644
--- a/refs.c
+++ b/refs.c
@@ -2126,6 +2126,9 @@  struct ref_store *get_submodule_ref_store(const char *submodule)
 	size_t len;
 	struct repository *subrepo;
 
+	// TODO is this locking tolerable, and/or can we get any finer
+	static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+
 	if (!submodule)
 		return NULL;
 
@@ -2139,7 +2142,9 @@  struct ref_store *get_submodule_ref_store(const char *submodule)
 		/* We need to strip off one or more trailing slashes */
 		submodule = to_free = xmemdupz(submodule, len);
 
+	pthread_mutex_lock(&lock);
 	refs = lookup_ref_store_map(&submodule_ref_stores, submodule);
+	pthread_mutex_unlock(&lock);
 	if (refs)
 		goto done;
 
@@ -2162,10 +2167,14 @@  struct ref_store *get_submodule_ref_store(const char *submodule)
 		free(subrepo);
 		goto done;
 	}
+
+	pthread_mutex_lock(&lock);
+	// TODO maybe lock this separately
 	refs = ref_store_init(subrepo, submodule_sb.buf,
 			      REF_STORE_READ | REF_STORE_ODB);
 	register_ref_store_map(&submodule_ref_stores, "submodule",
 			       refs, submodule);
+	pthread_mutex_unlock(&lock);
 
 done:
 	strbuf_release(&submodule_sb);