diff mbox series

revision: trace topo-walk statistics

Message ID pull.828.git.1609302714183.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit 90b666da601a5ccf9bbc3f2282fb166bf3fdade7
Headers show
Series revision: trace topo-walk statistics | expand

Commit Message

Derrick Stolee Dec. 30, 2020, 4:31 a.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

We trace statistics about the effectiveness of changed-path Bloom
filters since 42e50e78 (revision.c: add trace2 stats around Bloom
filter usage, 2020-04-06). Add similar tracing for the topo-walk
algorithm that uses generation numbers to limit the walk size.

This information can help investigate and describe benefits to
heuristics and other changes.

The information that is printed is in JSON format and can be formatted
nicely to present as follows:

    {
	"count_explort_walked":2603,
	"count_indegree_walked":2603,
	"count_topo_walked":473
    }

Each of these values count the number of commits are visited by each of
the three "stages" of the topo-walk as detailed in b4542418 (revision.c:
generation-based topo-order algorithm, 2018-11-01).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    revision: trace topo-walk statistics
    
    Here is a patch I thought useful when investigating the performance
    implications of Abhishek's series [1]
    
    [1]
    https://lore.kernel.org/git/pull.676.v5.git.1609154168.gitgitgadget@gmail.com/

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-828%2Fderrickstolee%2Ftrace2-topo-walk-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-828/derrickstolee/trace2-topo-walk-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/828

 revision.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)


base-commit: 71ca53e8125e36efbda17293c50027d31681a41f

Comments

Junio C Hamano Jan. 7, 2021, 1:37 a.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/revision.c b/revision.c
> index 9dff845bed6..1bb590ece78 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3308,6 +3308,26 @@ struct topo_walk_info {
>  	struct author_date_slab author_date;
>  };
>  
> +static int topo_walk_atexit_registered;
> +static unsigned int count_explore_walked;
> +static unsigned int count_indegree_walked;
> +static unsigned int count_topo_walked;

The revision walk machinery is designed to be callable more than
once during the lifetime of a process.  These make readers wonder
if they should be defined in "struct rev_info" to allow stats
collected per traversal.
Derrick Stolee Jan. 7, 2021, 2:04 a.m. UTC | #2
On 1/6/2021 8:37 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> diff --git a/revision.c b/revision.c
>> index 9dff845bed6..1bb590ece78 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -3308,6 +3308,26 @@ struct topo_walk_info {
>>  	struct author_date_slab author_date;
>>  };
>>  
>> +static int topo_walk_atexit_registered;
>> +static unsigned int count_explore_walked;
>> +static unsigned int count_indegree_walked;
>> +static unsigned int count_topo_walked;
> 
> The revision walk machinery is designed to be callable more than
> once during the lifetime of a process.  These make readers wonder
> if they should be defined in "struct rev_info" to allow stats
> collected per traversal.

That's possible, but the use of an at-exit method means we only
report one set of statistics and the 'struct rev_info' might
be defunct.

It does limit how useful the statistics can be when there are
multiple 'struct rev_info's in use, but we also cannot trust
that the rev_infos are being cleaned up properly at the end
of the process to trigger the stats logging.

Of course, maybe that _is_ something we could guarantee, or
rather _should_ guarantee by patching any leaks. Seems like
a lot of work when these aggregate statistics will be
effective enough. But maybe I'm judging the potential work
too harshly?

Thanks,
-Stolee
Junio C Hamano Jan. 7, 2021, 2:29 a.m. UTC | #3
Derrick Stolee <stolee@gmail.com> writes:

> On 1/6/2021 8:37 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> diff --git a/revision.c b/revision.c
>>> index 9dff845bed6..1bb590ece78 100644
>>> --- a/revision.c
>>> +++ b/revision.c
>>> @@ -3308,6 +3308,26 @@ struct topo_walk_info {
>>>  	struct author_date_slab author_date;
>>>  };
>>>  
>>> +static int topo_walk_atexit_registered;
>>> +static unsigned int count_explore_walked;
>>> +static unsigned int count_indegree_walked;
>>> +static unsigned int count_topo_walked;
>> 
>> The revision walk machinery is designed to be callable more than
>> once during the lifetime of a process.  These make readers wonder
>> if they should be defined in "struct rev_info" to allow stats
>> collected per traversal.
>
> That's possible, but the use of an at-exit method means we only
> report one set of statistics and the 'struct rev_info' might
> be defunct.

Ah, sorry for the noise.  Even after making multiple traversal we
want to report the aggregate.

> It does limit how useful the statistics can be when there are
> multiple 'struct rev_info's in use, but we also cannot trust
> that the rev_infos are being cleaned up properly at the end
> of the process to trigger the stats logging.
>
> Of course, maybe that _is_ something we could guarantee, or
> rather _should_ guarantee by patching any leaks. Seems like
> a lot of work when these aggregate statistics will be
> effective enough. But maybe I'm judging the potential work
> too harshly?

But different subsystems would have different "per-invocation"
structure (e.g. "diff" uses "struct diff_options") and some may not
even have an appropriate structure to hang these stats on.  We may
want to design a more general mechanism that can be used to annotate
the subsystems uniformly.  While that could be a worthy longer term
goal, it certainly does not have to be part of this single-patch
topic, I would think.
Derrick Stolee Jan. 7, 2021, 11:08 a.m. UTC | #4
On 1/6/2021 9:29 PM, Junio C Hamano wrote:
> But different subsystems would have different "per-invocation"
> structure (e.g. "diff" uses "struct diff_options") and some may not
> even have an appropriate structure to hang these stats on.  We may
> want to design a more general mechanism that can be used to annotate
> the subsystems uniformly.  While that could be a worthy longer term
> goal, it certainly does not have to be part of this single-patch
> topic, I would think.
 
I will give this further thought.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/revision.c b/revision.c
index 9dff845bed6..1bb590ece78 100644
--- a/revision.c
+++ b/revision.c
@@ -3308,6 +3308,26 @@  struct topo_walk_info {
 	struct author_date_slab author_date;
 };
 
+static int topo_walk_atexit_registered;
+static unsigned int count_explore_walked;
+static unsigned int count_indegree_walked;
+static unsigned int count_topo_walked;
+
+static void trace2_topo_walk_statistics_atexit(void)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "count_explore_walked", count_explore_walked);
+	jw_object_intmax(&jw, "count_indegree_walked", count_indegree_walked);
+	jw_object_intmax(&jw, "count_topo_walked", count_topo_walked);
+	jw_end(&jw);
+
+	trace2_data_json("topo_walk", the_repository, "statistics", &jw);
+
+	jw_release(&jw);
+}
+
 static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
 {
 	if (c->object.flags & flag)
@@ -3329,6 +3349,8 @@  static void explore_walk_step(struct rev_info *revs)
 	if (repo_parse_commit_gently(revs->repo, c, 1) < 0)
 		return;
 
+	count_explore_walked++;
+
 	if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
 		record_author_date(&info->author_date, c);
 
@@ -3367,6 +3389,8 @@  static void indegree_walk_step(struct rev_info *revs)
 	if (repo_parse_commit_gently(revs->repo, c, 1) < 0)
 		return;
 
+	count_indegree_walked++;
+
 	explore_to_depth(revs, commit_graph_generation(c));
 
 	for (p = c->parents; p; p = p->next) {
@@ -3476,6 +3500,11 @@  static void init_topo_walk(struct rev_info *revs)
 	 */
 	if (revs->sort_order == REV_SORT_IN_GRAPH_ORDER)
 		prio_queue_reverse(&info->topo_queue);
+
+	if (trace2_is_enabled() && !topo_walk_atexit_registered) {
+		atexit(trace2_topo_walk_statistics_atexit);
+		topo_walk_atexit_registered = 1;
+	}
 }
 
 static struct commit *next_topo_commit(struct rev_info *revs)
@@ -3502,6 +3531,8 @@  static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 			    oid_to_hex(&commit->object.oid));
 	}
 
+	count_topo_walked++;
+
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;