diff mbox series

[v4] setlocalversion: work around "git describe" performance

Message ID 20241122150037.1085800-1-linux@rasmusvillemoes.dk (mailing list archive)
State New
Headers show
Series [v4] setlocalversion: work around "git describe" performance | expand

Commit Message

Rasmus Villemoes Nov. 22, 2024, 3 p.m. UTC
Contrary to expectations, passing a single candidate tag to "git
describe" is slower than not passing any --match options.

  $ time git describe --debug
  ...
  traversed 10619 commits
  ...
  v6.12-rc5-63-g0fc810ae3ae1

  real    0m0.169s

  $ time git describe --match=v6.12-rc5 --debug
  ...
  traversed 1310024 commits
  v6.12-rc5-63-g0fc810ae3ae1

  real    0m1.281s

In fact, the --debug output shows that git traverses all or most of
history. For some repositories and/or git versions, those 1.3s are
actually 10-15 seconds.

This has been acknowledged as a performance bug in git [1], and a fix
is on its way [2]. However, no solution is yet in git.git, and even
when one lands, it will take quite a while before it finds its way to
a release and for $random_kernel_developer to pick that up.

So rewrite the logic to use plumbing commands. For each of the
candidate values of $tag, we ask: (1) is $tag even an annotated
tag? (2) Is it eligible to describe HEAD, i.e. an ancestor of
HEAD? (3) If so, how many commits are in $tag..HEAD?

I have tested that this produces the same output as the current script
for ~700 random commits between v6.9..v6.10. For those 700 commits,
and in my git repo, the 'make -s kernelrelease' command is on average
~4 times faster with this patch applied (geometric mean of ratios).

For the commit mentioned in Josh's original report [3], the
time-consuming part of setlocalversion goes from

$ time git describe --match=v6.12-rc5 c1e939a21eb1
v6.12-rc5-44-gc1e939a21eb1

real    0m1.210s

to

$ time git rev-list --count --left-right v6.12-rc5..c1e939a21eb1
0       44

real    0m0.037s

[1] https://lore.kernel.org/git/20241101113910.GA2301440@coredump.intra.peff.net/
[2] https://lore.kernel.org/git/20241106192236.GC880133@coredump.intra.peff.net/
[3] https://lore.kernel.org/lkml/309549cafdcfe50c4fceac3263220cc3d8b109b2.1730337435.git.jpoimboe@kernel.org/

Reported-by: Sean Christopherson <seanjc@google.com>
Closes: https://lore.kernel.org/lkml/ZPtlxmdIJXOe0sEy@google.com/
Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Closes: https://lore.kernel.org/lkml/309549cafdcfe50c4fceac3263220cc3d8b109b2.1730337435.git.jpoimboe@kernel.org/
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
v4:

- Switch the logic to make use of the return values from try_tag,
  instead of asking whether $count has been set.

- Update a comment.

- Drop T-b tag from Josh as I think this changes the flow sufficiently
  from the version he tested. I have repeated my tests, with the same
  functional and performance result as indicated in the commit log.

v3: https://lore.kernel.org/lkml/20241118110154.3711777-1-linux@rasmusvillemoes.dk/

 scripts/setlocalversion | 59 +++++++++++++++++++++++++++--------------
 1 file changed, 39 insertions(+), 20 deletions(-)

Comments

Josh Poimboeuf Nov. 22, 2024, 5:08 p.m. UTC | #1
On Fri, Nov 22, 2024 at 04:00:37PM +0100, Rasmus Villemoes wrote:
> v4:
> 
> - Switch the logic to make use of the return values from try_tag,
>   instead of asking whether $count has been set.
> 
> - Update a comment.
> 
> - Drop T-b tag from Josh as I think this changes the flow sufficiently
>   from the version he tested. I have repeated my tests, with the same
>   functional and performance result as indicated in the commit log.

Tested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Masahiro Yamada Nov. 23, 2024, 8:06 a.m. UTC | #2
On Sat, Nov 23, 2024 at 12:01 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> Contrary to expectations, passing a single candidate tag to "git
> describe" is slower than not passing any --match options.
>
>   $ time git describe --debug
>   ...
>   traversed 10619 commits
>   ...
>   v6.12-rc5-63-g0fc810ae3ae1
>
>   real    0m0.169s
>
>   $ time git describe --match=v6.12-rc5 --debug
>   ...
>   traversed 1310024 commits
>   v6.12-rc5-63-g0fc810ae3ae1
>
>   real    0m1.281s
>
> In fact, the --debug output shows that git traverses all or most of
> history. For some repositories and/or git versions, those 1.3s are
> actually 10-15 seconds.
>
> This has been acknowledged as a performance bug in git [1], and a fix
> is on its way [2]. However, no solution is yet in git.git, and even
> when one lands, it will take quite a while before it finds its way to
> a release and for $random_kernel_developer to pick that up.
>
> So rewrite the logic to use plumbing commands. For each of the
> candidate values of $tag, we ask: (1) is $tag even an annotated
> tag? (2) Is it eligible to describe HEAD, i.e. an ancestor of
> HEAD? (3) If so, how many commits are in $tag..HEAD?
>
> I have tested that this produces the same output as the current script
> for ~700 random commits between v6.9..v6.10. For those 700 commits,
> and in my git repo, the 'make -s kernelrelease' command is on average
> ~4 times faster with this patch applied (geometric mean of ratios).
>
> For the commit mentioned in Josh's original report [3], the
> time-consuming part of setlocalversion goes from
>
> $ time git describe --match=v6.12-rc5 c1e939a21eb1
> v6.12-rc5-44-gc1e939a21eb1
>
> real    0m1.210s
>
> to
>
> $ time git rev-list --count --left-right v6.12-rc5..c1e939a21eb1
> 0       44
>
> real    0m0.037s
>
> [1] https://lore.kernel.org/git/20241101113910.GA2301440@coredump.intra.peff.net/
> [2] https://lore.kernel.org/git/20241106192236.GC880133@coredump.intra.peff.net/
> [3] https://lore.kernel.org/lkml/309549cafdcfe50c4fceac3263220cc3d8b109b2.1730337435.git.jpoimboe@kernel.org/
>
> Reported-by: Sean Christopherson <seanjc@google.com>
> Closes: https://lore.kernel.org/lkml/ZPtlxmdIJXOe0sEy@google.com/
> Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Closes: https://lore.kernel.org/lkml/309549cafdcfe50c4fceac3263220cc3d8b109b2.1730337435.git.jpoimboe@kernel.org/
> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> ---
> v4:
>
> - Switch the logic to make use of the return values from try_tag,
>   instead of asking whether $count has been set.


No, please do not do this.

As I replied in v3, my plan is to set -e, because otherwise
the shell script is fragile.

With this version, -e will not work in try_tag()
because it is used in the if condition.


> +try_tag() {
> +       tag="$1"
> +
> +       # Is $tag an annotated tag?
> +       [ "$(git cat-file -t "$tag" 2> /dev/null)" = tag ] || return 1
> +
> +       # Is it an ancestor of HEAD, and if so, how many commits are in $tag..HEAD?
> +       # shellcheck disable=SC2046 # word splitting is the point here
> +       set -- $(git rev-list --count --left-right "$tag"...HEAD 2> /dev/null)
> +
> +       # $1 is 0 if and only if $tag is an ancestor of HEAD. Use
> +       # string comparison, because $1 is empty if the 'git rev-list'
> +       # command somehow failed.
> +       [ "$1" = 0 ] || return 1
> +
> +       # $2 is the number of commits in the range $tag..HEAD, possibly 0.
> +       count="$2"

Redundant double-quotes.



--
Best Regards
Masahiro Yamada
Rasmus Villemoes Nov. 23, 2024, 10:12 p.m. UTC | #3
On Sat, Nov 23 2024, Masahiro Yamada <masahiroy@kernel.org> wrote:

> On Sat, Nov 23, 2024 at 12:01 AM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:

>> v4:
>>
>> - Switch the logic to make use of the return values from try_tag,
>>   instead of asking whether $count has been set.
>
>
> No, please do not do this.
>
> As I replied in v3, my plan is to set -e, because otherwise
> the shell script is fragile.
>
> With this version, -e will not work in try_tag()
> because it is used in the if condition.

I'm confused. Previously you said that "either style is fine with
me". Now you've invented some necessity to add "set -e", which of course
yes, is then suppressed inside try_tag. But there is not a single
statement within that that would cause "set -e" to exit anyway: The only
one that is not a simple assignment or is itself a test is the "set --
$()", and git rev-list failing there does not cause set -e to trigger.

Aside from that, I'm skeptical to introducing set -e anyway, it's simply
too hard to reason about what it will actually
do. http://mywiki.wooledge.org/BashFAQ/105 . But you're the maintainer.

>> +try_tag() {
>> +       tag="$1"
>> +
>> +       # Is $tag an annotated tag?
>> +       [ "$(git cat-file -t "$tag" 2> /dev/null)" = tag ] || return 1
>> +
>> +       # Is it an ancestor of HEAD, and if so, how many commits are in $tag..HEAD?
>> +       # shellcheck disable=SC2046 # word splitting is the point here
>> +       set -- $(git rev-list --count --left-right "$tag"...HEAD 2> /dev/null)
>> +
>> +       # $1 is 0 if and only if $tag is an ancestor of HEAD. Use
>> +       # string comparison, because $1 is empty if the 'git rev-list'
>> +       # command somehow failed.
>> +       [ "$1" = 0 ] || return 1
>> +
>> +       # $2 is the number of commits in the range $tag..HEAD, possibly 0.
>> +       count="$2"
>
> Redundant double-quotes.

Perhaps, but sorry, I'm not code-golfing, and trying to remember when
quotes can be elided when variables are referenced is simply not
something I spend my very limited brain capacity on.

Feel free to make any adjustments you want and commit that, or drop
this, I'm not sending a v5 as that seems to be a waste of everybody's
time.

Rasmus
Masahiro Yamada Nov. 24, 2024, 2:10 a.m. UTC | #4
On Sun, Nov 24, 2024 at 7:12 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> On Sat, Nov 23 2024, Masahiro Yamada <masahiroy@kernel.org> wrote:
>
> > On Sat, Nov 23, 2024 at 12:01 AM Rasmus Villemoes
> > <linux@rasmusvillemoes.dk> wrote:
>
> >> v4:
> >>
> >> - Switch the logic to make use of the return values from try_tag,
> >>   instead of asking whether $count has been set.
> >
> >
> > No, please do not do this.
> >
> > As I replied in v3, my plan is to set -e, because otherwise
> > the shell script is fragile.
> >
> > With this version, -e will not work in try_tag()
> > because it is used in the if condition.
>
> I'm confused. Previously you said that "either style is fine with
> me".

That was my comment in v2.

At that time, I was not aware that the -e option was missing
in this script.

Sorry, I changed my mind.

In v3, I commented what I like this script to look like
when turning on the -e option.
Then, you came back with a different approach.


> Now you've invented some necessity to add "set -e", which of course
> yes, is then suppressed inside try_tag. But there is not a single
> statement within that that would cause "set -e" to exit anyway: The only
> one that is not a simple assignment or is itself a test is the "set --
> $()", and git rev-list failing there does not cause set -e to trigger.

This is correct, but think about the resiliency when someone
adds more code to try_tag() in the future.


> Aside from that, I'm skeptical to introducing set -e anyway, it's simply
> too hard to reason about what it will actually
> do. http://mywiki.wooledge.org/BashFAQ/105 . But you're the maintainer.



First, set -e is almost always useful because it is not realistic
to add '|| exit 1' for every command.

For example, see commit bc53d3d777f81385c1bb08b07bd1c06450ecc2c1
how the -e catches an error.

Second, the link you reference is exaggerating.

For example, the article mentioned the quirks in the Bash mode.
This argument is not applicable in our case because
Bash runs in the POSIX mode when it is invoked as 'sh'.
Since this script is invoked by #!/bin/sh,
'set -e' follows the POSIX behavior whether /bin/sh is a
symlink to bash or dash.
The -e option is propagated to subshell executions.
(It would not if the shebang had been #!/bin/bash,
but Bash still provides "set -o posix" to cater to this case).

In the referred link, I do not find a good reason to
avoid using the -e option.

When a function returns a value (yes/no question just like
try_tag()), you need to be careful about the possible
third case.

1) yes
2) no
3) error


I hope the following provides even more clarification.

Let's say, is_ancestor_tag() checks if the given tag is
an ancestor or not.



[Bad Code]

set -e

if is_ancestor_tag "${tag}"; then
        # Any error in is_ancestor_tag() is ignored.
        # "Yes, ... " may be printed even if  an error occurs.
        echo "Yes, ${tag} is an ancestor"
else
        echo "No, ${tag} is not an ancestor"
fi


[Good Code 1]

set -e

ret=$(is_ancestor_tag "${tag}")
# If any error occurs in is_ancestor_tag()
# the script is terminated here.

if [ "${ret}" = yes ]; then
        echo "Yes, ${tag} is an ancestor"
else
        echo "No, ${tag} is not an ancestor"
fi


[Good Code 2]

set -e

# is_ancestor_tag() sets 'ret' in the function body
is_ancestor_tag "${tag}"
if [ "${ret}" = yes ]; then
        echo "Yes, ${tag} is an ancestor"
else
        echo "No, ${tag} is not an ancestor"
fi


V3 is [Good Code 2], as the return value, 'count' is assigned
within the try_tag() function, and the caller checks it.








> >> +try_tag() {
> >> +       tag="$1"
> >> +
> >> +       # Is $tag an annotated tag?
> >> +       [ "$(git cat-file -t "$tag" 2> /dev/null)" = tag ] || return 1
> >> +
> >> +       # Is it an ancestor of HEAD, and if so, how many commits are in $tag..HEAD?
> >> +       # shellcheck disable=SC2046 # word splitting is the point here
> >> +       set -- $(git rev-list --count --left-right "$tag"...HEAD 2> /dev/null)
> >> +
> >> +       # $1 is 0 if and only if $tag is an ancestor of HEAD. Use
> >> +       # string comparison, because $1 is empty if the 'git rev-list'
> >> +       # command somehow failed.
> >> +       [ "$1" = 0 ] || return 1
> >> +
> >> +       # $2 is the number of commits in the range $tag..HEAD, possibly 0.
> >> +       count="$2"
> >
> > Redundant double-quotes.
>
> Perhaps, but sorry, I'm not code-golfing, and trying to remember when
> quotes can be elided when variables are referenced is simply not
> something I spend my very limited brain capacity on.
>
> Feel free to make any adjustments you want and commit that, or drop
> this, I'm not sending a v5 as that seems to be a waste of everybody's
> time.
>
> Rasmus



--
Best Regards
Masahiro Yamada
diff mbox series

Patch

diff --git a/scripts/setlocalversion b/scripts/setlocalversion
index 38b96c6797f4..cde45d92cc0b 100755
--- a/scripts/setlocalversion
+++ b/scripts/setlocalversion
@@ -30,6 +30,27 @@  if test $# -gt 0 -o ! -d "$srctree"; then
 	usage
 fi
 
+try_tag() {
+	tag="$1"
+
+	# Is $tag an annotated tag?
+	[ "$(git cat-file -t "$tag" 2> /dev/null)" = tag ] || return 1
+
+	# Is it an ancestor of HEAD, and if so, how many commits are in $tag..HEAD?
+	# shellcheck disable=SC2046 # word splitting is the point here
+	set -- $(git rev-list --count --left-right "$tag"...HEAD 2> /dev/null)
+
+	# $1 is 0 if and only if $tag is an ancestor of HEAD. Use
+	# string comparison, because $1 is empty if the 'git rev-list'
+	# command somehow failed.
+	[ "$1" = 0 ] || return 1
+
+	# $2 is the number of commits in the range $tag..HEAD, possibly 0.
+	count="$2"
+
+	return 0
+}
+
 scm_version()
 {
 	local short=false
@@ -64,30 +85,27 @@  scm_version()
 	# If a localversion* file exists, and the corresponding
 	# annotated tag exists and is an ancestor of HEAD, use
 	# it. This is the case in linux-next.
-	tag=${file_localversion#-}
-	desc=
-	if [ -n "${tag}" ]; then
-		desc=$(git describe --match=$tag 2>/dev/null)
-	fi
-
+	if [ -n "${file_localversion#-}" ] && try_tag "${file_localversion#-}" ; then
+		:
 	# Otherwise, if a localversion* file exists, and the tag
 	# obtained by appending it to the tag derived from
 	# KERNELVERSION exists and is an ancestor of HEAD, use
 	# it. This is e.g. the case in linux-rt.
-	if [ -z "${desc}" ] && [ -n "${file_localversion}" ]; then
-		tag="${version_tag}${file_localversion}"
-		desc=$(git describe --match=$tag 2>/dev/null)
-	fi
-
+	elif [ -n "${file_localversion}" ] && try_tag "${version_tag}${file_localversion}" ; then
+		:
 	# Otherwise, default to the annotated tag derived from KERNELVERSION.
-	if [ -z "${desc}" ]; then
-		tag="${version_tag}"
-		desc=$(git describe --match=$tag 2>/dev/null)
+	elif try_tag "${version_tag}" ; then
+		:
+	else
+		count=
 	fi
 
-	# If we are at the tagged commit, we ignore it because the version is
-	# well-defined.
-	if [ "${tag}" != "${desc}" ]; then
+	# If we are at the tagged commit, we ignore it because the
+	# version is well-defined. If none of the attempted tags exist
+	# or were usable, $count is empty, so there is no count to
+	# pretty-print, but we can and should still append the -g plus
+	# the abbreviated sha1.
+	if [ "${count}" != 0 ]; then
 
 		# If only the short version is requested, don't bother
 		# running further git commands
@@ -95,14 +113,15 @@  scm_version()
 			echo "+"
 			return
 		fi
+
 		# If we are past the tagged commit, we pretty print it.
 		# (like 6.1.0-14595-g292a089d78d3)
-		if [ -n "${desc}" ]; then
-			echo "${desc}" | awk -F- '{printf("-%05d", $(NF-1))}'
+		if [ -n "${count}" ]; then
+			printf "%s%05d" "-" "${count}"
 		fi
 
 		# Add -g and exactly 12 hex chars.
-		printf '%s%s' -g "$(echo $head | cut -c1-12)"
+		printf '%s%.12s' -g "$head"
 	fi
 
 	if ${no_dirty}; then