[v3,4/4] Documentation: add lint-fsck-msgids

Message ID	20221025224224.2352979-5-gitster@pobox.com (mailing list archive)
State	Accepted
Commit	3882a0d3ad92ace6559c950921d9afb4cdcc7ea0
Headers	show Return-Path: <git-owner@kernel.org> Sender: Junio C Hamano <jch2355@gmail.com> From: Junio C Hamano <gitster@pobox.com> To: git@vger.kernel.org Cc: John Cai <johncai86@gmail.com> Subject: [PATCH v3 4/4] Documentation: add lint-fsck-msgids Date: Tue, 25 Oct 2022 15:42:24 -0700 Message-Id: <20221025224224.2352979-5-gitster@pobox.com> In-Reply-To: <20221025224224.2352979-1-gitster@pobox.com> References: <pull.1369.v2.git.git.1666667864.gitgitgadget@gmail.com> <20221025224224.2352979-1-gitster@pobox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	document fsck error message ids \| expand [v3,0/4] document fsck error message ids [v3,1/4] fsck: remove the unused BAD_TAG_OBJECT [v3,2/4] fsck: remove the unused MISSING_TREE_OBJECT [v3,3/4] fsck: document msg-id [v3,4/4] Documentation: add lint-fsck-msgids

Junio C Hamano Oct. 25, 2022, 10:42 p.m. UTC

During the initial development of the fsck-msgids.txt feature, it
has become apparent that it is very much error prone to make sure
the description in the documentation file are sorted and correctly
match what is in the fsck.h header file.

Add a quick-and-dirty Perl script and doc-lint target to sanity
check that the fsck-msgids.txt is consistent with the error type
list in the fsck.h header file.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/Makefile              | 11 +++++
 Documentation/lint-fsck-msgids.perl | 70 +++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)
 create mode 100755 Documentation/lint-fsck-msgids.perl

Ævar Arnfjörð Bjarmason Oct. 26, 2022, 2:43 a.m. UTC | #1

On Tue, Oct 25 2022, Junio C Hamano wrote:

> During the initial development of the fsck-msgids.txt feature, it
> has become apparent that it is very much error prone to make sure
> the description in the documentation file are sorted and correctly
> match what is in the fsck.h header file.

I have local fixes for the same issues in the list of advice in our
docs, some of it's missing, wrong, out of date etc.

I tried to quickly adapt the generation script I had for that, which
works nicely, and by line count much shorter than the lint :)

Having to exhaustively list every *.c file that uses fsck.h is a bit of
a bother, but we have the same with the other generated *.h's, so it
shouldn't be too bad.

And this way, if we get it wrong we get a compile error:
	
	$ git -P diff; make
	diff --git a/Documentation/fsck-msgids.txt b/Documentation/fsck-msgids.txt
	index 7af76ff99a5..f0b4308a6e7 100644
	--- a/Documentation/fsck-msgids.txt
	+++ b/Documentation/fsck-msgids.txt
	@@ -1,6 +1,3 @@
	-`badDate`::
	-       (ERROR) Invalid date format in an author/committer line.
	-
	 `badDateOverflow`::
	        (ERROR) Invalid date value in an author/committer line.
	 
	    CC fsck.o
	fsck.c:31:9: error: excess elements in array initializer [-Werror]
	   31 |         { NULL, NULL, NULL, -1 }
	      |         ^
	fsck.c:31:9: note: (near initialization for ‘msg_id_info’)
	fsck.c: In function ‘fsck_ident’:
	fsck.c:810:51: error: ‘FSCK_MSG_BAD_DATE’ undeclared (first use in this function); did you mean ‘FSCK_MSG_BAD_NAME’?
	  810 |                 return report(options, oid, type, FSCK_MSG_BAD_DATE, "invalid author/committer line - bad date");
	      |                                                   ^~~~~~~~~~~~~~~~~
	      |                                                   FSCK_MSG_BAD_NAME
	fsck.c:810:51: note: each undeclared identifier is reported only once for each function it appears in
	cc1: all warnings being treated as errors
	make: *** [Makefile:2617: fsck.o] Error 1

So, if you're interested:

 Makefile               | 12 ++++++++++++
 fsck.h                 |  8 +-------
 generate-fsckmsgids.sh | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/Makefile b/Makefile
index 85f03c6aed1..21bbc3dfb9f 100644
--- a/Makefile
+++ b/Makefile
@@ -860,6 +860,7 @@ REFTABLE_TEST_LIB = reftable/libreftable_test.a
 GENERATED_H += command-list.h
 GENERATED_H += config-list.h
 GENERATED_H += hook-list.h
+GENERATED_H += fsck-msgids.h
 
 .PHONY: generated-hdrs
 generated-hdrs: $(GENERATED_H)
@@ -2289,6 +2290,14 @@ git$X: git.o GIT-LDFLAGS $(BUILTIN_OBJS) $(GITLIBS)
 	$(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) \
 		$(filter %.o,$^) $(LIBS)
 
+# Unfortunately our dependency management of generated headers used
+# from within other headers suck, so we'll need to list every user of
+# fsck.h here, but not too bad...
+FSCK_MSGIDS_H_BUILTINS = fsck index-pack mktag receive-pack unpack-objects
+$(foreach f,$(FSCK_MSGIDS_H_BUILTINS:%=builtin/%),$f.sp $f.s $f.o): fsck-msgids.h
+FSCK_MSGIDS_H_LIBS = fetch-pack fsck
+$(foreach f,$(FSCK_MSGIDS_H_LIBS),$f.sp $f.s $f.o): fsck-msgids.h
+
 help.sp help.s help.o: command-list.h
 builtin/bugreport.sp builtin/bugreport.s builtin/bugreport.o: hook-list.h
 
@@ -2333,6 +2342,9 @@ command-list.h: $(wildcard Documentation/git*.txt)
 hook-list.h: generate-hooklist.sh Documentation/githooks.txt
 	$(QUIET_GEN)$(SHELL_PATH) ./generate-hooklist.sh >$@
 
+fsck-msgids.h: generate-fsckmsgids.sh Documentation/fsck-msgids.txt
+	$(QUIET_GEN)$(SHELL_PATH) ./generate-fsckmsgids.sh >$@
+
 SCRIPT_DEFINES = $(SHELL_PATH_SQ):$(DIFF_SQ):\
 	$(localedir_SQ):$(USE_GETTEXT_SCHEME):$(SANE_TOOL_PATH_SQ):\
 	$(gitwebdir_SQ):$(PERL_PATH_SQ):$(PAGER_ENV):\
diff --git a/fsck.h b/fsck.h
index 6fbce68ad67..1a9d68482ea 100644
--- a/fsck.h
+++ b/fsck.h
@@ -2,6 +2,7 @@
 #define GIT_FSCK_H
 
 #include "oidset.h"
+#include "fsck-msgids.h"
 
 enum fsck_msg_type {
 	/* for internal use only */
@@ -79,13 +80,6 @@ enum fsck_msg_type {
 	/* ignored (elevated when requested) */ \
 	FUNC(EXTRA_HEADER_ENTRY, IGNORE)
 
-#define MSG_ID(id, msg_type) FSCK_MSG_##id,
-enum fsck_msg_id {
-	FOREACH_FSCK_MSG_ID(MSG_ID)
-	FSCK_MSG_MAX
-};
-#undef MSG_ID
-
 struct fsck_options;
 struct object;
 
diff --git a/generate-fsckmsgids.sh b/generate-fsckmsgids.sh
new file mode 100755
index 00000000000..bbf236159aa
--- /dev/null
+++ b/generate-fsckmsgids.sh
@@ -0,0 +1,42 @@
+#!/bin/sh
+
+HT='	'
+
+fsck_list () {
+	sed -n \
+		-e "/::$/ {
+			s/::\$//;
+			s/^\`//;
+			s/\`$//;
+			p;
+		}" \
+	    <Documentation/fsck-msgids.txt
+}
+
+txt2label () {
+	sed \
+		-e 's/\([^_]\)\([[:upper:]]\)/\1_\2/g' \
+		-e 's/^/FSCK_MSG_/' |
+		tr '[:lower:]' '[:upper:]'
+}
+
+fsck_labels () {
+	fsck_list |
+	txt2label
+}
+
+listify () {
+	sed \
+		-e "2,\$s/^/$HT/" \
+		-e 's/$/,/'
+}
+
+cat <<EOF
+/* Automatically generated by generate-fsck-msgids.sh */
+
+enum fsck_msg_id {
+	/* Auto-generated from Documentation/fsck-msgids.txt */
+	$(fsck_labels | listify)
+	FSCK_MSG_MAX
+};
+EOF

Jeff King Oct. 26, 2022, 5:34 a.m. UTC | #2

On Wed, Oct 26, 2022 at 04:43:32AM +0200, Ævar Arnfjörð Bjarmason wrote:

> 
> On Tue, Oct 25 2022, Junio C Hamano wrote:
> 
> > During the initial development of the fsck-msgids.txt feature, it
> > has become apparent that it is very much error prone to make sure
> > the description in the documentation file are sorted and correctly
> > match what is in the fsck.h header file.
> 
> I have local fixes for the same issues in the list of advice in our
> docs, some of it's missing, wrong, out of date etc.
> 
> I tried to quickly adapt the generation script I had for that, which
> works nicely, and by line count much shorter than the lint :)

Yeah, my instinct here was to generate rather than lint. If you make a
mistake and the linter hits you over the head, that is better than
quietly letting your mistake go. But better still is making it
impossible to make in the first place.

The downside is added complexity to the build, but it doesn't seem too
bad in this case.

(I had a similar thought after getting hit on the head by the recent
t0450-txt-doc-vs-help.sh).

> Having to exhaustively list every *.c file that uses fsck.h is a bit of
> a bother, but we have the same with the other generated *.h's, so it
> shouldn't be too bad.

It feels like this could be made much shorter by having a separate
fsck-msgs.h and not including it from fsck.h. Only fsck.c and mktag.c
need the actual list. It would probably have to stop being an "enum",
though.

Another alternative is to generate the doc from the code, rather than
the other way around.

> +# Unfortunately our dependency management of generated headers used
> +# from within other headers suck, so we'll need to list every user of
> +# fsck.h here, but not too bad...
> +FSCK_MSGIDS_H_BUILTINS = fsck index-pack mktag receive-pack unpack-objects
> +$(foreach f,$(FSCK_MSGIDS_H_BUILTINS:%=builtin/%),$f.sp $f.s $f.o): fsck-msgids.h
> +FSCK_MSGIDS_H_LIBS = fetch-pack fsck
> +$(foreach f,$(FSCK_MSGIDS_H_LIBS),$f.sp $f.s $f.o): fsck-msgids.h

I don't understand the "used within other headers" part here. Computed
dependencies will get this right. It's only the initial build (before we
have any computed dependencies) that needs to know that the C files in
question depend on the generated file. But that is true whether they do
so directly or indirectly.

-Peff

Junio C Hamano Oct. 26, 2022, 6:46 a.m. UTC | #3

Jeff King <peff@peff.net> writes:

> Another alternative is to generate the doc from the code, rather than
> the other way around.

Yup, that indeed was what I envisioned in the "Possible future
directions" comment in the cover letter.  It might not even be bad
to do

    #define FOREACH_FSCK_MSG_ID(FUNC) \
	/* fatal errors */ \
	FUNC(NUL_IN_HEADER, FATAL, N_("NUL byte exists in the object header.")) \
	FUNC(UNTERMINATED_HEADER, FATAL, N_("Missing end-of-line in the object header.")) \
	/* errors */ \
	FUNC(BAD_DATE, ERROR, N_("Invalid date format in an author/committer line.")) \
	...

as some of the users of the macro may have use of the translatable
messages themselves.

Ævar Arnfjörð Bjarmason Oct. 26, 2022, 11:35 a.m. UTC | #4

On Wed, Oct 26 2022, Jeff King wrote:

> On Wed, Oct 26, 2022 at 04:43:32AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> 
>> On Tue, Oct 25 2022, Junio C Hamano wrote:
>> 
>> > During the initial development of the fsck-msgids.txt feature, it
>> > has become apparent that it is very much error prone to make sure
>> > the description in the documentation file are sorted and correctly
>> > match what is in the fsck.h header file.
>> 
>> I have local fixes for the same issues in the list of advice in our
>> docs, some of it's missing, wrong, out of date etc.
>> 
>> I tried to quickly adapt the generation script I had for that, which
>> works nicely, and by line count much shorter than the lint :)
>
> Yeah, my instinct here was to generate rather than lint. If you make a
> mistake and the linter hits you over the head, that is better than
> quietly letting your mistake go. But better still is making it
> impossible to make in the first place.
>
> The downside is added complexity to the build, but it doesn't seem too
> bad in this case.

Yeah, it's not, I have local patches to generate advice-{type,config}.h,
and builtin.h. This is a quick POC to do it for fsck-msgids.h.

I see I forgot the .gitignore entry, so it's a rough POC :)

> (I had a similar thought after getting hit on the head by the recent
> t0450-txt-doc-vs-help.sh).

Sorry about that. FWIW I've wanted to assert that for a while, and to do
it by e.g. having the doc *.txt blurbs generated from running "$buildin
-h" during the build.

But when I suggested that I gathered That Junio wasn't up for that
approach, it does have its downsides thorugh. E.g. to build the docs
you'd now have to compile C code, and e.g. that git-scm.com publisher
written in Ruby would have to compile the code to generate its docs.

Or we could do it the other way around, and scape the *.txt for the *.c
generation, but then we need to run a new script for building
builtin/*.o. Also possible, and I think eventually both are better than
what t0450-txt-doc-vs-help.sh's doing now, but that was a less invasive
change than both...

>> Having to exhaustively list every *.c file that uses fsck.h is a bit of
>> a bother, but we have the same with the other generated *.h's, so it
>> shouldn't be too bad.
>
> It feels like this could be made much shorter by having a separate
> fsck-msgs.h and not including it from fsck.h. Only fsck.c and mktag.c
> need the actual list. It would probably have to stop being an "enum",
> though.

Yes, that would make it shorter, but C doesn't have forward decls of
enums, so we'd need to make it "int", ....

> Another alternative is to generate the doc from the code, rather than
> the other way around.

*nod* :)

>> +# Unfortunately our dependency management of generated headers used
>> +# from within other headers suck, so we'll need to list every user of
>> +# fsck.h here, but not too bad...
>> +FSCK_MSGIDS_H_BUILTINS = fsck index-pack mktag receive-pack unpack-objects
>> +$(foreach f,$(FSCK_MSGIDS_H_BUILTINS:%=builtin/%),$f.sp $f.s $f.o): fsck-msgids.h
>> +FSCK_MSGIDS_H_LIBS = fetch-pack fsck
>> +$(foreach f,$(FSCK_MSGIDS_H_LIBS),$f.sp $f.s $f.o): fsck-msgids.h
>
> I don't understand the "used within other headers" part here. Computed
> dependencies will get this right. It's only the initial build (before we
> have any computed dependencies) that needs to know that the C files in
> question depend on the generated file. But that is true whether they do
> so directly or indirectly.

I forgot the full story, but luckily I wrote it down in the WIP commits
:) FWIW if you want to scour that it's mainly:

	https://github.com/avar/git/commit/a00f1cb9ea5 # add advice-type.h
	https://github.com/avar/git/commit/9e080923a11 # generate 'struct advice

Also, before generating builtin.h I've got e.g. this:

	https://github.com/avar/git/commit/5a5360d0134 # just check with 'make' that any file is sorted

To actualy generate it, very WIP:

        http://github.com/avar/git/commit/cf1d02fa6b2

Anyway, in partial summary, why (and sorry this got so long):

 * Yes, once we build e.g. fsck.o we have the full dependency tree,
   yay....
 * ...but only for gcc & clang, but we support other compilers.
 * ...for other compilers (or gcc & clang without the dep detection
   enabled) what should this do:

	make file-that-does-not-use-generated-header.o

   It sucks a bit to have e.g. every *.o depend on command-list.h, when
   we only use it in 1-2 places, ditto the other headers.

   The approach I took to this was to manually list headers that had
   "small uses" (1-10), but say that all of *.h dependend on running the
   *.sh to generate some "widely used" headers.

   E.g. if we generate builtin.h we'll either need that, or add
   builtin/%.o as a dependency, with a manual listing of the handful of
   uses outside of that subdirectory.

 * .... or just not care about any of that, i.e. to have all of our *.sh
   run on the "first build" no matter what, which certainly simplifies
   things, but once you have e.g. 5-10 generated headers doing e.g.:

	make grep.o

   and having it build command-list.h or whatever is a bit noisy, but
   it's a trade-off of manually maintaining deps v.s. not.

 * Our "COMPUTED_HEADER_DEPENDENCIES" only happens when you build the
   *.o, but we have *.sp and *.s targets too. I have some local changes
   *to generalize that, so we e.g. get proper dependencies for building
   **.s files.

 * We also have e.g. "make hdr-check", which works just fine now, but
   it's becausue we have no *.h file that uses another generated *.h
   file.

   Actually, I suspect the posted WIP change is broken in that regard
   (but doing this for real on my topic branch works), i.e. we need to
   start caring about deps for those auxiliary targets.

 * I may be wrong (I'm just looking at some old "here be dragons" note
   of mine), but I think there's also an edge case where the dependency
   graph of .depend.d/* is correct *once you've always had it*, but if a
   file now e.g. depends on foo.h, and I make foo.h include a new
   foo-generated.h things go boom.

   That issue (it if even exists, sorry, I'm blanking on the details)
   was solvable by just doing a "make clean", and building again,
   i.e. once we had re-built once we were OK, it was only a problem with
   an older checkout with an existing build pulling new sources.

   Still, I wanted to not make that case annoying either if I added a
   generated header...

Jeff King Oct. 28, 2022, 1:23 a.m. UTC | #5

On Wed, Oct 26, 2022 at 01:35:43PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > (I had a similar thought after getting hit on the head by the recent
> > t0450-txt-doc-vs-help.sh).
> 
> Sorry about that. FWIW I've wanted to assert that for a while, and to do
> it by e.g. having the doc *.txt blurbs generated from running "$buildin
> -h" during the build.
> 
> But when I suggested that I gathered That Junio wasn't up for that
> approach, it does have its downsides thorugh. E.g. to build the docs
> you'd now have to compile C code, and e.g. that git-scm.com publisher
> written in Ruby would have to compile the code to generate its docs.

Yes, it would definitely break the git-scm.com importer. It might be
possible to make it work by actually running "make man" (or maybe even
some partial targets) in a local repo. The nightly update job pulls all
of the data using GitHub's API, but it does run in a heroku dyno that
has git and gcc. Doing a shallow clone of the tag to build is probably
more expensive, but the cost of an actual build isn't that important.
99% of the scheduled job runs are noops because there's no new version
of Git to build manpages for; as long as those remain cheap-ish, we're
OK.

I also think in the long run that the site should move to a system that
builds off a local repo, which we can trigger manually or via GitHub
Actions. But that doesn't exist yet, and I don't think anyone's actively
working on it.

I also think it would be nice if the git-scm.com system relied more on
"make man USE_ASCIIDOCTOR=1", possibly by hooking into
asciidoctor-extensions.rb or similar, rather than munging files in an
ad-hoc manner. But that's also a big change that nobody is actively
working on.

All of which is to say that yes, having docs depend on C code will cause
work on the site, but it may be work that takes us in the right
direction.

> Or we could do it the other way around, and scape the *.txt for the *.c
> generation, but then we need to run a new script for building
> builtin/*.o. Also possible, and I think eventually both are better than
> what t0450-txt-doc-vs-help.sh's doing now, but that was a less invasive
> change than both...

I think this particular case is tricky in that direction, because it's a
big set of dependencies that aren't necessarily one-to-one. E.g.,
builtin/log.c needs to depend on git-log.txt, but also on git-show.txt
and git-format-patch.txt.

> >> +# Unfortunately our dependency management of generated headers used
> >> +# from within other headers suck, so we'll need to list every user of
> >> +# fsck.h here, but not too bad...
> >> +FSCK_MSGIDS_H_BUILTINS = fsck index-pack mktag receive-pack unpack-objects
> >> +$(foreach f,$(FSCK_MSGIDS_H_BUILTINS:%=builtin/%),$f.sp $f.s $f.o): fsck-msgids.h
> >> +FSCK_MSGIDS_H_LIBS = fetch-pack fsck
> >> +$(foreach f,$(FSCK_MSGIDS_H_LIBS),$f.sp $f.s $f.o): fsck-msgids.h
> >
> > I don't understand the "used within other headers" part here. Computed
> > dependencies will get this right. It's only the initial build (before we
> > have any computed dependencies) that needs to know that the C files in
> > question depend on the generated file. But that is true whether they do
> > so directly or indirectly.
> 
> I forgot the full story, but luckily I wrote it down in the WIP commits
> :) FWIW if you want to scour that it's mainly:
> 
> 	https://github.com/avar/git/commit/a00f1cb9ea5 # add advice-type.h
> 	https://github.com/avar/git/commit/9e080923a11 # generate 'struct advice

That seems like the same problem to me. It's just that if something is
in cache.h, then it's needed by basically everything, and so the
dependency list is big.

> Anyway, in partial summary, why (and sorry this got so long):
> 
>  * Yes, once we build e.g. fsck.o we have the full dependency tree,
>    yay....
>  * ...but only for gcc & clang, but we support other compilers.
>  * ...for other compilers (or gcc & clang without the dep detection
>    enabled) what should this do:
> 
> 	make file-that-does-not-use-generated-header.o
> 
>    It sucks a bit to have e.g. every *.o depend on command-list.h, when
>    we only use it in 1-2 places, ditto the other headers.

But that is already true of non-generated headers. If your system
doesn't support computed deps, then all objects depend on all headers.
Yes, that sucks for you. But almost nobody actively develops on such a
system, and people building Git to use rarely notice (because they are
doing an initial build, or jumping versions, both of which generally
require recompilation anyway).

That all comes from b8ba629264 (Makefile: fold MISC_H into LIB_H,
2012-06-20).

>  * We also have e.g. "make hdr-check", which works just fine now, but
>    it's becausue we have no *.h file that uses another generated *.h
>    file.

I'm not that surprised. Probably it should depend on $(GENERATED_H), and
possibly even be checking those files too.

>  * I may be wrong (I'm just looking at some old "here be dragons" note
>    of mine), but I think there's also an edge case where the dependency
>    graph of .depend.d/* is correct *once you've always had it*, but if a
>    file now e.g. depends on foo.h, and I make foo.h include a new
>    foo-generated.h things go boom.

That should work because the timestamp on foo.h will have been updated,
causing anything that includes it to rebuild (and thus updating its
computed dependencies to include foo-generated.h).

-Peff

Ævar Arnfjörð Bjarmason Oct. 28, 2022, 2:04 a.m. UTC | #6

On Thu, Oct 27 2022, Jeff King wrote:

> On Wed, Oct 26, 2022 at 01:35:43PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > (I had a similar thought after getting hit on the head by the recent
>> > t0450-txt-doc-vs-help.sh).
>> 
>> Sorry about that. FWIW I've wanted to assert that for a while, and to do
>> it by e.g. having the doc *.txt blurbs generated from running "$buildin
>> -h" during the build.
>> 
>> But when I suggested that I gathered That Junio wasn't up for that
>> approach, it does have its downsides thorugh. E.g. to build the docs
>> you'd now have to compile C code, and e.g. that git-scm.com publisher
>> written in Ruby would have to compile the code to generate its docs.
>
> Yes, it would definitely break the git-scm.com importer. It might be
> possible to make it work by actually running "make man" (or maybe even
> some partial targets) in a local repo. The nightly update job pulls all
> of the data using GitHub's API, but it does run in a heroku dyno that
> has git and gcc. Doing a shallow clone of the tag to build is probably
> more expensive, but the cost of an actual build isn't that important.
> 99% of the scheduled job runs are noops because there's no new version
> of Git to build manpages for; as long as those remain cheap-ish, we're
> OK.
>
> I also think in the long run that the site should move to a system that
> builds off a local repo, which we can trigger manually or via GitHub
> Actions. But that doesn't exist yet, and I don't think anyone's actively
> working on it.
>
> I also think it would be nice if the git-scm.com system relied more on
> "make man USE_ASCIIDOCTOR=1", possibly by hooking into
> asciidoctor-extensions.rb or similar, rather than munging files in an
> ad-hoc manner. But that's also a big change that nobody is actively
> working on.
>
> All of which is to say that yes, having docs depend on C code will cause
> work on the site, but it may be work that takes us in the right
> direction.

Good to know.

>> Or we could do it the other way around, and scape the *.txt for the *.c
>> generation, but then we need to run a new script for building
>> builtin/*.o. Also possible, and I think eventually both are better than
>> what t0450-txt-doc-vs-help.sh's doing now, but that was a less invasive
>> change than both...
>
> I think this particular case is tricky in that direction, because it's a
> big set of dependencies that aren't necessarily one-to-one. E.g.,
> builtin/log.c needs to depend on git-log.txt, but also on git-show.txt
> and git-format-patch.txt.

I was thinking of just a generated usage-strings.c or whatever. I.e. one
file with every usage string in it. Then you don't need to hunt down
which thing goes in which file. We'll just have a variable named
"builtin_format_patch_usage". Include it in "builtin.h" and it doesn't
matter if that's in builtin/log.c or builtin/format-patch.c.

It does mean you need to rebuild the C code for other built-ins every
time one of the SYNOPSIS sections changes, but doesn't happen too often.

>> >> +# Unfortunately our dependency management of generated headers used
>> >> +# from within other headers suck, so we'll need to list every user of
>> >> +# fsck.h here, but not too bad...
>> >> +FSCK_MSGIDS_H_BUILTINS = fsck index-pack mktag receive-pack unpack-objects
>> >> +$(foreach f,$(FSCK_MSGIDS_H_BUILTINS:%=builtin/%),$f.sp $f.s $f.o): fsck-msgids.h
>> >> +FSCK_MSGIDS_H_LIBS = fetch-pack fsck
>> >> +$(foreach f,$(FSCK_MSGIDS_H_LIBS),$f.sp $f.s $f.o): fsck-msgids.h
>> >
>> > I don't understand the "used within other headers" part here. Computed
>> > dependencies will get this right. It's only the initial build (before we
>> > have any computed dependencies) that needs to know that the C files in
>> > question depend on the generated file. But that is true whether they do
>> > so directly or indirectly.
>> 
>> I forgot the full story, but luckily I wrote it down in the WIP commits
>> :) FWIW if you want to scour that it's mainly:
>> 
>> 	https://github.com/avar/git/commit/a00f1cb9ea5 # add advice-type.h
>> 	https://github.com/avar/git/commit/9e080923a11 # generate 'struct advice
>
> That seems like the same problem to me. It's just that if something is
> in cache.h, then it's needed by basically everything, and so the
> dependency list is big.

I think I either did or was planning to take it out of cache.h as part
of that, we put way too much in cache.h.

Even advice.c depends on cache.h for its advice.h *sigh*.

Trying it just now putting advice.h in builtin.h instead leaves 10
top-level files not-compiling (and they just need the header).

I think it's fine to include common utilties in our "included by
everything" headers, but if we've got ~500 *.c files and something is
only needed by ~20 of them (as in this case) we should probably just
include it in those 20.

>> Anyway, in partial summary, why (and sorry this got so long):
>> 
>>  * Yes, once we build e.g. fsck.o we have the full dependency tree,
>>    yay....
>>  * ...but only for gcc & clang, but we support other compilers.
>>  * ...for other compilers (or gcc & clang without the dep detection
>>    enabled) what should this do:

I didn't summarize this well enough, I should have said that whether you
have computed deps or not that.....

>> 
>> 	make file-that-does-not-use-generated-header.o
>> 
>>    It sucks a bit to have e.g. every *.o depend on command-list.h, when
>>    we only use it in 1-2 places, ditto the other headers.
> But that is already true of non-generated headers. If your system
> doesn't support computed deps, then all objects depend on all headers.

... this does not build e.g. command-list.h:

	make clean
	make grep.o

But this does:

	make clean
	make help.o

Because we've manually declared that.

[Re-arranging a bit]

> That all comes from b8ba629264 (Makefile: fold MISC_H into LIB_H,
> 2012-06-20).

Yes, but you get that for free whether you have computed deps or
not. I.e. your compiler is about to build the code anyway.

But it wasn't about to run a bunch of shellscripts to generate headers
it doesn't actually need.

> Yes, that sucks for you. But almost nobody actively develops on such a
> system, and people building Git to use rarely notice (because they are
> doing an initial build, or jumping versions, both of which generally
> require recompilation anyway).

I guess I'm "almost nobody" :) Not because I don't use computed deps,
but because I tend to e.g. do large interactive rebases with:

	git rebase -i 'make grep.o'

Or some other subset of our built assets/objects.

On that front no one thing is the end of the world, i.e. sure, if we run
some shellscript to make a needed-header.h that takes 10ms we can live
with it.

But it adds up, which is one reason I've been slowly trickling in
patches to optimize various common parts of the Makefile & those
scripts.

I think before I started that the fixed cost even to have just make
decide it had nothing to do was often 500ms-1s.

I think that's down to 1/4 or so of that on "master", and I've got some
local patches to get it consistently down to <50ms.

It adds up, especially when combined with ccache & the like. I.e. if i'm
testing 10 commits I happen to have cached a "rebase -i" goes from
noticeably slow (~10s) to not being so slow as to be distracting.

Anyway, all of which is to say that that's one thing I was aiming for:
If I was going to submit patches for new generated headers to find some
good trade-off between complexity and over-declaring dependencies.

>>  * We also have e.g. "make hdr-check", which works just fine now, but
>>    it's becausue we have no *.h file that uses another generated *.h
>>    file.
>
> I'm not that surprised. Probably it should depend on $(GENERATED_H), and
> possibly even be checking those files too.
>
>>  * I may be wrong (I'm just looking at some old "here be dragons" note
>>    of mine), but I think there's also an edge case where the dependency
>>    graph of .depend.d/* is correct *once you've always had it*, but if a
>>    file now e.g. depends on foo.h, and I make foo.h include a new
>>    foo-generated.h things go boom.
>
> That should work because the timestamp on foo.h will have been updated,
> causing anything that includes it to rebuild (and thus updating its
> computed dependencies to include foo-generated.h).

Yes, in general it should work, maybe there's some Heisenbug there. I
can't recall or reproduce it, sorry.

But that's a good thing, maybe there's no issue :)

John Cai Oct. 28, 2022, 3:02 a.m. UTC | #7

Hi Ævar,

On 26 Oct 2022, at 7:35, Ævar Arnfjörð Bjarmason wrote:

> On Wed, Oct 26 2022, Jeff King wrote:
>
>> On Wed, Oct 26, 2022 at 04:43:32AM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>>>
>>> On Tue, Oct 25 2022, Junio C Hamano wrote:
>>>
>>>> During the initial development of the fsck-msgids.txt feature, it
>>>> has become apparent that it is very much error prone to make sure
>>>> the description in the documentation file are sorted and correctly
>>>> match what is in the fsck.h header file.
>>>
>>> I have local fixes for the same issues in the list of advice in our
>>> docs, some of it's missing, wrong, out of date etc.
>>>
>>> I tried to quickly adapt the generation script I had for that, which
>>> works nicely, and by line count much shorter than the lint :)
>>
>> Yeah, my instinct here was to generate rather than lint. If you make a
>> mistake and the linter hits you over the head, that is better than
>> quietly letting your mistake go. But better still is making it
>> impossible to make in the first place.
>>
>> The downside is added complexity to the build, but it doesn't seem too
>> bad in this case.
>
> Yeah, it's not, I have local patches to generate advice-{type,config}.h,
> and builtin.h. This is a quick POC to do it for fsck-msgids.h.
>
> I see I forgot the .gitignore entry, so it's a rough POC :)
>
>> (I had a similar thought after getting hit on the head by the recent
>> t0450-txt-doc-vs-help.sh).
>
> Sorry about that. FWIW I've wanted to assert that for a while, and to do
> it by e.g. having the doc *.txt blurbs generated from running "$buildin
> -h" during the build.

If we wanted to go this route of generating the docs from the code (which sounds
like a better long term solution), how would this work? Would we print out the
list of message ids in builtin/fsck.c and write it to
Documentation/fsck-msgids.txt ?

>
> But when I suggested that I gathered That Junio wasn't up for that
> approach, it does have its downsides thorugh. E.g. to build the docs
> you'd now have to compile C code, and e.g. that git-scm.com publisher
> written in Ruby would have to compile the code to generate its docs.
>
> Or we could do it the other way around, and scape the *.txt for the *.c
> generation, but then we need to run a new script for building
> builtin/*.o. Also possible, and I think eventually both are better than
> what t0450-txt-doc-vs-help.sh's doing now, but that was a less invasive
> change than both...
>
>>> Having to exhaustively list every *.c file that uses fsck.h is a bit of
>>> a bother, but we have the same with the other generated *.h's, so it
>>> shouldn't be too bad.
>>
>> It feels like this could be made much shorter by having a separate
>> fsck-msgs.h and not including it from fsck.h. Only fsck.c and mktag.c
>> need the actual list. It would probably have to stop being an "enum",
>> though.
>
> Yes, that would make it shorter, but C doesn't have forward decls of
> enums, so we'd need to make it "int", ....
>
>> Another alternative is to generate the doc from the code, rather than
>> the other way around.
>
> *nod* :)
>
>>> +# Unfortunately our dependency management of generated headers used
>>> +# from within other headers suck, so we'll need to list every user of
>>> +# fsck.h here, but not too bad...
>>> +FSCK_MSGIDS_H_BUILTINS = fsck index-pack mktag receive-pack unpack-objects
>>> +$(foreach f,$(FSCK_MSGIDS_H_BUILTINS:%=builtin/%),$f.sp $f.s $f.o): fsck-msgids.h
>>> +FSCK_MSGIDS_H_LIBS = fetch-pack fsck
>>> +$(foreach f,$(FSCK_MSGIDS_H_LIBS),$f.sp $f.s $f.o): fsck-msgids.h
>>
>> I don't understand the "used within other headers" part here. Computed
>> dependencies will get this right. It's only the initial build (before we
>> have any computed dependencies) that needs to know that the C files in
>> question depend on the generated file. But that is true whether they do
>> so directly or indirectly.
>
> I forgot the full story, but luckily I wrote it down in the WIP commits
> :) FWIW if you want to scour that it's mainly:
>
> 	https://github.com/avar/git/commit/a00f1cb9ea5 # add advice-type.h
> 	https://github.com/avar/git/commit/9e080923a11 # generate 'struct advice
>
> Also, before generating builtin.h I've got e.g. this:
>
> 	https://github.com/avar/git/commit/5a5360d0134 # just check with 'make' that any file is sorted
>
> To actualy generate it, very WIP:
>
>         http://github.com/avar/git/commit/cf1d02fa6b2
>
> Anyway, in partial summary, why (and sorry this got so long):
>
>  * Yes, once we build e.g. fsck.o we have the full dependency tree,
>    yay....
>  * ...but only for gcc & clang, but we support other compilers.
>  * ...for other compilers (or gcc & clang without the dep detection
>    enabled) what should this do:
>
> 	make file-that-does-not-use-generated-header.o
>
>    It sucks a bit to have e.g. every *.o depend on command-list.h, when
>    we only use it in 1-2 places, ditto the other headers.
>
>    The approach I took to this was to manually list headers that had
>    "small uses" (1-10), but say that all of *.h dependend on running the
>    *.sh to generate some "widely used" headers.
>
>    E.g. if we generate builtin.h we'll either need that, or add
>    builtin/%.o as a dependency, with a manual listing of the handful of
>    uses outside of that subdirectory.
>
>  * .... or just not care about any of that, i.e. to have all of our *.sh
>    run on the "first build" no matter what, which certainly simplifies
>    things, but once you have e.g. 5-10 generated headers doing e.g.:
>
> 	make grep.o
>
>    and having it build command-list.h or whatever is a bit noisy, but
>    it's a trade-off of manually maintaining deps v.s. not.
>
>  * Our "COMPUTED_HEADER_DEPENDENCIES" only happens when you build the
>    *.o, but we have *.sp and *.s targets too. I have some local changes
>    *to generalize that, so we e.g. get proper dependencies for building
>    **.s files.
>
>  * We also have e.g. "make hdr-check", which works just fine now, but
>    it's becausue we have no *.h file that uses another generated *.h
>    file.
>
>    Actually, I suspect the posted WIP change is broken in that regard
>    (but doing this for real on my topic branch works), i.e. we need to
>    start caring about deps for those auxiliary targets.
>
>  * I may be wrong (I'm just looking at some old "here be dragons" note
>    of mine), but I think there's also an edge case where the dependency
>    graph of .depend.d/* is correct *once you've always had it*, but if a
>    file now e.g. depends on foo.h, and I make foo.h include a new
>    foo-generated.h things go boom.
>
>    That issue (it if even exists, sorry, I'm blanking on the details)
>    was solvable by just doing a "make clean", and building again,
>    i.e. once we had re-built once we were OK, it was only a problem with
>    an older checkout with an existing build pulling new sources.
>
>    Still, I wanted to not make that case annoying either if I added a
>    generated header...

Ævar Arnfjörð Bjarmason Oct. 28, 2022, 3:11 a.m. UTC | #8

On Thu, Oct 27 2022, John Cai wrote:

> Hi Ævar,
>
> On 26 Oct 2022, at 7:35, Ævar Arnfjörð Bjarmason wrote:
>
>> On Wed, Oct 26 2022, Jeff King wrote:
>>
>>> On Wed, Oct 26, 2022 at 04:43:32AM +0200, Ævar Arnfjörð Bjarmason wrote:
>>>
>>>>
>>>> On Tue, Oct 25 2022, Junio C Hamano wrote:
>>>>
>>>>> During the initial development of the fsck-msgids.txt feature, it
>>>>> has become apparent that it is very much error prone to make sure
>>>>> the description in the documentation file are sorted and correctly
>>>>> match what is in the fsck.h header file.
>>>>
>>>> I have local fixes for the same issues in the list of advice in our
>>>> docs, some of it's missing, wrong, out of date etc.
>>>>
>>>> I tried to quickly adapt the generation script I had for that, which
>>>> works nicely, and by line count much shorter than the lint :)
>>>
>>> Yeah, my instinct here was to generate rather than lint. If you make a
>>> mistake and the linter hits you over the head, that is better than
>>> quietly letting your mistake go. But better still is making it
>>> impossible to make in the first place.
>>>
>>> The downside is added complexity to the build, but it doesn't seem too
>>> bad in this case.
>>
>> Yeah, it's not, I have local patches to generate advice-{type,config}.h,
>> and builtin.h. This is a quick POC to do it for fsck-msgids.h.
>>
>> I see I forgot the .gitignore entry, so it's a rough POC :)
>>
>>> (I had a similar thought after getting hit on the head by the recent
>>> t0450-txt-doc-vs-help.sh).
>>
>> Sorry about that. FWIW I've wanted to assert that for a while, and to do
>> it by e.g. having the doc *.txt blurbs generated from running "$buildin
>> -h" during the build.
>
> If we wanted to go this route of generating the docs from the code (which sounds
> like a better long term solution), how would this work? Would we print out the
> list of message ids in builtin/fsck.c and write it to
> Documentation/fsck-msgids.txt ?

First, for the purposes of this thread I think Jeff and I are far off
into the weeds here :)

I think nothing needs to change in how this topic's doing things, we're
just takling about the longer term.

But if we go for that: I think in this case & most I can think of
generating the code from the docs is better (as that rough POC I had
showed), because:

 - You just need a shellscript to scrape the docs to make a *.c or *.h,
   whereas you'd need a C compiler to make the docs if it's the other
   way around. But more importantly:

 - The docs are way easier to scrape with some sed/awk/grep/whatever
   few-liner than to scrape C code for generating docs. E.g. see
   config-list.h.

 - Scraping the C code sucks so much that we'd probably make some
   dedicated interface for it, e.g. what we have for "git <cmd>
   --git-completion-helper".

   In that case it's worth it, but for other things we'n need to make
   the interface & maintain it (even if it's some test helper just for
   the build).

But mainly it helps to have a use-case, replacing the linter script with
e.g. the *.sh I demo'd might be a marginal improvement. But e.g. "git
help -c" uses one of those generated files (config-list.h), and actually
does something useful ...

Is there a good use-case for the fsck data like that? I'd think that
we'd want to make sure the docs are in sync with the code, as in we're
not adding new warnings/errors etc. without documenting them. But beyond
that maybe not much, and people would just run "git help fsck" to get
the list of variables..

Jeff King Oct. 28, 2022, 5:32 a.m. UTC | #9

On Fri, Oct 28, 2022 at 04:04:12AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > I think this particular case is tricky in that direction, because it's a
> > big set of dependencies that aren't necessarily one-to-one. E.g.,
> > builtin/log.c needs to depend on git-log.txt, but also on git-show.txt
> > and git-format-patch.txt.
> 
> I was thinking of just a generated usage-strings.c or whatever. I.e. one
> file with every usage string in it. Then you don't need to hunt down
> which thing goes in which file. We'll just have a variable named
> "builtin_format_patch_usage". Include it in "builtin.h" and it doesn't
> matter if that's in builtin/log.c or builtin/format-patch.c.

Yes, though you have the opposite problem, then: what are the source
files that can produce that usage-strings.c? If your answer is
"Documentation/git-*.txt", then that is a recipe for flaky dependencies.

We already have one such for config-list.h. Try this:

  # introduce a new file
  printf 'foo.bar::\n\ta fake config var\n' \
    >Documentation/config/foo.txt

  # it shows up in the output, as expected
  make
  ./git help --config-for-completion | grep foo

  # now drop it
  rm Documentation/config/foo.txt

  # oops; make won't rebuild anything, and it's still there
  make
  ./git help --config-for-completion | grep foo

The bug is that config-list.h depends on a glob. So we notice when a
file changes, but not when one goes away. And this isn't just
hypothetical. Files come and go as you "git checkout" around history (or
bisect). I don't remember the details, but I'm pretty sure I've gotten
false positive test failures out of this before (maybe a topic branch
with a bogus entry that broke t0012 or t9902, and then moving back to
master doesn't fix it?).

So I'd prefer to avoid introducing more flakiness if we can. You might
be able to piggy back on command-list.txt in this case (that's what
makes command-list.h not flaky).

> It does mean you need to rebuild the C code for other built-ins every
> time one of the SYNOPSIS sections changes, but doesn't happen too often.

If it's a c/h combo that only makes the variable names public (not the
contents of the strings they point to), then it would only trigger a
rebuild when a command is added or removed.

> I think I either did or was planning to take it out of cache.h as part
> of that, we put way too much in cache.h.
> 
> Even advice.c depends on cache.h for its advice.h *sigh*.
> 
> Trying it just now putting advice.h in builtin.h instead leaves 10
> top-level files not-compiling (and they just need the header).
> 
> I think it's fine to include common utilties in our "included by
> everything" headers, but if we've got ~500 *.c files and something is
> only needed by ~20 of them (as in this case) we should probably just
> include it in those 20.

Oh, definitely, we should be shrinking cache.h, and not adding more to
it. Especially not generated stuff.

> >> 	make file-that-does-not-use-generated-header.o
> >> 
> >>    It sucks a bit to have e.g. every *.o depend on command-list.h, when
> >>    we only use it in 1-2 places, ditto the other headers.
> > But that is already true of non-generated headers. If your system
> > doesn't support computed deps, then all objects depend on all headers.
> 
> ... this does not build e.g. command-list.h:
> 
> 	make clean
> 	make grep.o
> 
> But this does:
> 
> 	make clean
> 	make help.o
> 
> Because we've manually declared that.

Right, but...does grep.o actually need command-list.h? If it doesn't
(and that seems to be the case), then all is working as intended. If
grep included some other header that included command-list.h, then yeah,
that would be a bug. But that is true whether grep.c includes it
directly or not. Any Makefile dependency needs to take into account
recursive includes.

> > Yes, that sucks for you. But almost nobody actively develops on such a
> > system, and people building Git to use rarely notice (because they are
> > doing an initial build, or jumping versions, both of which generally
> > require recompilation anyway).
> 
> I guess I'm "almost nobody" :) Not because I don't use computed deps,
> but because I tend to e.g. do large interactive rebases with:
> 
> 	git rebase -i 'make grep.o'
> 
> Or some other subset of our built assets/objects.

I do that all the time, too. But with computed deps, it works (by which
I mean it rebuilds only stuff needed by grep.o).

I'm not even sure what we're talking about anymore. If you are saying
that no, we don't want to just say "everything depends on foo.h that is
generated", I agree. That is wrong to do, and we should specify the
minimal dependencies where appropriate (and take care to keep that set
as small as is feasible using small C interfaces).

If you're saying "for people without computed dependencies, everything
will want to rebuild shell scripts", then I don't care. We decided long
ago that maintaining a manual list of header dependencies was not worth
doing, and people with sub-par compilers will have to suffer.

In the email you're replying to, I was trying to express the second one.
But it sounds like you thought I was trying to argue against the first
one.

-Peff

Junio C Hamano Oct. 28, 2022, 5:32 a.m. UTC | #10

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> First, for the purposes of this thread I think Jeff and I are far off
> into the weeds here :)

It is good to clearly separate where we would want to draw the line
for this round, to get the already-worked-on-and-immediately-available
improvement going, while we envision a future direction for the longer
term.

> But if we go for that: I think in this case & most I can think of
> generating the code from the docs is better (as that rough POC I had
> showed), because:
>
>  - You just need a shellscript to scrape the docs to make a *.c or *.h,
>    whereas you'd need a C compiler to make the docs if it's the other
>    way around. But more importantly:
>
>  - The docs are way easier to scrape with some sed/awk/grep/whatever
>    few-liner than to scrape C code for generating docs. E.g. see
>    config-list.h.

Scraping docs is easier because we do not have to choose from
multiple choices that are all reasonable ;-).  Either way, the
source material needs some discipline to keep it scrapable (e.g. to
keep the doc scrapable, you'd probably keep each entry a single
line, or a fixed format like "<token>::" followed by "^I(<token>) "
followed by description.  Nothing forbids us from giving developers
a similar rule to keep each entry in FOR_EACH_MSG_ID() macro easier
to scrape, so it is about the same difficulty going either way.

But if you choose to make the code the source of truth, you'd have
to see if it makes more sense to "compile and run" instead of
scraping.  That's another thing to consider and choose from, which
makes it harder ;-)

> But mainly it helps to have a use-case, replacing the linter script with
> e.g. the *.sh I demo'd might be a marginal improvement. But e.g. "git
> help -c" uses one of those generated files (config-list.h), and actually
> does something useful ...

Yes, I've shown how N_("explanation of the error") may fit into the
existing scheme in a separate message upthread.  If we go from code
to doc, it would be a reasonable starting point.

Whichever way the aout-generation goes, we'd complicate the build
dependency graph, which is a downside.  Another is that third-party
consumers of docs now need to generate some docs from the source,
which may be additional burden for them.

> Is there a good use-case for the fsck data like that? I'd think that
> we'd want to make sure the docs are in sync with the code, as in we're
> not adding new warnings/errors etc. without documenting them. But beyond
> that maybe not much, and people would just run "git help fsck" to get
> the list of variables..

"git help fsck-error-codes" that does not have a pre-generated
documentation (instead we'd just dump the N_("explanation") to the
output) is certainly tempting.  I am not sure if it would fly well.
When was the last time you saw a manpage that says "run this command
and view its output for the most up-to-date list" ;-)?

Jeff King Oct. 28, 2022, 5:35 a.m. UTC | #11

On Fri, Oct 28, 2022 at 05:11:07AM +0200, Ævar Arnfjörð Bjarmason wrote:

>  - The docs are way easier to scrape with some sed/awk/grep/whatever
>    few-liner than to scrape C code for generating docs. E.g. see
>    config-list.h.
> 
>  - Scraping the C code sucks so much that we'd probably make some
>    dedicated interface for it, e.g. what we have for "git <cmd>
>    --git-completion-helper".

In the general case, scraping C code is awful. But if you are looking
for one particular thing, it is not unreasonable to say something like
"when we write a usage string, we write it like:

  const char *git_cmdname_usage[] = "...";

> But mainly it helps to have a use-case, replacing the linter script with
> e.g. the *.sh I demo'd might be a marginal improvement. But e.g. "git
> help -c" uses one of those generated files (config-list.h), and actually
> does something useful ...

Yeah. In cases like config-list.h, it really is helpful for it to be
something we can query at run-time (the other option would be stuffing
them into git-completion.bash at build-time, but there are some
downsides there).

-Peff

Jeff King Oct. 28, 2022, 5:37 a.m. UTC | #12

On Thu, Oct 27, 2022 at 10:32:30PM -0700, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
> > First, for the purposes of this thread I think Jeff and I are far off
> > into the weeds here :)
> 
> It is good to clearly separate where we would want to draw the line
> for this round, to get the already-worked-on-and-immediately-available
> improvement going, while we envision a future direction for the longer
> term.

Yeah, to be clear, I'm OK with your linting script in the near term. I
didn't look at it _too_ carefully, but if the linter passes, then I
think the proof is in the pudding, and we can consider the topic done
for now.

-Peff

Ævar Arnfjörð Bjarmason Oct. 28, 2022, 10:41 a.m. UTC | #13

On Fri, Oct 28 2022, Jeff King wrote:

> On Fri, Oct 28, 2022 at 04:04:12AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > I think this particular case is tricky in that direction, because it's a
>> > big set of dependencies that aren't necessarily one-to-one. E.g.,
>> > builtin/log.c needs to depend on git-log.txt, but also on git-show.txt
>> > and git-format-patch.txt.
>> 
>> I was thinking of just a generated usage-strings.c or whatever. I.e. one
>> file with every usage string in it. Then you don't need to hunt down
>> which thing goes in which file. We'll just have a variable named
>> "builtin_format_patch_usage". Include it in "builtin.h" and it doesn't
>> matter if that's in builtin/log.c or builtin/format-patch.c.
>
> Yes, though you have the opposite problem, then: what are the source
> files that can produce that usage-strings.c? If your answer is
> "Documentation/git-*.txt", then that is a recipe for flaky dependencies.

First, you're not telling me anything new here :) I've looked at
literally all of these special-cases recently for t0450*.

> We already have one such for config-list.h. Try this:
>
>   # introduce a new file
>   printf 'foo.bar::\n\ta fake config var\n' \
>     >Documentation/config/foo.txt
>
>   # it shows up in the output, as expected
>   make
>   ./git help --config-for-completion | grep foo
>
>   # now drop it
>   rm Documentation/config/foo.txt
>
>   # oops; make won't rebuild anything, and it's still there
>   make
>   ./git help --config-for-completion | grep foo

Yes, much of the dependency graph in Documentation/Makefile is simply
broken, and that's not the worst offender.

Try adding a new category to command-list.txt, then remove it, and watch
your entire build fail until you 'git clean -dxf' (because
Documentation/Makefile will only remove the generated cmds-*.txt it
knows about, but another part includes things based on a wildcard.

> The bug is that config-list.h depends on a glob. So we notice when a
> file changes, but not when one goes away. And this isn't just
> hypothetical. Files come and go as you "git checkout" around history (or
> bisect). I don't remember the details, but I'm pretty sure I've gotten
> false positive test failures out of this before (maybe a topic branch
> with a bogus entry that broke t0012 or t9902, and then moving back to
> master doesn't fix it?).
>
> So I'd prefer to avoid introducing more flakiness if we can. You might
> be able to piggy back on command-list.txt in this case (that's what
> makes command-list.h not flaky).

I'm in violent agreement with you that this problem sucks, but it's also
trivial to solve: Just don't create such crappy dependency graphs in our
Makefiles.

So, in this case (and keep in mind this is off the cuff, and I'm not
advocating for doing this now) is:

 1. We already have an exhaustive list of built-ins in the top-level
    Makefile
 2. Turn those into corresponding Documentation/git-*.txt dependencies,
    e.g. builtin/rebase.o to Documentation/git-rebase.txt.
 3. Some won't exist on the other side, we have no
    builtin/cherry-pick.o, but a Documentation/git-cherry-pick.txt etc. I
    know, keep reading... :)
 4. Slurp up the *.txt, parse it (this code's mostly in t0450 already),
    spew out a generated *.c or *.h.

Critically this is based on an exhaustive known list at this point, if
we add/remove built-ins we *will* re-generate it. It's not using a
wildcard dependency.

At this point we have e.g. variables like:

 - git_builtin_bundle_usage
 - git_builtin_log_usage

But because we're not smart enough and/or it doesn't correspond to our
*.txt structure we don't have e.g.:

 - git_builtin_bundle_create_usage (a sub-command)
 - git_builtin_fsck_objects_usage

But that's fine.

We can just handle remaining special-cases by not removing the cases we
didn't handle from the C sources, we'll still need to maintain those on
both sides, and have t0450 hopefully catch them drifting.

I.e. a subsequent change to the C code would be something like:

	diff --git a/builtin/bundle.c b/builtin/bundle.c
	index e80efce3a42..557e00be5a2 100644
	--- a/builtin/bundle.c
	+++ b/builtin/bundle.c
	@@ -11,14 +11,7 @@
	  * bundle supporting "fetch", "pull", and "ls-remote".
	  */

	-static const char * const builtin_bundle_usage[] = {
	-  N_("git bundle create [<options>] <file> <git-rev-list args>"),
	-  N_("git bundle verify [<options>] <file>"),
	-  N_("git bundle list-heads <file> [<refname>...]"),
	-  N_("git bundle unbundle <file> [<refname>...]"),
	-  NULL
	-};
	-
	+/* too dumb for these still... */
	 static const char * const builtin_bundle_create_usage[] = {
	   N_("git bundle create [<options>] <file> <git-rev-list args>"),
	   NULL

We didn't handle everything, but that's OK. Perfect shouldn't be the
enemy of the good.

We have on the order of 150 built-ins, by far the majority of them are
relativtely easy to correlate to their docs & scrape them out.

The goal here would be to mostly get rid of the maintenance burden of
maintaining these in two places, if we still need to do that in 10-30
cases instead of 150-200 we've solved most of the problem.

>> It does mean you need to rebuild the C code for other built-ins every
>> time one of the SYNOPSIS sections changes, but doesn't happen too often.
>
> If it's a c/h combo that only makes the variable names public (not the
> contents of the strings they point to), then it would only trigger a
> rebuild when a command is added or removed.

Yeah, and if we didn't like that we can make it a file per variable
(which I think sucks), but a full re-build every time we add a built-in
isn't too bad. And we *way* worse now, every time some utility function
is added to a few common headers we basically do a re-build.

>> I think I either did or was planning to take it out of cache.h as part
>> of that, we put way too much in cache.h.
>> 
>> Even advice.c depends on cache.h for its advice.h *sigh*.
>> 
>> Trying it just now putting advice.h in builtin.h instead leaves 10
>> top-level files not-compiling (and they just need the header).
>> 
>> I think it's fine to include common utilties in our "included by
>> everything" headers, but if we've got ~500 *.c files and something is
>> only needed by ~20 of them (as in this case) we should probably just
>> include it in those 20.
>
> Oh, definitely, we should be shrinking cache.h, and not adding more to
> it. Especially not generated stuff.

*more violent agreement noises* :)

FWIW I have some patches to make headway on that stuff, which I've kept
unsubmitted because I thought nobody was interested...

I.e. a *very large* part of cache.h and friends is stuff that maybe 3-5
files here and there need, not "nearly everything", as one might assume.

We should still have e.g. gettext.h and the like in there,
probably.

Although for those we should probably split them off into builtin.h and
e.g. a new libs.h, so that we don't get translations added in
t/helper/*.c and the like, so maybe ...

>> >> 	make file-that-does-not-use-generated-header.o
>> >> 
>> >>    It sucks a bit to have e.g. every *.o depend on command-list.h, when
>> >>    we only use it in 1-2 places, ditto the other headers.
>> > But that is already true of non-generated headers. If your system
>> > doesn't support computed deps, then all objects depend on all headers.
>> 
>> ... this does not build e.g. command-list.h:
>> 
>> 	make clean
>> 	make grep.o
>> 
>> But this does:
>> 
>> 	make clean
>> 	make help.o
>> 
>> Because we've manually declared that.
>
> Right, but...does grep.o actually need command-list.h? If it doesn't
> (and that seems to be the case), then all is working as intended. If
> grep included some other header that included command-list.h, then yeah,
> that would be a bug. But that is true whether grep.c includes it
> directly or not. Any Makefile dependency needs to take into account
> recursive includes.

Yes, it doesn't need it, and it shouldn't build it. I'm saying that this
is the reason we've needed manually maintained dependencies of these
generated headers.

I.e. if you take the easy way out and say that we have no generated deps
yet, let's build all of it you would build that stuff with grep.o for no
reason.

And yes, I'm aware that that's what we do right now. I'm saying that
it's probably OK because we've had 3-4 of these headers, but people
might mind if it's more, or a lot more...

>> > Yes, that sucks for you. But almost nobody actively develops on such a
>> > system, and people building Git to use rarely notice (because they are
>> > doing an initial build, or jumping versions, both of which generally
>> > require recompilation anyway).
>> 
>> I guess I'm "almost nobody" :) Not because I don't use computed deps,
>> but because I tend to e.g. do large interactive rebases with:
>> 
>> 	git rebase -i 'make grep.o'
>> 
>> Or some other subset of our built assets/objects.
>
> I do that all the time, too. But with computed deps, it works (by which
> I mean it rebuilds only stuff needed by grep.o).
>
> I'm not even sure what we're talking about anymore. If you are saying
> that no, we don't want to just say "everything depends on foo.h that is
> generated", I agree. That is wrong to do, and we should specify the
> minimal dependencies where appropriate (and take care to keep that set
> as small as is feasible using small C interfaces).
>
> If you're saying "for people without computed dependencies, everything
> will want to rebuild shell scripts", then I don't care. We decided long
> ago that maintaining a manual list of header dependencies was not worth
> doing, and people with sub-par compilers will have to suffer.
>
> In the email you're replying to, I was trying to express the second one.
> But it sounds like you thought I was trying to argue against the first
> one.

I'm saying there's a large chasm between using generated dependencies,
and using them all the time.

Maybe my dev setup is just different, but I do most of my development in
a primary development tree where I've generally built things already, so
all the incremental compilation we support works out nicely.

But I also do things like spin up different workrtees for different
revisions, build some subset, and destroy them afterwards. Those don't
benefit from computed dependencies, so it's nice that we're not
over-building things.

Anyway:

All I was trying to get across with the upthread was that I had further
patches in this area that might be useful, but that part of the reason I
hadn't submitted them was because I didn't want to introduce suckage
with them.

We don't make a lot of use of generated code, and the uses we have now
are really isolated (one header used by one *.c file, mostly).

If we make that a one<->many we should be careful, and there's some
dragons lurking e.g. for generated *.h's used by other *.h's (all our
use is by *.c's at this point).

[v3,4/4] Documentation: add lint-fsck-msgids

Commit Message

Comments

Patch