Message ID | 20190607010708.46654-1-emilyshaffer@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | documentation: add tutorial for revision walking | expand |
On Thu, Jun 6, 2019 at 9:08 PM Emily Shaffer <emilyshaffer@google.com> wrote: > [...] > The tutorial covers a basic overview of the structs involved during > revision walk, setting up a basic commit walk, setting up a basic > all-object walk, and adding some configuration changes to both walk > types. It intentionally does not cover how to create new commands or > search for options from the command line or gitconfigs. > [...] > Signed-off-by: Emily Shaffer <emilyshaffer@google.com> > --- > diff --git a/Documentation/.gitignore b/Documentation/.gitignore > @@ -12,6 +12,7 @@ cmds-*.txt > SubmittingPatches.txt > +MyFirstRevWalk.txt The new file itself is named Documentation/MyFirstRevWalk.txt, so why add it to .gitignore? > diff --git a/Documentation/MyFirstRevWalk.txt b/Documentation/MyFirstRevWalk.txt > @@ -0,0 +1,826 @@ > +== What's a Revision Walk? > + > +The revision walk is a key concept in Git - this is the process that underpins > +operations like `git log`, `git blame`, and `git reflog`. Beginning at HEAD, the > +list of objects is found by walking parent relationships between objects. The > +revision walk can also be usedto determine whether or not a given object is s/usedto/used to/ > +reachable from the current HEAD pointer. > + > +We'll put our fiddling into a new command. For fun, let's name it `git walken`. > +Open up a new file `builtin/walken.c` and set up the command handler: > + > +---- > +/* > + * "git walken" > + * > + * Part of the "My First Revision Walk" tutorial. > + */ > + > +#include <stdio.h> > +#include "builtin.h" Git source files must always include cache.h or git-compat-util.h (or, for builtins, builtin.h) as the very first header since those headers take care of differences which might crop up as problems with system headers on various platforms. System headers are included after Git headers. So, stdio.h should be included after builtin.h. In this case, however, stdio.h will get pulled in by git-compat-util.h anyhow, so you need not include it here. > +Add usage text and `-h` handling, in order to pass the test suite: > + > +---- > +static const char * const walken_usage[] = { > + N_("git walken"), > + NULL, > +} Unless you plan on referencing this from functions other than cmd_walken(), it need not be global. > +int cmd_walken(int argc, const char **argv, const char *prefix) > +{ > + struct option options[] = { > + OPT_END() > + }; > + > + argc = parse_options(argc, argv, prefix, options, walken_usage, 0); > + > + ... Perhaps comment out the "..." or remove it altogether to avoid having the compiler barf when the below instructions tell the reader to build the command. > +} > + > +Also add the relevant line in builtin.h near `cmd_whatchanged()`: s/builtin.h/`&`/ > +Build and test out your command, without forgetting to ensure the `DEVELOPER` > +flag is set: > + > +---- > +echo DEVELOPER=1 >config.mak This will blast existing content of 'config.mak' which could be dangerous. It might be better to suggest >> instead. > +`name` is the SHA-1 of the object - a 40-digit hex string you may be familiar > +with from using Git to organize your source in the past. Check the tutorial > +mentioned above towards the top for a discussion of where the SHA-1 can come > +from. With all the recent work to move away from SHA-1 and to support other hash functions, perhaps just call this "object ID" rather than SHA-1, and drop mention of it being exactly 40 digits. Instead, perhaps say something like "...is the hexadecimal representation of the object ID...". > +== Basic Commit Walk > + > +First, let's see if we can replicate the output of `git log --oneline`. We'll > +refer back to the implementation frequently to discover norms when performing > +a revision walk of our own. > + > +We'll need all the commits, in order, which preceded our current commit. We will > +also need to know the name and subject. This paragraph confused me. I read it as these being prerequisites I would somehow have to provide in order to write the code. Perhaps it can be rephrased to state that this is what the code will be doing. Maybe: "To do this, we will find all the commits, in order, which precede the current commit, and extract from them the name and subject [of the commit message]" or something. > +=== Setting Up > + > +Preparing for your revision walk has some distinct stages. > + > +1. Perform default setup for this mode, and others which may be invoked. > +2. Check configuration files for relevant settings. > +3. Set up the rev_info struct. > +4. Tweak the initialized rev_info to suit the current walk. > +5. Prepare the rev_info for the walk. s/rev_info/`&`/ in the above three lines. > +==== Default Setups > + > +Before you begin to examine user configuration for your revision walk, it's > +common practice for you to initialize to default any switches that your command > +may have, as well as ask any other components you may invoke to initialize as > +well. `git log` does this in `init_log_defaults()`; in that case, one global > +`decoration_style` is initialized, as well as the grep and diff-UI components. > + > +For our purposes, within `git walken`, for the first example we do we don't "we do we don't"? > +intend to invoke anything, and we don't have any configuration to do. However, "invoke anything" is pretty nebulous, as is the earlier "components you may invoke". A newcomer is unlikely to know what this means, so perhaps it needs an example (even if just a short parenthetical comment). > +we may want to add some later, so for now, we can add an empty placeholder. > +Create a new function in `builtin/walken.c`: > + > +---- > +static void init_walken_defaults(void) > +{ > + /* We don't actually need the same components `git log` does; leave this > + * empty for now. > + */ > +} /* * Git multi-line comments * are formatted like this. */ > +Add a new function to `builtin/walken.c`: > + > +---- > +static int git_walken_config(const char *var, const char *value, void *cb) > +{ > + /* For now, let's not bother with anything. */ > + return git_default_config(var, value, cb); > +} Comment is somewhat confusing. Perhaps say instead "We don't currently have custom configuration, so fall back to git_default_config()" or something. > +==== Setting Up `rev_info` > + > +Now that we've gathered external configuration and options, it's time to > +initialize the `rev_info` object which we will use to perform the walk. This is > +typically done by calling `repo_init_revisions()` with the repository you intend > +to target, as well as the prefix and your `rev_info` struct. Maybe: s/the prefix/the `&` argument of `cmd_walken`/ > +Add the `struct rev_info` and the `repo_init_revisions()` call: > +---- > +int cmd_walken(int argc, const char **argv, const char *prefix) > +{ > + /* This can go wherever you like in your declarations.*/ > + struct rev_info rev; > + ... A less verbose way to indicate the same without using a /* comment */: ... struct rev_info rev; ... > + /* This should go after the git_config() call. */ > + repo_init_revisions(the_repository, &rev, prefix); > +} > +---- > +static void final_rev_info_setup(struct rev_info *rev) > +{ > + /* We want to mimick the appearance of `git log --oneline`, so let's > + * force oneline format. */ s/mimick/mimic/ /* * Multi-line * comment. */ > +==== Preparing `rev_info` For the Walk > + > +Now that `rev` is all initialized and configured, we've got one more setup step > +before we get rolling. We can do this in a helper, which will both prepare the > +`rev_info` for the walk, and perform the walk itself. Let's start the helper > +with the call to `prepare_revision_walk()`. > + > +---- > +static int walken_commit_walk(struct rev_info *rev) > +{ > + /* prepare_revision_walk() gets the final steps ready for a revision > + * walk. We check the return value for errors. */ Not at all sure what this comment is trying to say. Also, the second sentence adds no value to what the code itself already says clearly by actually checking the return value. > + if (prepare_revision_walk(rev)) > + die(_("revision walk setup failed")); > +} > +==== Performing the Walk! > + > +Finally! We are ready to begin the walk itself. Now we can see that `rev_info` > +can also be used as an iterator; we move to the next item in the walk by using > +`get_revision()` repeatedly. Add the listed variable declarations at the top and > +the walk loop below the `prepare_revision_walk()` call within your > +`walken_commit_walk()`: > + > +---- > +static int walken_commit_walk(struct rev_info *rev) > +{ > + struct commit *commit; > + struct strbuf prettybuf; > + strbuf_init(&prettybuf, 0); More idiomatic: struct strbuf prettybuf = STRBUF_INIT; > + while ((commit = get_revision(rev)) != NULL) { > + if (commit == NULL) > + continue; Idiomatic Git code doesn't mention NULL explicitly in conditionals, so: while ((commit = get_revision(rev))) { if (!commit) continue; > + strbuf_reset(&prettybuf); > + pp_commit_easy(CMIT_FMT_ONELINE, commit, &prettybuf); Earlier, you talked about calling get_commit_format("oneline",...) to get "oneline" output, so what is the purpose of CMIT_FMT_ONELINE here? The text should explain more clearly what these two different "online"-related bits mean. > + printf(_("%s\n"), prettybuf.buf); There is nothing here to localize, so drop _(...): printf("%s\n", prettybuf.buf); or perhaps just: puts(prettybuf.buf); > + } > + > + return 0; > +} What does the return value signify? > +=== Adding a Filter > + > +Next, we can modify the `grep_filter`. This is done with convenience functions > +found in `grep.h`. For fun, we're filtering to only commits from folks using a > +gmail.com email address - a not-very-precise guess at who may be working on Git Perhaps? s/gmail.com/`&`/ > +=== Changing the Order > + > +Let's see what happens when we run with `REV_SORT_BY_COMMIT_DATE` as opposed to > +`REV_SORT_BY_AUTHOR_DATE`. Add the following: > + > +static void final_rev_info_setup(int argc, const char **argv, > + const char *prefix, struct rev_info *rev) > +{ > + ... > + > + rev->topo_order = 1; > + rev->sort_order = REV_SORT_BY_COMMIT_DATE; The assignment to rev->sort_order is obvious enough, but the rev->topo_order assignment is quite mysterious to someone coming to this tutorial to learn about revision walking, thus some commentary explaining 'topo_order' would be a good idea. > +Finally, compare the two. This is a little less helpful without object names or > +dates, but hopefully we get the idea. > + > +---- > +$ diff -u commit-date.txt author-date.txt > +---- > + > +This display is an indicator for the latency between publishing a commit for > +review the first time, and getting it actually merged into master. Perhaps: s/master/`&`/ Even as a long-time contributor to the project, I had to pause over this statement for several seconds before figuring out what it was talking about. Without a long-winded explanation of how topics progress from submission through 'pu' through 'next' through 'master' and finally into a release, the above statement is likely to be mystifying to a newcomer. Perhaps it should be dropped. > +Let's try one more reordering of commits. `rev_info` exposes a `reverse` flag. > +However, it needs to be applied after `add_head_to_pending()` is called. Find This leaves the reader hanging, wondering why 'reverse' needs to be assigned after add_head_to_pending(). > +== Basic Object Walk > + > +static void walken_show_commit(struct commit *cmt, void *buf) > +{ > + commit_count++; > +} > +---- > + > +Since we have the `struct commit` object, we can look at all the same parts that > +we looked at in our earlier commit-only walk. For the sake of this tutorial, > +though, we'll just increment the commit counter and move on. This leaves the reader wondering what 'buf' is and what it's used for. Presumably this is the 'show_data' context mentioned earlier? If so, perhaps name this 'ctxt' or 'context' or something and, because this is a tutorial trying to teach revision walking, say a quick word about how it might be used. > +static void walken_show_object(struct object *obj, const char *str, void *buf) > +{ > + switch (obj->type) { > + [...] > + case OBJ_COMMIT: > + printf(_("Unexpectedly encountered a commit in " > + "walken_show_object!\n")); > + commit_count++; > + break; > + default: > + printf(_("Unexpected object type %s!\n"), > + type_name(obj->type)); > + break; > + } > +} Modern practice in this project is to start error messages with lowercase and to not punctuate the end (no need for "!"). Also, same complaint about the mysterious 'str' argument to the callback as for 'buf' mentioned above. > +To help assure us that we aren't double-counting commits, we'll include some > +complaining if a commit object is routed through our non-commit callback; we'll > +also complain if we see an invalid object type. Are these two error cases "impossible" conditions or can they actually arise in practice? If the former, use die() instead and drop use of _(...) so as to avoid confusing the reader into thinking that the behavior is indeterminate. > +Our main object walk implementation is substantially different from our commit > +walk implementation, so let's make a new function to perform the object walk. We > +can perform setup which is applicable to all objects here, too, to keep separate > +from setup which is applicable to commit-only walks. > + > +---- > +static int walken_object_walk(struct rev_info *rev) > +{ > +} > +---- This skeleton function definition is populated immediately below, so it's not clear why it needs to be shown here. > +We'll start by enabling all types of objects in the `struct rev_info`, and > +asking to have our trees and blobs shown in commit order. We'll also exclude > +promisors as the walk becomes more complicated with those types of objects. When > +our settings are ready, we'll perform the normal revision walk setup and > +initialize our tracking variables. > + > +---- > +static int walken_object_walk(struct rev_info *rev) > +{ > + rev->tree_objects = 1; > + rev->blob_objects = 1; > + rev->tag_objects = 1; > + rev->tree_blobs_in_commit_order = 1; > + rev->exclude_promisor_objects = 1; > + [...] > +---- > + > +Unless you cloned or fetched your repository earlier with a filter, > +`exclude_promisor_objects` is unlikely to make a difference, but we'll turn it > +on just to make sure our lives are simple. We'll also turn on > +`tree_blobs_in_commit_order`, which means that we will walk a commit's tree and > +everything it points to immediately after we find each commit, as opposed to > +waiting for the end and walking through all trees after the commit history has > +been discovered. This paragraph is repeating much of the information in the paragraph just above the code snippet. One or the other should be dropped or thinned to avoid the duplication. > +Let's start by calling just the unfiltered walk and reporting our counts. > +Complete your implementation of `walken_object_walk()`: > + > +---- > + traverse_commit_list(rev, walken_show_commit, walken_show_object, NULL); > + > + printf(_("Object walk completed. Found %d commits, %d blobs, %d tags, " > + "and %d trees.\n"), commit_count, blob_count, tag_count, > + tree_count); Or make the output more useful by having it be machine-parseable (and not localized): printf("commits %d\nblobs %d\ntags %d\ntrees %d\n", commit_count, blob_count, tag_cont, tree_count); > + return 0; > +} What does the return value signify? > +Now we can try to run our command! It should take noticeably longer than the > +commit walk, but an examination of the output will give you an idea why - for > +example: > + > +---- > +Object walk completed. Found 55733 commits, 100274 blobs, 0 tags, and 104210 trees. > +---- > + > +This makes sense. We have more trees than commits because the Git project has > +lots of subdirectories which can change, plus at least one tree per commit. We > +have no tags because we started on a commit (`HEAD`) and while tags can point to > +commits, commits can't point to tags. > + > +NOTE: You will have different counts when you run this yourself! The number of > +objects grows along with the Git project. Not sure if this NOTE is useful; after all, you introduced the output by saying "for example". > +=== Adding a Filter > + > +There are a handful of filters that we can apply to the object walk laid out in > +`Documentation/rev-list-options.txt`. These filters are typically useful for > +operations such as creating packfiles or performing a partial or shallow clone. > +They are defined in `list-objects-filter-options.h`. For the purposes of this > +tutorial we will use the "tree:1" filter, which causes the walk to omit all > +trees and blobs which are not directly referenced by commits reachable from the > +commit in `pending` when the walk begins. (In our case, that means we omit trees > +and blobs not directly referenced by HEAD or HEAD's history.) Need some explanation of what 'pending' is, as it's just mysterious as written. > +First, we'll need to `#include "list-objects-filter-options.h`". Then, we can > +set up the `struct list_objects_filter_options` and `struct oidset` at the top > +of `walken_object_walk()`: > + > +---- > +static int walken_object_walk(struct rev_info *rev) > +{ > + struct list_objects_filter_options filter_options = {}; > + struct oidset omitted; > + oidset_init(&omitted, 0); > + ... This 'omitted' is so far removed from the description of the 'omitted' argument to traverse_commit_list_filtered() way earlier in the tutorial that a reader is likely to have forgotten what it's about (indeed, I did). Some explanation, even if superficial, is likely warranted here or at least mention that it is explained in more detail below (as I discovered). > +After we run `traverse_commit_list_filtered()` we would also be able to examine > +`omitted`, which is a linked-list of all objects we did not include in our walk. > +Since all omitted objects are included, the performance of > +`traverse_commit_list_filtered()` with a non-null `omitted` arument is equitable s/arument/argument/ > +with the performance of `traverse_commit_list()`; so for our purposes, we leave > +it null. It's easy to provide one and iterate over it, though - check `oidset.h` > +for the declaration of the accessor methods for `oidset`. I'm confused. What are we leaving NULL here? > +=== Changing the Order > + > +Finally, let's demonstrate that you can also reorder walks of all objects, not > +just walks of commits. First, we'll make our handlers chattier - modify > +`walken_show_commit()` and `walken_show_object` to print the object as they go: s/walken_show_object/&()/ > +static void walken_show_commit(struct commit *cmt, void *buf) > +{ > + printf(_("commit: %s\n"), oid_to_hex(&cmt->object.oid)); > + commit_count++; > +} Is there a bunch of trailing whitespace on these lines of the code sample (and in some lines below)? > +static void walken_show_object(struct object *obj, const char *str, void *buf) > +{ > + printf(_("%s: %s\n"), type_name(obj->type), oid_to_hex(&obj->oid)); Localizing "%s: %s\n" via _(...) probably doesn't add value, which implies that you might not want to be localizing "commit" above either. > +(Try to leave the counter increment logic in place in `walken_show_object()`.) > + > +With only that change, run again (but save yourself some scrollback): > + > +---- > +$ ./bin-wrappers/git walken | head -n 10 > +---- > + > +Take a look at the top commit with `git show` and the OID you printed; it should > +be the same as the output of `git show HEAD`. I think this is the first use of "OID", which might be mysterious and confusing to a newcomer. Earlier, you used SHA-1 and I suggested "object ID" instead. Perhaps use the same here, or define OID earlier in the document in place of SHA-1. > +Next, let's change a setting on our `struct rev_info` within > +`walken_object_walk()`. Find where you're changing the other settings on `rev`, > +such as `rev->tree_objects` and `rev->tree_blobs_in_commit_order`, and add > +another setting at the bottom: Instead of nebulous "another setting", mentioning 'reverse' explicitly would make this clearer. > + rev->tree_objects = 1; > + rev->blob_objects = 1; > + rev->tag_objects = 1; > + rev->tree_blobs_in_commit_order = 1; > + rev->exclude_promisor_objects = 1; > + rev->reverse = 1;
Emily Shaffer <emilyshaffer@google.com> writes: > I'll also be mailing an RFC patchset In-Reply-To this message; the RFC > patchset should not be merged to Git, as I intend to host it in my own > mirror as an example. I hosted a similar example for the > MyFirstContribution tutorial; it's visible at > https://github.com/nasamuffin/git/tree/psuh. There might be a better > place to host these so I don't "own" them but I'm not sure what it is; > keeping them as a live branch somewhere struck me as an okay way to keep > them from getting stale. Yes, writing the initial version is one thing, but keeping it alive is more work and more important. As the underlying API changes over time, it will become necessary to update the sample implementation, but for a newbie who wants to learn by building "walken" on top of the then-current codebase and API, it would not be so helpful to show "these 7 patches were for older codebase, and the tip 2 are incremental updates to adjust to the newer API", so the maintenance of these sample patches may need different paradigm than the norm for our main codebase that values incremental polishing.
Emily Shaffer <emilyshaffer@google.com> writes: > +My First Revision Walk > +====================== > + > +== What's a Revision Walk? > + > +The revision walk is a key concept in Git - this is the process that underpins > +operations like `git log`, `git blame`, and `git reflog`. Beginning at HEAD, the > +list of objects is found by walking parent relationships between objects. The > +revision walk can also be usedto determine whether or not a given object is > +reachable from the current HEAD pointer. s/usedto/used to/; > +We'll put our fiddling into a new command. For fun, let's name it `git walken`. > +Open up a new file `builtin/walken.c` and set up the command handler: > + > +---- > +/* > + * "git walken" > + * > + * Part of the "My First Revision Walk" tutorial. > + */ > + > +#include <stdio.h> Bad idea. In the generic part of the codebase, system headers are supposed to be supplied by including git-compat-util.h (or cache.h or builtin.h, that are common header files that begin by including it and are allowed by CodingGuidelines to be used as such). > +#include "builtin.h" > + > +int cmd_walken(int argc, const char **argv, const char *prefix) > +{ > + printf(_("cmd_walken incoming...\n")); > + return 0; > +} > +---- I wonder if it makes sense to use trace instead of printf, as our reader has already seen the psuh example for doing the above. > +Add usage text and `-h` handling, in order to pass the test suite: It is not wrong per-se, and it indeed is a very good practice to make sure that our subcommands consistently gives usage text and short usage. Encouraging them early is a good idea. But "in order to pass the test suite" invites "eh, the test suite does not pass without usage and -h? why?". Either drop the mention of "the test suite", or perhaps say something like Add usage text and `-h` handling, like all the subcommands should consistently do (our test suite will notice and complain if you fail to do so). i.e. the real purpose is consistency and usability; test suite is merely an enforcement mechanism. > +---- > +{ "walken", cmd_walken, RUN_SETUP }, > +---- > + > +Add it to the `Makefile` near the line for `builtin\worktree.o`: Backslash intended?
Eric Sunshine <sunshine@sunshineco.com> writes: >> +/* >> + * "git walken" >> + * >> + * Part of the "My First Revision Walk" tutorial. >> + */ >> + >> +#include <stdio.h> >> +#include "builtin.h" > > Git source files must always include cache.h or git-compat-util.h (or, > for builtins, builtin.h) as the very first header since those headers > take care of differences which might crop up as problems with system > headers on various platforms. System headers are included after Git > headers. So, stdio.h should be included after builtin.h. In this case, Actually the idea is that platform agnostic part of the codebase should not have to include _any_ system header themselves; instead, including git-compat-util.h should take care of the system header files *including* the funky ordering requirements some platforms may have. So, we'd want to go stronger than "should be included after"; it shouldn't have to be included or our git-compat-util.h is wrong. I've started reading the patch myself, but it seems that you've already done a lot more thorough read-thru than I would have done, so thank you very much for that.
On Mon, Jun 10, 2019 at 5:27 PM Junio C Hamano <gitster@pobox.com> wrote: > Eric Sunshine <sunshine@sunshineco.com> writes: > >> +#include <stdio.h> > >> +#include "builtin.h" > > > > Git source files must always include cache.h or git-compat-util.h (or, > > for builtins, builtin.h) as the very first header since those headers > > take care of differences which might crop up as problems with system > > headers on various platforms. System headers are included after Git > > headers. So, stdio.h should be included after builtin.h. In this case, > > Actually the idea is that platform agnostic part of the codebase > should not have to include _any_ system header themselves; instead, > including git-compat-util.h should take care of the system header > files *including* the funky ordering requirements some platforms may > have. So, we'd want to go stronger than "should be included after"; > it shouldn't have to be included or our git-compat-util.h is wrong. Thanks for clarifying that.
On Fri, Jun 07, 2019 at 02:21:07AM -0400, Eric Sunshine wrote: > On Thu, Jun 6, 2019 at 9:08 PM Emily Shaffer <emilyshaffer@google.com> wrote: > > [...] > > The tutorial covers a basic overview of the structs involved during > > revision walk, setting up a basic commit walk, setting up a basic > > all-object walk, and adding some configuration changes to both walk > > types. It intentionally does not cover how to create new commands or > > search for options from the command line or gitconfigs. > > [...] > > Signed-off-by: Emily Shaffer <emilyshaffer@google.com> > > --- > > diff --git a/Documentation/.gitignore b/Documentation/.gitignore > > @@ -12,6 +12,7 @@ cmds-*.txt > > SubmittingPatches.txt > > +MyFirstRevWalk.txt > > The new file itself is named Documentation/MyFirstRevWalk.txt, so why > add it to .gitignore? Yep, fixed. Holdover from an initial attempt which named the file MyFirstRevWalk (no extension), which was then corrected for the earlier tutorial I sent. Thanks. > > > diff --git a/Documentation/MyFirstRevWalk.txt b/Documentation/MyFirstRevWalk.txt > > @@ -0,0 +1,826 @@ > > +== What's a Revision Walk? > > + > > +The revision walk is a key concept in Git - this is the process that underpins > > +operations like `git log`, `git blame`, and `git reflog`. Beginning at HEAD, the > > +list of objects is found by walking parent relationships between objects. The > > +revision walk can also be usedto determine whether or not a given object is > > s/usedto/used to/ Done. > > > +reachable from the current HEAD pointer. > > + > > +We'll put our fiddling into a new command. For fun, let's name it `git walken`. > > +Open up a new file `builtin/walken.c` and set up the command handler: > > + > > +---- > > +/* > > + * "git walken" > > + * > > + * Part of the "My First Revision Walk" tutorial. > > + */ > > + > > +#include <stdio.h> > > +#include "builtin.h" > > Git source files must always include cache.h or git-compat-util.h (or, > for builtins, builtin.h) as the very first header since those headers > take care of differences which might crop up as problems with system > headers on various platforms. System headers are included after Git > headers. So, stdio.h should be included after builtin.h. In this case, > however, stdio.h will get pulled in by git-compat-util.h anyhow, so > you need not include it here. Done. > > > +Add usage text and `-h` handling, in order to pass the test suite: > > + > > +---- > > +static const char * const walken_usage[] = { > > + N_("git walken"), > > + NULL, > > +} > > Unless you plan on referencing this from functions other than > cmd_walken(), it need not be global. Done; bad C++ habits sneaking in. :) > > > +int cmd_walken(int argc, const char **argv, const char *prefix) > > +{ > > + struct option options[] = { > > + OPT_END() > > + }; > > + > > + argc = parse_options(argc, argv, prefix, options, walken_usage, 0); > > + > > + ... > > Perhaps comment out the "..." or remove it altogether to avoid having > the compiler barf when the below instructions tell the reader to build > the command. Hmm. That part I'm not so sure about. I like to use the "..." to indicate where the code in the snippet should be added around the other code already in the file - which I suppose it does just as clearly if it's commented - but I also hope folks are not simply copy-pasting blindly from the tutorial. It seems like including uncommented "..." in code tutorials is pretty common. I don't think I have a good reason to push back on this except that I think "/* ... */" is ugly :) I'll go through and replace "..." with some actual hints about what's supposed to go there; for example, here I'll replace with "/* print and return */". > > > +} > > + > > +Also add the relevant line in builtin.h near `cmd_whatchanged()`: > > s/builtin.h/`&`/ Done. > > > +Build and test out your command, without forgetting to ensure the `DEVELOPER` > > +flag is set: > > + > > +---- > > +echo DEVELOPER=1 >config.mak > > This will blast existing content of 'config.mak' which could be > dangerous. It might be better to suggest >> instead. Done. > > > +`name` is the SHA-1 of the object - a 40-digit hex string you may be familiar > > +with from using Git to organize your source in the past. Check the tutorial > > +mentioned above towards the top for a discussion of where the SHA-1 can come > > +from. > > With all the recent work to move away from SHA-1 and to support other > hash functions, perhaps just call this "object ID" rather than SHA-1, > and drop mention of it being exactly 40 digits. Instead, perhaps say > something like "...is the hexadecimal representation of the object > ID...". Good point. Will do. > > > +== Basic Commit Walk > > + > > +First, let's see if we can replicate the output of `git log --oneline`. We'll > > +refer back to the implementation frequently to discover norms when performing > > +a revision walk of our own. > > + > > +We'll need all the commits, in order, which preceded our current commit. We will > > +also need to know the name and subject. > > This paragraph confused me. I read it as these being prerequisites I > would somehow have to provide in order to write the code. Perhaps it > can be rephrased to state that this is what the code will be doing. > Maybe: "To do this, we will find all the commits, in order, which > precede the current commit, and extract from them the name and subject > [of the commit message]" or something. Yeah, good point. Thanks - this is the kind of thing that sounds logical when you write it but not when you read it later :) > > > +=== Setting Up > > + > > +Preparing for your revision walk has some distinct stages. > > + > > +1. Perform default setup for this mode, and others which may be invoked. > > +2. Check configuration files for relevant settings. > > +3. Set up the rev_info struct. > > +4. Tweak the initialized rev_info to suit the current walk. > > +5. Prepare the rev_info for the walk. > > s/rev_info/`&`/ in the above three lines. Done. > > > +==== Default Setups > > + > > +Before you begin to examine user configuration for your revision walk, it's > > +common practice for you to initialize to default any switches that your command > > +may have, as well as ask any other components you may invoke to initialize as > > +well. `git log` does this in `init_log_defaults()`; in that case, one global > > +`decoration_style` is initialized, as well as the grep and diff-UI components. > > + > > +For our purposes, within `git walken`, for the first example we do we don't > > "we do we don't"? > > > +intend to invoke anything, and we don't have any configuration to do. However, > > "invoke anything" is pretty nebulous, as is the earlier "components > you may invoke". A newcomer is unlikely to know what this means, so > perhaps it needs an example (even if just a short parenthetical > comment). I have tried to reword this; I hope this is a little clearer. Before you begin to examine user configuration for your revision walk, it's common practice for you to initialize to default any switches that your command may have, as well as ask any other components you may invoke to initialize as well (for example, how `git log` also uses the `grep` and `diff` components). `git log` does this in `init_log_defaults()`; in that case, one global `decoration_style` is initialized, as well as the grep and diff-UI components. For our purposes, within `git walken`, for the first example we don't intend to use any other components within Git, and we don't have any configuration to do. However, we may want to add some later, so for now, we can add an empty placeholder. Create a new function in `builtin/walken.c`: > > > +we may want to add some later, so for now, we can add an empty placeholder. > > +Create a new function in `builtin/walken.c`: > > + > > +---- > > +static void init_walken_defaults(void) > > +{ > > + /* We don't actually need the same components `git log` does; leave this > > + * empty for now. > > + */ > > +} > > /* > * Git multi-line comments > * are formatted like this. > */ Done; I'll look through the rest of the samples for it too. > > > +Add a new function to `builtin/walken.c`: > > + > > +---- > > +static int git_walken_config(const char *var, const char *value, void *cb) > > +{ > > + /* For now, let's not bother with anything. */ > > + return git_default_config(var, value, cb); > > +} > > Comment is somewhat confusing. Perhaps say instead "We don't currently > have custom configuration, so fall back to git_default_config()" or > something. Done. > > > +==== Setting Up `rev_info` > > + > > +Now that we've gathered external configuration and options, it's time to > > +initialize the `rev_info` object which we will use to perform the walk. This is > > +typically done by calling `repo_init_revisions()` with the repository you intend > > +to target, as well as the prefix and your `rev_info` struct. > > Maybe: s/the prefix/the `&` argument of `cmd_walken`/ Done. > > > +Add the `struct rev_info` and the `repo_init_revisions()` call: > > +---- > > +int cmd_walken(int argc, const char **argv, const char *prefix) > > +{ > > + /* This can go wherever you like in your declarations.*/ > > + struct rev_info rev; > > + ... > > A less verbose way to indicate the same without using a /* comment */: > > ... > struct rev_info rev; > ... Per the earlier comment about losing "..." I'm not going to take this comment; I'll also be replacing the "..." after. > > > + /* This should go after the git_config() call. */ > > + repo_init_revisions(the_repository, &rev, prefix); > > +} > > +---- > > +static void final_rev_info_setup(struct rev_info *rev) > > +{ > > + /* We want to mimick the appearance of `git log --oneline`, so let's > > + * force oneline format. */ > > s/mimick/mimic/ > > /* > * Multi-line > * comment. > */ Done. > > > +==== Preparing `rev_info` For the Walk > > + > > +Now that `rev` is all initialized and configured, we've got one more setup step > > +before we get rolling. We can do this in a helper, which will both prepare the > > +`rev_info` for the walk, and perform the walk itself. Let's start the helper > > +with the call to `prepare_revision_walk()`. > > + > > +---- > > +static int walken_commit_walk(struct rev_info *rev) > > +{ > > + /* prepare_revision_walk() gets the final steps ready for a revision > > + * walk. We check the return value for errors. */ > > Not at all sure what this comment is trying to say. Also, the second > sentence adds no value to what the code itself already says clearly by > actually checking the return value. Attempted to rephrase. I ended up with: /* * prepare_revision_walk() does the final setup needed by revision.h * before a walk. It may return an error if there is a problem. */ Maybe the second sentence still doesn't serve a purpose, but I was trying to express that prepare_revision_walk() won't die() on its own. > > > + if (prepare_revision_walk(rev)) > > + die(_("revision walk setup failed")); > > +} > > +==== Performing the Walk! > > + > > +Finally! We are ready to begin the walk itself. Now we can see that `rev_info` > > +can also be used as an iterator; we move to the next item in the walk by using > > +`get_revision()` repeatedly. Add the listed variable declarations at the top and > > +the walk loop below the `prepare_revision_walk()` call within your > > +`walken_commit_walk()`: > > + > > +---- > > +static int walken_commit_walk(struct rev_info *rev) > > +{ > > + struct commit *commit; > > + struct strbuf prettybuf; > > + strbuf_init(&prettybuf, 0); > > More idiomatic: > > struct strbuf prettybuf = STRBUF_INIT; Ok, I'll change it. I wasn't sure which one was preferred, so this is super helpful. Thanks. > > > + while ((commit = get_revision(rev)) != NULL) { > > + if (commit == NULL) > > + continue; > > Idiomatic Git code doesn't mention NULL explicitly in conditionals, so: > > while ((commit = get_revision(rev))) { > if (!commit) > continue; Done, thanks. > > > + strbuf_reset(&prettybuf); > > + pp_commit_easy(CMIT_FMT_ONELINE, commit, &prettybuf); > > Earlier, you talked about calling get_commit_format("oneline",...) to > get "oneline" output, so what is the purpose of CMIT_FMT_ONELINE here? > The text should explain more clearly what these two different > "online"-related bits mean. Thanks. I've got to research a little on this one. I'll clarify it before the next reroll. > > > + printf(_("%s\n"), prettybuf.buf); > > There is nothing here to localize, so drop _(...): > > printf("%s\n", prettybuf.buf); > > or perhaps just: > > puts(prettybuf.buf); Sure, I'll use this one. > > > + } > > + > > + return 0; > > +} > > What does the return value signify? Will double check that I don't use it for anything; I can probalby drop it and make this a void function instead. > > > +=== Adding a Filter > > + > > +Next, we can modify the `grep_filter`. This is done with convenience functions > > +found in `grep.h`. For fun, we're filtering to only commits from folks using a > > +gmail.com email address - a not-very-precise guess at who may be working on Git > > Perhaps? s/gmail.com/`&`/ Done. > > > +=== Changing the Order > > + > > +Let's see what happens when we run with `REV_SORT_BY_COMMIT_DATE` as opposed to > > +`REV_SORT_BY_AUTHOR_DATE`. Add the following: > > + > > +static void final_rev_info_setup(int argc, const char **argv, > > + const char *prefix, struct rev_info *rev) > > +{ > > + ... > > + > > + rev->topo_order = 1; > > + rev->sort_order = REV_SORT_BY_COMMIT_DATE; > > The assignment to rev->sort_order is obvious enough, but the > rev->topo_order assignment is quite mysterious to someone coming to > this tutorial to learn about revision walking, thus some commentary > explaining 'topo_order' would be a good idea. Will do. > > > +Finally, compare the two. This is a little less helpful without object names or > > +dates, but hopefully we get the idea. > > + > > +---- > > +$ diff -u commit-date.txt author-date.txt > > +---- > > + > > +This display is an indicator for the latency between publishing a commit for > > +review the first time, and getting it actually merged into master. > > Perhaps: s/master/`&`/ > > Even as a long-time contributor to the project, I had to pause over > this statement for several seconds before figuring out what it was > talking about. Without a long-winded explanation of how topics > progress from submission through 'pu' through 'next' through 'master' > and finally into a release, the above statement is likely to be > mystifying to a newcomer. Perhaps it should be dropped. Such an explanation exists in MyFirstContribution.txt. I will include a shameless plug to that document here. :) > > > +Let's try one more reordering of commits. `rev_info` exposes a `reverse` flag. > > +However, it needs to be applied after `add_head_to_pending()` is called. Find > > This leaves the reader hanging, wondering why 'reverse' needs to be > assigned after add_head_to_pending(). Will address. > > > +== Basic Object Walk > > + > > +static void walken_show_commit(struct commit *cmt, void *buf) > > +{ > > + commit_count++; > > +} > > +---- > > + > > +Since we have the `struct commit` object, we can look at all the same parts that > > +we looked at in our earlier commit-only walk. For the sake of this tutorial, > > +though, we'll just increment the commit counter and move on. > > This leaves the reader wondering what 'buf' is and what it's used for. > Presumably this is the 'show_data' context mentioned earlier? If so, > perhaps name this 'ctxt' or 'context' or something and, because this > is a tutorial trying to teach revision walking, say a quick word about > how it might be used. > > > +static void walken_show_object(struct object *obj, const char *str, void *buf) > > +{ > > + switch (obj->type) { > > + [...] > > + case OBJ_COMMIT: > > + printf(_("Unexpectedly encountered a commit in " > > + "walken_show_object!\n")); > > + commit_count++; > > + break; > > + default: > > + printf(_("Unexpected object type %s!\n"), > > + type_name(obj->type)); > > + break; > > + } > > +} > > Modern practice in this project is to start error messages with > lowercase and to not punctuate the end (no need for "!"). Done. > Also, same complaint about the mysterious 'str' argument to the > callback as for 'buf' mentioned above. Will do. > > > +To help assure us that we aren't double-counting commits, we'll include some > > +complaining if a commit object is routed through our non-commit callback; we'll > > +also complain if we see an invalid object type. > > Are these two error cases "impossible" conditions or can they > actually arise in practice? If the former, use die() instead and drop > use of _(...) so as to avoid confusing the reader into thinking that > the behavior is indeterminate. Ah, these should be impossible. I'll turn them into die(). > > > +Our main object walk implementation is substantially different from our commit > > +walk implementation, so let's make a new function to perform the object walk. We > > +can perform setup which is applicable to all objects here, too, to keep separate > > +from setup which is applicable to commit-only walks. > > + > > +---- > > +static int walken_object_walk(struct rev_info *rev) > > +{ > > +} > > +---- > > This skeleton function definition is populated immediately below, so > it's not clear why it needs to be shown here. Yeah, you're right. Removed the skeleton snippet. > > > +We'll start by enabling all types of objects in the `struct rev_info`, and > > +asking to have our trees and blobs shown in commit order. We'll also exclude > > +promisors as the walk becomes more complicated with those types of objects. When > > +our settings are ready, we'll perform the normal revision walk setup and > > +initialize our tracking variables. > > + > > +---- > > +static int walken_object_walk(struct rev_info *rev) > > +{ > > + rev->tree_objects = 1; > > + rev->blob_objects = 1; > > + rev->tag_objects = 1; > > + rev->tree_blobs_in_commit_order = 1; > > + rev->exclude_promisor_objects = 1; > > + [...] > > +---- > > + > > +Unless you cloned or fetched your repository earlier with a filter, > > +`exclude_promisor_objects` is unlikely to make a difference, but we'll turn it > > +on just to make sure our lives are simple. We'll also turn on > > +`tree_blobs_in_commit_order`, which means that we will walk a commit's tree and > > +everything it points to immediately after we find each commit, as opposed to > > +waiting for the end and walking through all trees after the commit history has > > +been discovered. > > This paragraph is repeating much of the information in the paragraph > just above the code snippet. One or the other should be dropped or > thinned to avoid the duplication. We'll start by enabling all types of objects in the `struct rev_info`. Unless you cloned or fetched your repository earlier with a filter, `exclude_promisor_objects` is unlikely to make a difference, but we'll turn it on just to make sure our lives are simple. We'll also turn on `tree_blobs_in_commit_order`, which means that we will walk a commit's tree and everything it points to immediately after we find each commit, as opposed to waiting for the end and walking through all trees after the commit history has been discovered. With the appropriate settings configured, we are ready to call `prepare_revision_walk()`. > > > +Let's start by calling just the unfiltered walk and reporting our counts. > > +Complete your implementation of `walken_object_walk()`: > > + > > +---- > > + traverse_commit_list(rev, walken_show_commit, walken_show_object, NULL); > > + > > + printf(_("Object walk completed. Found %d commits, %d blobs, %d tags, " > > + "and %d trees.\n"), commit_count, blob_count, tag_count, > > + tree_count); > > Or make the output more useful by having it be machine-parseable (and > not localized): > > printf("commits %d\nblobs %d\ntags %d\ntrees %d\n", > commit_count, blob_count, tag_cont, tree_count); I'm not sure whether I agree, since it's a useless toy command only for human parsing. > > > + return 0; > > +} > > What does the return value signify? Yeah, again I think I can get rid of this; I'll take a look at the final sample code and make sure it can go. > > > +Now we can try to run our command! It should take noticeably longer than the > > +commit walk, but an examination of the output will give you an idea why - for > > +example: > > + > > +---- > > +Object walk completed. Found 55733 commits, 100274 blobs, 0 tags, and 104210 trees. > > +---- > > + > > +This makes sense. We have more trees than commits because the Git project has > > +lots of subdirectories which can change, plus at least one tree per commit. We > > +have no tags because we started on a commit (`HEAD`) and while tags can point to > > +commits, commits can't point to tags. > > + > > +NOTE: You will have different counts when you run this yourself! The number of > > +objects grows along with the Git project. > > Not sure if this NOTE is useful; after all, you introduced the output > by saying "for example". I think you're probably right, but I'll try to fix this by slightly fleshing out the "for example" phrasing. > > > +=== Adding a Filter > > + > > +There are a handful of filters that we can apply to the object walk laid out in > > +`Documentation/rev-list-options.txt`. These filters are typically useful for > > +operations such as creating packfiles or performing a partial or shallow clone. > > +They are defined in `list-objects-filter-options.h`. For the purposes of this > > +tutorial we will use the "tree:1" filter, which causes the walk to omit all > > +trees and blobs which are not directly referenced by commits reachable from the > > +commit in `pending` when the walk begins. (In our case, that means we omit trees > > +and blobs not directly referenced by HEAD or HEAD's history.) > > Need some explanation of what 'pending' is, as it's just mysterious as written. Done. I've tried to explain it by drawing a parallel to BFS tree traversal, although that might be even more confusing as the DAG isn't quite the same. > > > +First, we'll need to `#include "list-objects-filter-options.h`". Then, we can > > +set up the `struct list_objects_filter_options` and `struct oidset` at the top > > +of `walken_object_walk()`: > > + > > +---- > > +static int walken_object_walk(struct rev_info *rev) > > +{ > > + struct list_objects_filter_options filter_options = {}; > > + struct oidset omitted; > > + oidset_init(&omitted, 0); > > + ... > > This 'omitted' is so far removed from the description of the 'omitted' > argument to traverse_commit_list_filtered() way earlier in the > tutorial that a reader is likely to have forgotten what it's about > (indeed, I did). Some explanation, even if superficial, is likely > warranted here or at least mention that it is explained in more detail > below (as I discovered). > > > +After we run `traverse_commit_list_filtered()` we would also be able to examine > > +`omitted`, which is a linked-list of all objects we did not include in our walk. > > +Since all omitted objects are included, the performance of > > +`traverse_commit_list_filtered()` with a non-null `omitted` arument is equitable > > s/arument/argument/ > > > +with the performance of `traverse_commit_list()`; so for our purposes, we leave > > +it null. It's easy to provide one and iterate over it, though - check `oidset.h` > > +for the declaration of the accessor methods for `oidset`. > > I'm confused. What are we leaving NULL here? Yeah, this isn't very well written. I'll try to rephrase it; I think I meant to leave `omitted` out of the arglist to `traverse_comit_list_filtered()` but looks like I didn't manage to do so in the actual impl. I think I'll break out an additional section to show how `--filter-print-omitted` works, instead of just leaving this with an RTFM at the end. (This will also end up with a reroll of the example patchset, too.) > > > +=== Changing the Order > > + > > +Finally, let's demonstrate that you can also reorder walks of all objects, not > > +just walks of commits. First, we'll make our handlers chattier - modify > > +`walken_show_commit()` and `walken_show_object` to print the object as they go: > > s/walken_show_object/&()/ Done. > > > +static void walken_show_commit(struct commit *cmt, void *buf) > > +{ > > + printf(_("commit: %s\n"), oid_to_hex(&cmt->object.oid)); > > + commit_count++; > > +} > > Is there a bunch of trailing whitespace on these lines of the code > sample (and in some lines below)? Oh no, there might be. Bad on me for my copy/paste between vim windows workflow; I thought I had trimmed from all of them but guess not. I'll check over the whole doc and fix it up. > > > +static void walken_show_object(struct object *obj, const char *str, void *buf) > > +{ > > + printf(_("%s: %s\n"), type_name(obj->type), oid_to_hex(&obj->oid)); > > Localizing "%s: %s\n" via _(...) probably doesn't add value, which > implies that you might not want to be localizing "commit" above > either. This is closer to machine-readable, so I'll remove the locale. > > > +(Try to leave the counter increment logic in place in `walken_show_object()`.) > > + > > +With only that change, run again (but save yourself some scrollback): > > + > > +---- > > +$ ./bin-wrappers/git walken | head -n 10 > > +---- > > + > > +Take a look at the top commit with `git show` and the OID you printed; it should > > +be the same as the output of `git show HEAD`. > > I think this is the first use of "OID", which might be mysterious and > confusing to a newcomer. Earlier, you used SHA-1 and I suggested > "object ID" instead. Perhaps use the same here, or define OID earlier > in the document in place of SHA-1. Yeah, I ended up replacing it above with "object ID (OID)" but this is far enough along that I think I'll replace it with "object ID" here too. > > > +Next, let's change a setting on our `struct rev_info` within > > +`walken_object_walk()`. Find where you're changing the other settings on `rev`, > > +such as `rev->tree_objects` and `rev->tree_blobs_in_commit_order`, and add > > +another setting at the bottom: > > Instead of nebulous "another setting", mentioning 'reverse' explicitly > would make this clearer. Done. > > > + rev->tree_objects = 1; > > + rev->blob_objects = 1; > > + rev->tag_objects = 1; > > + rev->tree_blobs_in_commit_order = 1; > > + rev->exclude_promisor_objects = 1; > > + rev->reverse = 1; Thank you so much for taking the time to do a detailed review of this. This is great feedback. - Emily
On Mon, Jun 10, 2019 at 01:49:41PM -0700, Junio C Hamano wrote: > Emily Shaffer <emilyshaffer@google.com> writes: > > > +My First Revision Walk > > +====================== > > + > > +== What's a Revision Walk? > > + > > +The revision walk is a key concept in Git - this is the process that underpins > > +operations like `git log`, `git blame`, and `git reflog`. Beginning at HEAD, the > > +list of objects is found by walking parent relationships between objects. The > > +revision walk can also be usedto determine whether or not a given object is > > +reachable from the current HEAD pointer. > > s/usedto/used to/; Done. > > > +We'll put our fiddling into a new command. For fun, let's name it `git walken`. > > +Open up a new file `builtin/walken.c` and set up the command handler: > > + > > +---- > > +/* > > + * "git walken" > > + * > > + * Part of the "My First Revision Walk" tutorial. > > + */ > > + > > +#include <stdio.h> > > Bad idea. In the generic part of the codebase, system headers are > supposed to be supplied by including git-compat-util.h (or cache.h > or builtin.h, that are common header files that begin by including > it and are allowed by CodingGuidelines to be used as such). Done. > > > +#include "builtin.h" > > + > > +int cmd_walken(int argc, const char **argv, const char *prefix) > > +{ > > + printf(_("cmd_walken incoming...\n")); > > + return 0; > > +} > > +---- > > I wonder if it makes sense to use trace instead of printf, as our > reader has already seen the psuh example for doing the above. Hmmm. I will think about it and look into the intended use of each. I hadn't considered using a different logging method. > > > +Add usage text and `-h` handling, in order to pass the test suite: > > It is not wrong per-se, and it indeed is a very good practice to > make sure that our subcommands consistently gives usage text and > short usage. Encouraging them early is a good idea. > > But "in order to pass the test suite" invites "eh, the test suite > does not pass without usage and -h? why?". > > Either drop the mention of "the test suite", or perhaps say > something like > > Add usage text and `-h` handling, like all the subcommands > should consistently do (our test suite will notice and > complain if you fail to do so). > > i.e. the real purpose is consistency and usability; test suite is > merely an enforcement mechanism. Yeah, you're right. I'll reword this. > > > +---- > > +{ "walken", cmd_walken, RUN_SETUP }, > > +---- > > + > > +Add it to the `Makefile` near the line for `builtin\worktree.o`: > > Backslash intended? Nope, typo. Thanks for the comments, Junio. - Emily
On Mon, Jun 10, 2019 at 01:25:14PM -0700, Junio C Hamano wrote: > Emily Shaffer <emilyshaffer@google.com> writes: > > > I'll also be mailing an RFC patchset In-Reply-To this message; the RFC > > patchset should not be merged to Git, as I intend to host it in my own > > mirror as an example. I hosted a similar example for the > > MyFirstContribution tutorial; it's visible at > > https://github.com/nasamuffin/git/tree/psuh. There might be a better > > place to host these so I don't "own" them but I'm not sure what it is; > > keeping them as a live branch somewhere struck me as an okay way to keep > > them from getting stale. > > Yes, writing the initial version is one thing, but keeping it alive > is more work and more important. As the underlying API changes over > time, it will become necessary to update the sample implementation, > but for a newbie who wants to learn by building "walken" on top of > the then-current codebase and API, it would not be so helpful to > show "these 7 patches were for older codebase, and the tip 2 are > incremental updates to adjust to the newer API", so the maintenance > of these sample patches may need different paradigm than the norm > for our main codebase that values incremental polishing. > I'm trying to think of how it would end up working if I tried to use a Github workflow. I think it wouldn't - someone would open a PR, and then I'd have to rewrite that change into the appropriate commit in the live branch and push the entire branch anew. Considering that workflow leaves me doubly convinced that leaving it in my personal fork indefinitely might not be wise (what if I become unable to continue maintaining it)? I wonder if this is something that might fit well in one of the more closely-associated mirrors, like gitster/git or gitgitgadget/git - although I wonder if those count as "owned" by Junio and Johannes, respectively. Hmmmm. Maybe there's a case for storing them as a set of patch files that are revision-controlled somewhere within Documentation/? There was some discussion on the IRC a few weeks ago about trying to organize these tutorials into their own directory to form a sort of "Git Contribution 101" course, maybe it makes sense to store there? Documentation/contributing/myfirstcontrib/MyFirstContrib.txt Documentation/contributing/myfirstcontrib/sample/*.patch Documentation/contributing/myfirstrevwalk/MyFirstRevWalk.txt Documentation/contributing/myfirstrevwalk/sample/*.patch I don't love the idea of maintaining text patches with the expectation that they should cleanly apply always, but it might make the idea that they shouldn't contain 2 patches on the tip for API adjustment more clear. And it would be probably pretty easy to inflate and build them with a build target or something. Hmmmmmmmmm. - Emily
On Mon, Jun 17, 2019 at 7:20 PM Emily Shaffer <emilyshaffer@google.com> wrote: > On Fri, Jun 07, 2019 at 02:21:07AM -0400, Eric Sunshine wrote: > > On Thu, Jun 6, 2019 at 9:08 PM Emily Shaffer <emilyshaffer@google.com> wrote: > > > +int cmd_walken(int argc, const char **argv, const char *prefix) > > > +{ > > > + struct option options[] = { > > > + OPT_END() > > > + }; > > > + > > > + argc = parse_options(argc, argv, prefix, options, walken_usage, 0); > > > + > > > + ... > > > > Perhaps comment out the "..." or remove it altogether to avoid having > > the compiler barf when the below instructions tell the reader to build > > the command. > > Hmm. That part I'm not so sure about. I like to use the "..." to > indicate where the code in the snippet should be added around the other > code already in the file - which I suppose it does just as clearly if > it's commented - but I also hope folks are not simply copy-pasting > blindly from the tutorial. > > It seems like including uncommented "..." in code tutorials is pretty > common. You're right, and that's not what I was "complaining" about. Looking back at your original email, I see that I somehow got confused and didn't realize or (quickly) forgot that you had already presented a _complete_ cmd_walken() snippet just above that spot, and that the cmd_walken() snippet upon which I was commenting was _incomplete_, thus the "..." was perfectly justified. Not realizing that the incomplete cmd_walken() example was just that (incomplete), I "complained" that the following "compile the project" instructions would barf on "...". Maybe I got confused because the tiny cmd_walken() snippets followed one another so closely (or because I got interrupted several times during the review), but one way to avoid that would be to present a single _complete_ snippet from the start, followed by a bit of explanation. That is, something like this: Open up a new file `builtin/walken.c` and set up the command handler: ---- /* "git walken" -- Part of the "My First Revision Walk" tutorial. */ #include "builtin.h" int cmd_walken(int argc, const char **argv, const char *prefix) { const char * const usage[] = { N_("git walken"), NULL, } struct option options[] = { OPT_END() }; argc = parse_options(argc, argv, prefix, options, usage, 0); printf(_("cmd_walken incoming...\n")); return 0; } ---- `usage` is the usage message presented by `git -h walken`, and `options` will eventually specify command-line options. > I don't think I have a good reason to push back on this except that I > think "/* ... */" is ugly :) > > I'll go through and replace "..." with some actual hints about what's > supposed to go there; for example, here I'll replace with "/* print and > return */". Seeing as my initial review comment was in error, I'm not sure that you ought to replace "..." with anything else. > > "invoke anything" is pretty nebulous, as is the earlier "components > > you may invoke". A newcomer is unlikely to know what this means, so > > perhaps it needs an example (even if just a short parenthetical > > comment). > > I have tried to reword this; I hope this is a little clearer. > > Before you begin to examine user configuration for your revision walk, it's > common practice for you to initialize to default any switches that your command > may have, as well as ask any other components you may invoke to initialize as > well (for example, how `git log` also uses the `grep` and `diff` components). > `git log` does this in `init_log_defaults()`; in that case, one global > `decoration_style` is initialized, as well as the grep and diff-UI components. By trying to express too many things at once, it's still difficult to follow. Perhaps use shorter, more easily digestible sentences, like this: Before examining configuration files which may modify command behavior, set up default state for switches or options your command may have. If your command utilizes other Git components, ask them to set up their default states, as well. For instance, `git log` takes advantage of `grep` and `diff` functionality; its init_log_defaults() sets its own state (`decoration_style`) and asks `grep` and `diff` to initialize themselves by calling their initialization functions. > > > +static int walken_commit_walk(struct rev_info *rev) > > > +{ > > > + /* prepare_revision_walk() gets the final steps ready for a revision > > > + * walk. We check the return value for errors. */ > > > > Not at all sure what this comment is trying to say. Also, the second > > sentence adds no value to what the code itself already says clearly by > > actually checking the return value. > > Attempted to rephrase. I ended up with: > > /* > * prepare_revision_walk() does the final setup needed by revision.h > * before a walk. It may return an error if there is a problem. > */ > > Maybe the second sentence still doesn't serve a purpose, but I was > trying to express that prepare_revision_walk() won't die() on its own. > > > > > > + if (prepare_revision_walk(rev)) > > > + die(_("revision walk setup failed")); As this is just a toy example, I don't care too strongly about the unnecessary second sentence. On the other hand, the tutorial is trying to teach people how to contribute to this project, and on this project, that sort of pointless comment is likely to be called out in review. In fact, given that view, the entire comment block is unnecessary (it doesn't add any value for anyone reviewing or reading the code), so it might make more sense to drop the comment from the code entirely, and just do a better job explaining in prose above the snippet why you are calling that function. For instance: ... Let's start the helper with the call to `prepare_revision_walk()`, which does the final setup of the `rev_info` structure before it can be used. The above observation may be more widely applicable than to just this one instance. Don't use in-code comments for what should be explained in prose if the in-code comment adds no value to the code itself (to wit, if a reviewer would say "don't repeat in a comment what the code already says clearly" or "don't use a comment to state the obvious"). > > > +This display is an indicator for the latency between publishing a commit for > > > +review the first time, and getting it actually merged into master. > > > > Perhaps: s/master/`&`/ > > > > Even as a long-time contributor to the project, I had to pause over > > this statement for several seconds before figuring out what it was > > talking about. Without a long-winded explanation of how topics > > progress from submission through 'pu' through 'next' through 'master' > > and finally into a release, the above statement is likely to be > > mystifying to a newcomer. Perhaps it should be dropped. > > Such an explanation exists in MyFirstContribution.txt. I will include a > shameless plug to that document here. :) I found that this sort of tangential reference disturbed the flow of the tutorial, leading the mind astray from the otherwise natural progression of the presentation. So, I'm not convinced that talking about the migration of a topic in the Git project itself adds value to this tutorial. The same effect could be seen when commits have been re-ordered via git-rebase, too, right? Perhaps mention that instead? > > > + printf(_("Object walk completed. Found %d commits, %d blobs, %d tags, " > > > + "and %d trees.\n"), commit_count, blob_count, tag_count, > > > + tree_count); > > > > Or make the output more useful by having it be machine-parseable (and > > not localized): > > > > printf("commits %d\nblobs %d\ntags %d\ntrees %d\n", > > commit_count, blob_count, tag_cont, tree_count); > > I'm not sure whether I agree, since it's a useless toy command only for human > parsing. True, it's not a big deal, and I don't insist upon it. But, if you mention in prose that this output is easily machine-parseable, then perhaps that nudges the reader a bit in the direction of thinking about porcelain vs. plumbing, which is something a contributor to this project eventually has to be concerned with (the sooner, the better).
Emily Shaffer <emilyshaffer@google.com> writes: > Maybe there's a case for storing them as a set of patch files that are > revision-controlled somewhere within Documentation/? There was some > discussion on the IRC a few weeks ago about trying to organize these > tutorials into their own directory to form a sort of "Git Contribution > 101" course, maybe it makes sense to store there? > > Documentation/contributing/myfirstcontrib/MyFirstContrib.txt > Documentation/contributing/myfirstcontrib/sample/*.patch > Documentation/contributing/myfirstrevwalk/MyFirstRevWalk.txt > Documentation/contributing/myfirstrevwalk/sample/*.patch > > I don't love the idea of maintaining text patches with the expectation > that they should cleanly apply always,... Well, I actually think the above organization does match the intent of the "My first contribution codelab" perfectly. When the codebase, the workflow used by the project, and/or the coding or documentation guideline gets updated, the text that documents how to contribute to the project as well as the sample patches must be updated to match the updated reality. I agree with you that maintaining the *.patch files to always cleanly apply is less than ideal. A topic to update the sample patches and tutorial text may be competing with another topic that updates the very API the tutorials are teaching, and the sample patches may not apply cleanly when two topics are merged together, even if the "update sample patches and tutorial text" topic does update them to match the API at the tip of the topic branch itself. One thing we _could_ do is to pin the target version of the codebase for the sake of tutorial. IOW, the sample/*.patch may not apply cleanly to the version of the tree these patches were taken from, but would always apply cleanly to the most recent released version before the last update to the tutorial, or something like that. Also having to review the patch to sample/*.patch files will be unpleasant.
On Wed, Jun 19, 2019 at 04:13:35AM -0400, Eric Sunshine wrote: > On Mon, Jun 17, 2019 at 7:20 PM Emily Shaffer <emilyshaffer@google.com> wrote: > > On Fri, Jun 07, 2019 at 02:21:07AM -0400, Eric Sunshine wrote: > > > On Thu, Jun 6, 2019 at 9:08 PM Emily Shaffer <emilyshaffer@google.com> wrote: > > > > +int cmd_walken(int argc, const char **argv, const char *prefix) > > > > +{ > > > > + struct option options[] = { > > > > + OPT_END() > > > > + }; > > > > + > > > > + argc = parse_options(argc, argv, prefix, options, walken_usage, 0); > > > > + > > > > + ... > > > > > > Perhaps comment out the "..." or remove it altogether to avoid having > > > the compiler barf when the below instructions tell the reader to build > > > the command. > > > > Hmm. That part I'm not so sure about. I like to use the "..." to > > indicate where the code in the snippet should be added around the other > > code already in the file - which I suppose it does just as clearly if > > it's commented - but I also hope folks are not simply copy-pasting > > blindly from the tutorial. > > > > It seems like including uncommented "..." in code tutorials is pretty > > common. > > You're right, and that's not what I was "complaining" about. Looking > back at your original email, I see that I somehow got confused and > didn't realize or (quickly) forgot that you had already presented a > _complete_ cmd_walken() snippet just above that spot, and that the > cmd_walken() snippet upon which I was commenting was _incomplete_, > thus the "..." was perfectly justified. Not realizing that the > incomplete cmd_walken() example was just that (incomplete), I > "complained" that the following "compile the project" instructions > would barf on "...". > > Maybe I got confused because the tiny cmd_walken() snippets followed > one another so closely (or because I got interrupted several times > during the review), but one way to avoid that would be to present a > single _complete_ snippet from the start, followed by a bit of > explanation. That is, something like this: > > Open up a new file `builtin/walken.c` and set up the command handler: > > ---- > /* "git walken" -- Part of the "My First Revision Walk" tutorial. */ > #include "builtin.h" > > int cmd_walken(int argc, const char **argv, const char *prefix) > { > const char * const usage[] = { > N_("git walken"), > NULL, > } > struct option options[] = { > OPT_END() > }; > > argc = parse_options(argc, argv, prefix, options, usage, 0); > > printf(_("cmd_walken incoming...\n")); > return 0; > } > ---- > > `usage` is the usage message presented by `git -h walken`, and > `options` will eventually specify command-line options. Hmm. I can say that I personally would find that much more difficult to follow interactively, and I'd be tempted to copy-and-paste and skim through the wall of text if I was presented with such a snippet. However, I could also imagine the reverse - someone becoming tired of having their hand held through a fairly straightforward implementation, when they're perfectly capable of reading a long description and would just like to get on with it. I'm really curious about what others think in this scenario, since I imagine it boils down to individual learning styles. (Maybe we can split the difference and present a complete patch or new function, followed by a breakdown? That would end up even more verbose than the current approach, though.) ... Now that I'm thinking more about this, and reading some of your later comments on this mail, I think it might be valuable to lean on the sample patchset for complete code samples, especially if we figure a good way to distribute the patchset near the tutorial (as Junio and I are discussing in another branch of this thread). Then we can keep the tutorial concise, but have the complete code available for those who prefer to look there. > > > I don't think I have a good reason to push back on this except that I > > think "/* ... */" is ugly :) > > > > I'll go through and replace "..." with some actual hints about what's > > supposed to go there; for example, here I'll replace with "/* print and > > return */". > > Seeing as my initial review comment was in error, I'm not sure that > you ought to replace "..." with anything else. > > > > "invoke anything" is pretty nebulous, as is the earlier "components > > > you may invoke". A newcomer is unlikely to know what this means, so > > > perhaps it needs an example (even if just a short parenthetical > > > comment). > > > > I have tried to reword this; I hope this is a little clearer. > > > > Before you begin to examine user configuration for your revision walk, it's > > common practice for you to initialize to default any switches that your command > > may have, as well as ask any other components you may invoke to initialize as > > well (for example, how `git log` also uses the `grep` and `diff` components). > > `git log` does this in `init_log_defaults()`; in that case, one global > > `decoration_style` is initialized, as well as the grep and diff-UI components. > > By trying to express too many things at once, it's still difficult to > follow. Perhaps use shorter, more easily digestible sentences, like > this: > > Before examining configuration files which may modify command > behavior, set up default state for switches or options your > command may have. If your command utilizes other Git components, > ask them to set up their default states, as well. For instance, > `git log` takes advantage of `grep` and `diff` functionality; its > init_log_defaults() sets its own state (`decoration_style`) and > asks `grep` and `diff` to initialize themselves by calling their > initialization functions. Yeah, I like this a lot. Thanks! I took it word for word; will be adding you to the Helped-by line of the commit. > As this is just a toy example, I don't care too strongly about the > unnecessary second sentence. On the other hand, the tutorial is trying > to teach people how to contribute to this project, and on this > project, that sort of pointless comment is likely to be called out in > review. In fact, given that view, the entire comment block is > unnecessary (it doesn't add any value for anyone reviewing or reading > the code), so it might make more sense to drop the comment from the > code entirely, and just do a better job explaining in prose above the > snippet why you are calling that function. For instance: > > ... Let's start the helper with the call to `prepare_revision_walk()`, > which does the final setup of the `rev_info` structure before it can > be used. > > The above observation may be more widely applicable than to just this > one instance. Don't use in-code comments for what should be explained > in prose if the in-code comment adds no value to the code itself (to > wit, if a reviewer would say "don't repeat in a comment what the code > already says clearly" or "don't use a comment to state the obvious"). I'm of two minds about this. On the one hand, I'm somewhat in favor of leaving contextual, informational comments in the sample code, so the sample code can teach on its own without the tutorial (specifically, I mean the patchset that was sent alongside this one as RFC). On the other hand, you're right that adding these informational comments doesn't model best practices for real commits. I don't have a strong opposition to removing those comments from the in-place samples in the tutorial itself. But I do think it's useful to include them in the sample patchset, which is intended as an additional learning tool, rather than as a pristine code example - especially if we make it clear in the commit messages there. > > > > > +This display is an indicator for the latency between publishing a commit for > > > > +review the first time, and getting it actually merged into master. > > > > > > Perhaps: s/master/`&`/ > > > > > > Even as a long-time contributor to the project, I had to pause over > > > this statement for several seconds before figuring out what it was > > > talking about. Without a long-winded explanation of how topics > > > progress from submission through 'pu' through 'next' through 'master' > > > and finally into a release, the above statement is likely to be > > > mystifying to a newcomer. Perhaps it should be dropped. > > > > Such an explanation exists in MyFirstContribution.txt. I will include a > > shameless plug to that document here. :) > > I found that this sort of tangential reference disturbed the flow of > the tutorial, leading the mind astray from the otherwise natural > progression of the presentation. So, I'm not convinced that talking > about the migration of a topic in the Git project itself adds value to > this tutorial. The same effect could be seen when commits have been > re-ordered via git-rebase, too, right? Perhaps mention that instead? Yeah, that's a good point. I'll try to mention it in a more universally-applicable way, like you suggested. > > > > > + printf(_("Object walk completed. Found %d commits, %d blobs, %d tags, " > > > > + "and %d trees.\n"), commit_count, blob_count, tag_count, > > > > + tree_count); > > > > > > Or make the output more useful by having it be machine-parseable (and > > > not localized): > > > > > > printf("commits %d\nblobs %d\ntags %d\ntrees %d\n", > > > commit_count, blob_count, tag_cont, tree_count); > > > > I'm not sure whether I agree, since it's a useless toy command only for human > > parsing. > > True, it's not a big deal, and I don't insist upon it. But, if you > mention in prose that this output is easily machine-parseable, then > perhaps that nudges the reader a bit in the direction of thinking > about porcelain vs. plumbing, which is something a contributor to this > project eventually has to be concerned with (the sooner, the better). Oh, that's a very good point. I'll frame it that way - that's a handy place to slip in some bonus context about Git. Thanks. NOTE: We aren't localizing the printf here because we have purposefully formatted it in a machine-parseable way. Commands in Git are divided into "plumbing" and "porcelain"; the "plumbing" commands are machine-parseable and intended for use in scripts, while the "porcelain" commands are intended for human interaction. Output intended for script usage doesn't need to be localized; output intended for humans does. Thanks again for the review effort. - Emily
On Wed, Jun 19, 2019 at 08:17:29AM -0700, Junio C Hamano wrote: > Emily Shaffer <emilyshaffer@google.com> writes: > > > Maybe there's a case for storing them as a set of patch files that are > > revision-controlled somewhere within Documentation/? There was some > > discussion on the IRC a few weeks ago about trying to organize these > > tutorials into their own directory to form a sort of "Git Contribution > > 101" course, maybe it makes sense to store there? > > > > Documentation/contributing/myfirstcontrib/MyFirstContrib.txt > > Documentation/contributing/myfirstcontrib/sample/*.patch > > Documentation/contributing/myfirstrevwalk/MyFirstRevWalk.txt > > Documentation/contributing/myfirstrevwalk/sample/*.patch > > > > I don't love the idea of maintaining text patches with the expectation > > that they should cleanly apply always,... > > Well, I actually think the above organization does match the intent > of the "My first contribution codelab" perfectly. When the codebase, > the workflow used by the project, and/or the coding or documentation > guideline gets updated, the text that documents how to contribute to > the project as well as the sample patches must be updated to match > the updated reality. > > I agree with you that maintaining the *.patch files to always > cleanly apply is less than ideal. A topic to update the sample > patches and tutorial text may be competing with another topic that > updates the very API the tutorials are teaching, and the sample > patches may not apply cleanly when two topics are merged together, > even if the "update sample patches and tutorial text" topic does > update them to match the API at the tip of the topic branch itself. > One thing we _could_ do is to pin the target version of the codebase > for the sake of tutorial. IOW, the sample/*.patch may not apply > cleanly to the version of the tree these patches were taken from, > but would always apply cleanly to the most recent released version > before the last update to the tutorial, or something like that. > > Also having to review the patch to sample/*.patch files will be > unpleasant. I wonder if we can ease some pain for both of the above issues by including some scripts to "inflate" the patch files into a topic branch, or figure out some more easily-reviewed (but more complicated, I suppose) method for sending updates to the sample/*.patch files. Imagining workflows like this: Doing the tutorial: - In worktree a/. - Run a magic script which creates a worktree with the sample code, b/. - Read through a/Documentation/MyFirstContribution.txt and generate a/builtins/psuh.c, referring to b/builtins/psuh.c if confused. Rebasing the tutorial patches: - In worktree a/. - Run a magic script which checks out a new branch at the last known good base for the patchset, then applies all the patches. - Now faced with, likely, a topic branch based on v<n-1> (where n is latest release). - `git rebase v<n> -x (make && ./bin-wrappers/git psuh)` - Interactively fix conflicts - Run a script to generate a magic interdiff from the old version of patches - Mail out magic interdiff to list and get approval - (Maybe maintainer does this when interdiff is happy? Maybe updater does this when review looks good?) Run a magic script to regenerate patches from rebased branch, and note somewhere they are based on v<n> - Mail sample/*.patch (based on v<n>) to list (if maintainer rolled the patches after interdiff approval, this step can be skipped) (This seems to still be a lot of steps, even with the magic script..) Alternatively, for the same process: Updater: Run a magic script to create topic branch based on v<n-1> (like before) U: `git rebase v<n> -x (make && ./bin-wrappers/git psuh)` U: Interactively fix conflicts U: Run a script to turn topic branch back into sample/*.patch U: Send email with changes to sample/*.patch (this will be ugly and unreadable) - message ID <M1> Reviewer: Run a magic script, providing <M1> argument, which grabs the diff-of-.patch and generates an interdiff, or a topic branch based on v<n> R: Send comments explaining where issue is (tricky to find where to inline in the diff-of-.patch) U: Reroll diff-of-.patch email R: Accepts Maintainer: Applies diff-of-.patch email normally I suppose for the first suggestion, there ends up being quite a lot of onus on the maintainer, and a lot of trust that there is no difference between the RFC easy-to-read interdiff patchset. For the second suggestion, there ends up being onus on the reviewers to run some magical script. Maybe we can split the difference by expecting Updater to provide the interdiff below the --- line? Maybe in practice the diff-of-.patch isn't so unreadable, if it's only minor changes needed to bring the tutorial up to latest? I'm not sure there's a way to make this totally painless using email tools. - Emily
On Wed, Jun 19, 2019 at 7:36 PM Emily Shaffer <emilyshaffer@google.com> wrote: > On Wed, Jun 19, 2019 at 04:13:35AM -0400, Eric Sunshine wrote: > > Maybe I got confused because the tiny cmd_walken() snippets followed > > one another so closely (or because I got interrupted several times > > during the review), but one way to avoid that would be to present a > > single _complete_ snippet from the start, followed by a bit of > > explanation. [...] > > Hmm. I can say that I personally would find that much more difficult to > follow interactively, and I'd be tempted to copy-and-paste and skim > through the wall of text if I was presented with such a snippet. > However, I could also imagine the reverse - someone becoming tired of > having their hand held through a fairly straightforward implementation, > when they're perfectly capable of reading a long description and would > just like to get on with it. > > (Maybe we can split the difference and present a complete patch or new > function, followed by a breakdown? That would end up even more verbose > than the current approach, though.) It might not be that important and may not need fixing considering that I read it correctly the second time, and don't know how I managed to get confused on the first read. > > As this is just a toy example, I don't care too strongly about the > > unnecessary second sentence. On the other hand, the tutorial is trying > > to teach people how to contribute to this project, and on this > > project, that sort of pointless comment is likely to be called out in > > review. In fact, given that view, the entire comment block is > > unnecessary (it doesn't add any value for anyone reviewing or reading > > the code), so it might make more sense to drop the comment from the > > code entirely, and just do a better job explaining in prose above the > > snippet why you are calling that function. For instance: > > > > ... Let's start the helper with the call to `prepare_revision_walk()`, > > which does the final setup of the `rev_info` structure before it can > > be used. > > > > The above observation may be more widely applicable than to just this > > one instance. Don't use in-code comments for what should be explained > > in prose if the in-code comment adds no value to the code itself (to > > wit, if a reviewer would say "don't repeat in a comment what the code > > already says clearly" or "don't use a comment to state the obvious"). > > I'm of two minds about this. On the one hand, I'm somewhat in favor of > leaving contextual, informational comments in the sample code, so the > sample code can teach on its own without the tutorial (specifically, I > mean the patchset that was sent alongside this one as RFC). On the other > hand, you're right that adding these informational comments doesn't > model best practices for real commits. > > I don't have a strong opposition to removing those comments from the > in-place samples in the tutorial itself. But I do think it's useful to > include them in the sample patchset, which is intended as an additional > learning tool, rather than as a pristine code example - especially if we > make it clear in the commit messages there. Indeed, having the comments in the sample patch-set makes sense for people who learn better that way (by seeing a complete piece of code). > > > > Or make the output more useful by having it be machine-parseable (and > > > > not localized): > > > > > > > > printf("commits %d\nblobs %d\ntags %d\ntrees %d\n", > > > > commit_count, blob_count, tag_cont, tree_count); > > > > > > I'm not sure whether I agree, since it's a useless toy command only for human > > > parsing. > > > > True, it's not a big deal, and I don't insist upon it. But, if you > > mention in prose that this output is easily machine-parseable, then > > perhaps that nudges the reader a bit in the direction of thinking > > about porcelain vs. plumbing, which is something a contributor to this > > project eventually has to be concerned with (the sooner, the better). > > Oh, that's a very good point. I'll frame it that way - that's a handy > place to slip in some bonus context about Git. Thanks. > > NOTE: We aren't localizing the printf here because we have purposefully > formatted it in a machine-parseable way. Commands in Git are divided into > "plumbing" and "porcelain"; the "plumbing" commands are machine-parseable and > intended for use in scripts, while the "porcelain" commands are intended for > human interaction. Output intended for script usage doesn't need to be > localized; output intended for humans does. I'd go with stronger language than "doesn't need to be localized" and say instead that plumbing output "must not be localized" since scripts depend upon stable output (and stable API).
On 2019.06.20 14:06, Emily Shaffer wrote: > On Wed, Jun 19, 2019 at 08:17:29AM -0700, Junio C Hamano wrote: > > Emily Shaffer <emilyshaffer@google.com> writes: > > > > > Maybe there's a case for storing them as a set of patch files that are > > > revision-controlled somewhere within Documentation/? There was some > > > discussion on the IRC a few weeks ago about trying to organize these > > > tutorials into their own directory to form a sort of "Git Contribution > > > 101" course, maybe it makes sense to store there? > > > > > > Documentation/contributing/myfirstcontrib/MyFirstContrib.txt > > > Documentation/contributing/myfirstcontrib/sample/*.patch > > > Documentation/contributing/myfirstrevwalk/MyFirstRevWalk.txt > > > Documentation/contributing/myfirstrevwalk/sample/*.patch > > > > > > I don't love the idea of maintaining text patches with the expectation > > > that they should cleanly apply always,... > > > > Well, I actually think the above organization does match the intent > > of the "My first contribution codelab" perfectly. When the codebase, > > the workflow used by the project, and/or the coding or documentation > > guideline gets updated, the text that documents how to contribute to > > the project as well as the sample patches must be updated to match > > the updated reality. > > > > I agree with you that maintaining the *.patch files to always > > cleanly apply is less than ideal. A topic to update the sample > > patches and tutorial text may be competing with another topic that > > updates the very API the tutorials are teaching, and the sample > > patches may not apply cleanly when two topics are merged together, > > even if the "update sample patches and tutorial text" topic does > > update them to match the API at the tip of the topic branch itself. > > One thing we _could_ do is to pin the target version of the codebase > > for the sake of tutorial. IOW, the sample/*.patch may not apply > > cleanly to the version of the tree these patches were taken from, > > but would always apply cleanly to the most recent released version > > before the last update to the tutorial, or something like that. > > > > Also having to review the patch to sample/*.patch files will be > > unpleasant. > > I wonder if we can ease some pain for both of the above issues by > including some scripts to "inflate" the patch files into a topic branch, > or figure out some more easily-reviewed (but more complicated, I > suppose) method for sending updates to the sample/*.patch files. > > Imagining workflows like this: > > Doing the tutorial: > - In worktree a/. > - Run a magic script which creates a worktree with the sample code, b/. > - Read through a/Documentation/MyFirstContribution.txt and generate > a/builtins/psuh.c, referring to b/builtins/psuh.c if confused. > > Rebasing the tutorial patches: > - In worktree a/. > - Run a magic script which checks out a new branch at the last known > good base for the patchset, then applies all the patches. > - Now faced with, likely, a topic branch based on v<n-1> (where n is > latest release). > - `git rebase v<n> -x (make && ./bin-wrappers/git psuh)` > - Interactively fix conflicts > - Run a script to generate a magic interdiff from the old version of > patches > - Mail out magic interdiff to list and get approval > - (Maybe maintainer does this when interdiff is happy? Maybe updater > does this when review looks good?) Run a magic script to regenerate > patches from rebased branch, and note somewhere they are based on > v<n> > - Mail sample/*.patch (based on v<n>) to list (if maintainer rolled the > patches after interdiff approval, this step can be skipped) > > (This seems to still be a lot of steps, even with the magic script..) > > Alternatively, for the same process: > Updater: Run a magic script to create topic branch based on v<n-1> > (like before) > U: `git rebase v<n> -x (make && ./bin-wrappers/git psuh)` > U: Interactively fix conflicts > U: Run a script to turn topic branch back into sample/*.patch > U: Send email with changes to sample/*.patch (this will be ugly and > unreadable) - message ID <M1> > Reviewer: Run a magic script, providing <M1> argument, which grabs the > diff-of-.patch and generates an interdiff, or a topic branch based > on v<n> > R: Send comments explaining where issue is (tricky to find where to > inline in the diff-of-.patch) > U: Reroll diff-of-.patch email > R: Accepts > Maintainer: Applies diff-of-.patch email normally > > I suppose for the first suggestion, there ends up being quite a lot of > onus on the maintainer, and a lot of trust that there is no difference > between the RFC easy-to-read interdiff patchset. For the second > suggestion, there ends up being onus on the reviewers to run some > magical script. Maybe we can split the difference by expecting Updater > to provide the interdiff below the --- line? Maybe in practice the > diff-of-.patch isn't so unreadable, if it's only minor changes needed > to bring the tutorial up to latest? > > I'm not sure there's a way to make this totally painless using email > tools. Random thought about the "magic scripts": if we keep an mbox instead of a directory of *.patch files, then it seems like git-format-patch and git-am would solve the bulk of this. I don't think dealing with diffs-of-patches-in-mbox is much worse than dealing with diffs-of-patches-in-multiple-files. And for the "Doing the tutorial" workflow, it nudges the new contributor to learn git-am. But I guess the hard part here is the reviewing diffs-of-diffs part. I'm leaning towards the second option here; I personally would not feel too troubled as a reviewer by having to run an extra script. And as you say, diff-of-diffs may not be so bad in practice. Reviewers already see these whenever someone includes a range-diff in their v>=2 emails.
On Fri, Jul 12, 2019 at 05:39:48PM -0700, Josh Steadmon wrote: > On 2019.06.20 14:06, Emily Shaffer wrote: > > On Wed, Jun 19, 2019 at 08:17:29AM -0700, Junio C Hamano wrote: > > > Emily Shaffer <emilyshaffer@google.com> writes: > > > > > > > Maybe there's a case for storing them as a set of patch files that are > > > > revision-controlled somewhere within Documentation/? There was some > > > > discussion on the IRC a few weeks ago about trying to organize these > > > > tutorials into their own directory to form a sort of "Git Contribution > > > > 101" course, maybe it makes sense to store there? > > > > > > > > Documentation/contributing/myfirstcontrib/MyFirstContrib.txt > > > > Documentation/contributing/myfirstcontrib/sample/*.patch > > > > Documentation/contributing/myfirstrevwalk/MyFirstRevWalk.txt > > > > Documentation/contributing/myfirstrevwalk/sample/*.patch > > > > > > > > I don't love the idea of maintaining text patches with the expectation > > > > that they should cleanly apply always,... > > > > > > Well, I actually think the above organization does match the intent > > > of the "My first contribution codelab" perfectly. When the codebase, > > > the workflow used by the project, and/or the coding or documentation > > > guideline gets updated, the text that documents how to contribute to > > > the project as well as the sample patches must be updated to match > > > the updated reality. > > > > > > I agree with you that maintaining the *.patch files to always > > > cleanly apply is less than ideal. A topic to update the sample > > > patches and tutorial text may be competing with another topic that > > > updates the very API the tutorials are teaching, and the sample > > > patches may not apply cleanly when two topics are merged together, > > > even if the "update sample patches and tutorial text" topic does > > > update them to match the API at the tip of the topic branch itself. > > > One thing we _could_ do is to pin the target version of the codebase > > > for the sake of tutorial. IOW, the sample/*.patch may not apply > > > cleanly to the version of the tree these patches were taken from, > > > but would always apply cleanly to the most recent released version > > > before the last update to the tutorial, or something like that. > > > > > > Also having to review the patch to sample/*.patch files will be > > > unpleasant. > > > > I wonder if we can ease some pain for both of the above issues by > > including some scripts to "inflate" the patch files into a topic branch, > > or figure out some more easily-reviewed (but more complicated, I > > suppose) method for sending updates to the sample/*.patch files. > > > > Imagining workflows like this: > > > > Doing the tutorial: > > - In worktree a/. > > - Run a magic script which creates a worktree with the sample code, b/. > > - Read through a/Documentation/MyFirstContribution.txt and generate > > a/builtins/psuh.c, referring to b/builtins/psuh.c if confused. > > > > Rebasing the tutorial patches: > > - In worktree a/. > > - Run a magic script which checks out a new branch at the last known > > good base for the patchset, then applies all the patches. > > - Now faced with, likely, a topic branch based on v<n-1> (where n is > > latest release). > > - `git rebase v<n> -x (make && ./bin-wrappers/git psuh)` > > - Interactively fix conflicts > > - Run a script to generate a magic interdiff from the old version of > > patches > > - Mail out magic interdiff to list and get approval > > - (Maybe maintainer does this when interdiff is happy? Maybe updater > > does this when review looks good?) Run a magic script to regenerate > > patches from rebased branch, and note somewhere they are based on > > v<n> > > - Mail sample/*.patch (based on v<n>) to list (if maintainer rolled the > > patches after interdiff approval, this step can be skipped) > > > > (This seems to still be a lot of steps, even with the magic script..) > > > > Alternatively, for the same process: > > Updater: Run a magic script to create topic branch based on v<n-1> > > (like before) > > U: `git rebase v<n> -x (make && ./bin-wrappers/git psuh)` > > U: Interactively fix conflicts > > U: Run a script to turn topic branch back into sample/*.patch > > U: Send email with changes to sample/*.patch (this will be ugly and > > unreadable) - message ID <M1> > > Reviewer: Run a magic script, providing <M1> argument, which grabs the > > diff-of-.patch and generates an interdiff, or a topic branch based > > on v<n> > > R: Send comments explaining where issue is (tricky to find where to > > inline in the diff-of-.patch) > > U: Reroll diff-of-.patch email > > R: Accepts > > Maintainer: Applies diff-of-.patch email normally > > > > I suppose for the first suggestion, there ends up being quite a lot of > > onus on the maintainer, and a lot of trust that there is no difference > > between the RFC easy-to-read interdiff patchset. For the second > > suggestion, there ends up being onus on the reviewers to run some > > magical script. Maybe we can split the difference by expecting Updater > > to provide the interdiff below the --- line? Maybe in practice the > > diff-of-.patch isn't so unreadable, if it's only minor changes needed > > to bring the tutorial up to latest? > > > > I'm not sure there's a way to make this totally painless using email > > tools. > > Random thought about the "magic scripts": if we keep an mbox instead of > a directory of *.patch files, then it seems like git-format-patch and > git-am would solve the bulk of this. I don't think dealing with > diffs-of-patches-in-mbox is much worse than dealing with > diffs-of-patches-in-multiple-files. And for the "Doing the tutorial" > workflow, it nudges the new contributor to learn git-am. > > But I guess the hard part here is the reviewing diffs-of-diffs part. > I'm leaning towards the second option here; I personally would not feel > too troubled as a reviewer by having to run an extra script. And as you > say, diff-of-diffs may not be so bad in practice. Reviewers already see > these whenever someone includes a range-diff in their v>=2 emails. There was also some suggestion of instead checking in ed scripts or similar to populate the changes. On one hand, it might be nicer, as there aren't diff markers on the front of all the code... but on the other hand, I'm not sure how many folks are familiar with ed (I know I'm not) and it might be complex to indicate where to insert changes. I have been in a position of reviewing diff-of-.patch in a past life, albeit via Gerrit, and it's not the worst when the code is simple (as we should always hope this example tutorial code would be). - Emily
Emily Shaffer <emilyshaffer@google.com> writes: > I have been in a position of reviewing diff-of-.patch in a past life, > albeit via Gerrit, and it's not the worst when the code is simple (as we > should always hope this example tutorial code would be). I personally think a directory full of patch files is OK. I am not sure if they (together with this rev walk tutorial) belong to the main part of the project, though.
diff --git a/Documentation/.gitignore b/Documentation/.gitignore index 9022d48355..0e3df737c5 100644 --- a/Documentation/.gitignore +++ b/Documentation/.gitignore @@ -12,6 +12,7 @@ cmds-*.txt mergetools-*.txt manpage-base-url.xsl SubmittingPatches.txt +MyFirstRevWalk.txt tmp-doc-diff/ GIT-ASCIIDOCFLAGS /GIT-EXCLUDED-PROGRAMS diff --git a/Documentation/Makefile b/Documentation/Makefile index dbf5a0f276..d57b80962f 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -77,6 +77,7 @@ API_DOCS = $(patsubst %.txt,%,$(filter-out technical/api-index-skel.txt technica SP_ARTICLES += $(API_DOCS) TECH_DOCS += SubmittingPatches +TECH_DOCS += MyFirstRevWalk TECH_DOCS += technical/hash-function-transition TECH_DOCS += technical/http-protocol TECH_DOCS += technical/index-format diff --git a/Documentation/MyFirstRevWalk.txt b/Documentation/MyFirstRevWalk.txt new file mode 100644 index 0000000000..494c09d1fa --- /dev/null +++ b/Documentation/MyFirstRevWalk.txt @@ -0,0 +1,826 @@ +My First Revision Walk +====================== + +== What's a Revision Walk? + +The revision walk is a key concept in Git - this is the process that underpins +operations like `git log`, `git blame`, and `git reflog`. Beginning at HEAD, the +list of objects is found by walking parent relationships between objects. The +revision walk can also be usedto determine whether or not a given object is +reachable from the current HEAD pointer. + +=== Related Reading + +- `Documentation/user-manual.txt` under "Hacking Git" contains some coverage of + the revision walker in its various incarnations. +- `Documentation/technical/api-revision-walking.txt` +- https://eagain.net/articles/git-for-computer-scientists/[Git for Computer Scientists] + gives a good overview of the types of objects in Git and what your revision + walk is really describing. + +== Setting Up + +Create a new branch from `master`. + +---- +git checkout -b revwalk origin/master +---- + +We'll put our fiddling into a new command. For fun, let's name it `git walken`. +Open up a new file `builtin/walken.c` and set up the command handler: + +---- +/* + * "git walken" + * + * Part of the "My First Revision Walk" tutorial. + */ + +#include <stdio.h> +#include "builtin.h" + +int cmd_walken(int argc, const char **argv, const char *prefix) +{ + printf(_("cmd_walken incoming...\n")); + return 0; +} +---- + +Add usage text and `-h` handling, in order to pass the test suite: + +---- +static const char * const walken_usage[] = { + N_("git walken"), + NULL, +} + +int cmd_walken(int argc, const char **argv, const char *prefix) +{ + struct option options[] = { + OPT_END() + }; + + argc = parse_options(argc, argv, prefix, options, walken_usage, 0); + + ... +} +---- + +Also add the relevant line in builtin.h near `cmd_whatchanged()`: + +---- +extern int cmd_walken(int argc, const char **argv, const char *prefix); +---- + +Include the command in `git.c` in `commands[]` near the entry for `whatchanged`: + +---- +{ "walken", cmd_walken, RUN_SETUP }, +---- + +Add it to the `Makefile` near the line for `builtin\worktree.o`: + +---- +BUILTIN_OBJS += builtin/walken.o +---- + +Build and test out your command, without forgetting to ensure the `DEVELOPER` +flag is set: + +---- +echo DEVELOPER=1 >config.mak +make +./bin-wrappers/git walken +---- + +NOTE: For a more exhaustive overview of the new command process, take a look at +`Documentation/MyFirstContribution`. + +NOTE: A reference implementation can be found at TODO LINK. + +=== `struct rev_cmdline_info` + +The definition of `struct rev_cmdline_info` can be found in `revision.h`. + +This struct is contained within the `rev_info` struct and is used to reflect +parameters provided by the user over the CLI. + +`nr` represents the number of `rev_cmdline_entry` present in the array. + +`alloc` is used by the `ALLOC_GROW` macro. Check +`Documentation/technical/api-allocation-growing.txt` - this variable is used to +track the allocated size of the list. + +Per entry, we find: + +`item` is the object provided upon which to base the revision walk. Items in Git +can be blobs, trees, commits, or tags. (See `Documentation/gittutorial-2.txt`.) + +`name` is the SHA-1 of the object - a 40-digit hex string you may be familiar +with from using Git to organize your source in the past. Check the tutorial +mentioned above towards the top for a discussion of where the SHA-1 can come +from. + +`whence` indicates some information about what to do with the parents of the +specified object. We'll explore this flag more later on; take a look at +`Documentation/revisions.txt` to get an idea of what could set the `whence` +value. + +`flags` are used to hint the beginning of the revision walk and are the first +block under the `#include`s in `revision.h`. The most likely ones to be set in +the `rev_cmdline_info` are `UNINTERESTING` and `BOTTOM`, but these same flags +can be used during the walk, as well. + +=== `struct rev_info` + +This one is quite a bit longer, and many fields are only used during the walk +by `revision.c` - not configuration options. Most of the configurable flags in +`struct rev_info` have a mirror in `Documentation/rev-list-options.txt`. It's a +good idea to take some time and read through that document. + +== Basic Commit Walk + +First, let's see if we can replicate the output of `git log --oneline`. We'll +refer back to the implementation frequently to discover norms when performing +a revision walk of our own. + +We'll need all the commits, in order, which preceded our current commit. We will +also need to know the name and subject. + +Ideally, we will also be able to find out which ones are currently at the tip of +various branches. + +=== Setting Up + +Preparing for your revision walk has some distinct stages. + +1. Perform default setup for this mode, and others which may be invoked. +2. Check configuration files for relevant settings. +3. Set up the rev_info struct. +4. Tweak the initialized rev_info to suit the current walk. +5. Prepare the rev_info for the walk. +6. Iterate over the objects, processing each one. + +==== Default Setups + +Before you begin to examine user configuration for your revision walk, it's +common practice for you to initialize to default any switches that your command +may have, as well as ask any other components you may invoke to initialize as +well. `git log` does this in `init_log_defaults()`; in that case, one global +`decoration_style` is initialized, as well as the grep and diff-UI components. + +For our purposes, within `git walken`, for the first example we do we don't +intend to invoke anything, and we don't have any configuration to do. However, +we may want to add some later, so for now, we can add an empty placeholder. +Create a new function in `builtin/walken.c`: + +---- +static void init_walken_defaults(void) +{ + /* We don't actually need the same components `git log` does; leave this + * empty for now. + */ +} +---- + +Make sure to add a line invoking it inside of `cmd_walken()`. + +---- +int cmd_walken(int argc, const char **argv, const char *prefix) +{ + init_walken_defaults(); +} +---- + +==== Configuring From `.gitconfig` + +Next, we should have a look at any relevant configuration settings (i.e., +settings readable and settable from `git config`). This is done by providing a +callback to `git_config()`; within that callback, you can also invoke methods +from other components you may need that need to intercept these options. Your +callback will be invoked once per each configuration value which Git knows about +(global, local, worktree, etc.). + +Similarly to the default values, we don't have anything to do here yet +ourselves; however, we should call `git_default_config()` if we aren't calling +any other existing config callbacks. + +TODO: Use the "modern" configset API + +Add a new function to `builtin/walken.c`: + +---- +static int git_walken_config(const char *var, const char *value, void *cb) +{ + /* For now, let's not bother with anything. */ + return git_default_config(var, value, cb); +} +---- + +Make sure to invoke `git_config()` with it in your `cmd_walken()`: + +---- +int cmd_walken(int argc, const char **argv, const char *prefix) +{ + ... + + git_config(git_walken_config, NULL); +} +---- + +// TODO: Checking CLI options + +==== Setting Up `rev_info` + +Now that we've gathered external configuration and options, it's time to +initialize the `rev_info` object which we will use to perform the walk. This is +typically done by calling `repo_init_revisions()` with the repository you intend +to target, as well as the prefix and your `rev_info` struct. + +Add the `struct rev_info` and the `repo_init_revisions()` call: +---- +int cmd_walken(int argc, const char **argv, const char *prefix) +{ + /* This can go wherever you like in your declarations.*/ + struct rev_info rev; + ... + + /* This should go after the git_config() call. */ + repo_init_revisions(the_repository, &rev, prefix); +} +---- + +==== Tweaking `rev_info` For the Walk + +We're getting close, but we're still not quite ready to go. Now that `rev` is +initialized, we can modify it to fit our needs. This is usually done within a +helper for clarity, so let's add one: + +---- +static void final_rev_info_setup(struct rev_info *rev) +{ + /* We want to mimick the appearance of `git log --oneline`, so let's + * force oneline format. */ + get_commit_format("oneline", rev); + + /* Start our revision walk at HEAD. */ + add_head_to_pending(rev); +} +---- + +[NOTE] +==== +Instead of using the shorthand `add_head_to_pending()`, you could do +something like this: +---- + struct setup_revision_opt opt; + + memset(&opt, 0, sizeof(opt)); + opt.def = "HEAD"; + opt.revarg_opt = REVARG_COMMITTISH; + setup_revisions(argc, argv, rev, &opt); +---- +Using a `setup_revision_opt` gives you finer control over your walk's starting +point. +==== + +Then let's invoke `final_rev_info_setup()` after the call to +`repo_init_revisions()`: + +---- +int cmd_walken(int argc, const char **argv, const char *prefix) +{ + ... + + final_rev_info_setup(&rev); +} +---- + +Later, we may wish to add more arguments to `final_rev_info_setup()`. But for +now, this is all we need. + +==== Preparing `rev_info` For the Walk + +Now that `rev` is all initialized and configured, we've got one more setup step +before we get rolling. We can do this in a helper, which will both prepare the +`rev_info` for the walk, and perform the walk itself. Let's start the helper +with the call to `prepare_revision_walk()`. + +---- +static int walken_commit_walk(struct rev_info *rev) +{ + /* prepare_revision_walk() gets the final steps ready for a revision + * walk. We check the return value for errors. */ + if (prepare_revision_walk(rev)) + die(_("revision walk setup failed")); +} +---- + +==== Performing the Walk! + +Finally! We are ready to begin the walk itself. Now we can see that `rev_info` +can also be used as an iterator; we move to the next item in the walk by using +`get_revision()` repeatedly. Add the listed variable declarations at the top and +the walk loop below the `prepare_revision_walk()` call within your +`walken_commit_walk()`: + +---- +static int walken_commit_walk(struct rev_info *rev) +{ + struct commit *commit; + struct strbuf prettybuf; + strbuf_init(&prettybuf, 0); + + ... + + while ((commit = get_revision(rev)) != NULL) { + if (commit == NULL) + continue; + + strbuf_reset(&prettybuf); + pp_commit_easy(CMIT_FMT_ONELINE, commit, &prettybuf); + printf(_("%s\n"), prettybuf.buf); + } + + return 0; +} +---- + +Give it a shot. + +---- +$ make +$ ./bin-wrappers/git walken +---- + +You should see all of the subject lines of all the commits in +your tree's history, in order, ending with the initial commit, "Initial revision +of "git", the information manager from hell". Congratulations! You've written +your first revision walk. You can play with printing some additional fields +from each commit if you're curious; have a look at the functions available in +`commit.h`. + +=== Adding a Filter + +Next, let's try to filter the commits we see based on their author. This is +equivalent to running `git log --author=<pattern>`. We can add a filter by +modifying `rev_info.grep_filter`, which is a `struct grep_opt`. + +First some setup. Add `init_grep_defaults()` to `init_walken_defaults()` and add +`grep_config()` to `git_walken_config()`: + +---- +static void init_walken_defaults(void) +{ + init_grep_defaults(the_repository); +} + +... + +static int git_walken_config(const char *var, const char *value, void *cb) +{ + grep_config(var, value, cb); + return git_default_config(var, value, cb); +} +---- + +Next, we can modify the `grep_filter`. This is done with convenience functions +found in `grep.h`. For fun, we're filtering to only commits from folks using a +gmail.com email address - a not-very-precise guess at who may be working on Git +as a hobby. Since we're checking the author, which is a specific line in the +header, we'll use the `append_header_grep_pattern()` helper. We can use +the `enum grep_header_field` to indicate which part of the commit header we want +to search. + +In `final_rev_info_setup()`, add your filter line: + +---- +static void final_rev_info_setup(int argc, const char **argv, + const char *prefix, struct rev_info *rev) +{ + ... + + append_header_grep_pattern(&rev->grep_filter, GREP_HEADER_AUTHOR, + "gmail"); + compile_grep_patterns(&rev->grep_filter); + + ... +} +---- + +`append_header_grep_pattern()` adds your new "gmail" pattern to `rev_info`, but +it won't work unless we compile it with `compile_grep_patterns()`. + +NOTE: If you are using `setup_revisions()` (for example, if you are passing a +`setup_revision_opt` instead of using `add_head_to_pending()`), you don't need +to call `compile_grep_patterns()` because `setup_revisions()` calls it for you. + +NOTE: We could add the same filter via the `append_grep_pattern()` helper if we +wanted to, but `append_header_grep_pattern()` adds the `enum grep_context` and +`enum grep_pat_token` for us. + +=== Changing the Order + +There are a few ways that we can change the order of the commits during a +revision walk. Firstly, we can use the `enum rev_sort_order` to choose from some +sane orderings. + +Let's see what happens when we run with `REV_SORT_BY_COMMIT_DATE` as opposed to +`REV_SORT_BY_AUTHOR_DATE`. Add the following: + +---- +static void final_rev_info_setup(int argc, const char **argv, + const char *prefix, struct rev_info *rev) +{ + ... + + rev->topo_order = 1; + rev->sort_order = REV_SORT_BY_COMMIT_DATE; + + ... +} +---- + +Let's output this into a file so we can easily diff it with the walk sorted by +author date. + +---- +$ make +$ ./bin-wrappers/git walken > commit-date.txt +---- + +Then, let's sort by author date and run it again. + +---- +static void final_rev_info_setup(int argc, const char **argv, + const char *prefix, struct rev_info *rev) +{ + ... + + rev->topo_order = 1; + rev->sort_order = REV_SORT_BY_AUTHOR_DATE; + + ... +} +---- + +---- +$ make +$ ./bin-wrappers/git walken > author-date.txt +---- + +Finally, compare the two. This is a little less helpful without object names or +dates, but hopefully we get the idea. + +---- +$ diff -u commit-date.txt author-date.txt +---- + +This display is an indicator for the latency between publishing a commit for +review the first time, and getting it actually merged into master. + +Let's try one more reordering of commits. `rev_info` exposes a `reverse` flag. +However, it needs to be applied after `add_head_to_pending()` is called. Find +the line where you call `add_head_to_pending()` and set the `reverse` flag right +after: + +---- +static void final_rev_info_setup(int argc, const char **argv, const char *prefix, + struct rev_info *rev) +{ + ... + + add_head_to_pending(rev); + rev->reverse = 1; + + ... +} +---- + +Run your walk again and note the difference in order. (If you remove the grep +pattern, you should see the last commit this call gives you as your current +HEAD.) + +== Basic Object Walk + +So far we've been walking only commits. But Git has more types of objects than +that! Let's see if we can walk _all_ objects, and find out some information +about each one. + +We can base our work on an example. `git pack-objects` prepares all kinds of +objects for packing into a bitmap or packfile. The work we are interested in +resides in `builtins/pack-objects.c:get_object_list()`; examination of that +function shows that the all-object walk is being performed by +`traverse_commit_list()` or `traverse_commit_list_filtered()`. Those two +functions reside in `list-objects.c`; examining the source shows that, despite +the name, these functions traverse all kinds of objects. Let's have a look at +the arguments to `traverse_commit_list_filtered()`, which are a superset of the +arguments to the unfiltered version. + +- `struct list_objects_filter_options *filter_options`: This is a struct which + stores a filter-spec as outlined in `Documentation/rev-list-options.txt`. +- `struct rev_info *revs`: This is the `rev_info` used for the walk. +- `show_commit_fn show_commit`: A callback which will be used to handle each + individual commit object. +- `show_object_fn show_object`: A callback which will be used to handle each + non-commit object (so each blob, tree, or tag). +- `void show_data*`: A context buffer which is passed in turn to `show_commit` + and `show_object`. +- `struct oidset *omitted`: A linked-list of object IDs which the provided + filter caused to be omitted. + +It looks like this `traverse_commit_list_filtered()` uses callbacks we provide +instead of needing us to call it repeatedly ourselves. Cool! Let's add the +callbacks first. + +For the sake of this tutorial, we'll simply keep track of how many of each kind +of object we find. At file scope in `builtin/walken.c` add the following +tracking variables: + +---- +static int commit_count; +static int tag_count; +static int blob_count; +static int tree_count; +---- + +Commits are handled by a different callback than other objects; let's do that +one first: + +---- +static void walken_show_commit(struct commit *cmt, void *buf) +{ + commit_count++; +} +---- + +Since we have the `struct commit` object, we can look at all the same parts that +we looked at in our earlier commit-only walk. For the sake of this tutorial, +though, we'll just increment the commit counter and move on. + +The callback for non-commits is a little different, as we'll need to check +which kind of object we're dealing with: + +---- +static void walken_show_object(struct object *obj, const char *str, void *buf) +{ + switch (obj->type) { + case OBJ_TREE: + tree_count++; + break; + case OBJ_BLOB: + blob_count++; + break; + case OBJ_TAG: + tag_count++; + break; + case OBJ_COMMIT: + printf(_("Unexpectedly encountered a commit in " + "walken_show_object!\n")); + commit_count++; + break; + default: + printf(_("Unexpected object type %s!\n"), + type_name(obj->type)); + break; + } +} +---- + +To help assure us that we aren't double-counting commits, we'll include some +complaining if a commit object is routed through our non-commit callback; we'll +also complain if we see an invalid object type. + +Our main object walk implementation is substantially different from our commit +walk implementation, so let's make a new function to perform the object walk. We +can perform setup which is applicable to all objects here, too, to keep separate +from setup which is applicable to commit-only walks. + +---- +static int walken_object_walk(struct rev_info *rev) +{ +} +---- + +We'll start by enabling all types of objects in the `struct rev_info`, and +asking to have our trees and blobs shown in commit order. We'll also exclude +promisors as the walk becomes more complicated with those types of objects. When +our settings are ready, we'll perform the normal revision walk setup and +initialize our tracking variables. + +---- +static int walken_object_walk(struct rev_info *rev) +{ + rev->tree_objects = 1; + rev->blob_objects = 1; + rev->tag_objects = 1; + rev->tree_blobs_in_commit_order = 1; + rev->exclude_promisor_objects = 1; + + if (prepare_revision_walk(rev)) + die(_("revision walk setup failed")); + + commit_count = 0; + tag_count = 0; + blob_count = 0; + tree_count = 0; +---- + +Unless you cloned or fetched your repository earlier with a filter, +`exclude_promisor_objects` is unlikely to make a difference, but we'll turn it +on just to make sure our lives are simple. We'll also turn on +`tree_blobs_in_commit_order`, which means that we will walk a commit's tree and +everything it points to immediately after we find each commit, as opposed to +waiting for the end and walking through all trees after the commit history has +been discovered. + +Let's start by calling just the unfiltered walk and reporting our counts. +Complete your implementation of `walken_object_walk()`: + +---- + traverse_commit_list(rev, walken_show_commit, walken_show_object, NULL); + + printf(_("Object walk completed. Found %d commits, %d blobs, %d tags, " + "and %d trees.\n"), commit_count, blob_count, tag_count, + tree_count); + + return 0; +} +---- + +Finally, we'll ask `cmd_walken()` to use the object walk instead. Discussing +command line options is out of scope for this tutorial, so we'll just hardcode +a branch we can change at compile time. Where you call `final_rev_info_setup()` +and `walken_commit_walk()`, instead branch like so: + +---- + if (1) { + add_head_to_pending(&rev); + walken_object_walk(&rev); + } else { + final_rev_info_setup(argc, argv, prefix, &rev); + walken_commit_walk(&rev); + } +---- + +NOTE: For simplicity, we've avoided all the filters and sorts we applied in +`final_rev_info_setup()` and simply added `HEAD` to our pending queue. If you +want, you can certainly use the filters we added before by moving +`final_rev_info_setup()` out of the conditional and removing the call to +`add_head_to_pending()`. + +Now we can try to run our command! It should take noticeably longer than the +commit walk, but an examination of the output will give you an idea why - for +example: + +---- +Object walk completed. Found 55733 commits, 100274 blobs, 0 tags, and 104210 trees. +---- + +This makes sense. We have more trees than commits because the Git project has +lots of subdirectories which can change, plus at least one tree per commit. We +have no tags because we started on a commit (`HEAD`) and while tags can point to +commits, commits can't point to tags. + +NOTE: You will have different counts when you run this yourself! The number of +objects grows along with the Git project. + +=== Adding a Filter + +There are a handful of filters that we can apply to the object walk laid out in +`Documentation/rev-list-options.txt`. These filters are typically useful for +operations such as creating packfiles or performing a partial or shallow clone. +They are defined in `list-objects-filter-options.h`. For the purposes of this +tutorial we will use the "tree:1" filter, which causes the walk to omit all +trees and blobs which are not directly referenced by commits reachable from the +commit in `pending` when the walk begins. (In our case, that means we omit trees +and blobs not directly referenced by HEAD or HEAD's history.) + +First, we'll need to `#include "list-objects-filter-options.h`". Then, we can +set up the `struct list_objects_filter_options` and `struct oidset` at the top +of `walken_object_walk()`: + +---- +static int walken_object_walk(struct rev_info *rev) +{ + struct list_objects_filter_options filter_options = {}; + struct oidset omitted; + oidset_init(&omitted, 0); + ... +---- + +Then, for the sake of simplicity, we'll add a simple build-time branch to use +our filter or not. Replace the line calling `traverse_commit_list()` with the +following, which will remind us which kind of walk we've just performed: + +---- + if (1) { + /* Unfiltered: */ + printf(_("Unfiltered object walk.\n")); + traverse_commit_list(rev, walken_show_commit, + walken_show_object, NULL); + } else { + printf(_("Filtered object walk with filterspec 'tree:1'.\n")); + /* + * We can parse a tree depth of 1 to demonstrate the kind of + * filtering that could occur eg during shallow cloning. + */ + parse_list_objects_filter(&filter_options, "tree:1"); + + traverse_commit_list_filtered(&filter_options, rev, + walken_show_commit, walken_show_object, NULL, &omitted); + } +---- + +`struct list_objects_filter_options` is usually built directly from a command +line argument, so the module provides an easy way to build one from a string. +Even though we aren't taking user input right now, we can still build one with +a hardcoded string using `parse_list_objects_filter()`. + +After we run `traverse_commit_list_filtered()` we would also be able to examine +`omitted`, which is a linked-list of all objects we did not include in our walk. +Since all omitted objects are included, the performance of +`traverse_commit_list_filtered()` with a non-null `omitted` arument is equitable +with the performance of `traverse_commit_list()`; so for our purposes, we leave +it null. It's easy to provide one and iterate over it, though - check `oidset.h` +for the declaration of the accessor methods for `oidset`. + +With the filter spec "tree:1", we are expecting to see _only_ the root tree for +each commit; therefore, the tree object count should be less than or equal to +the number of commits. (For an example of why that's true: `git commit --revert` +points to the same tree object as its grandparent.) + +=== Changing the Order + +Finally, let's demonstrate that you can also reorder walks of all objects, not +just walks of commits. First, we'll make our handlers chattier - modify +`walken_show_commit()` and `walken_show_object` to print the object as they go: + +---- +static void walken_show_commit(struct commit *cmt, void *buf) +{ + printf(_("commit: %s\n"), oid_to_hex(&cmt->object.oid)); + commit_count++; +} + +static void walken_show_object(struct object *obj, const char *str, void *buf) +{ + printf(_("%s: %s\n"), type_name(obj->type), oid_to_hex(&obj->oid)); + ... +} +---- + +(Try to leave the counter increment logic in place in `walken_show_object()`.) + +With only that change, run again (but save yourself some scrollback): + +---- +$ ./bin-wrappers/git walken | head -n 10 +---- + +Take a look at the top commit with `git show` and the OID you printed; it should +be the same as the output of `git show HEAD`. + +Next, let's change a setting on our `struct rev_info` within +`walken_object_walk()`. Find where you're changing the other settings on `rev`, +such as `rev->tree_objects` and `rev->tree_blobs_in_commit_order`, and add +another setting at the bottom: + +---- + ... + + rev->tree_objects = 1; + rev->blob_objects = 1; + rev->tag_objects = 1; + rev->tree_blobs_in_commit_order = 1; + rev->exclude_promisor_objects = 1; + rev->reverse = 1; + + ... +---- + +Now, run again, but this time, let's grab the last handful of objects instead +of the first handful: + +---- +$ make +$ ./bin-wrappers git walken | tail -n 10 +---- + +The last commit object given should have the same OID as the one we saw at the +top before, and running `git show <oid>` with that OID should give you again +the same results as `git show HEAD`. Furthermore, if you run and examine the +first ten lines again (with `head` instead of `tail` like we did before applying +the `reverse` setting), you should see that now the first commit printed is the +initial commit, `e83c5163`. + +== Wrapping Up + +Let's review. In this tutorial, we: + +- Built a commit walk from the ground up +- Enabled a grep filter for that commit walk +- Changed the sort order of that filtered commit walk +- Built an object walk (tags, commits, trees, and blobs) from the ground up +- Learned how to add a filter-spec to an object walk +- Changed the display order of the filtered object walk
Existing documentation on revision walks seems to be primarily intended as a reference for those already familiar with the procedure. This tutorial attempts to give an entry-level guide to a couple of bare-bones revision walks so that new Git contributors can learn the concepts without having to wade through options parsing or special casing. The target audience is a Git contributor who is just getting started with the concept of revision walking. The goal is to prepare this contributor to be able to understand and modify existing commands which perform revision walks more easily, although it will also prepare contributors to create new commands which perform walks. The tutorial covers a basic overview of the structs involved during revision walk, setting up a basic commit walk, setting up a basic all-object walk, and adding some configuration changes to both walk types. It intentionally does not cover how to create new commands or search for options from the command line or gitconfigs. There is an associated patchset at https://github.com/nasamuffin/git/tree/revwalk that contains a reference implementation of the code generated by this tutorial. Signed-off-by: Emily Shaffer <emilyshaffer@google.com> --- This one is longer than the MyFirstContribution one, thanks in advance to anybody with the wherewithal to review this. I'll also be mailing an RFC patchset In-Reply-To this message; the RFC patchset should not be merged to Git, as I intend to host it in my own mirror as an example. I hosted a similar example for the MyFirstContribution tutorial; it's visible at https://github.com/nasamuffin/git/tree/psuh. There might be a better place to host these so I don't "own" them but I'm not sure what it is; keeping them as a live branch somewhere struck me as an okay way to keep them from getting stale. Looking forward to hearing everyone's comments! - Emily Documentation/.gitignore | 1 + Documentation/Makefile | 1 + Documentation/MyFirstRevWalk.txt | 826 +++++++++++++++++++++++++++++++ 3 files changed, 828 insertions(+) create mode 100644 Documentation/MyFirstRevWalk.txt