diff mbox series

[v5,1/3] builtin/grep.c: add --sparse option

Message ID 20220908001854.206789-2-shaoxuan.yuan02@gmail.com (mailing list archive)
State Superseded
Headers show
Series grep: integrate with sparse index | expand

Commit Message

Shaoxuan Yuan Sept. 8, 2022, 12:18 a.m. UTC
Add a --sparse option to `git-grep`.

When the '--cached' option is used with the 'git grep' command, the
search is limited to the blobs found in the index, not in the worktree.
If the user has enabled sparse-checkout, this might present more results
than they would like, since the files outside of the sparse-checkout are
unlikely to be important to them.

Change the default behavior of 'git grep' to focus on the files within
the sparse-checkout definition. To enable the previous behavior, add a
'--sparse' option to 'git grep' that triggers the old behavior that
inspects paths outside of the sparse-checkout definition when paired
with the '--cached' option.

Suggested-by: Victoria Dye <vdye@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Victoria Dye <vdye@github.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
---
 Documentation/git-grep.txt      |  5 ++++-
 builtin/grep.c                  | 10 +++++++++-
 t/t7817-grep-sparse-checkout.sh | 34 +++++++++++++++++++++++++++------
 3 files changed, 41 insertions(+), 8 deletions(-)

Comments

Victoria Dye Sept. 10, 2022, 1:07 a.m. UTC | #1
Shaoxuan Yuan wrote:
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index eb59564565..a9879cc980 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p
>  	test_cmp expect actual
>  '
>  
> -test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
> +test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	EOF
> +	git grep --cached "text" >actual &&
> +	test_cmp expect actual &&
> +
>  	cat >expect <<-EOF &&
>  	a:text
>  	b:text
>  	dir/c:text
>  	EOF
> -	git grep --cached "text" >actual &&
> +	git grep --cached --sparse "text" >actual &&
>  	test_cmp expect actual
>  '

At first, seeing that all the test titles were changed from "grep --cached
<does something>" to "grep --cached and --sparse <does something>", I was
going to suggest that 'git grep --cached' (without '--sparse') should
receive some new tests in addition to updating existing ones (which now
require '--sparse' to work as before).

However, looking at the actual content of the tests like the one above, I
can see that you've added cases demonstrating the expected difference in
behavior between 'grep --cached' and 'grep --cached --sparse'. I can't think
of a clearer way to name the tests, though, so this looks okay to me.

The rest of the patch (namely, the implementation of '--sparse' and
corresponding documentation) looked good as well - I didn't have anything
specific to note on that.
Elijah Newren Sept. 14, 2022, 6:08 a.m. UTC | #2
Hi Shaoxuan,

Please note that it's customary to cc folks who have commented on
previous versions of your patch series when you re-roll.

On Wed, Sep 7, 2022 at 5:28 PM Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> wrote:
>
> Add a --sparse option to `git-grep`.

It's awesome you're working on this.  Adding more of "behavior A"
(restricting querying commands to the sparse cone) is something I've
wanted for a long time.

I think most of your code is beneficial, but I do have some issues
with high level direction you were implementing, which may require
some tweaks...

> When the '--cached' option is used with the 'git grep' command, the
> search is limited to the blobs found in the index, not in the worktree.
> If the user has enabled sparse-checkout, this might present more results
> than they would like, since the files outside of the sparse-checkout are
> unlikely to be important to them.

"files outside of the sparse-checkout are unlikely to be important to
[users]" is certainly an issue.  But it's *much* wider than this.
Beyond `grep --cached`, it also affects `grep REVISION`, `log`, `diff
[REVISION]`, and related things...perhaps even something like `blame`.
I think all those other commands probably deserve a mode where they
restrict output to the view associated with the user's cone.  I've
brought that up before[1].  I was skeptical of making it the default,
because it'd probably take a long time to implement it everywhere.
Slowly changing defaults of all commands over many git releases seems
like a poor strategy, but I'm afraid that's what it looks like we are
doing here.

I'm also worried that slowly changing the defaults without a
high-level plan will lead to users struggling to figure out what
flag(s) to pass.  Are we going to be stuck in a situation where users
have to remember that for a dense search, they use one flag for `grep
--cached`, a different one for  `grep [REVISION]`, no flag is needed
for `diff [REVISION]`, but yet a different flag is needed for `git
log`?

I'm also curious whether there shouldn't be a config option for
something like this, so folks don't have to specify it with every
invocation.  In particular, while I certainly have users that want to
just query git for information about the part of the history they are
interested in, there are other users who are fully aware they are
working in a bigger repository and want to search for additional
things to add to their sparse-checkout and predominantly use grep for
things like that.  They have even documented that `git grep --cached
<TERM>` can be used in sparse-checkouts for this purpose...and have
been using that for a few years.  (I did warn them at the time that
there was a risk they'd have to change their command, but it's still
going to be a behavioral change they might not expect.)  Further, when
I brought up changing the behavior of commands during sparse-checkouts
to limit to files matching the sparsity paths in that old thread at
[1], Stolee was a bit skeptical of making that the default.  That
suggests, at least, that two independent groups of users would want to
use the non-sparse searching frequently, and frequently enough that
they'd appreciate a config option.

I also brought up in that old thread that perhaps we want to avoid
adding a flag to every subcommand, and instead just having a
git-global flag for triggering this type of behavior.  (e.g. `git
--no-restrict grep --cached ...` or `git --dense grep --cached ...`).

[1] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
and the responses to that email.

> Change the default behavior of 'git grep' to focus on the files within
> the sparse-checkout definition. To enable the previous behavior, add a
> '--sparse' option to 'git grep' that triggers the old behavior that
> inspects paths outside of the sparse-checkout definition when paired
> with the '--cached' option.

I still think the flag name of `--sparse` is totally backwards and
highly confusing for the described behavior.  I missed Stolee's email
at the time (wasn't cc'ed) where he brought up that "--sparse" had
already been added to "git-add" and "git-rm", but in those cases the
commands aren't querying and I just don't see how they lead to the
same level of user confusion.  This one seems glaringly wrong to me
and both Junio and I flagged it on v1 when we first saw it.  (Perhaps
it also helps that for the add/rm cases, that a user is often given an
error message with the suggested flag to use, which just doesn't make
sense here either.)  If there is concern that this flag should be the
same as add and rm, then I think we need to do the backward
compatibility dance and fix add and rm by adding an alias over there
so that grep's flag won't be so confusing.

I really don't want to have to deal with the backward compatibility
headache of "git grep --sparse" means do a non-sparse search for
backward compatibility reasons.  Here's the flag you should really
use..."

> Suggested-by: Victoria Dye <vdye@github.com>
> Helped-by: Derrick Stolee <derrickstolee@github.com>
> Helped-by: Victoria Dye <vdye@github.com>
> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
> ---
>  Documentation/git-grep.txt      |  5 ++++-
>  builtin/grep.c                  | 10 +++++++++-
>  t/t7817-grep-sparse-checkout.sh | 34 +++++++++++++++++++++++++++------
>  3 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 58d944bd57..bdd3d5b8a6 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -28,7 +28,7 @@ SYNOPSIS
>            [-f <file>] [-e] <pattern>
>            [--and|--or|--not|(|)|-e <pattern>...]
>            [--recurse-submodules] [--parent-basename <basename>]
> -          [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | <tree>...]
> +          [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | <tree>...]
>            [--] [<pathspec>...]
>
>  DESCRIPTION
> @@ -45,6 +45,9 @@ OPTIONS
>         Instead of searching tracked files in the working tree, search
>         blobs registered in the index file.
>
> +--sparse::
> +       Use with --cached. Search outside of sparse-checkout definition.
> +
>  --no-index::
>         Search files in the current directory that is not managed by Git.
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index e6bcdf860c..12abd832fa 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -96,6 +96,8 @@ static pthread_cond_t cond_result;
>
>  static int skip_first_line;
>
> +static int grep_sparse = 0;
> +
>  static void add_work(struct grep_opt *opt, struct grep_source *gs)
>  {
>         if (opt->binary != GREP_BINARY_TEXT)
> @@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (!cached && ce_skip_worktree(ce))
> +               /*
> +                * Skip entries with SKIP_WORKTREE unless both --sparse and
> +                * --cached are given.
> +                */
> +               if (!(grep_sparse && cached) && ce_skip_worktree(ce))
>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>                            PARSE_OPT_NOCOMPLETE),
>                 OPT_INTEGER('m', "max-count", &opt.max_count,
>                         N_("maximum number of results per file")),
> +               OPT_BOOL(0, "sparse", &grep_sparse,
> +                        N_("search the contents of files outside the sparse-checkout definition")),
>                 OPT_END()
>         };
>         grep_prefix = prefix;
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index eb59564565..a9879cc980 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p
>         test_cmp expect actual
>  '
>
> -test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
> +test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep --cached "text" >actual &&
> +       test_cmp expect actual &&
> +
>         cat >expect <<-EOF &&
>         a:text
>         b:text
>         dir/c:text
>         EOF
> -       git grep --cached "text" >actual &&
> +       git grep --cached --sparse "text" >actual &&
>         test_cmp expect actual
>  '
>
> @@ -143,7 +149,15 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu
>         test_cmp expect actual
>  '
>
> -test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
> +test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/B/b:text
> +       sub2/a:text
> +       EOF
> +       git grep --recurse-submodules --cached "text" >actual &&
> +       test_cmp expect actual &&
> +
>         cat >expect <<-EOF &&
>         a:text
>         b:text
> @@ -152,7 +166,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th
>         sub/B/b:text
>         sub2/a:text
>         EOF
> -       git grep --recurse-submodules --cached "text" >actual &&
> +       git grep --recurse-submodules --cached --sparse "text" >actual &&
>         test_cmp expect actual
>  '
>
> @@ -166,7 +180,15 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a
>         test_cmp expect actual
>  '
>
> -test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
> +test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       test_when_finished "git update-index --no-assume-unchanged b" &&
> +       git update-index --assume-unchanged b &&
> +       git grep --cached text >actual &&
> +       test_cmp expect actual &&
> +
>         cat >expect <<-EOF &&
>         a:text
>         b:text
> @@ -174,7 +196,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and
>         EOF
>         test_when_finished "git update-index --no-assume-unchanged b" &&
>         git update-index --assume-unchanged b &&
> -       git grep --cached text >actual &&
> +       git grep --cached --sparse text >actual &&
>         test_cmp expect actual
>  '
>
> --
> 2.37.0

I read over this patch and the other two patches.  Other than things
like variable names propagating the sparse/dense confusion, and the
high level goals already discussed, I didn't spot any other issues.
Junio C Hamano Sept. 15, 2022, 2:57 a.m. UTC | #3
Elijah Newren <newren@gmail.com> writes:

> ... I think all those other commands probably deserve a mode where they
> restrict output to the view associated with the user's cone.  I've
> brought that up before[1].  I was skeptical of making it the default,
> because it'd probably take a long time to implement it everywhere.
> Slowly changing defaults of all commands over many git releases seems
> like a poor strategy, but I'm afraid that's what it looks like we are
> doing here.
>
> I'm also worried that slowly changing the defaults without a
> high-level plan will lead to users struggling to figure out what
> flag(s) to pass.  Are we going to be stuck in a situation where users
> have to remember that for a dense search, they use one flag for `grep
> --cached`, a different one for  `grep [REVISION]`, no flag is needed
> for `diff [REVISION]`, but yet a different flag is needed for `git
> log`?

In short, the default should be "everywhere in tree, regardless of
the current sparse-checkout settings", with commands opting into
implementing "limit only to sparse-checkout settings" as an option,
at least initially, with an eye to possibly flip the default later
when all commands support that position but not before?

I think that is a reasonable position to take.  I lean towards the
default of limiting the operations to inside sparse cone(s) for all
subcommands when all subcommands learn to be capable to do so, but I
also agree that using that default for only for subcommands that
have learned to do, which will happen over time, would be way too
confusing for our users.

By the way, I briefly wondered if "limit to sparse-checkout setting"
can be done by introducing a fake "attribute" and using the "attr"
pathspec magic, but it may probably be a bad match, and separate
option would be more appropriate.

>> Change the default behavior of 'git grep' to focus on the files within
>> the sparse-checkout definition. To enable the previous behavior, add a
>> '--sparse' option to 'git grep' that triggers the old behavior that
>> inspects paths outside of the sparse-checkout definition when paired
>> with the '--cached' option.
>
> I still think the flag name of `--sparse` is totally backwards and
> highly confusing for the described behavior.

Yeah, regardless of which between "--sparse" and "--no-sparse"
should be the default, I am in 100% agreement that "--sparse"
meaning "affect things both inside and outside the sparse cones" is
totally backwards.

How strongly ingrained is this UI mistake?  I have a feeling that
this may be something we still can undo and redo relatively easily,
i.e. "--sparse" may be that "limit to sparse-checkout setting"
option, not "--no-sparse".
Shaoxuan Yuan Sept. 17, 2022, 3:34 a.m. UTC | #4
On 9/13/2022 11:08 PM, Elijah Newren wrote:
> Hi Shaoxuan,
> 
> Please note that it's customary to cc folks who have commented on
> previous versions of your patch series when you re-roll.

Hi Elijah,

Sorry for the delay, I didn't have my computer with me during Merge 2022
and couldn't respond.

I'm sorry that I somehow lost you along the way :(

> On Wed, Sep 7, 2022 at 5:28 PM Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> wrote:
>>
>> Add a --sparse option to `git-grep`.
> 
> It's awesome you're working on this.  Adding more of "behavior A"
> (restricting querying commands to the sparse cone) is something I've
> wanted for a long time.

Thanks :)

> I think most of your code is beneficial, but I do have some issues
> with high level direction you were implementing, which may require
> some tweaks...

OK.

>> When the '--cached' option is used with the 'git grep' command, the
>> search is limited to the blobs found in the index, not in the worktree.
>> If the user has enabled sparse-checkout, this might present more results
>> than they would like, since the files outside of the sparse-checkout are
>> unlikely to be important to them.
> 
> "files outside of the sparse-checkout are unlikely to be important to
> [users]" is certainly an issue.  But it's *much* wider than this.
> Beyond `grep --cached`, it also affects `grep REVISION`, `log`, `diff
> [REVISION]`, and related things...perhaps even something like `blame`.

Agree. Keep reading...

> I think all those other commands probably deserve a mode where they
> restrict output to the view associated with the user's cone.  I've

Agree.

> brought that up before[1].  I was skeptical of making it the default,
> because it'd probably take a long time to implement it everywhere.
> Slowly changing defaults of all commands over many git releases seems
> like a poor strategy, but I'm afraid that's what it looks like we are
> doing here.

True.

> I'm also worried that slowly changing the defaults without a
> high-level plan will lead to users struggling to figure out what
> flag(s) to pass.  Are we going to be stuck in a situation where users
> have to remember that for a dense search, they use one flag for `grep
> --cached`, a different one for  `grep [REVISION]`, no flag is needed
> for `diff [REVISION]`, but yet a different flag is needed for `git
> log`?

I think the inconsistency is certainly unsettling.

> I'm also curious whether there shouldn't be a config option for
> something like this, so folks don't have to specify it with every
> invocation.  In particular, while I certainly have users that want to
> just query git for information about the part of the history they are
> interested in, there are other users who are fully aware they are
> working in a bigger repository and want to search for additional
> things to add to their sparse-checkout and predominantly use grep for
> things like that.  They have even documented that `git grep --cached
> <TERM>` can be used in sparse-checkouts for this purpose...and have
> been using that for a few years.  (I did warn them at the time that
> there was a risk they'd have to change their command, but it's still
> going to be a behavioral change they might not expect.)  Further, when
> I brought up changing the behavior of commands during sparse-checkouts
> to limit to files matching the sparsity paths in that old thread at
> [1], Stolee was a bit skeptical of making that the default.  That
> suggests, at least, that two independent groups of users would want to
> use the non-sparse searching frequently, and frequently enough that
> they'd appreciate a config option.

A config option sounds good. Though I think

1. If this option is for global behavior: users may better off turning
off sparse-checkout if they want a config to do things densely everywhere.

2. If this option is for a single subcommand (e.g. 'grep'): I don't have
much thoughts here. It certainly can be nice for users who need to do
non-sparse searching frequently. This design, if necessary, should
belong to a patch where this config is added for every single subcommand?

> I also brought up in that old thread that perhaps we want to avoid
> adding a flag to every subcommand, and instead just having a
> git-global flag for triggering this type of behavior.  (e.g. `git
> --no-restrict grep --cached ...` or `git --dense grep --cached ...`).

This looks more like the answer to me. It's a peace of mind for users if
they don't have to worry about whether a subcommand is sparse-aware, and
how may their behaviors differ. Though we still may need to update the
actual behavior in each subcommand over an extended period of time
(though may not be difficult?), which you mentioned above "seems like a
poor strategy".

> [1] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
> and the responses to that email>
>> Change the default behavior of 'git grep' to focus on the files within
>> the sparse-checkout definition. To enable the previous behavior, add a
>> '--sparse' option to 'git grep' that triggers the old behavior that
>> inspects paths outside of the sparse-checkout definition when paired
>> with the '--cached' option.
> 
> I still think the flag name of `--sparse` is totally backwards and
> highly confusing for the described behavior.  I missed Stolee's email
> at the time (wasn't cc'ed) where he brought up that "--sparse" had
> already been added to "git-add" and "git-rm", but in those cases the
> commands aren't querying and I just don't see how they lead to the
> same level of user confusion.  This one seems glaringly wrong to me
> and both Junio and I flagged it on v1 when we first saw it.  (Perhaps
> it also helps that for the add/rm cases, that a user is often given an
> error message with the suggested flag to use, which just doesn't make
> sense here either.)  If there is concern that this flag should be the
> same as add and rm, then I think we need to do the backward
> compatibility dance and fix add and rm by adding an alias over there
> so that grep's flag won't be so confusing.

I guess I'm using "--sparse" here because "add", "rm" and "mv" all imply
that "when operating on a sparse path, ignores/warns unless '--sparse'
is used". I take it as an analogy so "when searching a sparse path,
ignores/warns unless '--sparse' is used". As the idea that "Git does
*not* care sparse contents unless '--[no-]sparse' is specified" is sort
of established through the implementations in "add", "rm", or "mv", I
don't see a big problem using "--sparse" here.

I *think*, as long as the users are informed that the default is to
ignore things outside of the sparse-checkout definition, and they have
to do something (using "--sparse" or a potential better name) to
override the default, we are safe to use a name that is famous (i.e.
"--sparse") even though its literal meaning is not perfectly descriptive.

One outlier I do find confusing though, is the "--sparse" option from
"git-ls-files". Without it, Git expands the index and show everything
outside of sparse-checkout definition, which seems a bit controversial.

...

> 
> I read over this patch and the other two patches.  Other than things
> like variable names propagating the sparse/dense confusion, and the
> high level goals already discussed, I didn't spot any other issues.

Thanks,
Shaoxuan
Shaoxuan Yuan Sept. 17, 2022, 3:45 a.m. UTC | #5
On 9/13/2022 11:08 PM, Elijah Newren wrote:

...

I think we are now at a point to make this UI decision, which may not be
easily (and should not be?) reverted once it's made in this patch.

So, is "--sparse" we want for "grep", even for "rm", "add", or "mv"?

Love to hear from other contributors :)

Thanks,
Shaoxuan
Elijah Newren Sept. 18, 2022, 2:14 a.m. UTC | #6
On Wed, Sep 14, 2022 at 7:57 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > ... I think all those other commands probably deserve a mode where they
> > restrict output to the view associated with the user's cone.  I've
> > brought that up before[1].  I was skeptical of making it the default,
> > because it'd probably take a long time to implement it everywhere.
> > Slowly changing defaults of all commands over many git releases seems
> > like a poor strategy, but I'm afraid that's what it looks like we are
> > doing here.
> >
> > I'm also worried that slowly changing the defaults without a
> > high-level plan will lead to users struggling to figure out what
> > flag(s) to pass.  Are we going to be stuck in a situation where users
> > have to remember that for a dense search, they use one flag for `grep
> > --cached`, a different one for  `grep [REVISION]`, no flag is needed
> > for `diff [REVISION]`, but yet a different flag is needed for `git
> > log`?
>
> In short, the default should be "everywhere in tree, regardless of
> the current sparse-checkout settings", with commands opting into
> implementing "limit only to sparse-checkout settings" as an option,
> at least initially, with an eye to possibly flip the default later
> when all commands support that position but not before?
>
> I think that is a reasonable position to take.  I lean towards the
> default of limiting the operations to inside sparse cone(s) for all
> subcommands when all subcommands learn to be capable to do so, but I
> also agree that using that default for only for subcommands that
> have learned to do, which will happen over time, would be way too
> confusing for our users.
>
> By the way, I briefly wondered if "limit to sparse-checkout setting"
> can be done by introducing a fake "attribute" and using the "attr"
> pathspec magic, but it may probably be a bad match, and separate
> option would be more appropriate.
>
> >> Change the default behavior of 'git grep' to focus on the files within
> >> the sparse-checkout definition. To enable the previous behavior, add a
> >> '--sparse' option to 'git grep' that triggers the old behavior that
> >> inspects paths outside of the sparse-checkout definition when paired
> >> with the '--cached' option.
> >
> > I still think the flag name of `--sparse` is totally backwards and
> > highly confusing for the described behavior.
>
> Yeah, regardless of which between "--sparse" and "--no-sparse"
> should be the default, I am in 100% agreement that "--sparse"
> meaning "affect things both inside and outside the sparse cones" is
> totally backwards.
>
> How strongly ingrained is this UI mistake?  I have a feeling that
> this may be something we still can undo and redo relatively easily,
> i.e. "--sparse" may be that "limit to sparse-checkout setting"
> option, not "--no-sparse".

It's gotten into a few commands, but I agree it seems like something
we can still undo.

In fact, not all uses of `--sparse` are backwards; two commands (clone
& ls-files) use `--sparse` to mean limit to sparsity specification.
There are three commands that use `--sparse` in a potentially
confusing or backwards way, though one is new to this cycle and isn't
even documented.  In more detail...

== clone --sparse ==

For clone, `--sparse` definitely means limit to the sparsity patterns.
That's the meaning we want.

== ls-files --sparse ==

For ls-files, the meaning of `--sparse` is "do not recurse into sparse
directory entries in order to print the traditional ls-files output,
just print the sparse directory entry".  So, I'd say that also has the
meaning we want; it's for restricting rather than expanding.

This one is also interesting in that it is the only command in the
list about querying for information rather than modifying the
worktree/index, and is thus the best precedent for grep.

If grep behaved similarly to ls-files, it would suggest that
Shaoxuan's series should default to searching the whole index (the
opposite of what his current series does) and that --sparse would be
used to restrict to the sparsity patterns (also the opposite of the
meaning for his flag).

== add --sparse ==

For add, `--sparse` affects the behavior of untracked files.  Its
usage allows untracked files to be added to the index despite the file
normally being outside the sparsity patterns.  There are two ways for
users to view this:
  * The file added is now tracked, and is present (or "checked-out").
Thus, the new file is part of the user's "sparse checkout" now.
Perhaps the flag makes sense viewed from this light?  (I had actually
looked at it this way previously).
  * We used the `--sparse` flag to allow git-add to operate on
something outside of the normal sparsity patterns.  The flag is
backwards.

It might be worth noting that the reason this flag was added was that
users are likely to be surprised later when some other command runs
and causes the file to vanish when they update the working tree to
match the sparsity patterns.

== rm --sparse ==

For rm, `--sparse` allows files to be removed from the index despite
normally being outside the sparsity patterns.  There's also a couple
ways to view this:
  * Any file being removed is not going to be part of the sparse
checkout anymore.  Thus there is no meaning to `--sparse`, but git-add
used it as a safety check to avoid surprises by operating outside the
normal patterns so perhaps we re-use that?
  * We used the `--sparse` flag to allow rm to operate on something
outside the normal sparsity patterns.  The flag is backwards.

Much like add, it might be worth noting that this flag was added for
cases like `git rm '*.jpg'` -- users probably only want such
expressions to operate on their sparse-checkout and they could be
negatively surprised by also removing stuff elsewhere.

== mv --sparse ==

For mv, `--sparse` feels like it's stretching the logic used for
`git-add` and isn't so clear that it could make sense anymore.  The
connection might be that when it moves files outside the sparsity
specification, it actually leaves them materialized, so in that sense
you could argue the files are still part of the sparse checkout, but
I'd say we're stretching that a bit.

However, the `mv` changes were made earlier this same cycle and aren't
part of a release yet.  It doesn't feel like this should be setting a
precedent for how grep should behave.  Especially since it's a
modification command, and grep is a querying command; ls-files seems
like a better precedent.

Also, the `--sparse` flag was not documented for mv for whatever reason.

== Overall ==

For existing querying commands (just ls-files), `--sparse` already
means restrict to the sparse cone.  If we keep using the existing flag
names, grep should follow suit.

For existing modification commands already released (add, rm), the
fact that the command is modifying actually gives a different way to
interpret things such that it's not clear `--sparse` was even a
problem.  However, perhaps the name of the flag is bad just because
there are multiple ways to view it and those who view it one way will
see it as counter-intuitive.

== Flag rename? ==

There's another reason to potentially rename the flag.  We already
have `--sparse` and `--dense` flags for rev-list and friends.  So,
when we want to enable those other commands to restrict to the
sparsity patterns, we probably need a different name.  So, perhaps, we
should rename our `--sparse/--dense` to `--restrict/--no-restrict`.
Such a rename would also likely clear up the ambiguity about which way
to interpret the command for the add & rm commands (though it'd pick
the second one and suggest we were using the wrong name after all).

(There are also two other commands that use `--sparse` -- pack-objects
and show-branch, though in a much different way and neither would ever
be affected by our new --sparse/--dense/--restrict/--no-restrict
flags.)

Other names are also possible.  Any suggestions?

== global flag vs subcommand flags ==

Do we want to make --[no-]restrict a flag for each subcommand, or just
make it a global git flag?  I kind of think it'd make sense to do the
latter

== Defaults ==

As discussed before, we probably want querying commands (ls-files,
grep, log, etc.) to default to --no-restrict for now, since we are
otherwise slowly changing the defaults.  We may want to swap that
default in the future.

However, for modification commands, I think we want the default to be
--restrict, regardless of the default for querying commands.  There
are some potentially very negative surprises for users if we don't,
and those surprises will be delayed rather than occur at the time the
user runs the command.  In fact, those negative surprises are likely
why those commands were the first to gain an option controlling
whether they operated on paths outside the sparsity specification.
(Also, the modification commands print a warning if they could have
affected other files but didn't due the the default of restricting, so
I think we have their default correct, even if the flag name is
suboptimal.)
Elijah Newren Sept. 18, 2022, 4:24 a.m. UTC | #7
On Fri, Sep 16, 2022 at 8:34 PM Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> wrote:
>
> > I'm also curious whether there shouldn't be a config option for
> > something like this, so folks don't have to specify it with every
> > invocation.  In particular, while I certainly have users that want to
> > just query git for information about the part of the history they are
> > interested in, there are other users who are fully aware they are
> > working in a bigger repository and want to search for additional
> > things to add to their sparse-checkout and predominantly use grep for
> > things like that.  They have even documented that `git grep --cached
> > <TERM>` can be used in sparse-checkouts for this purpose...and have
> > been using that for a few years.  (I did warn them at the time that
> > there was a risk they'd have to change their command, but it's still
> > going to be a behavioral change they might not expect.)  Further, when
> > I brought up changing the behavior of commands during sparse-checkouts
> > to limit to files matching the sparsity paths in that old thread at
> > [1], Stolee was a bit skeptical of making that the default.  That
> > suggests, at least, that two independent groups of users would want to
> > use the non-sparse searching frequently, and frequently enough that
> > they'd appreciate a config option.
>
> A config option sounds good. Though I think
>
> 1. If this option is for global behavior: users may better off turning
> off sparse-checkout if they want a config to do things densely everywhere.

Sorry, it sounds like I haven't explained the usecases to you very
well.  Let me try again.

There are people who want to do everything densely, as you say, and
those folks can just turn off sparse-checkout or not use it in the
first place.  Git has traditionally catered to these folks just fine.
However, it's not a subset of interest for this discussion and wasn't
what I was talking about.

There are (at least) two different usecases for people wanting to use
sparse-checkouts; I have users that fall under each category:


1) Working on a repository subset; users are _only_ interested in that subset.

This usecase is very poorly supported in Git right now, but I think
you understand it so I'll only briefly describe it.

These folks might know there are other things in the repository, but
don't care.  Not only should the working tree be sparse, but grep,
log, diff, etc. should be restricted to the subset of the tree they
are interested in.

Restricting operations to the sparsity specification is also important
for marrying partial clones with sparse checkouts while allowing
disconnected development.  Without such a restrict-to-sparsity-paths
feature, the partial clones will attempt to download objects the first
time they try to grep an old revision, or do log with a glob path.
The download will fail, causing the operation to fail, and break the
ability of the user to work in a disconnected manner.


2) The working directory is sparse, but users are working in a larger whole.

Stolee described this usecase this way[2]:

"I'm also focused on users that know that they are a part of a larger
whole. They know they are operating on a large repository but focus on
what they need to contribute their part. I expect multiple "roles" to
use very different, almost disjoint parts of the codebase. Some other
"architect" users operate across the entire tree or hop between different
sections of the codebase as necessary. In this situation, I'm wary of
scoping too many features to the sparse-checkout definition, especially
"git log," as it can be too confusing to have their view of the codebase
depend on your "point of view."

[2] https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/

I describe it very similarly, but I'd like to point out something
additional around this usecase and how it can be influenced by
dependencies.  The first cut for sparse-checkouts is usually the
directories you are interested in plus what those directories depend
upon within your repository.  But there's a monkey wrench here: if you
have integration tests, they invert the hierarchy: to run integration
tests, you need not only what you are interested in and its
dependencies, you also need everything that depends upon what you are
interested in or that depends upon one of your dependencies...AND you
need all the dependencies of that expanded group.  That can easily
change your sparse-checkout into a nearly dense one.  Naturally, that
tends to kill the benefits of sparse-checkouts.  There are a couple
solutions to this conundrum: either avoid grabbing dependencies (maybe
have built versions of your dependencies pulled from a CI cache
somewhere), or say that users shouldn't run integration tests directly
and instead do it on the CI server when they submit a code review.  Or
do both.  Regardless of whether you stub out your dependencies or stub
out the things that depend upon you, there is certainly a reason to
want to query and be aware of those other parts of the repository.
Thus, sparse-checkouts can be used to limit what you directly build
and modify, but these users do not want it to limit their queries of
history.


Once users pick either the first or the second usecase, they often
stick within it.  For either group, regardless of what Git's default
is, needing to specify an additional flag for *every*
grep/log/diff/etc. they run would just be a total annoyance.  Neither
wants a dense worktree, but one side wants a dense history query while
the other wants a sparse one.  Different groups should be able to
configure the default that works well for them, much like we allow
users to configure whether they want "git pull" to rebase or merge.

> 2. If this option is for a single subcommand (e.g. 'grep'): I don't have
> much thoughts here. It certainly can be nice for users who need to do
> non-sparse searching frequently. This design, if necessary, should
> belong to a patch where this config is added for every single subcommand?
>
> > I also brought up in that old thread that perhaps we want to avoid
> > adding a flag to every subcommand, and instead just having a
> > git-global flag for triggering this type of behavior.  (e.g. `git
> > --no-restrict grep --cached ...` or `git --dense grep --cached ...`).
>
> This looks more like the answer to me. It's a peace of mind for users if
> they don't have to worry about whether a subcommand is sparse-aware, and
> how may their behaviors differ. Though we still may need to update the
> actual behavior in each subcommand over an extended period of time
> (though may not be difficult?), which you mentioned above "seems like a
> poor strategy".
>
> > [1] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
> > and the responses to that email>
> >> Change the default behavior of 'git grep' to focus on the files within
> >> the sparse-checkout definition. To enable the previous behavior, add a
> >> '--sparse' option to 'git grep' that triggers the old behavior that
> >> inspects paths outside of the sparse-checkout definition when paired
> >> with the '--cached' option.
> >
> > I still think the flag name of `--sparse` is totally backwards and
> > highly confusing for the described behavior.  I missed Stolee's email
> > at the time (wasn't cc'ed) where he brought up that "--sparse" had
> > already been added to "git-add" and "git-rm", but in those cases the
> > commands aren't querying and I just don't see how they lead to the
> > same level of user confusion.  This one seems glaringly wrong to me
> > and both Junio and I flagged it on v1 when we first saw it.  (Perhaps
> > it also helps that for the add/rm cases, that a user is often given an
> > error message with the suggested flag to use, which just doesn't make
> > sense here either.)  If there is concern that this flag should be the
> > same as add and rm, then I think we need to do the backward
> > compatibility dance and fix add and rm by adding an alias over there
> > so that grep's flag won't be so confusing.
>
> I guess I'm using "--sparse" here because "add", "rm" and "mv" all imply
> that "when operating on a sparse path, ignores/warns unless '--sparse'
> is used". I take it as an analogy so "when searching a sparse path,
> ignores/warns unless '--sparse' is used". As the idea that "Git does
> *not* care sparse contents unless '--[no-]sparse' is specified" is sort
> of established through the implementations in "add", "rm", or "mv", I
> don't see a big problem using "--sparse" here.

Well, I do.

In addition to just being utterly backwards and confusing in the
context of grep:
  * Both `clone` and `ls-files` use `--sparse` to mean to limit things
to the sparsity cone, so we're already kinda split-brained.
  * grep is more like ls-files (both being querying functions) than
add/rm/mv, so should really follow its lead instead of the one from
add/rm/mv.
  * There's another way to interpret `--sparse` for `add` and `rm`
such that it makes sense (at least to me); see my other email to Junio
in this thread.
  * `mv` is indeed using it backward, but the `mv` change is new to
this cycle (and undocumented) so I'm not sure it counts as much of a
precedent yet.

> I *think*, as long as the users are informed that the default is to
> ignore things outside of the sparse-checkout definition, and they have
> to do something (using "--sparse" or a potential better name) to
> override the default, we are safe to use a name that is famous (i.e.
> "--sparse") even though its literal meaning is not perfectly descriptive.
>
> One outlier I do find confusing though, is the "--sparse" option from
> "git-ls-files". Without it, Git expands the index and show everything
> outside of sparse-checkout definition, which seems a bit controversial.

Nah, that perfectly matches the expectation of users in the second
usecase above -- querying (ls-files/grep/log/diff) defaults to
non-restricted history, modifying (add/rm/mv) defaults to restricted
paths but warns if the arguments could have matched something else,
and the working tree is restricted to sparse paths.  It doesn't seem
too controversial to me, even if it's not what we want for the
long-term default.

The defaults for the first usecase would be defaulting to restricted
paths for everything, and perhaps not warn if arguments to a modifying
command could have matched something else.


Anyway, hope that helps you understand my perspective and framing.
Victoria Dye Sept. 18, 2022, 7:52 p.m. UTC | #8
Elijah Newren wrote:
> == Overall ==
> 
> For existing querying commands (just ls-files), `--sparse` already
> means restrict to the sparse cone.  If we keep using the existing flag
> names, grep should follow suit.
> 
> For existing modification commands already released (add, rm), the
> fact that the command is modifying actually gives a different way to
> interpret things such that it's not clear `--sparse` was even a
> problem.  However, perhaps the name of the flag is bad just because
> there are multiple ways to view it and those who view it one way will
> see it as counter-intuitive.
> 
> == Flag rename? ==
> 
> There's another reason to potentially rename the flag.  We already
> have `--sparse` and `--dense` flags for rev-list and friends.  So,
> when we want to enable those other commands to restrict to the
> sparsity patterns, we probably need a different name.  So, perhaps, we
> should rename our `--sparse/--dense` to `--restrict/--no-restrict`.
> Such a rename would also likely clear up the ambiguity about which way
> to interpret the command for the add & rm commands (though it'd pick
> the second one and suggest we were using the wrong name after all).
> 
> (There are also two other commands that use `--sparse` -- pack-objects
> and show-branch, though in a much different way and neither would ever
> be affected by our new --sparse/--dense/--restrict/--no-restrict
> flags.)
> 
> Other names are also possible.  Any suggestions?
> 
> == global flag vs subcommand flags ==
> 
> Do we want to make --[no-]restrict a flag for each subcommand, or just
> make it a global git flag?  I kind of think it'd make sense to do the
> latter
> 
> == Defaults ==
> 
> As discussed before, we probably want querying commands (ls-files,
> grep, log, etc.) to default to --no-restrict for now, since we are
> otherwise slowly changing the defaults.  We may want to swap that
> default in the future.
> 
> However, for modification commands, I think we want the default to be
> --restrict, regardless of the default for querying commands.  There
> are some potentially very negative surprises for users if we don't,
> and those surprises will be delayed rather than occur at the time the
> user runs the command.  In fact, those negative surprises are likely
> why those commands were the first to gain an option controlling
> whether they operated on paths outside the sparsity specification.
> (Also, the modification commands print a warning if they could have
> affected other files but didn't due the the default of restricting, so
> I think we have their default correct, even if the flag name is
> suboptimal.)

One of the things I've found myself a bit frustrated with while working on
these sparse index integrations is that we haven't had a clear set of
guidelines for times when we need to make UI/UX changes relating to
'sparse-checkout' compatibility. I think what you've outlined here is a good
start to a larger discussion on the topic, but in the middle of this series
might not be the best place for that discussion (at least in terms of
preserving for later reference). 

Elijah, would you be interested in compiling your thoughts into a document
in 'Documentation/technical'? If not, Stolee or I could do it. If we could
settle on some guidelines (option names, behavior, etc.) for better
incorporating 'sparse-checkout' support into existing commands, it'd make
future sparse index work substantially easier for everyone involved.

As for this series, I think the best way to move the sparse index work along
is to drop this patch ("builtin/grep.c: add --sparse option") altogether.
Shaoxuan's updates in patch 3 [1] make 'git grep' sparse index-compatible
for *all* invocations (not just those without '--sparse'), so we don't need
the new option for sparse index compatibility. It can then be re-introduced
later (possibly modified) in a series dedicated to unifying the
sparse-checkout UX.

[1] https://lore.kernel.org/git/20220908001854.206789-4-shaoxuan.yuan02@gmail.com/
Junio C Hamano Sept. 19, 2022, 1:23 a.m. UTC | #9
Victoria Dye <vdye@github.com> writes:

> Elijah Newren wrote:
> ...
>> However, for modification commands, I think we want the default to be
>> --restrict, regardless of the default for querying commands.  There
>> are some potentially very negative surprises for users if we don't,
>> and those surprises will be delayed rather than occur at the time the
>> user runs the command.  In fact, those negative surprises are likely
>> why those commands were the first to gain an option controlling
>> whether they operated on paths outside the sparsity specification.
>> (Also, the modification commands print a warning if they could have
>> affected other files but didn't due the the default of restricting, so
>> I think we have their default correct, even if the flag name is
>> suboptimal.)
>
> One of the things I've found myself a bit frustrated with while working on
> these sparse index integrations is that we haven't had a clear set of
> guidelines for times when we need to make UI/UX changes relating to
> 'sparse-checkout' compatibility. I think what you've outlined here is a good
> start to a larger discussion on the topic, but in the middle of this series
> might not be the best place for that discussion (at least in terms of
> preserving for later reference). 

Yup, I think we were a bit too quick to add the "hide outside sparse
cones" feature without first coming up with a reasonable guideline
that is designed to keep things consistent.

It might have been nice if we did this "make X sparse checkout
aware" effort in two separate steps.  The first step will not change
any behaviour, i.e. no optional or default "hide outside sparse
cones" at all, just "we do not upfront expand the index fully;
instead as we discover we need to inspect the contents in a
subdirectory that is compacted to a tree in the index, we lazily
expand it" as performance optimization.  And once we made sure we
taught all commands that used to expand the index fully upfront not
to do so, we do the "guideline" design for UI to "hide outside
sparse cones", and add that feature to the commands in the second
step.

Unfortunately we all get excited too much when we find a new shiny
toy, and we ended up getting ahead of ourselves before designing a
consistent end user experience.  But better late than never ;-)

> As for this series, I think the best way to move the sparse index work along
> is to drop this patch ("builtin/grep.c: add --sparse option") altogether.

Does that roughly correspond to the first step in my "It would have
been nice if we did these in two steps" above?  That would be a
sensible thing to do, as it would be less surprises to the users, I
hope.

Thanks.
Shaoxuan Yuan Sept. 19, 2022, 4:13 a.m. UTC | #10
On 9/17/2022 9:24 PM, Elijah Newren wrote:
> On Fri, Sep 16, 2022 at 8:34 PM Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> wrote:
>>
>>> I'm also curious whether there shouldn't be a config option for
>>> something like this, so folks don't have to specify it with every
>>> invocation.  In particular, while I certainly have users that want to
>>> just query git for information about the part of the history they are
>>> interested in, there are other users who are fully aware they are
>>> working in a bigger repository and want to search for additional
>>> things to add to their sparse-checkout and predominantly use grep for
>>> things like that.  They have even documented that `git grep --cached
>>> <TERM>` can be used in sparse-checkouts for this purpose...and have
>>> been using that for a few years.  (I did warn them at the time that
>>> there was a risk they'd have to change their command, but it's still
>>> going to be a behavioral change they might not expect.)  Further, when
>>> I brought up changing the behavior of commands during sparse-checkouts
>>> to limit to files matching the sparsity paths in that old thread at
>>> [1], Stolee was a bit skeptical of making that the default.  That
>>> suggests, at least, that two independent groups of users would want to
>>> use the non-sparse searching frequently, and frequently enough that
>>> they'd appreciate a config option.
>>
>> A config option sounds good. Though I think
>>
>> 1. If this option is for global behavior: users may better off turning
>> off sparse-checkout if they want a config to do things densely everywhere.
> 
> Sorry, it sounds like I haven't explained the usecases to you very
> well.  Let me try again.
> 
> There are people who want to do everything densely, as you say, and
> those folks can just turn off sparse-checkout or not use it in the
> first place.  Git has traditionally catered to these folks just fine.
> However, it's not a subset of interest for this discussion and wasn't
> what I was talking about.

OK, reading...

> There are (at least) two different usecases for people wanting to use
> sparse-checkouts; I have users that fall under each category:
> 
> 
> 1) Working on a repository subset; users are _only_ interested in that subset.
> 
> This usecase is very poorly supported in Git right now, but I think
> you understand it so I'll only briefly describe it.
> 
> These folks might know there are other things in the repository, but
> don't care.  Not only should the working tree be sparse, but grep,
> log, diff, etc. should be restricted to the subset of the tree they
> are interested in.

Right, this is the usecase I am familiar with.

> Restricting operations to the sparsity specification is also important
> for marrying partial clones with sparse checkouts while allowing
> disconnected development.  Without such a restrict-to-sparsity-paths
> feature, the partial clones will attempt to download objects the first
> time they try to grep an old revision, or do log with a glob path.
> The download will fail, causing the operation to fail, and break the
> ability of the user to work in a disconnected manner.

OK, I'm still learning about partial clone, didn't get a chance to look
at it. Will try to figure out what this means :)

> 2) The working directory is sparse, but users are working in a larger whole.
> 
> Stolee described this usecase this way[2]:
> 
> "I'm also focused on users that know that they are a part of a larger
> whole. They know they are operating on a large repository but focus on
> what they need to contribute their part. I expect multiple "roles" to
> use very different, almost disjoint parts of the codebase. Some other
> "architect" users operate across the entire tree or hop between different
> sections of the codebase as necessary. In this situation, I'm wary of
> scoping too many features to the sparse-checkout definition, especially
> "git log," as it can be too confusing to have their view of the codebase
> depend on your "point of view."
> 
> [2] https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
> 
> I describe it very similarly, but I'd like to point out something
> additional around this usecase and how it can be influenced by
> dependencies.  The first cut for sparse-checkouts is usually the
> directories you are interested in plus what those directories depend
> upon within your repository.  But there's a monkey wrench here: if you
> have integration tests, they invert the hierarchy: to run integration
> tests, you need not only what you are interested in and its
> dependencies, you also need everything that depends upon what you are
> interested in or that depends upon one of your dependencies...AND you
> need all the dependencies of that expanded group.  That can easily
> change your sparse-checkout into a nearly dense one.  Naturally, that
> tends to kill the benefits of sparse-checkouts.  There are a couple
> solutions to this conundrum: either avoid grabbing dependencies (maybe
> have built versions of your dependencies pulled from a CI cache
> somewhere), or say that users shouldn't run integration tests directly
> and instead do it on the CI server when they submit a code review.  Or
> do both.  Regardless of whether you stub out your dependencies or stub
> out the things that depend upon you, there is certainly a reason to
> want to query and be aware of those other parts of the repository.
> Thus, sparse-checkouts can be used to limit what you directly build
> and modify, but these users do not want it to limit their queries of
> history.
> 
> 
> Once users pick either the first or the second usecase, they often
> stick within it.  For either group, regardless of what Git's default
> is, needing to specify an additional flag for *every*
> grep/log/diff/etc. they run would just be a total annoyance.  Neither
> wants a dense worktree, but one side wants a dense history query while
> the other wants a sparse one.  Different groups should be able to
> configure the default that works well for them, much like we allow
> users to configure whether they want "git pull" to rebase or merge.

OK, now I get it:

Case A: users only interested in a subset, so they need only sparse
history and a sparse worktree.

v.s.

Case B: users works within a subset but needs a larger context, so they
need a dense history/query (that's why we should let grep default to
--no-restrict, as you suggested?), though still a sparse worktree.

> 
>> 2. If this option is for a single subcommand (e.g. 'grep'): I don't have
>> much thoughts here. It certainly can be nice for users who need to do
>> non-sparse searching frequently. This design, if necessary, should
>> belong to a patch where this config is added for every single subcommand?
>>
>>> I also brought up in that old thread that perhaps we want to avoid
>>> adding a flag to every subcommand, and instead just having a
>>> git-global flag for triggering this type of behavior.  (e.g. `git
>>> --no-restrict grep --cached ...` or `git --dense grep --cached ...`).
>>
>> This looks more like the answer to me. It's a peace of mind for users if
>> they don't have to worry about whether a subcommand is sparse-aware, and
>> how may their behaviors differ. Though we still may need to update the
>> actual behavior in each subcommand over an extended period of time
>> (though may not be difficult?), which you mentioned above "seems like a
>> poor strategy".
>>
>>> [1] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
>>> and the responses to that email>
>>>> Change the default behavior of 'git grep' to focus on the files within
>>>> the sparse-checkout definition. To enable the previous behavior, add a
>>>> '--sparse' option to 'git grep' that triggers the old behavior that
>>>> inspects paths outside of the sparse-checkout definition when paired
>>>> with the '--cached' option.
>>>
>>> I still think the flag name of `--sparse` is totally backwards and
>>> highly confusing for the described behavior.  I missed Stolee's email
>>> at the time (wasn't cc'ed) where he brought up that "--sparse" had
>>> already been added to "git-add" and "git-rm", but in those cases the
>>> commands aren't querying and I just don't see how they lead to the
>>> same level of user confusion.  This one seems glaringly wrong to me
>>> and both Junio and I flagged it on v1 when we first saw it.  (Perhaps
>>> it also helps that for the add/rm cases, that a user is often given an
>>> error message with the suggested flag to use, which just doesn't make
>>> sense here either.)  If there is concern that this flag should be the
>>> same as add and rm, then I think we need to do the backward
>>> compatibility dance and fix add and rm by adding an alias over there
>>> so that grep's flag won't be so confusing.
>>
>> I guess I'm using "--sparse" here because "add", "rm" and "mv" all imply
>> that "when operating on a sparse path, ignores/warns unless '--sparse'
>> is used". I take it as an analogy so "when searching a sparse path,
>> ignores/warns unless '--sparse' is used". As the idea that "Git does
>> *not* care sparse contents unless '--[no-]sparse' is specified" is sort
>> of established through the implementations in "add", "rm", or "mv", I
>> don't see a big problem using "--sparse" here.
> 
> Well, I do.
> 
> In addition to just being utterly backwards and confusing in the
> context of grep:
>   * Both `clone` and `ls-files` use `--sparse` to mean to limit things
> to the sparsity cone, so we're already kinda split-brained.

Agree.

>   * grep is more like ls-files (both being querying functions) than
> add/rm/mv, so should really follow its lead instead of the one from
> add/rm/mv.

Agree.

>   * There's another way to interpret `--sparse` for `add` and `rm`
> such that it makes sense (at least to me); see my other email to Junio
> in this thread.

According to the spirit of your points, I think they should be
defaulting to --restrict (a rename perhaps) right now.

>   * `mv` is indeed using it backward, but the `mv` change is new to
> this cycle (and undocumented) so I'm not sure it counts as much of a
> precedent yet.

Oops, I was making the modifications to `mv` and forgot to add
documentation to it. Though the --sparse of `mv` was not documented
before I touching it. Perhaps it can be added later if we are going to
rename --sparse/--dense to --restrict/--no-restrict.

>> I *think*, as long as the users are informed that the default is to
>> ignore things outside of the sparse-checkout definition, and they have
>> to do something (using "--sparse" or a potential better name) to
>> override the default, we are safe to use a name that is famous (i.e.
>> "--sparse") even though its literal meaning is not perfectly descriptive.
>>
>> One outlier I do find confusing though, is the "--sparse" option from
>> "git-ls-files". Without it, Git expands the index and show everything
>> outside of sparse-checkout definition, which seems a bit controversial.
> 
> Nah, that perfectly matches the expectation of users in the second
> usecase above -- querying (ls-files/grep/log/diff) defaults to
> non-restricted history, modifying (add/rm/mv) defaults to restricted
> paths but warns if the arguments could have matched something else,
> and the working tree is restricted to sparse paths.  It doesn't seem
> too controversial to me, even if it's not what we want for the
> long-term default.

OK. After the reasoning you gave above, now the --sparse of ls-files
looks good.

> 
> The defaults for the first usecase would be defaulting to restricted
> paths for everything, and perhaps not warn if arguments to a modifying
> command could have matched something else.
> 
> 
> Anyway, hope that helps you understand my perspective and framing.

Thanks for the explanations, now I get it and agree with your points :)

Thanks,
Shaoxuan
Shaoxuan Yuan Sept. 19, 2022, 4:27 a.m. UTC | #11
Hi Victoria, :-)

On 9/18/2022 12:52 PM, Victoria Dye wrote:
> Elijah Newren wrote:
>> == Overall ==
>>
>> For existing querying commands (just ls-files), `--sparse` already
>> means restrict to the sparse cone.  If we keep using the existing flag
>> names, grep should follow suit.
>>
>> For existing modification commands already released (add, rm), the
>> fact that the command is modifying actually gives a different way to
>> interpret things such that it's not clear `--sparse` was even a
>> problem.  However, perhaps the name of the flag is bad just because
>> there are multiple ways to view it and those who view it one way will
>> see it as counter-intuitive.
>>
>> == Flag rename? ==
>>
>> There's another reason to potentially rename the flag.  We already
>> have `--sparse` and `--dense` flags for rev-list and friends.  So,
>> when we want to enable those other commands to restrict to the
>> sparsity patterns, we probably need a different name.  So, perhaps, we
>> should rename our `--sparse/--dense` to `--restrict/--no-restrict`.
>> Such a rename would also likely clear up the ambiguity about which way
>> to interpret the command for the add & rm commands (though it'd pick
>> the second one and suggest we were using the wrong name after all).
>>
>> (There are also two other commands that use `--sparse` -- pack-objects
>> and show-branch, though in a much different way and neither would ever
>> be affected by our new --sparse/--dense/--restrict/--no-restrict
>> flags.)
>>
>> Other names are also possible.  Any suggestions?
>>
>> == global flag vs subcommand flags ==
>>
>> Do we want to make --[no-]restrict a flag for each subcommand, or just
>> make it a global git flag?  I kind of think it'd make sense to do the
>> latter
>>
>> == Defaults ==
>>
>> As discussed before, we probably want querying commands (ls-files,
>> grep, log, etc.) to default to --no-restrict for now, since we are
>> otherwise slowly changing the defaults.  We may want to swap that
>> default in the future.
>>
>> However, for modification commands, I think we want the default to be
>> --restrict, regardless of the default for querying commands.  There
>> are some potentially very negative surprises for users if we don't,
>> and those surprises will be delayed rather than occur at the time the
>> user runs the command.  In fact, those negative surprises are likely
>> why those commands were the first to gain an option controlling
>> whether they operated on paths outside the sparsity specification.
>> (Also, the modification commands print a warning if they could have
>> affected other files but didn't due the the default of restricting, so
>> I think we have their default correct, even if the flag name is
>> suboptimal.)
> 
> One of the things I've found myself a bit frustrated with while working on
> these sparse index integrations is that we haven't had a clear set of
> guidelines for times when we need to make UI/UX changes relating to
> 'sparse-checkout' compatibility. I think what you've outlined here is a good
> start to a larger discussion on the topic, but in the middle of this series
> might not be the best place for that discussion (at least in terms of
> preserving for later reference). 
> 
> Elijah, would you be interested in compiling your thoughts into a document
> in 'Documentation/technical'? If not, Stolee or I could do it. If we could
> settle on some guidelines (option names, behavior, etc.) for better
> incorporating 'sparse-checkout' support into existing commands, it'd make
> future sparse index work substantially easier for everyone involved.

This sounds good! I am always confused about the inconsistency of the
meaning of "--sparse" across a variety of commands. A guideline
definitely corrects prior integrations and helps future ones.

> As for this series, I think the best way to move the sparse index work along
> is to drop this patch ("builtin/grep.c: add --sparse option") altogether.
> Shaoxuan's updates in patch 3 [1] make 'git grep' sparse index-compatible
> for *all* invocations (not just those without '--sparse'), so we don't need
> the new option for sparse index compatibility. It can then be re-introduced
> later (possibly modified) in a series dedicated to unifying the
> sparse-checkout UX.

Are you suggesting that we should still follow the original "use --cache
to search within the index and show SKIP_WORKTREE entries found"? I'm
asking because the tests in the second patch [2] are still using the
lately-introduced "--sparse". If yes, then I think it sounds good to
re-introduce the (potentially) modified UI in the future :-).

[2]
https://lore.kernel.org/git/20220908001854.206789-3-shaoxuan.yuan02@gmail.com/

> 
> [1] https://lore.kernel.org/git/20220908001854.206789-4-shaoxuan.yuan02@gmail.com/

Thanks,
Shaoxuan
Ævar Arnfjörð Bjarmason Sept. 19, 2022, 11:03 a.m. UTC | #12
On Sun, Sep 18 2022, Victoria Dye wrote:

> Elijah, would you be interested in compiling your thoughts into a document
> in 'Documentation/technical'? If not, Stolee or I could do it. If we could
> settle on some guidelines (option names, behavior, etc.) for better
> incorporating 'sparse-checkout' support into existing commands, it'd make
> future sparse index work substantially easier for everyone involved.

This sounds good. I'd just like to suggest that incorporating a table
similar to the one I made for checkout/switch in would be useful for
such documentation:

	https://lore.kernel.org/git/211021.86wnm6l1ip.gmgdl@evledraar.gmail.com/

We ended up dropping the ball on that topic, but for cross-command UX I
think it's a very useful way to present how a "meta option", or an
option shared across many commands is expected to behave.
Elijah Newren Sept. 20, 2022, 7:13 a.m. UTC | #13
On Sun, Sep 18, 2022 at 12:52 PM Victoria Dye <vdye@github.com> wrote:
>
> Elijah Newren wrote:
> > == Overall ==
> >
> > For existing querying commands (just ls-files), `--sparse` already
> > means restrict to the sparse cone.  If we keep using the existing flag
> > names, grep should follow suit.
> >
> > For existing modification commands already released (add, rm), the
> > fact that the command is modifying actually gives a different way to
> > interpret things such that it's not clear `--sparse` was even a
> > problem.  However, perhaps the name of the flag is bad just because
> > there are multiple ways to view it and those who view it one way will
> > see it as counter-intuitive.
> >
> > == Flag rename? ==
> >
> > There's another reason to potentially rename the flag.  We already
> > have `--sparse` and `--dense` flags for rev-list and friends.  So,
> > when we want to enable those other commands to restrict to the
> > sparsity patterns, we probably need a different name.  So, perhaps, we
> > should rename our `--sparse/--dense` to `--restrict/--no-restrict`.
> > Such a rename would also likely clear up the ambiguity about which way
> > to interpret the command for the add & rm commands (though it'd pick
> > the second one and suggest we were using the wrong name after all).
> >
> > (There are also two other commands that use `--sparse` -- pack-objects
> > and show-branch, though in a much different way and neither would ever
> > be affected by our new --sparse/--dense/--restrict/--no-restrict
> > flags.)
> >
> > Other names are also possible.  Any suggestions?
> >
> > == global flag vs subcommand flags ==
> >
> > Do we want to make --[no-]restrict a flag for each subcommand, or just
> > make it a global git flag?  I kind of think it'd make sense to do the
> > latter
> >
> > == Defaults ==
> >
> > As discussed before, we probably want querying commands (ls-files,
> > grep, log, etc.) to default to --no-restrict for now, since we are
> > otherwise slowly changing the defaults.  We may want to swap that
> > default in the future.
> >
> > However, for modification commands, I think we want the default to be
> > --restrict, regardless of the default for querying commands.  There
> > are some potentially very negative surprises for users if we don't,
> > and those surprises will be delayed rather than occur at the time the
> > user runs the command.  In fact, those negative surprises are likely
> > why those commands were the first to gain an option controlling
> > whether they operated on paths outside the sparsity specification.
> > (Also, the modification commands print a warning if they could have
> > affected other files but didn't due the the default of restricting, so
> > I think we have their default correct, even if the flag name is
> > suboptimal.)
>
> One of the things I've found myself a bit frustrated with while working on
> these sparse index integrations is that we haven't had a clear set of
> guidelines for times when we need to make UI/UX changes relating to
> 'sparse-checkout' compatibility. I think what you've outlined here is a good
> start to a larger discussion on the topic, but in the middle of this series
> might not be the best place for that discussion (at least in terms of
> preserving for later reference).

Yeah, that's fair, and I apologize for the problems.  I should have
pushed for a resolution and/or documentation of these issues at some
point; particularly since I was the one to bring it up in the first
place.  Between Stolee asking us to defer for a year-ish on UI/UX
changes in sparse-checkout while he got sparse-index into place, and
various other things coming up in the meantime, I just didn't get back
to it.  I probably should have, especially since we also had other
similar discussions going back to when git-sparse-checkout was first
introduced, but we've often focused on just solving the next subset of
usecases that were within reach rather than getting a bigger design
document.  Knowing that these kinds of issues were lurking was part of
the reason I insisted on having the big scary warning in the docs:

"""
THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
THE FUTURE.
"""

I'm glad I at least had the foresight to insist on that small measure...  :-)

> Elijah, would you be interested in compiling your thoughts into a document
> in 'Documentation/technical'? If not, Stolee or I could do it. If we could
> settle on some guidelines (option names, behavior, etc.) for better
> incorporating 'sparse-checkout' support into existing commands, it'd make
> future sparse index work substantially easier for everyone involved.

Sure, I'll take a stab at it this week.

> As for this series, I think the best way to move the sparse index work along
> is to drop this patch ("builtin/grep.c: add --sparse option") altogether.
> Shaoxuan's updates in patch 3 [1] make 'git grep' sparse index-compatible
> for *all* invocations (not just those without '--sparse'), so we don't need
> the new option for sparse index compatibility. It can then be re-introduced
> later (possibly modified) in a series dedicated to unifying the
> sparse-checkout UX.

Seems reasonable.

> [1] https://lore.kernel.org/git/20220908001854.206789-4-shaoxuan.yuan02@gmail.com/
diff mbox series

Patch

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 58d944bd57..bdd3d5b8a6 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -28,7 +28,7 @@  SYNOPSIS
 	   [-f <file>] [-e] <pattern>
 	   [--and|--or|--not|(|)|-e <pattern>...]
 	   [--recurse-submodules] [--parent-basename <basename>]
-	   [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | <tree>...]
+	   [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | <tree>...]
 	   [--] [<pathspec>...]
 
 DESCRIPTION
@@ -45,6 +45,9 @@  OPTIONS
 	Instead of searching tracked files in the working tree, search
 	blobs registered in the index file.
 
+--sparse::
+	Use with --cached. Search outside of sparse-checkout definition.
+
 --no-index::
 	Search files in the current directory that is not managed by Git.
 
diff --git a/builtin/grep.c b/builtin/grep.c
index e6bcdf860c..12abd832fa 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -96,6 +96,8 @@  static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
+static int grep_sparse = 0;
+
 static void add_work(struct grep_opt *opt, struct grep_source *gs)
 {
 	if (opt->binary != GREP_BINARY_TEXT)
@@ -525,7 +527,11 @@  static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (!cached && ce_skip_worktree(ce))
+		/*
+		 * Skip entries with SKIP_WORKTREE unless both --sparse and
+		 * --cached are given.
+		 */
+		if (!(grep_sparse && cached) && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -963,6 +969,8 @@  int cmd_grep(int argc, const char **argv, const char *prefix)
 			   PARSE_OPT_NOCOMPLETE),
 		OPT_INTEGER('m', "max-count", &opt.max_count,
 			N_("maximum number of results per file")),
+		OPT_BOOL(0, "sparse", &grep_sparse,
+			 N_("search the contents of files outside the sparse-checkout definition")),
 		OPT_END()
 	};
 	grep_prefix = prefix;
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index eb59564565..a9879cc980 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -118,13 +118,19 @@  test_expect_success 'grep searches unmerged file despite not matching sparsity p
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
 	dir/c:text
 	EOF
-	git grep --cached "text" >actual &&
+	git grep --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -143,7 +149,15 @@  test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' '
+test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -152,7 +166,7 @@  test_expect_success 'grep --recurse-submodules --cached searches entries with th
 	sub/B/b:text
 	sub2/a:text
 	EOF
-	git grep --recurse-submodules --cached "text" >actual &&
+	git grep --recurse-submodules --cached --sparse "text" >actual &&
 	test_cmp expect actual
 '
 
@@ -166,7 +180,15 @@  test_expect_success 'working tree grep does not search the index with CE_VALID a
 	test_cmp expect actual
 '
 
-test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' '
+test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	test_when_finished "git update-index --no-assume-unchanged b" &&
+	git update-index --assume-unchanged b &&
+	git grep --cached text >actual &&
+	test_cmp expect actual &&
+
 	cat >expect <<-EOF &&
 	a:text
 	b:text
@@ -174,7 +196,7 @@  test_expect_success 'grep --cached searches index entries with both CE_VALID and
 	EOF
 	test_when_finished "git update-index --no-assume-unchanged b" &&
 	git update-index --assume-unchanged b &&
-	git grep --cached text >actual &&
+	git grep --cached --sparse text >actual &&
 	test_cmp expect actual
 '