[RFC] grep: provide sane default to grep_source struct

Message ID	20190516020023.61161-1-emilyshaffer@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Wed, 15 May 2019 19:00:23 -0700 Message-Id: <20190516020023.61161-1-emilyshaffer@google.com> Mime-Version: 1.0 Subject: [RFC PATCH] grep: provide sane default to grep_source struct From: Emily Shaffer <emilyshaffer@google.com> To: git@vger.kernel.org Cc: jrn@google.com, Emily Shaffer <emilyshaffer@google.com> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	[RFC] grep: provide sane default to grep_source struct \| expand [RFC] grep: provide sane default to grep_source struct

Message ID

20190516020023.61161-1-emilyshaffer@google.com (mailing list archive)

State

New, archived

Headers

Date: Wed, 15 May 2019 19:00:23 -0700
Message-Id: <20190516020023.61161-1-emilyshaffer@google.com>
Mime-Version: 1.0
Subject: [RFC PATCH] grep: provide sane default to grep_source struct
From: Emily Shaffer <emilyshaffer@google.com>
To: git@vger.kernel.org
Cc: jrn@google.com, Emily Shaffer <emilyshaffer@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

[RFC] grep: provide sane default to grep_source struct | expand

Commit Message

Emily Shaffer May 16, 2019, 2 a.m. UTC

grep_buffer creates a struct grep_source gs and calls grep_source()
with it. However, gs.name is null, which eventually produces a
segmentation fault in
grep_source()->grep_source_1()->show_line() when grep_opt.status_only is
not set.

This seems to be unreachable from existing commands but is reachable in
the event that someone rolls their own revision walk and neglects to set
rev_info->grep_filter->status_only. Conceivably, someone might want to
print all changed lines of all commits which match some regex.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---
Hiya,

I ran into this issue while I was working on a tutorial on rolling your
own revision walk. I didn't want to print all the lines associated with
a matching change, but it didn't seem good that a seemingly-sane
grep_filter config was segfaulting.

I don't know if adding a placeholder name is the right answer (hence RFC
patch).

Jonathan Nieder proposed alternatively adding some check to grep_source()
to ensure that if opt->status_only is unset, gs->name must be non-NULL
(and yell about it if not), as well as some extra comments indicating
what assumptions are made about the data coming into functions like
grep_source(). I'm fine with that as well (although I'm not sure it
makes sense semantically to require a name which the user probably can't
easily set, or else ban the user from printing LOC during grep). Mostly
I'm happy with any solution besides a segfault with no error logging :)

Thanks in advance for everyone's thoughts.
Emily


 grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Junio C Hamano May 16, 2019, 3:07 a.m. UTC | #1

Emily Shaffer <emilyshaffer@google.com> writes:

> I don't know if adding a placeholder name is the right answer (hence RFC
> patch).
>
> Jonathan Nieder proposed alternatively adding some check to grep_source()
> to ensure that if opt->status_only is unset, gs->name must be non-NULL
> (and yell about it if not), as well as some extra comments indicating
> what assumptions are made about the data coming into functions like
> grep_source(). I'm fine with that as well (although I'm not sure it

That's an interesting problem.

Currently the only use of the function is to see if the log message
matches with the given pattern (yes/no), but it is conceivable that
new callers may want to prepare in-core data and use it to see if
that matches the pattern, or even to _show_ the lines that match the
pattern (possibly with its own .output callback).  Such a caller may
even make repeated calls to the function, and want to give hint in
the output to help users identify which in-core data produced hits,
which implies that a hardcoded "(in core)" is not such a great idea.
In a future where we support such a caller, the function signature
of grep_buffer() would change to include the name, and you would be
passing that to grep_source_init().

Until that happens, it would be OK to use a hardcoded "in-core"
constant here, but to futureproof for such a future, I think it
makes sense to have a check for a BUG() suggested by jrn---that
would become useful once callers of grep_buffer() can give names
prevent them from passing NULL when it later would be used.

> makes sense semantically to require a name which the user probably can't
> easily set, or else ban the user from printing LOC during grep). Mostly
> I'm happy with any solution besides a segfault with no error logging :)
>
> Thanks in advance for everyone's thoughts.
> Emily
>
>
>  grep.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/grep.c b/grep.c
> index 0d50598acd..fd84454faf 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -2045,7 +2045,7 @@ int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size)
>  	struct grep_source gs;
>  	int r;
>  
> -	grep_source_init(&gs, GREP_SOURCE_BUF, NULL, NULL, NULL);
> +	grep_source_init(&gs, GREP_SOURCE_BUF, _("(in memory)"), NULL, NULL);
>  	gs.buf = buf;
>  	gs.size = size;

Jonathan Nieder May 16, 2019, 3:11 a.m. UTC | #2

Hi,

Emily Shaffer wrote:

> grep_buffer creates a struct grep_source gs and calls grep_source()
> with it. However, gs.name is null, which eventually produces a
> segmentation fault in
> grep_source()->grep_source_1()->show_line() when grep_opt.status_only is
> not set.

Thanks for catching it.  Taking a step back, I think the problem is in
the definition of "struct grep_source":

	struct grep_source {
		char *name;

		enum grep_source_type {
			GREP_SOURCE_OID,
			GREP_SOURCE_FILE,
			GREP_SOURCE_BUF,
		} type;
		void *identifier;

		...
	};

What is the difference between a 'name' and an 'identifier'?  Who is
responsible for free()ing them?  Can they be NULL?  This is pretty
underdocumented for a public type.

If we take the point of view that 'name' should always be non-NULL,
then I wonder:

- can we document that?
- can grep_source_init enforce that?
- can we take advantage of that in grep_source as well, as a sanity
  check that the grep_source has been initialized?
- while we're here, can we describe what the field is used for
  (prefixing output with context before a ":", I believe)?

[...]
> This seems to be unreachable from existing commands but is reachable in
> the event that someone rolls their own revision walk and neglects to set
> rev_info->grep_filter->status_only.

init_revisions sets that field, so a caller would have to replace
grep_filter or unset status_only to trigger this.

>                                     Conceivably, someone might want to
> print all changed lines of all commits which match some regex.

Hm, does that work well?  The caller of grep_buffer in revision.c
doesn't seem to be careful about where in the output hits would be
printed, but I might be missing something.

[...]
> I ran into this issue while I was working on a tutorial on rolling your
> own revision walk. I didn't want to print all the lines associated with
> a matching change, but it didn't seem good that a seemingly-sane
> grep_filter config was segfaulting.

Good find, and thanks for taking the time to make it easier to debug for
the next person.

[...]
> Jonathan Nieder proposed alternatively adding some check to grep_source()
> to ensure that if opt->status_only is unset, gs->name must be non-NULL
> (and yell about it if not), as well as some extra comments indicating
> what assumptions are made about the data coming into functions like
> grep_source(). I'm fine with that as well (although I'm not sure it
> makes sense semantically to require a name which the user probably can't
> easily set, or else ban the user from printing LOC during grep). Mostly
> I'm happy with any solution besides a segfault with no error logging :)

Let's compare the two possibilities.

The advantage of "(in memory)" is that it Just Works, which should
make a nicer development experience with getting a new caller mostly
working on the way to getting them working just the way you want.

The disadvantage is that if we start outputting that in production, we
and static analyzers are less likely to notice.  In other words,
printing "(in memory)" is leaking details to the end user that do not
match what the end user asked for.  NULL would instead produce a
crash, prompting the author of the caller to fix it.

What was particularly pernicious about this example is that the NULL
dereference only occurs if the grep has a match.  So I suppose I'm
leaning toward (in addition to adding some comments to document the
struct) adding a check like

	if (!gs->name && !opt->status_only)
		BUG("grep calls that could print name require name");

to grep_source.

That would also sidestep the question of whether this debugging aid
should be translated. :)

Sensible?

Thanks,
Jonathan

Jonathan Nieder May 16, 2019, 3:13 a.m. UTC | #3

Junio C Hamano wrote:

> Currently the only use of the function is to see if the log message
> matches with the given pattern (yes/no), but it is conceivable that
> new callers may want to prepare in-core data and use it to see if
> that matches the pattern, or even to _show_ the lines that match the
> pattern (possibly with its own .output callback).

Oh, good point about .output.  That answers my question about getting
the output in the right place.

[...]
>> Thanks in advance for everyone's thoughts.
>> Emily

Thanks, both.

Jonathan

Emily Shaffer May 16, 2019, 9:05 p.m. UTC | #4

On Wed, May 15, 2019 at 08:11:52PM -0700, Jonathan Nieder wrote:
> Hi,
> 
> Emily Shaffer wrote:
> 
> > grep_buffer creates a struct grep_source gs and calls grep_source()
> > with it. However, gs.name is null, which eventually produces a
> > segmentation fault in
> > grep_source()->grep_source_1()->show_line() when grep_opt.status_only is
> > not set.
> 
> Thanks for catching it.  Taking a step back, I think the problem is in
> the definition of "struct grep_source":
> 
> 	struct grep_source {
> 		char *name;
> 
> 		enum grep_source_type {
> 			GREP_SOURCE_OID,
> 			GREP_SOURCE_FILE,
> 			GREP_SOURCE_BUF,
> 		} type;
> 		void *identifier;
> 
> 		...
> 	};
> 
> What is the difference between a 'name' and an 'identifier'?  Who is
> responsible for free()ing them?  Can they be NULL?  This is pretty
> underdocumented for a public type.
> 
> If we take the point of view that 'name' should always be non-NULL,
> then I wonder:
> 
> - can we document that?
> - can grep_source_init enforce that?

Today grep_source_init() defaults to NULL. So if we decide that 'name'
should be non-NULL it will be somewhat changing the intent.

	void grep_source_init(struct grep_source *gs, enum grep_source_type type,        
	                      const char *name, const char *path,                        
	                      const void *identifier)                                    
	{                                                                                
	        gs->type = type;                                                         
	        gs->name = xstrdup_or_null(name); 
	...

> - can we take advantage of that in grep_source as well, as a sanity
>   check that the grep_source has been initialized?
> - while we're here, can we describe what the field is used for
>   (prefixing output with context before a ":", I believe)?

In general the documentation for grep.[ch] is pretty light. There aren't
any header comments and `Documentation/technical/api-grep.txt` is a
todo. So I agree that we should document it anywhere we can.

> > Jonathan Nieder proposed alternatively adding some check to grep_source()
> > to ensure that if opt->status_only is unset, gs->name must be non-NULL
> > (and yell about it if not), as well as some extra comments indicating
> > what assumptions are made about the data coming into functions like
> > grep_source(). I'm fine with that as well (although I'm not sure it
> > makes sense semantically to require a name which the user probably can't
> > easily set, or else ban the user from printing LOC during grep). Mostly
> > I'm happy with any solution besides a segfault with no error logging :)
> 
> Let's compare the two possibilities.
> 
> The advantage of "(in memory)" is that it Just Works, which should
> make a nicer development experience with getting a new caller mostly
> working on the way to getting them working just the way you want.
> 
> The disadvantage is that if we start outputting that in production, we
> and static analyzers are less likely to notice.  In other words,
> printing "(in memory)" is leaking details to the end user that do not
> match what the end user asked for.  NULL would instead produce a
> crash, prompting the author of the caller to fix it.
> 
> What was particularly pernicious about this example is that the NULL
> dereference only occurs if the grep has a match.  So I suppose I'm
> leaning toward (in addition to adding some comments to document the
> struct) adding a check like
> 
> 	if (!gs->name && !opt->status_only)
> 		BUG("grep calls that could print name require name");
> 
> to grep_source.

Why not both? :)

But seriously, I am planning to push a second patch with both, per
Junio's reply.

I'll consider the documentation out of scope for now since I'm not sure
I know enough about grep.[ch]'s intent or history to document anything
yet. :)

> 
> That would also sidestep the question of whether this debugging aid
> should be translated. :)
> 
> Sensible?
> 
> Thanks,
> Jonathan

diff --git a/grep.c b/grep.c
index 0d50598acd..fd84454faf 100644
--- a/grep.c
+++ b/grep.c
@@ -2045,7 +2045,7 @@  int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size)
 	struct grep_source gs;
 	int r;
 
-	grep_source_init(&gs, GREP_SOURCE_BUF, NULL, NULL, NULL);
+	grep_source_init(&gs, GREP_SOURCE_BUF, _("(in memory)"), NULL, NULL);
 	gs.buf = buf;
 	gs.size = size;

[RFC] grep: provide sane default to grep_source struct

Commit Message

Comments

Patch