[v2,01/10] ls-files: add --json to dump the index

Message ID	20190624130226.17293-2-pclouds@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: =?utf-8?b?Tmd1eeG7hW4gVGjDoWkgTmfhu41jIER1eQ==?= <pclouds@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>, Derrick Stolee <stolee@gmail.com>, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFy?= =?utf-8?b?bWFzb24=?= <avarab@gmail.com>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, =?utf-8?b?Tmd1eeG7hW4g?= =?utf-8?b?VGjDoWkgTmfhu41jIER1eQ==?= <pclouds@gmail.com> Subject: [PATCH v2 01/10] ls-files: add --json to dump the index Date: Mon, 24 Jun 2019 20:02:17 +0700 Message-Id: <20190624130226.17293-2-pclouds@gmail.com> In-Reply-To: <20190624130226.17293-1-pclouds@gmail.com> References: <20190624130226.17293-1-pclouds@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	[v2,01/10] ls-files: add --json to dump the index \| expand [v2,01/10] ls-files: add --json to dump the index [v2,02/10] read-cache.c: dump common extension info in json [v2,03/10] cache-tree.c: dump "TREE" extension as json [v2,04/10] dir.c: dump "UNTR" extension as json [v2,05/10] split-index.c: dump "link" extension as json [v2,06/10] fsmonitor.c: dump "FSMN" extension as json [v2,07/10] resolve-undo.c: dump "REUC" extension as json [v2,08/10] read-cache.c: dump "EOIE" extension as json [v2,09/10] read-cache.c: dump "IEOT" extension as json [v2,10/10] t3008: use the new SINGLE_CPU prereq

Duy Nguyen June 24, 2019, 1:02 p.m. UTC

So far we don't have a command to basically dump the index file out,
with all its glory details. Checking some info, for example, stat
time, usually involves either writing new code or firing up "xxd" and
decoding values by yourself.

This --json is supposed to help that. It dumps the index in a human
readable format but also easy to be processed with tools. And it will
print almost enough info to reconstruct the index later.

In this patch we only dump the main part, not extensions. But at the
end of the series, the entire index is dumped. The end result could be
very verbose even on a small repository such as git.git.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 Documentation/git-ls-files.txt    |  5 +++
 builtin/ls-files.c                | 38 +++++++++++++---
 cache.h                           |  2 +
 json-writer.c                     | 22 ++++++++++
 json-writer.h                     | 23 ++++++++++
 read-cache.c                      | 72 ++++++++++++++++++++++++++++++-
 t/t3011-ls-files-json.sh (new +x) | 44 +++++++++++++++++++
 t/t3011/basic (new)               | 67 ++++++++++++++++++++++++++++
 8 files changed, 265 insertions(+), 8 deletions(-)

Jeff Hostetler June 24, 2019, 7:15 p.m. UTC | #1

On 6/24/2019 9:02 AM, Nguyễn Thái Ngọc Duy wrote:
> So far we don't have a command to basically dump the index file out,
> with all its glory details. Checking some info, for example, stat
> time, usually involves either writing new code or firing up "xxd" and
> decoding values by yourself.
> 
> This --json is supposed to help that. It dumps the index in a human
> readable format but also easy to be processed with tools. And it will
> print almost enough info to reconstruct the index later.
> 
> In this patch we only dump the main part, not extensions. But at the
> end of the series, the entire index is dumped. The end result could be
> very verbose even on a small repository such as git.git.
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---

...

> diff --git a/json-writer.c b/json-writer.c
> index aadb9dbddc..0608726512 100644
> --- a/json-writer.c
> +++ b/json-writer.c
> @@ -202,6 +202,28 @@ void jw_object_null(struct json_writer *jw, const char *key)
>   	strbuf_addstr(&jw->json, "null");
>   }
>   
> +void jw_object_filemode(struct json_writer *jw, const char *key, mode_t mode)
> +{
> +	object_common(jw, key);
> +	strbuf_addf(&jw->json, "\"%06o\"", mode);
> +}
> +
> +void jw_object_stat_data(struct json_writer *jw, const char *name,
> +			 const struct stat_data *sd)

Should this be in json_writer.c or in read-cache.c ?
Currently, json_writer.c is concerned with formatting
JSON on basic/scalar types.  Do we want to start
extending it to handle arbitrary structures?  Or would
it be better for the code that defines/manipulates the
structure to define a "stat_data_dump_json()" function.

I'm torn on the jw_object_filemode() function, JSON format
limits us to decimal integers and there are places where
I'd like to have hex, or in this case octal values.

I'm thinking it'd be better to have a helper function in
read-cache.c that formats a local strbuf and calls
js_object_string(&jw, key, buf);


> +{
> +	jw_object_inline_begin_object(jw, name);
> +	jw_object_intmax(jw, "ctime_sec", sd->sd_ctime.sec);
> +	jw_object_intmax(jw, "ctime_nsec", sd->sd_ctime.nsec);
> +	jw_object_intmax(jw, "mtime_sec", sd->sd_mtime.sec);
> +	jw_object_intmax(jw, "mtime_nsec", sd->sd_mtime.nsec);

It'd be nice if we could also have a formatted date
for the mtime and ctime in addition to the integer
values.  (I'm not sure whether you'd always want them
or make it a verbose option.)

> +	jw_object_intmax(jw, "device", sd->sd_dev);
> +	jw_object_intmax(jw, "inode", sd->sd_ino);
> +	jw_object_intmax(jw, "uid", sd->sd_uid);
> +	jw_object_intmax(jw, "gid", sd->sd_gid);
> +	jw_object_intmax(jw, "size", sd->sd_size);
> +	jw_end(jw);
> +}
> +
>   static void increase_indent(struct strbuf *sb,
>   			    const struct json_writer *jw,
>   			    int indent)
> diff --git a/json-writer.h b/json-writer.h
> index 83906b09c1..c48c4cbf33 100644
> --- a/json-writer.h
> +++ b/json-writer.h
> @@ -42,8 +42,11 @@
>    * of the given strings.
>    */
>   
> +#include "git-compat-util.h"
>   #include "strbuf.h"
>   
> +struct stat_data;
> +
>   struct json_writer
>   {
>   	/*
> @@ -81,6 +84,9 @@ void jw_object_true(struct json_writer *jw, const char *key);
>   void jw_object_false(struct json_writer *jw, const char *key);
>   void jw_object_bool(struct json_writer *jw, const char *key, int value);
>   void jw_object_null(struct json_writer *jw, const char *key);
> +void jw_object_filemode(struct json_writer *jw, const char *key, mode_t value);
> +void jw_object_stat_data(struct json_writer *jw, const char *key,
> +			 const struct stat_data *sd);
>   void jw_object_sub_jw(struct json_writer *jw, const char *key,
>   		      const struct json_writer *value);
>   
> @@ -104,4 +110,21 @@ void jw_array_inline_begin_array(struct json_writer *jw);
>   int jw_is_terminated(const struct json_writer *jw);
>   void jw_end(struct json_writer *jw);
>   
> +/*
> + * These _gently versions accept NULL json_writer to reduce too much
> + * branching at the call site.
> + */
> +static inline void jw_object_inline_begin_array_gently(struct json_writer *jw,
> +						       const char *name)
> +{
> +	if (jw)
> +		jw_object_inline_begin_array(jw, name);
> +}
> +
> +static inline void jw_end_gently(struct json_writer *jw)
> +{
> +	if (jw)
> +		jw_end(jw);
> +}
> +
>   #endif /* JSON_WRITER_H */
> diff --git a/read-cache.c b/read-cache.c
> index 4dd22f4f6e..db5147d088 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -25,6 +25,7 @@
>   #include "fsmonitor.h"
>   #include "thread-utils.h"
>   #include "progress.h"
> +#include "json-writer.h"
>   
>   /* Mask for the name length in ce_flags in the on-disk index */
>   
> @@ -1952,6 +1953,49 @@ static void *load_index_extensions(void *_data)
>   	return NULL;
>   }
>   
> +static void dump_cache_entry(struct index_state *istate,
> +			     int index,
> +			     unsigned long offset,
> +			     const struct cache_entry *ce)
> +{
> +	struct json_writer *jw = istate->jw;
> +
> +	jw_array_inline_begin_object(jw);
> +
> +	/*
> +	 * this is technically redundant, but it's for easier
> +	 * navigation when there hundreds of entries
> +	 */
> +	jw_object_intmax(jw, "id", index);
> +
> +	jw_object_string(jw, "name", ce->name);
> +
> +	jw_object_filemode(jw, "mode", ce->ce_mode);
> +
> +	jw_object_intmax(jw, "flags", ce->ce_flags);

It would be nice to have the flags as a hex-formatted string
in addition to (or instead of) the decimal integer value.

> +	/*
> +	 * again redundant info, just so you don't have to decode
> +	 * flags values manually
> +	 */
> +	if (ce->ce_flags & CE_EXTENDED)
> +		jw_object_true(jw, "extended_flags");
> +	if (ce->ce_flags & CE_VALID)
> +		jw_object_true(jw, "assume_unchanged");
> +	if (ce->ce_flags & CE_INTENT_TO_ADD)
> +		jw_object_true(jw, "intent_to_add");
> +	if (ce->ce_flags & CE_SKIP_WORKTREE)
> +		jw_object_true(jw, "skip_worktree");
> +	if (ce_stage(ce))
> +		jw_object_intmax(jw, "stage", ce_stage(ce));
> +
> +	jw_object_string(jw, "oid", oid_to_hex(&ce->oid));
> +
> +	jw_object_stat_data(jw, "stat", &ce->ce_stat_data);
> +	jw_object_intmax(jw, "file_offset", offset);
> +
> +	jw_end(jw);
> +}
> +
>   /*
>    * A helper function that will load the specified range of cache entries
>    * from the memory mapped file and add them to the given index.
> @@ -1972,6 +2016,9 @@ static unsigned long load_cache_entry_block(struct index_state *istate,
>   		ce = create_from_disk(ce_mem_pool, istate->version, disk_ce, &consumed, previous_ce);
>   		set_index_entry(istate, i, ce);
>   
> +		if (istate->jw)
> +			dump_cache_entry(istate, i, src_offset, ce);
> +
>   		src_offset += consumed;
>   		previous_ce = ce;
>   	}
> @@ -1983,6 +2030,8 @@ static unsigned long load_all_cache_entries(struct index_state *istate,
>   {
>   	unsigned long consumed;
>   
> +	jw_object_inline_begin_array_gently(istate->jw, "entries");
> +
>   	if (istate->version == 4) {
>   		mem_pool_init(&istate->ce_mem_pool,
>   				estimate_cache_size_from_compressed(istate->cache_nr));
> @@ -1993,6 +2042,8 @@ static unsigned long load_all_cache_entries(struct index_state *istate,
>   
>   	consumed = load_cache_entry_block(istate, istate->ce_mem_pool,
>   					0, istate->cache_nr, mmap, src_offset, NULL);
> +
> +	jw_end_gently(istate->jw);
>   	return consumed;
>   }
>   
> @@ -2120,6 +2171,7 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
>   	size_t extension_offset = 0;
>   	int nr_threads, cpus;
>   	struct index_entry_offset_table *ieot = NULL;
> +	int jw_pretty = 1;
>   
>   	if (istate->initialized)
>   		return istate->cache_nr;
> @@ -2154,6 +2206,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
>   	istate->cache_nr = ntohl(hdr->hdr_entries);
>   	istate->cache_alloc = alloc_nr(istate->cache_nr);
>   	istate->cache = xcalloc(istate->cache_alloc, sizeof(*istate->cache));
> +	istate->timestamp.sec = st.st_mtime;
> +	istate->timestamp.nsec = ST_MTIME_NSEC(st);
>   	istate->initialized = 1;
>   
>   	p.istate = istate;
> @@ -2176,6 +2230,20 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
>   	if (!HAVE_THREADS)
>   		nr_threads = 1;
>   
> +	if (istate->jw) {
> +		jw_object_begin(istate->jw, jw_pretty);
> +		jw_object_intmax(istate->jw, "version", istate->version);
> +		jw_object_string(istate->jw, "oid", oid_to_hex(&istate->oid));
> +		jw_object_intmax(istate->jw, "mtime_sec", istate->timestamp.sec);
> +		jw_object_intmax(istate->jw, "mtime_nsec", istate->timestamp.nsec);

again, it would be nice to also have a formated version of the mtime.

> +
> +		/*
> +		 * Threading may mess up json writing. This is for
> +		 * debugging only, so performance is not a concern.
> +		 */
> +		nr_threads = 1;

yes. we should turn off threading when dumping to json.

> +	}
> +
>   	if (nr_threads > 1) {
>   		extension_offset = read_eoie_extension(mmap, mmap_size);
>   		if (extension_offset) {
> @@ -2204,8 +2272,6 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
>   		src_offset += load_all_cache_entries(istate, mmap, mmap_size, src_offset);
>   	}
>   
> -	istate->timestamp.sec = st.st_mtime;
> -	istate->timestamp.nsec = ST_MTIME_NSEC(st);
>   
>   	/* if we created a thread, join it otherwise load the extensions on the primary thread */
>   	if (extension_offset) {
> @@ -2216,6 +2282,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
>   		p.src_offset = src_offset;
>   		load_index_extensions(&p);
>   	}
> +	jw_end_gently(istate->jw);
> +
>   	munmap((void *)mmap, mmap_size);
>   
>   	/*
> 
...

Thanks
Jeff

Junio C Hamano June 24, 2019, 8:04 p.m. UTC | #2

Jeff Hostetler <git@jeffhostetler.com> writes:

> On 6/24/2019 9:02 AM, Nguyễn Thái Ngọc Duy wrote:
> ...
>> +void jw_object_stat_data(struct json_writer *jw, const char *name,
>> +			 const struct stat_data *sd)
>
> Should this be in json_writer.c or in read-cache.c ?
> Currently, json_writer.c is concerned with formatting
> JSON on basic/scalar types.

That's an interesting question.

In the longer run, is it reasonable to assume that the current
callers of these functions will stay to be the only places that
would want to say "For this path, various attributes like timestamps
and other stuff are these values"?  The answer is probably no.

Would it be reasonable to assume that the future callers are all
likely to be closely related to what functions that are currently in
read-cache.c do and would appear only in that file?  I think the
answer would also be no.  Functions in entry.c may want to report
"I've created/updated these filesystem entities and their stat info
now look like so", for example.  So (assuming that we would want to
represent "struct stat" the same way no matter who wants to write it
out), I think it is reasonable to have this function on the json
side, not on Git side.

I do not think it is a bad idea to have a layer that sits above
json-writer.c and below read-cache.c that give us standardized
mapping from in-core objects (like "struct stat") to objects in json
(they are set of key-value pairs after all).  Call it json-schema.c
or something?

What other "higher than scaler" types do we foresee that we'd need
to serialize in a standardised way?  If we do not have that many
yet, perhaps a split like that might be a bit premature and we can
start by having the function in json-writer.c side, not in the API
consumer side.

Thanks.

Thomas Gummerer June 25, 2019, 9:05 a.m. UTC | #3

On 06/24, Nguyễn Thái Ngọc Duy wrote:
> So far we don't have a command to basically dump the index file out,
> with all its glory details. Checking some info, for example, stat
> time, usually involves either writing new code or firing up "xxd" and
> decoding values by yourself.
> 
> This --json is supposed to help that. It dumps the index in a human
> readable format but also easy to be processed with tools. And it will
> print almost enough info to reconstruct the index later.
> 
> In this patch we only dump the main part, not extensions. But at the
> end of the series, the entire index is dumped. The end result could be
> very verbose even on a small repository such as git.git.
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  Documentation/git-ls-files.txt    |  5 +++
>  builtin/ls-files.c                | 38 +++++++++++++---
>  cache.h                           |  2 +
>  json-writer.c                     | 22 ++++++++++
>  json-writer.h                     | 23 ++++++++++
>  read-cache.c                      | 72 ++++++++++++++++++++++++++++++-
>  t/t3011-ls-files-json.sh (new +x) | 44 +++++++++++++++++++
>  t/t3011/basic (new)               | 67 ++++++++++++++++++++++++++++
>  8 files changed, 265 insertions(+), 8 deletions(-)
>
> [...]
>
> diff --git a/t/t3011-ls-files-json.sh b/t/t3011-ls-files-json.sh
> new file mode 100755
> index 0000000000..97bcd814be
> --- /dev/null
> +++ b/t/t3011-ls-files-json.sh
> @@ -0,0 +1,44 @@
> +#!/bin/sh
> +
> +test_description='ls-files dumping json'
> +
> +. ./test-lib.sh
> +
> +strip_number() {
> +	for name; do
> +		echo 's/\("'$name'":\) [0-9]\+/\1 <number>/' >>filter.sed
> +	done
> +}
> +
> +strip_string() {
> +	for name; do
> +		echo 's/\("'$name'":\) ".*"/\1 <string>/' >>filter.sed
> +	done
> +}
> +
> +compare_json() {
> +	git ls-files --debug-json >json &&
> +	sed -f filter.sed json >filtered &&
> +	test_cmp "$TEST_DIRECTORY"/t3011/"$1" filtered
> +}
> +
> +test_expect_success 'setup' '
> +	mkdir sub &&
> +	echo one >one &&
> +	git add one &&
> +	echo 2 >sub/two &&
> +	git add sub/two &&
> +
> +	echo intent-to-add >ita &&
> +	git add -N ita &&
> +
> +	strip_number ctime_sec ctime_nsec mtime_sec mtime_nsec &&
> +	strip_number device inode uid gid file_offset ext_size &&
> +	strip_string oid ident
> +'
> +
> +test_expect_success 'ls-files --json, main entries' '
> +	compare_json basic
> +'
> +
> +test_done
> diff --git a/t/t3011/basic b/t/t3011/basic
> new file mode 100644
> index 0000000000..9436445d90
> --- /dev/null
> +++ b/t/t3011/basic
> @@ -0,0 +1,67 @@
> +{
> +  "version": 3,

This will break the test suite when 'GIT_TEST_INDEX_VERSION' is set to
4 for example.  I think this applies to a few other tests in later
patches as well.

> +  "oid": <string>,
> +  "mtime_sec": <number>,
> +  "mtime_nsec": <number>,
> +  "entries": [
> +    {
> +      "id": 0,
> +      "name": "ita",
> +      "mode": "100644",
> +      "flags": 536887296,
> +      "extended_flags": true,
> +      "intent_to_add": true,
> +      "oid": <string>,
> +      "stat": {
> +        "ctime_sec": <number>,
> +        "ctime_nsec": <number>,
> +        "mtime_sec": <number>,
> +        "mtime_nsec": <number>,
> +        "device": <number>,
> +        "inode": <number>,
> +        "uid": <number>,
> +        "gid": <number>,
> +        "size": 0
> +      },
> +      "file_offset": <number>
> +    },
> +    {
> +      "id": 1,
> +      "name": "one",
> +      "mode": "100644",
> +      "flags": 0,
> +      "oid": <string>,
> +      "stat": {
> +        "ctime_sec": <number>,
> +        "ctime_nsec": <number>,
> +        "mtime_sec": <number>,
> +        "mtime_nsec": <number>,
> +        "device": <number>,
> +        "inode": <number>,
> +        "uid": <number>,
> +        "gid": <number>,
> +        "size": 4
> +      },
> +      "file_offset": <number>
> +    },
> +    {
> +      "id": 2,
> +      "name": "sub/two",
> +      "mode": "100644",
> +      "flags": 0,
> +      "oid": <string>,
> +      "stat": {
> +        "ctime_sec": <number>,
> +        "ctime_nsec": <number>,
> +        "mtime_sec": <number>,
> +        "mtime_nsec": <number>,
> +        "device": <number>,
> +        "inode": <number>,
> +        "uid": <number>,
> +        "gid": <number>,
> +        "size": 2
> +      },
> +      "file_offset": <number>
> +    }
> +  ]
> +}
> -- 
> 2.22.0.rc0.322.g2b0371e29a
>

Johannes Schindelin June 25, 2019, 9:21 a.m. UTC | #4

Hi Jeff,

On Mon, 24 Jun 2019, Jeff Hostetler wrote:

> On 6/24/2019 9:02 AM, Nguyễn Thái Ngọc Duy wrote:
> [...]
> > +{
> > +	jw_object_inline_begin_object(jw, name);
> > +	jw_object_intmax(jw, "ctime_sec", sd->sd_ctime.sec);
> > +	jw_object_intmax(jw, "ctime_nsec", sd->sd_ctime.nsec);
> > +	jw_object_intmax(jw, "mtime_sec", sd->sd_mtime.sec);
> > +	jw_object_intmax(jw, "mtime_nsec", sd->sd_mtime.nsec);
>
> It'd be nice if we could also have a formatted date
> for the mtime and ctime in addition to the integer
> values.  (I'm not sure whether you'd always want them
> or make it a verbose option.)

It would be more consistent with JSON conventions to have only the
formatted date, preferably in ISO-8601 format [*1*], I would think.

And for debugging, it would also make more sense, in my opinion.

> [...]
> > diff --git a/read-cache.c b/read-cache.c
> > index 4dd22f4f6e..db5147d088 100644
> > --- a/read-cache.c
> > +++ b/read-cache.c
> > @@ -25,6 +25,7 @@
> >   #include "fsmonitor.h"
> >   #include "thread-utils.h"
> >   #include "progress.h"
> > +#include "json-writer.h"
> >
> >   /* Mask for the name length in ce_flags in the on-disk index */
> >
> > @@ -1952,6 +1953,49 @@ static void *load_index_extensions(void *_data)
> >   	return NULL;
> >   }
> >
> > +static void dump_cache_entry(struct index_state *istate,
> > +			     int index,
> > +			     unsigned long offset,
> > +			     const struct cache_entry *ce)
> > +{
> > +	struct json_writer *jw = istate->jw;
> > +
> > +	jw_array_inline_begin_object(jw);
> > +
> > +	/*
> > +	 * this is technically redundant, but it's for easier
> > +	 * navigation when there hundreds of entries
> > +	 */
> > +	jw_object_intmax(jw, "id", index);
> > +
> > +	jw_object_string(jw, "name", ce->name);
> > +
> > +	jw_object_filemode(jw, "mode", ce->ce_mode);
> > +
> > +	jw_object_intmax(jw, "flags", ce->ce_flags);
>
> It would be nice to have the flags as a hex-formatted string
> in addition to (or instead of) the decimal integer value.

Instead of, please, prefixed with "0x" to make it clear that this is hex.

And "mode" in octal, please, prefixed with "0", as that is the convention
to display file modes.

Thanks,
Dscho

Footnote *1*: It is too bad that JSON leaves the exact date format
unspecified, leading to a lot of inconsistency and putting the burden on
parsers:
https://stackoverflow.com/questions/10286204/the-right-json-date-format

Johannes Schindelin June 25, 2019, 9:44 a.m. UTC | #5

Hi Duy,

On Mon, 24 Jun 2019, Nguyễn Thái Ngọc Duy wrote:

> diff --git a/t/t3011-ls-files-json.sh b/t/t3011-ls-files-json.sh
> new file mode 100755
> index 0000000000..97bcd814be
> --- /dev/null
> +++ b/t/t3011-ls-files-json.sh
> @@ -0,0 +1,44 @@
> +#!/bin/sh
> +
> +test_description='ls-files dumping json'
> +
> +. ./test-lib.sh
> +
> +strip_number() {
> +	for name; do
> +		echo 's/\("'$name'":\) [0-9]\+/\1 <number>/' >>filter.sed

This does not do what you think it does, in Ubuntu Xenial and on macOS:

https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11408&view=ms.vss-test-web.build-test-results-tab&runId=27736&paneView=debug&resultId=105613

The `\1` is expanded to the ASCII character 001. Therefore your test cases
fail on almost all platforms.

Funnily enough, they pass on Windows...

Ciao,
Johannes

Duy Nguyen June 25, 2019, 9:52 a.m. UTC | #6

On Tue, Jun 25, 2019 at 2:15 AM Jeff Hostetler <git@jeffhostetler.com> wrote:
> > @@ -202,6 +202,28 @@ void jw_object_null(struct json_writer *jw, const char *key)
> >       strbuf_addstr(&jw->json, "null");
> >   }
> >
> > +void jw_object_filemode(struct json_writer *jw, const char *key, mode_t mode)
> > +{
> > +     object_common(jw, key);
> > +     strbuf_addf(&jw->json, "\"%06o\"", mode);
> > +}
> > +
> > +void jw_object_stat_data(struct json_writer *jw, const char *name,
> > +                      const struct stat_data *sd)
>
> Should this be in json_writer.c or in read-cache.c ?
> Currently, json_writer.c is concerned with formatting
> JSON on basic/scalar types.  Do we want to start
> extending it to handle arbitrary structures?  Or would
> it be better for the code that defines/manipulates the
> structure to define a "stat_data_dump_json()" function.
>
> I'm torn on the jw_object_filemode() function, JSON format
> limits us to decimal integers and there are places where
> I'd like to have hex, or in this case octal values.
>
> I'm thinking it'd be better to have a helper function in
> read-cache.c that formats a local strbuf and calls
> js_object_string(&jw, key, buf);

I can move these back to read-cache.c. Though if we have a lot more jw
helpers like this (hard to tell at the moment) then perhaps we can
have json-writer-utils.c or something to group them together. That
keep the "boring" code out of main logic code in read-cache.c and
other call sites.

> > @@ -1952,6 +1953,49 @@ static void *load_index_extensions(void *_data)
> >       return NULL;
> >   }
> >
> > +static void dump_cache_entry(struct index_state *istate,
> > +                          int index,
> > +                          unsigned long offset,
> > +                          const struct cache_entry *ce)
> > +{
> > +     struct json_writer *jw = istate->jw;
> > +
> > +     jw_array_inline_begin_object(jw);
> > +
> > +     /*
> > +      * this is technically redundant, but it's for easier
> > +      * navigation when there hundreds of entries
> > +      */
> > +     jw_object_intmax(jw, "id", index);
> > +
> > +     jw_object_string(jw, "name", ce->name);
> > +
> > +     jw_object_filemode(jw, "mode", ce->ce_mode);
> > +
> > +     jw_object_intmax(jw, "flags", ce->ce_flags);
>
> It would be nice to have the flags as a hex-formatted string
> in addition to (or instead of) the decimal integer value.

I'm not against reformatting it in hex string, but is there a value in
it? ce_flags is expanded in the code below so that you don't have to
decode it yourself when you read each entry. The "flags" field here is
for further processing in tools. I'm trying to see if looking at hex
values helps, but I'm still not seeing it...

> > +     /*
> > +      * again redundant info, just so you don't have to decode
> > +      * flags values manually
> > +      */
> > +     if (ce->ce_flags & CE_EXTENDED)
> > +             jw_object_true(jw, "extended_flags");
> > +     if (ce->ce_flags & CE_VALID)
> > +             jw_object_true(jw, "assume_unchanged");
> > +     if (ce->ce_flags & CE_INTENT_TO_ADD)
> > +             jw_object_true(jw, "intent_to_add");
> > +     if (ce->ce_flags & CE_SKIP_WORKTREE)
> > +             jw_object_true(jw, "skip_worktree");
> > +     if (ce_stage(ce))
> > +             jw_object_intmax(jw, "stage", ce_stage(ce));

Johannes Schindelin June 25, 2019, 11:31 a.m. UTC | #7

Hi Duy,

On Tue, 25 Jun 2019, Johannes Schindelin wrote:

> On Mon, 24 Jun 2019, Nguyễn Thái Ngọc Duy wrote:
>
> > diff --git a/t/t3011-ls-files-json.sh b/t/t3011-ls-files-json.sh
> > new file mode 100755
> > index 0000000000..97bcd814be
> > --- /dev/null
> > +++ b/t/t3011-ls-files-json.sh
> > @@ -0,0 +1,44 @@
> > +#!/bin/sh
> > +
> > +test_description='ls-files dumping json'
> > +
> > +. ./test-lib.sh
> > +
> > +strip_number() {
> > +	for name; do
> > +		echo 's/\("'$name'":\) [0-9]\+/\1 <number>/' >>filter.sed
>
> This does not do what you think it does, in Ubuntu Xenial and on macOS:
>
> https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11408&view=ms.vss-test-web.build-test-results-tab&runId=27736&paneView=debug&resultId=105613
>
> The `\1` is expanded to the ASCII character 001. Therefore your test cases
> fail on almost all platforms.

The `strip_number()`/`strip_string()` approach might look elegant from a
design perspective, but from a readability perspective (and obviously,
when one wants to make those tests more robust and cross-platform), it
would be a lot better to do it more explicitly.

This patch on top of your patch series makes the test run correctly in my
Linux and Windows setup, and much easier to understand:

-- snipsnap --
diff --git a/t/t3011-ls-files-json.sh b/t/t3011-ls-files-json.sh
index 9f4ad4c9cf..8b782c48e0 100755
--- a/t/t3011-ls-files-json.sh
+++ b/t/t3011-ls-files-json.sh
@@ -4,18 +4,6 @@ test_description='ls-files dumping json'

 . ./test-lib.sh

-strip_number() {
-	for name; do
-		echo 's/\("'$name'":\) [0-9]\+/\1 <number>/' >>filter.sed
-	done
-}
-
-strip_string() {
-	for name; do
-		echo 's/\("'$name'":\) ".*"/\1 <string>/' >>filter.sed
-	done
-}
-
 compare_json() {
 	git ls-files --debug-json >json &&
 	sed -f filter.sed json >filtered &&
@@ -35,9 +23,21 @@ test_expect_success 'setup' '
 	echo intent-to-add >ita &&
 	git add -N ita &&

-	strip_number ctime_sec ctime_nsec mtime_sec mtime_nsec &&
-	strip_number device inode uid gid file_offset ext_size last_update &&
-	strip_string oid ident
+	cat >filter.sed <<-\EOF
+	s/\("ctime_sec":\) [0-9]\+/\1 <number>/
+	s/\("ctime_nsec":\) [0-9]\+/\1 <number>/
+	s/\("mtime_sec":\) [0-9]\+/\1 <number>/
+	s/\("mtime_nsec":\) [0-9]\+/\1 <number>/
+	s/\("device":\) [0-9]\+/\1 <number>/
+	s/\("inode":\) [0-9]\+/\1 <number>/
+	s/\("uid":\) [0-9]\+/\1 <number>/
+	s/\("gid":\) [0-9]\+/\1 <number>/
+	s/\("file_offset":\) [0-9]\+/\1 <number>/
+	s/\("ext_size":\) [0-9]\+/\1 <number>/
+	s/\("last_update":\) [0-9]\+/\1 <number>/
+	s/\("oid":\) ".*"/\1 <string>/
+	s/\("ident":\) ".*"/\1 <string>/
+	EOF
 '

 test_expect_success 'ls-files --json, main entries, UNTR and TREE' '
@@ -98,7 +98,9 @@ test_expect_success !SINGLE_CPU 'ls-files --json and multicore extensions' '
 		touch one two three four &&
 		git add . &&
 		cp ../filter.sed . &&
-		strip_number offset &&
+		cat >>filter.sed <<-\EOF &&
+		s/\("offset":\) [0-9]\+/\1 <number>/
+		EOF
 		compare_json eoie
 	)
 '

Johannes Schindelin June 25, 2019, 1:57 p.m. UTC | #8

Hi Duy,

On Tue, 25 Jun 2019, Johannes Schindelin wrote:

> diff --git a/t/t3011-ls-files-json.sh b/t/t3011-ls-files-json.sh
> index 9f4ad4c9cf..8b782c48e0 100755
> --- a/t/t3011-ls-files-json.sh
> +++ b/t/t3011-ls-files-json.sh
> @@ -4,18 +4,6 @@ test_description='ls-files dumping json'
>
>  . ./test-lib.sh
>
> -strip_number() {
> -	for name; do
> -		echo 's/\("'$name'":\) [0-9]\+/\1 <number>/' >>filter.sed
> -	done
> -}
> -
> -strip_string() {
> -	for name; do
> -		echo 's/\("'$name'":\) ".*"/\1 <string>/' >>filter.sed
> -	done
> -}
> -
>  compare_json() {
>  	git ls-files --debug-json >json &&
>  	sed -f filter.sed json >filtered &&
> @@ -35,9 +23,21 @@ test_expect_success 'setup' '
>  	echo intent-to-add >ita &&
>  	git add -N ita &&
>
> -	strip_number ctime_sec ctime_nsec mtime_sec mtime_nsec &&
> -	strip_number device inode uid gid file_offset ext_size last_update &&
> -	strip_string oid ident
> +	cat >filter.sed <<-\EOF
> +	s/\("ctime_sec":\) [0-9]\+/\1 <number>/

And of course, \+ still isn't POSIX, so you have to write [0-9][1-9]*
instead.

Ciao,
Johannes

> +	s/\("ctime_nsec":\) [0-9]\+/\1 <number>/
> +	s/\("mtime_sec":\) [0-9]\+/\1 <number>/
> +	s/\("mtime_nsec":\) [0-9]\+/\1 <number>/
> +	s/\("device":\) [0-9]\+/\1 <number>/
> +	s/\("inode":\) [0-9]\+/\1 <number>/
> +	s/\("uid":\) [0-9]\+/\1 <number>/
> +	s/\("gid":\) [0-9]\+/\1 <number>/
> +	s/\("file_offset":\) [0-9]\+/\1 <number>/
> +	s/\("ext_size":\) [0-9]\+/\1 <number>/
> +	s/\("last_update":\) [0-9]\+/\1 <number>/
> +	s/\("oid":\) ".*"/\1 <string>/
> +	s/\("ident":\) ".*"/\1 <string>/
> +	EOF
>  '
>
>  test_expect_success 'ls-files --json, main entries, UNTR and TREE' '
> @@ -98,7 +98,9 @@ test_expect_success !SINGLE_CPU 'ls-files --json and multicore extensions' '
>  		touch one two three four &&
>  		git add . &&
>  		cp ../filter.sed . &&
> -		strip_number offset &&
> +		cat >>filter.sed <<-\EOF &&
> +		s/\("offset":\) [0-9]\+/\1 <number>/
> +		EOF
>  		compare_json eoie
>  	)
>  '

Jeff Hostetler June 25, 2019, 3:37 p.m. UTC | #9

On 6/25/2019 5:52 AM, Duy Nguyen wrote:
> On Tue, Jun 25, 2019 at 2:15 AM Jeff Hostetler <git@jeffhostetler.com> wrote:
>>> @@ -202,6 +202,28 @@ void jw_object_null(struct json_writer *jw, const char *key)
>>>        strbuf_addstr(&jw->json, "null");
>>>    }
>>>
>>> +void jw_object_filemode(struct json_writer *jw, const char *key, mode_t mode)
>>> +{
>>> +     object_common(jw, key);
>>> +     strbuf_addf(&jw->json, "\"%06o\"", mode);
>>> +}
>>> +
>>> +void jw_object_stat_data(struct json_writer *jw, const char *name,
>>> +                      const struct stat_data *sd)
>>
>> Should this be in json_writer.c or in read-cache.c ?
>> Currently, json_writer.c is concerned with formatting
>> JSON on basic/scalar types.  Do we want to start
>> extending it to handle arbitrary structures?  Or would
>> it be better for the code that defines/manipulates the
>> structure to define a "stat_data_dump_json()" function.
>>
>> I'm torn on the jw_object_filemode() function, JSON format
>> limits us to decimal integers and there are places where
>> I'd like to have hex, or in this case octal values.
>>
>> I'm thinking it'd be better to have a helper function in
>> read-cache.c that formats a local strbuf and calls
>> js_object_string(&jw, key, buf);
> 
> I can move these back to read-cache.c. Though if we have a lot more jw
> helpers like this (hard to tell at the moment) then perhaps we can
> have json-writer-utils.c or something to group them together. That
> keep the "boring" code out of main logic code in read-cache.c and
> other call sites.

yeah, in an utils file or close to the "constructors" of the
structure types.  either one works.

> 
>>> @@ -1952,6 +1953,49 @@ static void *load_index_extensions(void *_data)
>>>        return NULL;
>>>    }
>>>
>>> +static void dump_cache_entry(struct index_state *istate,
>>> +                          int index,
>>> +                          unsigned long offset,
>>> +                          const struct cache_entry *ce)
>>> +{
>>> +     struct json_writer *jw = istate->jw;
>>> +
>>> +     jw_array_inline_begin_object(jw);
>>> +
>>> +     /*
>>> +      * this is technically redundant, but it's for easier
>>> +      * navigation when there hundreds of entries
>>> +      */
>>> +     jw_object_intmax(jw, "id", index);
>>> +
>>> +     jw_object_string(jw, "name", ce->name);
>>> +
>>> +     jw_object_filemode(jw, "mode", ce->ce_mode);
>>> +
>>> +     jw_object_intmax(jw, "flags", ce->ce_flags);
>>
>> It would be nice to have the flags as a hex-formatted string
>> in addition to (or instead of) the decimal integer value.
> 
> I'm not against reformatting it in hex string, but is there a value in
> it? ce_flags is expanded in the code below so that you don't have to
> decode it yourself when you read each entry. The "flags" field here is
> for further processing in tools. I'm trying to see if looking at hex
> values helps, but I'm still not seeing it...
> 

I guess I was thinking of the in-memory bits and thinking
it'd be useful to be able to dump the index immediately
after reading it and then later after some operation or
traversal and see the intermediate states.  But I realize
now that that's not what you're building.  This is a dump
it while you're reading it feature (and that's fine).

So, as long as you have all of the on-disk bits, we should
be fine as you suggest.

Jeff


>>> +     /*
>>> +      * again redundant info, just so you don't have to decode
>>> +      * flags values manually
>>> +      */
>>> +     if (ce->ce_flags & CE_EXTENDED)
>>> +             jw_object_true(jw, "extended_flags");
>>> +     if (ce->ce_flags & CE_VALID)
>>> +             jw_object_true(jw, "assume_unchanged");
>>> +     if (ce->ce_flags & CE_INTENT_TO_ADD)
>>> +             jw_object_true(jw, "intent_to_add");
>>> +     if (ce->ce_flags & CE_SKIP_WORKTREE)
>>> +             jw_object_true(jw, "skip_worktree");
>>> +     if (ce_stage(ce))
>>> +             jw_object_intmax(jw, "stage", ce_stage(ce));

Junio C Hamano June 25, 2019, 10:28 p.m. UTC | #10

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> +		echo 's/\("'$name'":\) [0-9]\+/\1 <number>/' >>filter.sed
>
> This does not do what you think it does, in Ubuntu Xenial and on macOS:
>
> https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11408&view=ms.vss-test-web.build-test-results-tab&runId=27736&paneView=debug&resultId=105613
>
> The `\1` is expanded to the ASCII character 001. Therefore your test cases
> fail on almost all platforms.
>
> Funnily enough, they pass on Windows...

bash, dash and /bin/echo behave differently given 

    $ echo 'foo \1 bar'

some 'echo' suffer from the "\<n>" interpolation.  Some don't.

I think your spelled-out version downthread (except for stepping out
of BRE which would break your sed script, as you realized) would be
a much readable alternative.

Thanks.

Junio C Hamano June 26, 2019, 7:51 p.m. UTC | #11

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> This --json is supposed to help that. It dumps the index in a human

This and another one on the title need to become "--debug-json".

Are we expecting a reroll to introduce another layer above the most
primitive json writer that knows the schema used to represent both
system standard and our application-specific structures, or is the
current arrangement to have them in json-writer.c until there are
enough of them to warrant such a split good enough?

[v2,01/10] ls-files: add --json to dump the index

Commit Message

Comments

Patch