[2/2] date.c: allow compact version of ISO-8601 datetime

Message ID	06e62c58d5accad7bbebbc51f9fb38fda83a73f6.1586856398.git.congdanhqx@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=cd4n=56=vger.kernel.org=git-owner@kernel.org> From: =?utf-8?b?xJBvw6BuIFRy4bqnbiBDw7RuZyBEYW5o?= <congdanhqx@gmail.com> To: git@vger.kernel.org Cc: =?utf-8?b?xJBvw6BuIFRy4bqnbiBDw7RuZyBEYW5o?= <congdanhqx@gmail.com> Subject: [PATCH 2/2] date.c: allow compact version of ISO-8601 datetime Date: Tue, 14 Apr 2020 16:31:55 +0700 Message-Id: <06e62c58d5accad7bbebbc51f9fb38fda83a73f6.1586856398.git.congdanhqx@gmail.com> In-Reply-To: <cover.1586856398.git.congdanhqx@gmail.com> References: <20200414000324.GA2751707@camp.crustytoothpaste.net> <cover.1586856398.git.congdanhqx@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	More ISO-8601 support \| expand [0/2] More ISO-8601 support [1/2] date.c: allow fractional second part of ISO-8601 [2/2] date.c: allow compact version of ISO-8601 datetime

Đoàn Trần Công Danh April 14, 2020, 9:31 a.m. UTC

Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>
---
 date.c          | 18 ++++++++++++++++++
 t/t0006-date.sh |  3 +++
 2 files changed, 21 insertions(+)

Jeff King April 14, 2020, 8:24 p.m. UTC | #1

On Tue, Apr 14, 2020 at 04:31:55PM +0700, Đoàn Trần Công Danh wrote:

> @@ -666,6 +666,24 @@ static int match_digit(const char *date, struct tm *tm, int *offset, int *tm_gmt
>  		n++;
>  	} while (isdigit(date[n]));
>  
> +	/* 8 digits, compact style of ISO-8601's date: YYYYmmDD */
> +	if (n == 8) {
> +		tm->tm_year = num / 10000 - 1900;
> +		tm->tm_mon = (num % 10000) / 100 - 1;
> +		tm->tm_mday = num % 100;
> +		return n;
> +	}

I worry a little this may conflict with other approxidate heuristics.
The only one I can think of is an actual unix timestamp, though, and we
already require that to have at least 9 digits (plus anybody wanting to
use one robustly really should be using @12345678).

And it looks like we'd exit early from the function for anything longer
than 4 digits anyway, ignoring the value.

We could probably tighten the heuristics a bit by insisting that the
month and day be sensible. Or even year (see the 1900 to 2100 magic for
the 4-digit year guess).

> +	/* 6 digits, compact style of ISO-8601's time: HHMMSS */
> +	if (n == 6) {
> +		tm->tm_hour = num / 10000;
> +		tm->tm_min = (num % 10000) / 100;
> +		tm->tm_sec = num % 100;
> +		if (*end == '.' && isdigit(end[1]))
> +			strtoul(end + 1, &end, 10);
> +		return end - date;
> +	}

And likewise here that the hour, minute, and second be reasonable.

-Peff

Đoàn Trần Công Danh April 15, 2020, 2:12 a.m. UTC | #2

On 2020-04-14 16:24:01-0400, Jeff King <peff@peff.net> wrote:
> On Tue, Apr 14, 2020 at 04:31:55PM +0700, Đoàn Trần Công Danh wrote:
> 
> > @@ -666,6 +666,24 @@ static int match_digit(const char *date, struct tm *tm, int *offset, int *tm_gmt
> >  		n++;
> >  	} while (isdigit(date[n]));
> >  
> > +	/* 8 digits, compact style of ISO-8601's date: YYYYmmDD */
> > +	if (n == 8) {
> > +		tm->tm_year = num / 10000 - 1900;
> > +		tm->tm_mon = (num % 10000) / 100 - 1;
> > +		tm->tm_mday = num % 100;
> > +		return n;
> > +	}
> 
> I worry a little this may conflict with other approxidate heuristics.
> The only one I can think of is an actual unix timestamp, though, and we
> already require that to have at least 9 digits (plus anybody wanting to
> use one robustly really should be using @12345678).
> 
> And it looks like we'd exit early from the function for anything longer
> than 4 digits anyway, ignoring the value.
> 
> We could probably tighten the heuristics a bit by insisting that the
> month and day be sensible. Or even year (see the 1900 to 2100 magic for
> the 4-digit year guess).

Yeah, It's make sense to tighten the heuristics.
While 1900 is lower bound for year makes sense to me,
but I don't think we should limit upper bound for tm_year.

> > +	/* 6 digits, compact style of ISO-8601's time: HHMMSS */
> > +	if (n == 6) {
> > +		tm->tm_hour = num / 10000;
> > +		tm->tm_min = (num % 10000) / 100;
> > +		tm->tm_sec = num % 100;
> > +		if (*end == '.' && isdigit(end[1]))
> > +			strtoul(end + 1, &end, 10);
> > +		return end - date;
> > +	}
> 
> And likewise here that the hour, minute, and second be reasonable.
> 
> -Peff

Junio C Hamano April 15, 2020, 3:03 p.m. UTC | #3

Jeff King <peff@peff.net> writes:

> On Tue, Apr 14, 2020 at 04:31:55PM +0700, Đoàn Trần Công Danh wrote:
>
>> @@ -666,6 +666,24 @@ static int match_digit(const char *date, struct tm *tm, int *offset, int *tm_gmt
>>  		n++;
>>  	} while (isdigit(date[n]));
>>  
>> +	/* 8 digits, compact style of ISO-8601's date: YYYYmmDD */
>> +	if (n == 8) {
>> +		tm->tm_year = num / 10000 - 1900;
>> +		tm->tm_mon = (num % 10000) / 100 - 1;
>> +		tm->tm_mday = num % 100;
>> +		return n;
>> +	}
>
> I worry a little this may conflict with other approxidate heuristics.
>
> The only one I can think of is an actual unix timestamp, though, and we
> already require that to have at least 9 digits (plus anybody wanting to
> use one robustly really should be using @12345678).

I am OK with 1/2, but I'd worry a LOT about this one, if we didn't
require 9 digits.  60/100 * 60/100 * 24/100 ~= 8.6% so limiting the
hour/min/sec to sensible values is not a useful protection against
ambiguity.  We'd essentially be declaring that a raw UNIX timestamp
without @ prefix is now hit-and-miss if we take this change, were
there not the "must be at least 9 digits" requirement.

Somebody was thinking when he wrote the very beginning part of the
match_digit() function ;-)

> We could probably tighten the heuristics a bit by insisting that the
> month and day be sensible. Or even year (see the 1900 to 2100 magic for
> the 4-digit year guess).

I do agree that we'd want 0 <= hr <= 24, 0 <= min <= 59, and 0 <=
sec <= 60 (or should this be 61?), but it is for correctness of the
new code.  It wouldn't have any value in disambiguation from other
formats, I would think.  So, from that point of view, I would buy
something like 1969 as a lower bound, but I am not sure if we
necessarily want an upper bound for the year.  What mistake are we
protecting us against?

Thanks.

>> +	/* 6 digits, compact style of ISO-8601's time: HHMMSS */
>> +	if (n == 6) {
>> +		tm->tm_hour = num / 10000;
>> +		tm->tm_min = (num % 10000) / 100;
>> +		tm->tm_sec = num % 100;
>> +		if (*end == '.' && isdigit(end[1]))
>> +			strtoul(end + 1, &end, 10);
>> +		return end - date;
>> +	}
>
> And likewise here that the hour, minute, and second be reasonable.
>
> -Peff

Jeff King April 15, 2020, 3:41 p.m. UTC | #4

On Wed, Apr 15, 2020 at 08:03:11AM -0700, Junio C Hamano wrote:

> > I worry a little this may conflict with other approxidate heuristics.
> >
> > The only one I can think of is an actual unix timestamp, though, and we
> > already require that to have at least 9 digits (plus anybody wanting to
> > use one robustly really should be using @12345678).
> 
> I am OK with 1/2, but I'd worry a LOT about this one, if we didn't
> require 9 digits.  60/100 * 60/100 * 24/100 ~= 8.6% so limiting the
> hour/min/sec to sensible values is not a useful protection against
> ambiguity.  We'd essentially be declaring that a raw UNIX timestamp
> without @ prefix is now hit-and-miss if we take this change, were
> there not the "must be at least 9 digits" requirement.

Yeah, that was exactly why I poked at it.

Curiously, a1a5a6347b (Accept dates before 2000/01/01 when specified as
seconds since the epoch, 2007-06-06), the commit which introduced the
9-digit rule, specifically says:

  There is now still a limit, 100000000, which is 1973/03/03 09:46:40
  UTC, in order to allow that dates are represented as 8 digits.

but running "test-date 20100403" even back at that commit does not seem
to work (nor does it work now).

Doubly curious, some nearby code blames to 9f2b6d2936 (date/time: do not
get confused by fractional seconds, 2008-08-16). So why don't fractional
seconds work right now?

They sure don't seem to:

  [the 17 becomes a day-of-month]
  $ t/helper/test-tool date approxidate 12:43:50.17
  12:43:50.17 -> 2020-04-17 16:43:50 +0000

But I wonder if the new patch breaks the example from that commit
message, which does work now:

  [current version; the 17 attaches to "days ago"]
  $ t/helper/test-tool date approxidate 12:43:50.17.days.ago
  12:43:50.17.days.ago -> 2020-03-29 16:43:50 +0000

It looks like the answer is yes:

  [with patch 1/2 applied; now it's a fractional second]
  $ t/helper/test-tool date approxidate 12:43:50.17.days.ago
  12:43:50.17.days.ago -> 2020-04-15 16:43:50 +0000

TBH I'm not sure how big a deal it is. The input is rather ambiguous for
a strict left-to-right parser, and I'm not sure this case is worth doing
more look-ahead parsing. But it's worth making a conscious decision
about it.

One alternative would be to restrict the fractional second handling only
to the parse_date_basic(), which is quite a bit stricter. It shares
match_digit() with approxidate(), so we'd probably have to pass in an
extra flag to match_digit() to change the rules. It also means that:

  2020-04-14T12:43:50.17

would parse a fractional second, but:

  yesterday at 12:43:50.17

wouldn't.

> > We could probably tighten the heuristics a bit by insisting that the
> > month and day be sensible. Or even year (see the 1900 to 2100 magic for
> > the 4-digit year guess).
> 
> I do agree that we'd want 0 <= hr <= 24, 0 <= min <= 59, and 0 <=
> sec <= 60 (or should this be 61?), but it is for correctness of the
> new code.  It wouldn't have any value in disambiguation from other
> formats, I would think.  So, from that point of view, I would buy
> something like 1969 as a lower bound, but I am not sure if we
> necessarily want an upper bound for the year.  What mistake are we
> protecting us against?

No particular mistake. I was just suggesting to keep the heuristics in
the new code as tight as possible to leave room for later changes (since
approxidate by its nature is heuristics and hacks piled atop each
other).

I do agree that year bounds are a questionable restriction. Right now
Git's date code can only handle dates between 1970 and 2099, but it
would be nice to change that.

-Peff

Junio C Hamano April 15, 2020, 3:58 p.m. UTC | #5

Jeff King <peff@peff.net> writes:

> But I wonder if the new patch breaks the example from that commit
> message, which does work now:
>
>   [current version; the 17 attaches to "days ago"]
>   $ t/helper/test-tool date approxidate 12:43:50.17.days.ago
>   12:43:50.17.days.ago -> 2020-03-29 16:43:50 +0000
>
> It looks like the answer is yes:
>
>   [with patch 1/2 applied; now it's a fractional second]
>   $ t/helper/test-tool date approxidate 12:43:50.17.days.ago
>   12:43:50.17.days.ago -> 2020-04-15 16:43:50 +0000
>
> TBH I'm not sure how big a deal it is. The input is rather ambiguous for
> a strict left-to-right parser, and I'm not sure this case is worth doing
> more look-ahead parsing. But it's worth making a conscious decision
> about it.

True.  I do agree that this would qualify as a regression, but I do
not think we are breaking an important use case here.

> One alternative would be to restrict the fractional second handling only
> to the parse_date_basic(), which is quite a bit stricter. It shares
> match_digit() with approxidate(), so we'd probably have to pass in an
> extra flag to match_digit() to change the rules. It also means that:
>
>   2020-04-14T12:43:50.17
>
> would parse a fractional second, but:
>
>   yesterday at 12:43:50.17
>
> wouldn't.

That sounds less useful.  Somehow "yesterday at 12.43:50.17" looks a
lot more plausible real-user input that we would want to support
better than "12:43:50.17.day.ago".

> I do agree that year bounds are a questionable restriction. Right now
> Git's date code can only handle dates between 1970 and 2099, but it
> would be nice to change that.

Sure.  In both directions.

Đoàn Trần Công Danh April 16, 2020, 11:16 a.m. UTC | #6

On 2020-04-15 11:41:57-0400, Jeff King <peff@peff.net> wrote:
> On Wed, Apr 15, 2020 at 08:03:11AM -0700, Junio C Hamano wrote:
> 
> > > I worry a little this may conflict with other approxidate heuristics.
> > >
> > > The only one I can think of is an actual unix timestamp, though, and we
> > > already require that to have at least 9 digits (plus anybody wanting to
> > > use one robustly really should be using @12345678).
> > 
> > I am OK with 1/2, but I'd worry a LOT about this one, if we didn't
> > require 9 digits.  60/100 * 60/100 * 24/100 ~= 8.6% so limiting the
> > hour/min/sec to sensible values is not a useful protection against
> > ambiguity.  We'd essentially be declaring that a raw UNIX timestamp
> > without @ prefix is now hit-and-miss if we take this change, were
> > there not the "must be at least 9 digits" requirement.
> 
> Yeah, that was exactly why I poked at it.
> 
> Curiously, a1a5a6347b (Accept dates before 2000/01/01 when specified as
> seconds since the epoch, 2007-06-06), the commit which introduced the
> 9-digit rule, specifically says:
> 
>   There is now still a limit, 100000000, which is 1973/03/03 09:46:40
>   UTC, in order to allow that dates are represented as 8 digits.
> 
> but running "test-date 20100403" even back at that commit does not seem
> to work (nor does it work now).

check-parse 2010-04-03 doesn't work either.
check-approxiate is going through different code path via
approxidate_digit, but I'm not really sure which action should be take

> Doubly curious, some nearby code blames to 9f2b6d2936 (date/time: do not
> get confused by fractional seconds, 2008-08-16). So why don't fractional
> seconds work right now?

It doesn't work in that time, either:

	$ git checkout 9f2b6d2936
	$ make
	$ ./git --version
	git version 1.6.0.3.g9f2b6
	$ GIT_AUTHOR_DATE='2019-04-16 00:02:03.19-04:00' ./git-commit --allow-empty -m msg
	$ ./git-log --pretty=format:%aD -1
	Fri, 19 Apr 2019 00:02:03 +0700

The different is whether we stick the timezone (west of UTC) into the
fractional or not. If we have both those conditions:

- Specify the timezone west of UTC
- Stick it with the fractional part
- The fractional part is parsed to  number less than 32

The the parser will think it's US style of month/day

Judging from the test code, it's likely people that write the code and
the test get used to separate time and timezone.

And it's unlikely anyone will  write fractional parts with only
1 digit (yield 100% positive with sticking timezone), or 2 digits
(which yields 32% positive with sticking timezone),
for 3 digits, the positive rate is 3.2%

> 
> They sure don't seem to:
> 
>   [the 17 becomes a day-of-month]
>   $ t/helper/test-tool date approxidate 12:43:50.17
>   12:43:50.17 -> 2020-04-17 16:43:50 +0000
> 
> But I wonder if the new patch breaks the example from that commit
> message, which does work now:
> 
>   [current version; the 17 attaches to "days ago"]
>   $ t/helper/test-tool date approxidate 12:43:50.17.days.ago
>   12:43:50.17.days.ago -> 2020-03-29 16:43:50 +0000
> 
> It looks like the answer is yes:
> 
>   [with patch 1/2 applied; now it's a fractional second]
>   $ t/helper/test-tool date approxidate 12:43:50.17.days.ago
>   12:43:50.17.days.ago -> 2020-04-15 16:43:50 +0000
> 
> TBH I'm not sure how big a deal it is. The input is rather ambiguous for
> a strict left-to-right parser, and I'm not sure this case is worth doing
> more look-ahead parsing. But it's worth making a conscious decision
> about it.

Hm, I don't think "12:43:50.17.days.ago" is a normal input from user
but I think "12:43:50.3.days.ago" on Monday wouldn't have much less
user when compare with "yesterday at 12:43:50.17"

> One alternative would be to restrict the fractional second handling only
> to the parse_date_basic(), which is quite a bit stricter. It shares
> match_digit() with approxidate(), so we'd probably have to pass in an
> extra flag to match_digit() to change the rules. It also means that:
> 
>   2020-04-14T12:43:50.17
> 
> would parse a fractional second, but:
> 
>   yesterday at 12:43:50.17
> 
> wouldn't.
> 
> > > We could probably tighten the heuristics a bit by insisting that the
> > > month and day be sensible. Or even year (see the 1900 to 2100 magic for
> > > the 4-digit year guess).
> > 
> > I do agree that we'd want 0 <= hr <= 24, 0 <= min <= 59, and 0 <=
> > sec <= 60 (or should this be 61?), but it is for correctness of the
> > new code.  It wouldn't have any value in disambiguation from other
> > formats, I would think.  So, from that point of view, I would buy
> > something like 1969 as a lower bound, but I am not sure if we
> > necessarily want an upper bound for the year.  What mistake are we
> > protecting us against?
> 
> No particular mistake. I was just suggesting to keep the heuristics in
> the new code as tight as possible to leave room for later changes (since
> approxidate by its nature is heuristics and hacks piled atop each
> other).
> 
> I do agree that year bounds are a questionable restriction. Right now
> Git's date code can only handle dates between 1970 and 2099, but it
> would be nice to change that.

2100 is used inside is_date, which is the place we try to guess what
     does those number could mean in multiple format.

I don't think we have ambiguity when parsing ISO-8601 compact date.

[2/2] date.c: allow compact version of ISO-8601 datetime

Commit Message

Comments

Patch