diff mbox series

[v4,1/4] gitformat-commit-graph: describe version 2 of BDAT

Message ID a3b52af4c96db9a8164119e8ace0541db47b6815.1686677910.git.jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series Changed path filter hash fix and version bump | expand

Commit Message

Jonathan Tan June 13, 2023, 5:39 p.m. UTC
The code change to Git to support version 2 will be done in subsequent
commits.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/gitformat-commit-graph.txt | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Comments

Junio C Hamano June 13, 2023, 9:58 p.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> The code change to Git to support version 2 will be done in subsequent
> commits.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/gitformat-commit-graph.txt | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
> index 31cad585e2..112e6d36a6 100644
> --- a/Documentation/gitformat-commit-graph.txt
> +++ b/Documentation/gitformat-commit-graph.txt
> @@ -142,13 +142,16 @@ All multi-byte numbers are in network byte order.
>  
>  ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>      * It starts with header consisting of three unsigned 32-bit integers:
> -      - Version of the hash algorithm being used. We currently only support
> -	value 1 which corresponds to the 32-bit version of the murmur3 hash
> +      - Version of the hash algorithm being used. We currently support
> +	value 2 which corresponds to the 32-bit version of the murmur3 hash
>  	implemented exactly as described in
>  	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
>  	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
>  	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> -	in Probabilistic Verification"
> +	in Probabilistic Verification". Version 1 bloom filters have a bug that appears

"bloom" -> "Bloom", probably, as the name comes from the name of its
inventor (just like we spell "Boolean", not "boolean").

> +	when char is signed and the repository has path names that have characters >=
> +	0x80; Git supports reading and writing them, but this ability will be removed
> +	in a future version of Git.

Makes sense.

I wonder if we want to mention what the undesired misbehaviour the
"bug" causes and what we do to avoid getting affected by the bug
here.  If we can say something like "When querying for a pathname
with a byte with high-bit set, the buggy filter may produce false
negative, making the filter unusable, but asking for a pathname
without such a byte produces no false negatives (even though we may
get false positives).  When Git reads version 1 filter data, it
refrains from using it for processing paths with high-bit set to
avoid triggering the bug", then it would be ideal.  Or "When the
repository has even a single pathname with high-bit set anywhere in
its history, version 1 Bloom can give false negative when querying
any paths and becomes unusable.  You can use $THIS configuration
variable to disable use of Bloom filter data in such a case" would
also be fine.  The point is to give actionable piece of information
to the readers.

Thanks.
Derrick Stolee June 20, 2023, 1:22 p.m. UTC | #2
On 6/13/2023 5:58 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:

>> +	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
> 
> "bloom" -> "Bloom", probably, as the name comes from the name of its
> inventor (just like we spell "Boolean", not "boolean").

The ultimate recognition comes from when the term named after you
becomes lower-case ("boolean" is sometimes in this category, but
definitely "abelian" is an example).

In this case, you are right that we should capitalize Bloom.

>> +	when char is signed and the repository has path names that have characters >=
>> +	0x80; Git supports reading and writing them, but this ability will be removed
>> +	in a future version of Git.
> 
> Makes sense.

I also like how you organized this: "We support version 2. 1 is
still around but not for long."
 
> I wonder if we want to mention what the undesired misbehaviour the
> "bug" causes and what we do to avoid getting affected by the bug
> here.  If we can say something like "When querying for a pathname
> with a byte with high-bit set, the buggy filter may produce false
> negative, making the filter unusable, but asking for a pathname
> without such a byte produces no false negatives (even though we may
> get false positives).  When Git reads version 1 filter data, it
> refrains from using it for processing paths with high-bit set to
> avoid triggering the bug", then it would be ideal.  Or "When the
> repository has even a single pathname with high-bit set anywhere in
> its history, version 1 Bloom can give false negative when querying
> any paths and becomes unusable.  You can use $THIS configuration
> variable to disable use of Bloom filter data in such a case" would
> also be fine.  The point is to give actionable piece of information
> to the readers.

This is definitely helpful, but if someone is having issues we
would say "try version 2 and see if it still happens" and not
over-index on the underlying reason.

That's to say, I'm OK with the shorter description of the problem.
Feel free to expand if you're interested, though.

Thanks,
-Stolee
Taylor Blau June 21, 2023, 12:08 p.m. UTC | #3
On Tue, Jun 13, 2023 at 02:58:24PM -0700, Junio C Hamano wrote:
> "bloom" -> "Bloom", probably, as the name comes from the name of its
> inventor (just like we spell "Boolean", not "boolean").

Indeed.

> > +	when char is signed and the repository has path names that have characters >=
> > +	0x80; Git supports reading and writing them, but this ability will be removed
> > +	in a future version of Git.
>
> Makes sense.
>
> I wonder if we want to mention what the undesired misbehaviour the
> "bug" causes and what we do to avoid getting affected by the bug
> here.  If we can say something like "When querying for a pathname
> with a byte with high-bit set, the buggy filter may produce false
> negative, making the filter unusable, but asking for a pathname
> without such a byte produces no false negatives (even though we may
> get false positives).  When Git reads version 1 filter data, it
> refrains from using it for processing paths with high-bit set to
> avoid triggering the bug", then it would be ideal.

Your description of the bug matches my understanding of the issue, that
a corrupt filter would produce false negatives and thus be unusable.

I skimmed through the rest of the series, and couldn't find a spot where
we do the latter, i.e. still use v1 filters as long as we don't have any
characters in the path with high-order bits set.

I think this would be as simple as modifying the Bloom filter query
function to return "maybe" before even trying to hash a path with at
least one character with its high-bit set.

Apologies if this functionality is implemented and I just missed it.

Thanks,
Taylor
Jonathan Tan June 22, 2023, 10:26 p.m. UTC | #4
Taylor Blau <me@ttaylorr.com> writes:
> > I wonder if we want to mention what the undesired misbehaviour the
> > "bug" causes and what we do to avoid getting affected by the bug
> > here.  If we can say something like "When querying for a pathname
> > with a byte with high-bit set, the buggy filter may produce false
> > negative, making the filter unusable, but asking for a pathname
> > without such a byte produces no false negatives (even though we may
> > get false positives).  When Git reads version 1 filter data, it
> > refrains from using it for processing paths with high-bit set to
> > avoid triggering the bug", then it would be ideal.
> 
> Your description of the bug matches my understanding of the issue, that
> a corrupt filter would produce false negatives and thus be unusable.
> 
> I skimmed through the rest of the series, and couldn't find a spot where
> we do the latter, i.e. still use v1 filters as long as we don't have any
> characters in the path with high-order bits set.
> 
> I think this would be as simple as modifying the Bloom filter query
> function to return "maybe" before even trying to hash a path with at
> least one character with its high-bit set.
> 
> Apologies if this functionality is implemented and I just missed it.
> 
> Thanks,
> Taylor

Thanks for the suggestion - yeah, this might work.
Derrick Stolee June 23, 2023, 1:05 p.m. UTC | #5
On 6/22/2023 6:26 PM, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>>> I wonder if we want to mention what the undesired misbehaviour the
>>> "bug" causes and what we do to avoid getting affected by the bug
>>> here.  If we can say something like "When querying for a pathname
>>> with a byte with high-bit set, the buggy filter may produce false
>>> negative, making the filter unusable, but asking for a pathname
>>> without such a byte produces no false negatives (even though we may
>>> get false positives).  When Git reads version 1 filter data, it
>>> refrains from using it for processing paths with high-bit set to
>>> avoid triggering the bug", then it would be ideal.
>>
>> Your description of the bug matches my understanding of the issue, that
>> a corrupt filter would produce false negatives and thus be unusable.
>>
>> I skimmed through the rest of the series, and couldn't find a spot where
>> we do the latter, i.e. still use v1 filters as long as we don't have any
>> characters in the path with high-order bits set.
>>
>> I think this would be as simple as modifying the Bloom filter query
>> function to return "maybe" before even trying to hash a path with at
>> least one character with its high-bit set.
>>
>> Apologies if this functionality is implemented and I just missed it.
>>
>> Thanks,
>> Taylor
> 
> Thanks for the suggestion - yeah, this might work.

If I understand the situation correctly, the high bits can make the
hashes "not very random" but they are still effective at identifying
the "maybe" case consistently for the inputs it is given (it would not
present a "no" when it should not, but it might say "maybe" more often
than it should). The behavior is only incorrect if the same commit-graph
file is used with two different Git versions that were compiled with
different signed-ness.

If that is the case, then ignoring the Bloom filters when you see a
high bit would change the performance implication from "probably
slower" to "definitely slower" but not affect the correctness in a
system that doesn't have competing Git versions with different
compiler semantics.

That is to say, doing this extra work doesn't seem to be critical to
making this change. The ROI seems too low.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/Documentation/gitformat-commit-graph.txt b/Documentation/gitformat-commit-graph.txt
index 31cad585e2..112e6d36a6 100644
--- a/Documentation/gitformat-commit-graph.txt
+++ b/Documentation/gitformat-commit-graph.txt
@@ -142,13 +142,16 @@  All multi-byte numbers are in network byte order.
 
 ==== Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
     * It starts with header consisting of three unsigned 32-bit integers:
-      - Version of the hash algorithm being used. We currently only support
-	value 1 which corresponds to the 32-bit version of the murmur3 hash
+      - Version of the hash algorithm being used. We currently support
+	value 2 which corresponds to the 32-bit version of the murmur3 hash
 	implemented exactly as described in
 	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
 	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
 	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
-	in Probabilistic Verification"
+	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
+	when char is signed and the repository has path names that have characters >=
+	0x80; Git supports reading and writing them, but this ability will be removed
+	in a future version of Git.
       - The number of times a path is hashed and hence the number of bit positions
 	      that cumulatively determine whether a file is present in the commit.
       - The minimum number of bits 'b' per entry in the Bloom filter. If the filter