diff mbox series

test-tool: bloom: generate_filter for multiple string?

Message ID 20201231035438.22533-1-congdanhqx@gmail.com (mailing list archive)
State New, archived
Headers show
Series test-tool: bloom: generate_filter for multiple string? | expand

Commit Message

Đoàn Trần Công Danh Dec. 31, 2020, 3:54 a.m. UTC
Hello,

I'm reading the code for Bloom Filter to see if arXiv:2012.00472
could be an improvement.

I'm not sure if I'm missing something or it's our test-bloom generate_filter
doesn't really support testing for multiple inputs.

If I understand correctly, we should either:
* allocate more entry for inputs; or
* abort if argc != 3

Comments

Derrick Stolee Dec. 31, 2020, 11:31 a.m. UTC | #1
On 12/30/2020 10:54 PM, Đoàn Trần Công Danh wrote:
> 
> Hello,
> 
> I'm reading the code for Bloom Filter to see if arXiv:2012.00472
> could be an improvement.
> 
> I'm not sure if I'm missing something or it's our test-bloom generate_filter
> doesn't really support testing for multiple inputs.

It doesn't support creating a _realistic_ filter of the given size. The tests
at this level are more about keeping the filters consistent so we can trust
that we are not accidentally changing the hashing algorithm and its file format.

The test works if we provide multiple inputs, the problem is that the resulting
filter has a higher density of bits than we expect, since we didn't size it
according to the number of inputs.

> If I understand correctly, we should either:
> * allocate more entry for inputs; or
> * abort if argc != 3

Either approach would be fine. I wonder what your goal is for testing the
multiple inputs. Are you expecting a larger filter to demonstrate some
behavior that is not tested by a small filter?

> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 46e97b04eb..1026facc59 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -68,12 +68,14 @@ int cmd__bloom(int argc, const char **argv)
>  	if (!strcmp(argv[1], "generate_filter")) {
>  		struct bloom_filter filter;
>  		int i = 2;
> -		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> -		filter.data = xcalloc(filter.len, sizeof(unsigned char));
>  
>  		if (argc - 1 < i)
>  			usage(bloom_usage);
>  
> +		filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1)
> +			/ BITS_PER_WORD;
> +		filter.data = xcalloc(filter.len, sizeof(unsigned char));
> +

Whitespace issues aside, this is the right approach to make the filter grow with
the input.

>  		while (argv[i]) {
>  			add_string_to_filter(argv[i], &filter);
>  			i++;
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 7e4ab1795f..6d83927c86 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -67,6 +67,17 @@ test_expect_success 'compute bloom key for test string 2' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'compute bloom key for multiple string' '
> +	cat >expect <<-\EOF &&
> +	Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d|
> +	Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585|

Multiple hash outputs work already, helping us build confidence that our
hash algorithm hasn't changed.

> +	Filter_Length:3
> +	Filter_Data:45|ba|ac|

And this is the part of the output that you are changing.

> +	EOF
> +	test-tool bloom generate_filter "Hello world!" file.txt >actual &&
> +	test_cmp expect actual
> +'
> +

So, your suggested change has merit. I just wonder what value it provides on top
of the current implementation? If your goal is to create a way to inspect a full-sized
filter, then that would be interesting to explore.

Thanks,
-Stolee
Đoàn Trần Công Danh Jan. 5, 2021, 1:34 p.m. UTC | #2
On 2020-12-31 06:31:52-0500, Derrick Stolee <stolee@gmail.com> wrote:
> On 12/30/2020 10:54 PM, Đoàn Trần Công Danh wrote:
> > I'm reading the code for Bloom Filter to see if arXiv:2012.00472
> > could be an improvement.
> > 
> > I'm not sure if I'm missing something or it's our test-bloom generate_filter
> > doesn't really support testing for multiple inputs.
> 
> It doesn't support creating a _realistic_ filter of the given size. The tests
> at this level are more about keeping the filters consistent so we can trust
> that we are not accidentally changing the hashing algorithm and its file format.

(Sorry for this late reply)

Yes, make sense.

> The test works if we provide multiple inputs, the problem is that the resulting
> filter has a higher density of bits than we expect, since we didn't size it
> according to the number of inputs.

OK.

> > If I understand correctly, we should either:
> > * allocate more entry for inputs; or
> > * abort if argc != 3
> 
> Either approach would be fine. I wonder what your goal is for testing the
> multiple inputs. Are you expecting a larger filter to demonstrate some
> behavior that is not tested by a small filter?

I was expecting a larger filter to trying with some ideas that I had
in mind before implement it in real code.
However, after toying with my idea, I think it's more of a solution
looking for problem.
(The original post was posted before I realised that).

> > diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> > index 46e97b04eb..1026facc59 100644
> > --- a/t/helper/test-bloom.c
> > +++ b/t/helper/test-bloom.c
> > @@ -68,12 +68,14 @@ int cmd__bloom(int argc, const char **argv)
> >  	if (!strcmp(argv[1], "generate_filter")) {
> >  		struct bloom_filter filter;
> >  		int i = 2;
> > -		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> > -		filter.data = xcalloc(filter.len, sizeof(unsigned char));
> >  
> >  		if (argc - 1 < i)
> >  			usage(bloom_usage);
> >  
> > +		filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1)
> > +			/ BITS_PER_WORD;
> > +		filter.data = xcalloc(filter.len, sizeof(unsigned char));
> > +
> 
> Whitespace issues aside, this is the right approach to make the filter grow with
> the input.
> 
> >  		while (argv[i]) {
> >  			add_string_to_filter(argv[i], &filter);
> >  			i++;
> > diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> > index 7e4ab1795f..6d83927c86 100755
> > --- a/t/t0095-bloom.sh
> > +++ b/t/t0095-bloom.sh
> > @@ -67,6 +67,17 @@ test_expect_success 'compute bloom key for test string 2' '
> >  	test_cmp expect actual
> >  '
> >  
> > +test_expect_success 'compute bloom key for multiple string' '
> > +	cat >expect <<-\EOF &&
> > +	Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d|
> > +	Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585|
> 
> Multiple hash outputs work already, helping us build confidence that our
> hash algorithm hasn't changed.
> 
> > +	Filter_Length:3
> > +	Filter_Data:45|ba|ac|
> 
> And this is the part of the output that you are changing.
> 
> > +	EOF
> > +	test-tool bloom generate_filter "Hello world!" file.txt >actual &&
> > +	test_cmp expect actual
> > +'
> > +
> 
> So, your suggested change has merit. I just wonder what value it provides on top
> of the current implementation? If your goal is to create a way to inspect a full-sized
> filter, then that would be interesting to explore.

Yes, originally, I was trying to creat some way to inspect a full-size
filter, i.e. a bloom filter for some set of object_id during fetch
negotiation.
But, I realise it's not a good idea, now.
Taylor Blau Jan. 12, 2021, 7:53 p.m. UTC | #3
On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote:
>
> Hello,
>
> I'm reading the code for Bloom Filter to see if arXiv:2012.00472
> could be an improvement.

I'm late to the party, but I'm curious to hear which part of this
article you think would help out the Bloom filter implementation.

(I skimmed the article you're referencing during my time off, but
haven't come back to read it in detail since, so it's possible that the
part I'm missing is just totally obvious.)

> I'm not sure if I'm missing something or it's our test-bloom generate_filter
> doesn't really support testing for multiple inputs.

The rest of this discussion downthread seems sane to me.

Thanks,
Taylor
Đoàn Trần Công Danh Jan. 13, 2021, 11:59 a.m. UTC | #4
On 2021-01-12 14:53:32-0500, Taylor Blau <me@ttaylorr.com> wrote:
> On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote:
> > I'm reading the code for Bloom Filter to see if arXiv:2012.00472
> > could be an improvement.
> 
> I'm late to the party, but I'm curious to hear which part of this
> article you think would help out the Bloom filter implementation.

Uhm, no. The article doesn't help the Bloom filter implementation.
The article was suggesting using Bloom filter to speed-up the
negotiation in fetch-pack and upload-pack. Which, in my own quick
experience, doesn't help much. Maybe it's me not understand the
article idea or I have made a naive implementation. However, I'm not
convinced to pursued further.

If you are curious, I'm attaching 2 quick-and-low-quality patches with
this email for your consideration.
Derrick Stolee Jan. 13, 2021, 12:06 p.m. UTC | #5
On 1/13/2021 6:59 AM, Đoàn Trần Công Danh wrote:
> On 2021-01-12 14:53:32-0500, Taylor Blau <me@ttaylorr.com> wrote:
>> On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote:
>>> I'm reading the code for Bloom Filter to see if arXiv:2012.00472
>>> could be an improvement.
>>
>> I'm late to the party, but I'm curious to hear which part of this
>> article you think would help out the Bloom filter implementation.
> 
> Uhm, no. The article doesn't help the Bloom filter implementation.
> The article was suggesting using Bloom filter to speed-up the
> negotiation in fetch-pack and upload-pack. Which, in my own quick
> experience, doesn't help much. Maybe it's me not understand the
> article idea or I have made a naive implementation. However, I'm not
> convinced to pursued further.

I saw that idea, and was immediately skeptical. Bloom filters still
have size linear to the size of the set. By using a Bloom filter to
describe "these are ALL the objects I have" instead of "these are
the TIPS I have" the size will explode to be much larger than our
current negotiation algorithm.

Thanks,
-Stolee
Đoàn Trần Công Danh Jan. 13, 2021, 12:13 p.m. UTC | #6
On 2021-01-13 07:06:59-0500, Derrick Stolee <stolee@gmail.com> wrote:
> On 1/13/2021 6:59 AM, Đoàn Trần Công Danh wrote:
> > On 2021-01-12 14:53:32-0500, Taylor Blau <me@ttaylorr.com> wrote:
> >> On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote:
> >>> I'm reading the code for Bloom Filter to see if arXiv:2012.00472
> >>> could be an improvement.
> >>
> >> I'm late to the party, but I'm curious to hear which part of this
> >> article you think would help out the Bloom filter implementation.
> > 
> > Uhm, no. The article doesn't help the Bloom filter implementation.
> > The article was suggesting using Bloom filter to speed-up the
> > negotiation in fetch-pack and upload-pack. Which, in my own quick
> > experience, doesn't help much. Maybe it's me not understand the
> > article idea or I have made a naive implementation. However, I'm not
> > convinced to pursued further.
> 
> I saw that idea, and was immediately skeptical. Bloom filters still
> have size linear to the size of the set. By using a Bloom filter to
> describe "these are ALL the objects I have" instead of "these are
> the TIPS I have" the size will explode to be much larger than our
> current negotiation algorithm.

Yes, I saw that problem, too.

My implementation was trying to use it from the second round,
downstream will say: these are ALL the objects I have from our common
objects.

The result is not much different, though.

Thanks,
Taylor Blau Jan. 13, 2021, 3:13 p.m. UTC | #7
On Wed, Jan 13, 2021 at 06:59:52PM +0700, Đoàn Trần Công Danh wrote:
> On 2021-01-12 14:53:32-0500, Taylor Blau <me@ttaylorr.com> wrote:
> > On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote:
> > > I'm reading the code for Bloom Filter to see if arXiv:2012.00472
> > > could be an improvement.
> >
> > I'm late to the party, but I'm curious to hear which part of this
> > article you think would help out the Bloom filter implementation.
>
> Uhm, no. The article doesn't help the Bloom filter implementation.
> The article was suggesting using Bloom filter to speed-up the
> negotiation in fetch-pack and upload-pack. Which, in my own quick
> experience, doesn't help much. Maybe it's me not understand the
> article idea or I have made a naive implementation. However, I'm not
> convinced to pursued further.

I see. I read your "reading the code for Bloom Filter to see if ...
could be an improvement" as trying to improve the Bloom implementation.
Which after skimming the article, made me quite curious, since I didn't
understand what you were getting at.

But trying to speed up the negotiation makes sense, and is in line with
the goal of the article. It's too bad that you weren't able to produce
the same benefits here, but I understand why.

> If you are curious, I'm attaching 2 quick-and-low-quality patches with
> this email for your consideration.

Thanks. They were an interesting read.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 46e97b04eb..1026facc59 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -68,12 +68,14 @@  int cmd__bloom(int argc, const char **argv)
 	if (!strcmp(argv[1], "generate_filter")) {
 		struct bloom_filter filter;
 		int i = 2;
-		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
-		filter.data = xcalloc(filter.len, sizeof(unsigned char));
 
 		if (argc - 1 < i)
 			usage(bloom_usage);
 
+		filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1)
+			/ BITS_PER_WORD;
+		filter.data = xcalloc(filter.len, sizeof(unsigned char));
+
 		while (argv[i]) {
 			add_string_to_filter(argv[i], &filter);
 			i++;
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 7e4ab1795f..6d83927c86 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -67,6 +67,17 @@  test_expect_success 'compute bloom key for test string 2' '
 	test_cmp expect actual
 '
 
+test_expect_success 'compute bloom key for multiple string' '
+	cat >expect <<-\EOF &&
+	Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d|
+	Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585|
+	Filter_Length:3
+	Filter_Data:45|ba|ac|
+	EOF
+	test-tool bloom generate_filter "Hello world!" file.txt >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'get bloom filters for commit with no changes' '
 	git init &&
 	git commit --allow-empty -m "c0" &&