Message ID | 20201231035438.22533-1-congdanhqx@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | test-tool: bloom: generate_filter for multiple string? | expand |
On 12/30/2020 10:54 PM, Đoàn Trần Công Danh wrote: > > Hello, > > I'm reading the code for Bloom Filter to see if arXiv:2012.00472 > could be an improvement. > > I'm not sure if I'm missing something or it's our test-bloom generate_filter > doesn't really support testing for multiple inputs. It doesn't support creating a _realistic_ filter of the given size. The tests at this level are more about keeping the filters consistent so we can trust that we are not accidentally changing the hashing algorithm and its file format. The test works if we provide multiple inputs, the problem is that the resulting filter has a higher density of bits than we expect, since we didn't size it according to the number of inputs. > If I understand correctly, we should either: > * allocate more entry for inputs; or > * abort if argc != 3 Either approach would be fine. I wonder what your goal is for testing the multiple inputs. Are you expecting a larger filter to demonstrate some behavior that is not tested by a small filter? > diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c > index 46e97b04eb..1026facc59 100644 > --- a/t/helper/test-bloom.c > +++ b/t/helper/test-bloom.c > @@ -68,12 +68,14 @@ int cmd__bloom(int argc, const char **argv) > if (!strcmp(argv[1], "generate_filter")) { > struct bloom_filter filter; > int i = 2; > - filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD; > - filter.data = xcalloc(filter.len, sizeof(unsigned char)); > > if (argc - 1 < i) > usage(bloom_usage); > > + filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1) > + / BITS_PER_WORD; > + filter.data = xcalloc(filter.len, sizeof(unsigned char)); > + Whitespace issues aside, this is the right approach to make the filter grow with the input. > while (argv[i]) { > add_string_to_filter(argv[i], &filter); > i++; > diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh > index 7e4ab1795f..6d83927c86 100755 > --- a/t/t0095-bloom.sh > +++ b/t/t0095-bloom.sh > @@ -67,6 +67,17 @@ test_expect_success 'compute bloom key for test string 2' ' > test_cmp expect actual > ' > > +test_expect_success 'compute bloom key for multiple string' ' > + cat >expect <<-\EOF && > + Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d| > + Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585| Multiple hash outputs work already, helping us build confidence that our hash algorithm hasn't changed. > + Filter_Length:3 > + Filter_Data:45|ba|ac| And this is the part of the output that you are changing. > + EOF > + test-tool bloom generate_filter "Hello world!" file.txt >actual && > + test_cmp expect actual > +' > + So, your suggested change has merit. I just wonder what value it provides on top of the current implementation? If your goal is to create a way to inspect a full-sized filter, then that would be interesting to explore. Thanks, -Stolee
On 2020-12-31 06:31:52-0500, Derrick Stolee <stolee@gmail.com> wrote: > On 12/30/2020 10:54 PM, Đoàn Trần Công Danh wrote: > > I'm reading the code for Bloom Filter to see if arXiv:2012.00472 > > could be an improvement. > > > > I'm not sure if I'm missing something or it's our test-bloom generate_filter > > doesn't really support testing for multiple inputs. > > It doesn't support creating a _realistic_ filter of the given size. The tests > at this level are more about keeping the filters consistent so we can trust > that we are not accidentally changing the hashing algorithm and its file format. (Sorry for this late reply) Yes, make sense. > The test works if we provide multiple inputs, the problem is that the resulting > filter has a higher density of bits than we expect, since we didn't size it > according to the number of inputs. OK. > > If I understand correctly, we should either: > > * allocate more entry for inputs; or > > * abort if argc != 3 > > Either approach would be fine. I wonder what your goal is for testing the > multiple inputs. Are you expecting a larger filter to demonstrate some > behavior that is not tested by a small filter? I was expecting a larger filter to trying with some ideas that I had in mind before implement it in real code. However, after toying with my idea, I think it's more of a solution looking for problem. (The original post was posted before I realised that). > > diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c > > index 46e97b04eb..1026facc59 100644 > > --- a/t/helper/test-bloom.c > > +++ b/t/helper/test-bloom.c > > @@ -68,12 +68,14 @@ int cmd__bloom(int argc, const char **argv) > > if (!strcmp(argv[1], "generate_filter")) { > > struct bloom_filter filter; > > int i = 2; > > - filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD; > > - filter.data = xcalloc(filter.len, sizeof(unsigned char)); > > > > if (argc - 1 < i) > > usage(bloom_usage); > > > > + filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1) > > + / BITS_PER_WORD; > > + filter.data = xcalloc(filter.len, sizeof(unsigned char)); > > + > > Whitespace issues aside, this is the right approach to make the filter grow with > the input. > > > while (argv[i]) { > > add_string_to_filter(argv[i], &filter); > > i++; > > diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh > > index 7e4ab1795f..6d83927c86 100755 > > --- a/t/t0095-bloom.sh > > +++ b/t/t0095-bloom.sh > > @@ -67,6 +67,17 @@ test_expect_success 'compute bloom key for test string 2' ' > > test_cmp expect actual > > ' > > > > +test_expect_success 'compute bloom key for multiple string' ' > > + cat >expect <<-\EOF && > > + Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d| > > + Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585| > > Multiple hash outputs work already, helping us build confidence that our > hash algorithm hasn't changed. > > > + Filter_Length:3 > > + Filter_Data:45|ba|ac| > > And this is the part of the output that you are changing. > > > + EOF > > + test-tool bloom generate_filter "Hello world!" file.txt >actual && > > + test_cmp expect actual > > +' > > + > > So, your suggested change has merit. I just wonder what value it provides on top > of the current implementation? If your goal is to create a way to inspect a full-sized > filter, then that would be interesting to explore. Yes, originally, I was trying to creat some way to inspect a full-size filter, i.e. a bloom filter for some set of object_id during fetch negotiation. But, I realise it's not a good idea, now.
On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote: > > Hello, > > I'm reading the code for Bloom Filter to see if arXiv:2012.00472 > could be an improvement. I'm late to the party, but I'm curious to hear which part of this article you think would help out the Bloom filter implementation. (I skimmed the article you're referencing during my time off, but haven't come back to read it in detail since, so it's possible that the part I'm missing is just totally obvious.) > I'm not sure if I'm missing something or it's our test-bloom generate_filter > doesn't really support testing for multiple inputs. The rest of this discussion downthread seems sane to me. Thanks, Taylor
On 2021-01-12 14:53:32-0500, Taylor Blau <me@ttaylorr.com> wrote: > On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote: > > I'm reading the code for Bloom Filter to see if arXiv:2012.00472 > > could be an improvement. > > I'm late to the party, but I'm curious to hear which part of this > article you think would help out the Bloom filter implementation. Uhm, no. The article doesn't help the Bloom filter implementation. The article was suggesting using Bloom filter to speed-up the negotiation in fetch-pack and upload-pack. Which, in my own quick experience, doesn't help much. Maybe it's me not understand the article idea or I have made a naive implementation. However, I'm not convinced to pursued further. If you are curious, I'm attaching 2 quick-and-low-quality patches with this email for your consideration.
On 1/13/2021 6:59 AM, Đoàn Trần Công Danh wrote: > On 2021-01-12 14:53:32-0500, Taylor Blau <me@ttaylorr.com> wrote: >> On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote: >>> I'm reading the code for Bloom Filter to see if arXiv:2012.00472 >>> could be an improvement. >> >> I'm late to the party, but I'm curious to hear which part of this >> article you think would help out the Bloom filter implementation. > > Uhm, no. The article doesn't help the Bloom filter implementation. > The article was suggesting using Bloom filter to speed-up the > negotiation in fetch-pack and upload-pack. Which, in my own quick > experience, doesn't help much. Maybe it's me not understand the > article idea or I have made a naive implementation. However, I'm not > convinced to pursued further. I saw that idea, and was immediately skeptical. Bloom filters still have size linear to the size of the set. By using a Bloom filter to describe "these are ALL the objects I have" instead of "these are the TIPS I have" the size will explode to be much larger than our current negotiation algorithm. Thanks, -Stolee
On 2021-01-13 07:06:59-0500, Derrick Stolee <stolee@gmail.com> wrote: > On 1/13/2021 6:59 AM, Đoàn Trần Công Danh wrote: > > On 2021-01-12 14:53:32-0500, Taylor Blau <me@ttaylorr.com> wrote: > >> On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote: > >>> I'm reading the code for Bloom Filter to see if arXiv:2012.00472 > >>> could be an improvement. > >> > >> I'm late to the party, but I'm curious to hear which part of this > >> article you think would help out the Bloom filter implementation. > > > > Uhm, no. The article doesn't help the Bloom filter implementation. > > The article was suggesting using Bloom filter to speed-up the > > negotiation in fetch-pack and upload-pack. Which, in my own quick > > experience, doesn't help much. Maybe it's me not understand the > > article idea or I have made a naive implementation. However, I'm not > > convinced to pursued further. > > I saw that idea, and was immediately skeptical. Bloom filters still > have size linear to the size of the set. By using a Bloom filter to > describe "these are ALL the objects I have" instead of "these are > the TIPS I have" the size will explode to be much larger than our > current negotiation algorithm. Yes, I saw that problem, too. My implementation was trying to use it from the second round, downstream will say: these are ALL the objects I have from our common objects. The result is not much different, though. Thanks,
On Wed, Jan 13, 2021 at 06:59:52PM +0700, Đoàn Trần Công Danh wrote: > On 2021-01-12 14:53:32-0500, Taylor Blau <me@ttaylorr.com> wrote: > > On Thu, Dec 31, 2020 at 10:54:38AM +0700, Đoàn Trần Công Danh wrote: > > > I'm reading the code for Bloom Filter to see if arXiv:2012.00472 > > > could be an improvement. > > > > I'm late to the party, but I'm curious to hear which part of this > > article you think would help out the Bloom filter implementation. > > Uhm, no. The article doesn't help the Bloom filter implementation. > The article was suggesting using Bloom filter to speed-up the > negotiation in fetch-pack and upload-pack. Which, in my own quick > experience, doesn't help much. Maybe it's me not understand the > article idea or I have made a naive implementation. However, I'm not > convinced to pursued further. I see. I read your "reading the code for Bloom Filter to see if ... could be an improvement" as trying to improve the Bloom implementation. Which after skimming the article, made me quite curious, since I didn't understand what you were getting at. But trying to speed up the negotiation makes sense, and is in line with the goal of the article. It's too bad that you weren't able to produce the same benefits here, but I understand why. > If you are curious, I'm attaching 2 quick-and-low-quality patches with > this email for your consideration. Thanks. They were an interesting read. Thanks, Taylor
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c index 46e97b04eb..1026facc59 100644 --- a/t/helper/test-bloom.c +++ b/t/helper/test-bloom.c @@ -68,12 +68,14 @@ int cmd__bloom(int argc, const char **argv) if (!strcmp(argv[1], "generate_filter")) { struct bloom_filter filter; int i = 2; - filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD; - filter.data = xcalloc(filter.len, sizeof(unsigned char)); if (argc - 1 < i) usage(bloom_usage); + filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1) + / BITS_PER_WORD; + filter.data = xcalloc(filter.len, sizeof(unsigned char)); + while (argv[i]) { add_string_to_filter(argv[i], &filter); i++; diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh index 7e4ab1795f..6d83927c86 100755 --- a/t/t0095-bloom.sh +++ b/t/t0095-bloom.sh @@ -67,6 +67,17 @@ test_expect_success 'compute bloom key for test string 2' ' test_cmp expect actual ' +test_expect_success 'compute bloom key for multiple string' ' + cat >expect <<-\EOF && + Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d| + Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585| + Filter_Length:3 + Filter_Data:45|ba|ac| + EOF + test-tool bloom generate_filter "Hello world!" file.txt >actual && + test_cmp expect actual +' + test_expect_success 'get bloom filters for commit with no changes' ' git init && git commit --allow-empty -m "c0" &&