diff mbox series

[v3,8/8] clean/smudge: allow clean filters to process extremely large files

Message ID acc5591517fa519fc45c07c27defb309c891dea0.1635515959.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Allow clean/smudge filters to handle huge files in the LLP64 data model | expand

Commit Message

Matt Cooper Oct. 29, 2021, 1:59 p.m. UTC
From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
moved between the database and the worktree. We already made sure that
it is possible for smudge filters to produce contents that are larger
than `unsigned long` can represent (which matters on systems where
`unsigned long` is narrower than `size_t`, most notably 64-bit Windows).
Now we make sure that clean filters can _consume_ contents that are
larger than that.

Note that this commit only allows clean filters' _input_ to be larger
than can be represented by `unsigned long`.

This change makes only a very minute dent into the much larger project
to teach Git to use `size_t` instead of `unsigned long` wherever
appropriate.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 convert.c                   |  2 +-
 t/t1051-large-conversion.sh | 11 +++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Oct. 29, 2021, 11:17 p.m. UTC | #1
"Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +# This clean filter writes down the size of input it receives. By checking against
> +# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
> +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> +		'files over 4GB convert on input' '
> +	test-tool genzeros $((5*1024*1024*1024)) >big &&
> +	test_config filter.checklarge.clean "wc -c >big.size" &&
> +	echo "big filter=checklarge" >.gitattributes &&
> +	git add big &&
> +	test $(test_file_size big) -eq $(cat big.size)
> +'

I would have expected that the clean filter to be sending the count
to its standard output (to be hashed and made into a blob object),
and the test wuld be doing "git cat-file blob :big" to read the
contents of the raw blob, bypassing the filter system.  But we are
testing with only a single path anyway, use of this single extra
file is OK.

Looking good.
Johannes Schindelin Nov. 2, 2021, 2:59 p.m. UTC | #2
Hi Junio,

On Fri, 29 Oct 2021, Junio C Hamano wrote:

> "Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +# This clean filter writes down the size of input it receives. By checking against
> > +# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
> > +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> > +		'files over 4GB convert on input' '
> > +	test-tool genzeros $((5*1024*1024*1024)) >big &&
> > +	test_config filter.checklarge.clean "wc -c >big.size" &&
> > +	echo "big filter=checklarge" >.gitattributes &&
> > +	git add big &&
> > +	test $(test_file_size big) -eq $(cat big.size)
> > +'
>
> I would have expected that the clean filter to be sending the count
> to its standard output (to be hashed and made into a blob object),
> and the test wuld be doing "git cat-file blob :big" to read the
> contents of the raw blob, bypassing the filter system.

That was exactly what Matt had in his first iteration. But I dislike
unnecessarily spawned processes, they are not "free" on Windows, so I
shortened the design to take this shortcut.

> But we are testing with only a single path anyway, use of this single
> extra file is OK.

Precisely.

Ciao,
Dscho
diff mbox series

Patch

diff --git a/convert.c b/convert.c
index fd9c84b0257..5ad6dfc08a0 100644
--- a/convert.c
+++ b/convert.c
@@ -613,7 +613,7 @@  static int crlf_to_worktree(const char *src, size_t len, struct strbuf *buf,
 
 struct filter_params {
 	const char *src;
-	unsigned long size;
+	size_t size;
 	int fd;
 	const char *cmd;
 	const char *path;
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 8b23d862600..d4cfe8bf5de 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -97,4 +97,15 @@  test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
 	test "$size" -ge $((5 * 1024 * 1024 * 1024))
 '
 
+# This clean filter writes down the size of input it receives. By checking against
+# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
+test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+		'files over 4GB convert on input' '
+	test-tool genzeros $((5*1024*1024*1024)) >big &&
+	test_config filter.checklarge.clean "wc -c >big.size" &&
+	echo "big filter=checklarge" >.gitattributes &&
+	git add big &&
+	test $(test_file_size big) -eq $(cat big.size)
+'
+
 test_done