Message ID | 20231012160930.330618-3-sandals@crustytoothpaste.net (mailing list archive) |
---|---|
State | Accepted |
Commit | 5143ac07b17e2b025865378fce24cc11ac7bf8b1 |
Headers | show |
Series | Prevent re-reading 4 GiB files on every status | expand |
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > An example would be to have a 2^32 sized file in the index of > patched git. Patched git would save the file as 2^31 in the cache. > An unpatched git would very much see the file has changed in size > and force it to rehash the file, which is safe. The reason why this is "safe" is because an older Git will would keep rehashing whether 2^31 or 0 is stored as its sd_size, so the change is not making things worse? With older git, "git diff-files" will report that such a file is not up to date, and then the user will refresh the index, which will store 0 as its sd_file, so tentatively "git status" may give a wrong information, but we probalby do not care? Is that how the reasoning goes? > +/* > + * Munge st_size into an unsigned int. > + */ > +static unsigned int munge_st_size(off_t st_size) { > + unsigned int sd_size = st_size; > + > + /* > + * If the file is an exact multiple of 4 GiB, modify the value so it > + * doesn't get marked as racily clean (zero). > + */ > + if (!sd_size && st_size) > + return 0x80000000; > + else > + return sd_size; > +} This assumes typeof(sd_size) aka "unsigned int" is always 32-bit, which does not sound reasonable. Reference to 4GiB, 2^32 and 2^31 in the code and the proposed commit log message need to be qualified with "on a platform whose uint is 32-bit" or something, or better yet, phrased in a way that is agnostic to the integer size. At the very least, the hardcoded 0x80000000 needs to be rethought, I am afraid. Other than that, the workaround for the racy-git avoidance code does sound good. I actually wonder if we should always use 1 regardless of integer size.
On 2023-10-12 at 17:58:42, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > An example would be to have a 2^32 sized file in the index of > > patched git. Patched git would save the file as 2^31 in the cache. > > An unpatched git would very much see the file has changed in size > > and force it to rehash the file, which is safe. > > The reason why this is "safe" is because an older Git will would > keep rehashing whether 2^31 or 0 is stored as its sd_size, so the > change is not making things worse? With older git, "git diff-files" > will report that such a file is not up to date, and then the user > will refresh the index, which will store 0 as its sd_file, so > tentatively "git status" may give a wrong information, but we > probalby do not care? Is that how the reasoning goes? Correct. An older Git will rehash it either way, on every git status. If you run git diff-files on an older Git, it will always show it as modified (hence why I used that in the tests). The git status will rehash it and then realize that it isn't modified, not printing any changes, so the behaviour is not wrong, it's just extremely slow (SHA-1 DC on 4 GiB of data). If you use a new Git with an old index, git status will still rehash it the first time and correctly determine that it isn't changed, then write the 0x80000000 to the index. It just won't rehash it on subsequent calls to git status. The only incorrect behaviour (assuming that slow is not incorrect) that I'm aware of on older Git is that git diff-files marks it as modified when it's not. There is no incorrect behaviour on a fixed Git. > > +/* > > + * Munge st_size into an unsigned int. > > + */ > > +static unsigned int munge_st_size(off_t st_size) { > > + unsigned int sd_size = st_size; > > + > > + /* > > + * If the file is an exact multiple of 4 GiB, modify the value so it > > + * doesn't get marked as racily clean (zero). > > + */ > > + if (!sd_size && st_size) > > + return 0x80000000; > > + else > > + return sd_size; > > +} > > This assumes typeof(sd_size) aka "unsigned int" is always 32-bit, > which does not sound reasonable. Reference to 4GiB, 2^32 and 2^31 > in the code and the proposed commit log message need to be qualified > with "on a platform whose uint is 32-bit" or something, or better > yet, phrased in a way that is agnostic to the integer size. At > the very least, the hardcoded 0x80000000 needs to be rethought, I > am afraid. unsigned int is 32-bit on every modern 32- or 64-bit platform known to exist today. I don't believe we support MS-DOS or systems of its era, nor should we support 8- or 16-bit systems. If you'd prefer "uint32_t", I can do that. Note that the problem is reproducible on a stock Ubuntu amd64 system, so it is definitely a problem on all modern systems. > Other than that, the workaround for the racy-git avoidance code does > sound good. I actually wonder if we should always use 1 regardless > of integer size. I think using 2^31 is better because it's far away from very small values and very large values, which means that it's a hard to modify a file which used to have its value as a multiple of 4 GiB and accidentally make Git think it was unchanged. Using 1 would make a simple "echo > foo" possibly think things were unchanged in some cases, which we should avoid.
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > nor should we support 8- or 16-bit systems. If you'd prefer "uint32_t", > I can do that. Going into statinfo.h and updating sd_size to uint32_t is totally outside the scope of this fix. > I think using 2^31 is better because it's far away from very small > values and very large values, which means that it's a hard to modify a > file which used to have its value as a multiple of 4 GiB and > accidentally make Git think it was unchanged. Using 1 would make a > simple "echo > foo" possibly think things were unchanged in some cases, > which we should avoid. The reason I gave the extreme "1-byte" was to gauge how much actual "size" we are relying on the correctness of this change. As mtime is giving the primary protection from false matching of cached stat information, I do not think "echo >foo" would be a huge issue. IOW my stance is 1U<<31 is as good as 1U<<0, so I do not oppose to the former, either. But in a few years, 64-bit integers may cease to be too odd, who knows ;-)
On Thu, Oct 12, 2023 at 04:09:30PM +0000, brian m. carlson wrote: > +static unsigned int munge_st_size(off_t st_size) { > + unsigned int sd_size = st_size; > + > + /* > + * If the file is an exact multiple of 4 GiB, modify the value so it > + * doesn't get marked as racily clean (zero). > + */ > + if (!sd_size && st_size) > + return 0x80000000; > + else > + return sd_size; > +} Coverity complained that the "true" side of this conditional is unreachable, since sd_size is assigned from st_size, so the two values cannot be both true and false. But obviously we are depending here on the truncation of off_t to "unsigned int". I'm not sure if Coverity is just dumb, or if it somehow has a different size for off_t. I don't _think_ this would ever cause confusion in a real compiler, as assignment from a larger type to a smaller has well-defined truncation, as far as I know. But I do wonder if an explicit "& 0xFFFFFFFF" would make it more obvious what is happening (which would also do the right thing if in some hypothetical platform "unsigned int" ended up larger than 32 bits). -Peff
From: Jeff King <peff@peff.net> > On Thu, Oct 12, 2023 at 04:09:30PM +0000, brian m. carlson wrote: > > > +static unsigned int munge_st_size(off_t st_size) { > > + unsigned int sd_size = st_size; > > + > > + /* > > + * If the file is an exact multiple of 4 GiB, modify the value so it > > + * doesn't get marked as racily clean (zero). > > + */ > > + if (!sd_size && st_size) > > + return 0x80000000; > > + else > > + return sd_size; > > +} > > Coverity complained that the "true" side of this conditional is > unreachable, since sd_size is assigned from st_size, so the two values > cannot be both true and false. But obviously we are depending here on > the truncation of off_t to "unsigned int". I'm not sure if Coverity is > just dumb, or if it somehow has a different size for off_t. > > I don't _think_ this would ever cause confusion in a real compiler, as > assignment from a larger type to a smaller has well-defined truncation, > as far as I know. > > But I do wonder if an explicit "& 0xFFFFFFFF" would make it more obvious > what is happening (which would also do the right thing if in some > hypothetical platform "unsigned int" ended up larger than 32 bits). > > -Peff I originally wrote the code this way to work exactly like the original code with one exception: Never truncate a nonzero st_size to a zero sd_size. The original code is here in fill_stat_data: I was attempting to use exactly the same implicit type conversion and types as the original. We could probably write the below to do the same thing. void fill_stat_data(struct stat_data *sd, struct stat *st) { sd->sd_ctime.sec = (unsigned int)st->st_ctime; sd->sd_mtime.sec = (unsigned int)st->st_mtime; sd->sd_ctime.nsec = ST_CTIME_NSEC(*st); sd->sd_mtime.nsec = ST_MTIME_NSEC(*st); sd->sd_dev = st->st_dev; sd->sd_ino = st->st_ino; sd->sd_uid = st->st_uid; sd->sd_gid = st->st_gid; sd->sd_size = st->st_size; + if (sd->sd_size == 0 && st->st_size!= 0) { + sd->sd_size = 1; + } } - Jason D. Hatton
Jason Hatton <jhatton@globalfinishing.com> writes: > We could probably write the below to do the same thing. > > void fill_stat_data(struct stat_data *sd, struct stat *st) > { > sd->sd_ctime.sec = (unsigned int)st->st_ctime; > sd->sd_mtime.sec = (unsigned int)st->st_mtime; > sd->sd_ctime.nsec = ST_CTIME_NSEC(*st); > sd->sd_mtime.nsec = ST_MTIME_NSEC(*st); > sd->sd_dev = st->st_dev; > sd->sd_ino = st->st_ino; > sd->sd_uid = st->st_uid; > sd->sd_gid = st->st_gid; > sd->sd_size = st->st_size; > + if (sd->sd_size == 0 && st->st_size!= 0) { > + sd->sd_size = 1; > + } > } The above is a fairly straight-forward inlining (except that it does explicit comparisons with zero) of the helper function the version of patch under discussion added, and uses 1 instead of (1<<31) as an arbigrary nonzero number that can be used to work around the issue. So I agree with you that it would do the same thing. I am not surprised if it also gets scolded by Coverity the same way, though.
On 2023-10-17 at 00:00:19, Jeff King wrote: > On Thu, Oct 12, 2023 at 04:09:30PM +0000, brian m. carlson wrote: > > > +static unsigned int munge_st_size(off_t st_size) { > > + unsigned int sd_size = st_size; > > + > > + /* > > + * If the file is an exact multiple of 4 GiB, modify the value so it > > + * doesn't get marked as racily clean (zero). > > + */ > > + if (!sd_size && st_size) > > + return 0x80000000; > > + else > > + return sd_size; > > +} > > Coverity complained that the "true" side of this conditional is > unreachable, since sd_size is assigned from st_size, so the two values > cannot be both true and false. But obviously we are depending here on > the truncation of off_t to "unsigned int". I'm not sure if Coverity is > just dumb, or if it somehow has a different size for off_t. Technically, on 32-bit machines, you can have a 32-bit off_t if you don't compile with -D_FILE_OFFSET_BITS=64. However, such a program is not very useful on modern systems, since it wouldn't be able to handle files that are 2 GiB or larger, and so everyone considers that a silly and buggy way to compile programs. > I don't _think_ this would ever cause confusion in a real compiler, as > assignment from a larger type to a smaller has well-defined truncation, > as far as I know. > > But I do wonder if an explicit "& 0xFFFFFFFF" would make it more obvious > what is happening (which would also do the right thing if in some > hypothetical platform "unsigned int" ended up larger than 32 bits). Such a hypothetical platform is of course allowed by the C standard, but it's also going to run near zero real Unix programs or kernels. I am at this point reflecting on the prudent decision Rust made to make almost everything an explicit bit size (e.g., u32, i64). Nonetheless, I'll reroll this with the other things you mentioned, and I'll switch that "unsigned int" above to "uint32_t", which I think makes this more obvious. I don't, however, intend to change the constant from 0x8000000 to 1, because I think that's a poorer choice, but I'll try to explain it better in the commit message. (I had originally aimed to avoid editing it as much as possible since it's not my name on the commit to avoid Jason getting blamed unnecessarily for any mistakes I might make.)
diff --git a/statinfo.c b/statinfo.c index 17bb8966c3..9367ca099c 100644 --- a/statinfo.c +++ b/statinfo.c @@ -2,6 +2,22 @@ #include "environment.h" #include "statinfo.h" +/* + * Munge st_size into an unsigned int. + */ +static unsigned int munge_st_size(off_t st_size) { + unsigned int sd_size = st_size; + + /* + * If the file is an exact multiple of 4 GiB, modify the value so it + * doesn't get marked as racily clean (zero). + */ + if (!sd_size && st_size) + return 0x80000000; + else + return sd_size; +} + void fill_stat_data(struct stat_data *sd, struct stat *st) { sd->sd_ctime.sec = (unsigned int)st->st_ctime; @@ -12,7 +28,7 @@ void fill_stat_data(struct stat_data *sd, struct stat *st) sd->sd_ino = st->st_ino; sd->sd_uid = st->st_uid; sd->sd_gid = st->st_gid; - sd->sd_size = st->st_size; + sd->sd_size = munge_st_size(st->st_size); } int match_stat_data(const struct stat_data *sd, struct stat *st) @@ -51,7 +67,7 @@ int match_stat_data(const struct stat_data *sd, struct stat *st) changed |= INODE_CHANGED; #endif - if (sd->sd_size != (unsigned int) st->st_size) + if (sd->sd_size != munge_st_size(st->st_size)) changed |= DATA_CHANGED; return changed; diff --git a/t/t7508-status.sh b/t/t7508-status.sh index 6928fd89f5..6c46648e11 100755 --- a/t/t7508-status.sh +++ b/t/t7508-status.sh @@ -1745,4 +1745,20 @@ test_expect_success 'slow status advice when core.untrackedCache true, and fsmon ) ' +test_expect_success EXPENSIVE 'status does not re-read unchanged 4 or 8 GiB file' ' + ( + mkdir large-file && + cd large-file && + # Files are 2 GiB, 4 GiB, and 8 GiB sparse files. + test-tool truncate file-a 0x080000000 && + test-tool truncate file-b 0x100000000 && + test-tool truncate file-c 0x200000000 && + # This will be slow. + git add file-a file-b file-c && + git commit -m "add large files" && + git diff-index HEAD file-a file-b file-c >actual && + test_must_be_empty actual + ) +' + test_done