Message ID | 20241230-b4-pks-object-file-racy-collision-check-v1-1-11571294e60a@pks.im (mailing list archive) |
---|---|
State | Accepted |
Commit | 0ad3d656521aa16a6496aa855bbde97160a2b2bc |
Headers | show |
Series | object-file: fix race in object collision check | expand |
Patrick Steinhardt <ps@pks.im> writes: > By definition, two files cannot collide with each other when one of them > has been removed. We can thus trivially fix the issue by ignoring ENOENT > when opening either of the files we're about to check for collision. Thanks for digging it down to the cause. It is more like even if these two files collided (i.e. have the same name based on what the hash function says, with different contents), when one of them has been removed, we have no way to check if the collision is benign, and even if it were not, we cannot do anything about it, isn't it? I do like the simplicity of the solution. I wonder given bad enough race, we could fall into a case where both files are missing? Thanks, will queue. > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- > Hi, > > this patch is the follow-up for [1], where I've mentioned a couple of CI > flakes that happen rather regularly. As it turns out, this race was a > real bug hiding in the newly nitroduced object collision check in case > one of the files got unlinked while performing the check. > > Thanks! > > Patrick > > [1]: <Z2-2dbYVuuLxpNmK@pks.im> > --- > object-file.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/object-file.c b/object-file.c > index 5b792b3dd42cecde43a1b18abc164fd368cbcd69..f84dcd2f2a7b88716ab47bc00ee7a605a82e8d21 100644 > --- a/object-file.c > +++ b/object-file.c > @@ -1978,13 +1978,15 @@ static int check_collision(const char *filename_a, const char *filename_b) > > fd_a = open(filename_a, O_RDONLY); > if (fd_a < 0) { > - ret = error_errno(_("unable to open %s"), filename_a); > + if (errno != ENOENT) > + ret = error_errno(_("unable to open %s"), filename_a); > goto out; > } > > fd_b = open(filename_b, O_RDONLY); > if (fd_b < 0) { > - ret = error_errno(_("unable to open %s"), filename_b); > + if (errno != ENOENT) > + ret = error_errno(_("unable to open %s"), filename_b); > goto out; > } > > > --- > base-commit: 306ab352f4e98f6809ce52fc4e5d63fb947d0635 > change-id: 20241230-b4-pks-object-file-racy-collision-check-62a8d1588116
On Mon, Dec 30, 2024 at 06:40:53AM -0800, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > By definition, two files cannot collide with each other when one of them > > has been removed. We can thus trivially fix the issue by ignoring ENOENT > > when opening either of the files we're about to check for collision. > > Thanks for digging it down to the cause. > > It is more like even if these two files collided (i.e. have the same > name based on what the hash function says, with different contents), > when one of them has been removed, we have no way to check if the > collision is benign, and even if it were not, we cannot do anything > about it, isn't it? Depends on what "benign" means in this context, I guess. We can only assert the most trivial case of it being "benign", namely that we have computed a packfile that actually is the exact same. This is also going to be the most common case, as everything else would depend on a cryptographic collision of the packfile contents. And in that case... we cannot do anything about it, yes. > I do like the simplicity of the solution. I wonder given bad enough > race, we could fall into a case where both files are missing? I was wondering about that, too, but it would very much feel like a bug to me if that were ever to happen. So I briefly considered whether I should treat the passed-in filenames differently: - One that must exist non-racily. This is our temporary object or packfile that we want to move into place. - And one that may have been removed racily. This is our target file path that we want to overwrite, unless there is a collision. The idea would be to only handle ENOENT for the second case. But in the end I don't think it's worth the complexity because `check_collision()` is used before rename(3p)ing the former into place, and that function would already notice ENOENT anyway. So we would eventually just die the same. Patrick
Patrick Steinhardt <ps@pks.im> writes: > On Mon, Dec 30, 2024 at 06:40:53AM -0800, Junio C Hamano wrote: >> Patrick Steinhardt <ps@pks.im> writes: >> >> > By definition, two files cannot collide with each other when one of them >> > has been removed. We can thus trivially fix the issue by ignoring ENOENT >> > when opening either of the files we're about to check for collision. >> >> Thanks for digging it down to the cause. >> >> It is more like even if these two files collided (i.e. have the same >> name based on what the hash function says, with different contents), >> when one of them has been removed, we have no way to check if the >> collision is benign, and even if it were not, we cannot do anything >> about it, isn't it? > > Depends on what "benign" means in this context, I guess. We can only > assert the most trivial case of it being "benign", namely that we have > computed a packfile that actually is the exact same. This is also going > to be the most common case, as everything else would depend on a > cryptographic collision of the packfile contents. And in that case... we > cannot do anything about it, yes. Yes, the whole point of this collision check is to notice the rare case where we are under attack with cryptographic collision, so "most of the time it is OK so not being able to check is fine" is not an impression we want to give readers. > The idea would be to only handle ENOENT for the second case. But in the > end I don't think it's worth the complexity because `check_collision()` > is used before rename(3p)ing the former into place, and that function > would already notice ENOENT anyway. So we would eventually just die the > same. Thanks for thinking it through. Queued.
On Mon, Dec 30, 2024 at 03:50:04PM +0100, Patrick Steinhardt wrote: > On Mon, Dec 30, 2024 at 06:40:53AM -0800, Junio C Hamano wrote: > > Patrick Steinhardt <ps@pks.im> writes: > > > > > By definition, two files cannot collide with each other when one of them > > > has been removed. We can thus trivially fix the issue by ignoring ENOENT > > > when opening either of the files we're about to check for collision. > > > > Thanks for digging it down to the cause. > > > > It is more like even if these two files collided (i.e. have the same > > name based on what the hash function says, with different contents), > > when one of them has been removed, we have no way to check if the > > collision is benign, and even if it were not, we cannot do anything > > about it, isn't it? > > Depends on what "benign" means in this context, I guess. We can only > assert the most trivial case of it being "benign", namely that we have > computed a packfile that actually is the exact same. This is also going > to be the most common case, as everything else would depend on a > cryptographic collision of the packfile contents. And in that case... we > cannot do anything about it, yes. There is one gotcha here, though. We call this collision check only if we got EEXIST trying to move the tempfile into place. If the destination file then goes away, we can't do the collision check. But is it right to quietly return success? If the contents of the two were the same, that's fine. We don't need the extra copy. But if the contents were not the same, we'd prefer either to actually copy the contents into place, or to return an error. Of course we can't know, because the destination file has gone away. In the common case they will be the same, but the whole point of this check is to allow loosening the cryptographic collision of the packfile contents. So the safest thing would be to retain the tempfile, copying it into the destination file. That errs on the side of keeping data when we cannot make a determination. IOW, if we see ENOENT on filename_b, should we then loop back in the caller to try the link() again? > > I do like the simplicity of the solution. I wonder given bad enough > > race, we could fall into a case where both files are missing? > > I was wondering about that, too, but it would very much feel like a bug > to me if that were ever to happen. So I briefly considered whether I > should treat the passed-in filenames differently: > > - One that must exist non-racily. This is our temporary object or > packfile that we want to move into place. > > - And one that may have been removed racily. This is our target file > path that we want to overwrite, unless there is a collision. > > The idea would be to only handle ENOENT for the second case. But in the > end I don't think it's worth the complexity because `check_collision()` > is used before rename(3p)ing the former into place, and that function > would already notice ENOENT anyway. So we would eventually just die the > same. I think check_collision() is used _after_ the attempt to rename() into place. So there's a race when the tempfile goes away, but I think the outcome is made a bit worse by your patch. Consider a sequence like this: a. Process A writes tmp_pack_foo. b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it already exists. c. Process A opens both tmp_pack_foo and pack-<hash>. d. Process A compares the two byte-for-byte, and then returns success/failure based on whether they were actually identical. Now imagine there is a process B that deletes the file (maybe an over-zealous "gc --prune=now" deletes the in-use temporary file): - if process B deletes it between steps (a) and (b), process A returns an error (there is nothing to link). The caller knows that the data was not stored. - if process B deletes it between (b) and (c), then before your patch we see an error (because we can't compare the files). After your patch, we continue on and return success. The caller knows the data was stored (via the original file, not our new copy). - if process B deletes it between (c) and (d), then process A has no idea. But at this point it does not matter. If the files were identical, we return success (and in fact, process A deletes the file itself). And if not identical, then we return error, and the callers knows the data was not stored. So even though the exact behavior may depend on where we hit the race, I think ignoring an ENOENT open() error on the tempfile meaningfully changes what happens in the middle case. In practice I don't really expect this to happen, and "gc --prune=now" is inherently risky in a live repository. But I think we're probably better off to continue treating it as an error if we can't open our own tempfile. -Peff
Jeff King <peff@peff.net> writes: > There is one gotcha here, though. We call this collision check only if > we got EEXIST trying to move the tempfile into place. If the destination > file then goes away, we can't do the collision check. But is it right to > quietly return success? > > If the contents of the two were the same, that's fine. We don't need the > extra copy. > > But if the contents were not the same, we'd prefer either to actually > copy the contents into place, or to return an error. > > Of course we can't know, because the destination file has gone away. In > the common case they will be the same, but the whole point of this check > is to allow loosening the cryptographic collision of the packfile > contents. So the safest thing would be to retain the tempfile, copying > it into the destination file. That errs on the side of keeping data when > we cannot make a determination. > > IOW, if we see ENOENT on filename_b, should we then loop back in the > caller to try the link() again? Yuck, I think you're absolutely right. > I think check_collision() is used _after_ the attempt to rename() into > place. So there's a race when the tempfile goes away, but I think the > outcome is made a bit worse by your patch. > > Consider a sequence like this: > > a. Process A writes tmp_pack_foo. > > b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it > already exists. > > c. Process A opens both tmp_pack_foo and pack-<hash>. > > d. Process A compares the two byte-for-byte, and then returns > success/failure based on whether they were actually identical. > > Now imagine there is a process B that deletes the file (maybe an > over-zealous "gc --prune=now" deletes the in-use temporary file): > > - if process B deletes it between steps (a) and (b), process A returns > an error (there is nothing to link). The caller knows that the data > was not stored. > > - if process B deletes it between (b) and (c), then before your patch > we see an error (because we can't compare the files). After your > patch, we continue on and return success. The caller knows the data > was stored (via the original file, not our new copy). > > - if process B deletes it between (c) and (d), then process A has no > idea. But at this point it does not matter. If the files were > identical, we return success (and in fact, process A deletes the file > itself). And if not identical, then we return error, and the callers > knows the data was not stored. > > So even though the exact behavior may depend on where we hit the race, I > think ignoring an ENOENT open() error on the tempfile meaningfully > changes what happens in the middle case. > > In practice I don't really expect this to happen, and "gc --prune=now" > is inherently risky in a live repository. But I think we're probably > better off to continue treating it as an error if we can't open our own > tempfile. So we'd ignore the racy and flaky tests, as hiding the flake by ignoring the error would only hurt the real world users.
On Wed, Jan 01, 2025 at 08:50:51AM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > There is one gotcha here, though. We call this collision check only if > > we got EEXIST trying to move the tempfile into place. If the destination > > file then goes away, we can't do the collision check. But is it right to > > quietly return success? > > > > If the contents of the two were the same, that's fine. We don't need the > > extra copy. > > > > But if the contents were not the same, we'd prefer either to actually > > copy the contents into place, or to return an error. > > > > Of course we can't know, because the destination file has gone away. In > > the common case they will be the same, but the whole point of this check > > is to allow loosening the cryptographic collision of the packfile > > contents. So the safest thing would be to retain the tempfile, copying > > it into the destination file. That errs on the side of keeping data when > > we cannot make a determination. > > > > IOW, if we see ENOENT on filename_b, should we then loop back in the > > caller to try the link() again? > > Yuck, I think you're absolutely right. So I think this part, if adjusted as I suggested, would fix the race in the tests without making anything worse (it's just more code). And then this... > > I think check_collision() is used _after_ the attempt to rename() into > > place. So there's a race when the tempfile goes away, but I think the > > outcome is made a bit worse by your patch. > > > > Consider a sequence like this: > > > > a. Process A writes tmp_pack_foo. > > > > b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it > > already exists. > > > > c. Process A opens both tmp_pack_foo and pack-<hash>. > > > > d. Process A compares the two byte-for-byte, and then returns > > success/failure based on whether they were actually identical. > > > > Now imagine there is a process B that deletes the file (maybe an > > over-zealous "gc --prune=now" deletes the in-use temporary file): > > > > - if process B deletes it between steps (a) and (b), process A returns > > an error (there is nothing to link). The caller knows that the data > > was not stored. > > > > - if process B deletes it between (b) and (c), then before your patch > > we see an error (because we can't compare the files). After your > > patch, we continue on and return success. The caller knows the data > > was stored (via the original file, not our new copy). > > > > - if process B deletes it between (c) and (d), then process A has no > > idea. But at this point it does not matter. If the files were > > identical, we return success (and in fact, process A deletes the file > > itself). And if not identical, then we return error, and the callers > > knows the data was not stored. > > > > So even though the exact behavior may depend on where we hit the race, I > > think ignoring an ENOENT open() error on the tempfile meaningfully > > changes what happens in the middle case. > > > > In practice I don't really expect this to happen, and "gc --prune=now" > > is inherently risky in a live repository. But I think we're probably > > better off to continue treating it as an error if we can't open our own > > tempfile. > > So we'd ignore the racy and flaky tests, as hiding the flake by > ignoring the error would only hurt the real world users. ...is all about ignoring ENOENT on the tmpfile itself. And I think we can just drop that hunk entirely. The tests do not care here (they are running simultaneous gc, but _not_ a simultaneous "--prune=now" gc). -Peff
On Wed, Jan 01, 2025 at 01:19:52PM -0500, Jeff King wrote: > On Wed, Jan 01, 2025 at 08:50:51AM -0800, Junio C Hamano wrote: > > > Jeff King <peff@peff.net> writes: > > > > > There is one gotcha here, though. We call this collision check only if > > > we got EEXIST trying to move the tempfile into place. If the destination > > > file then goes away, we can't do the collision check. But is it right to > > > quietly return success? > > > > > > If the contents of the two were the same, that's fine. We don't need the > > > extra copy. > > > > > > But if the contents were not the same, we'd prefer either to actually > > > copy the contents into place, or to return an error. > > > > > > Of course we can't know, because the destination file has gone away. In > > > the common case they will be the same, but the whole point of this check > > > is to allow loosening the cryptographic collision of the packfile > > > contents. So the safest thing would be to retain the tempfile, copying > > > it into the destination file. That errs on the side of keeping data when > > > we cannot make a determination. > > > > > > IOW, if we see ENOENT on filename_b, should we then loop back in the > > > caller to try the link() again? > > > > Yuck, I think you're absolutely right. > > So I think this part, if adjusted as I suggested, would fix the race in > the tests without making anything worse (it's just more code). Hm, okay, I think that makes sense. Instead of being optimistic like my first version was we should be pessimistic and assume the worst. Which I guess is a sensible stance to take in a collision check. > And then this... > > > > I think check_collision() is used _after_ the attempt to rename() into > > > place. So there's a race when the tempfile goes away, but I think the > > > outcome is made a bit worse by your patch. > > > > > > Consider a sequence like this: > > > > > > a. Process A writes tmp_pack_foo. > > > > > > b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it > > > already exists. > > > > > > c. Process A opens both tmp_pack_foo and pack-<hash>. > > > > > > d. Process A compares the two byte-for-byte, and then returns > > > success/failure based on whether they were actually identical. > > > > > > Now imagine there is a process B that deletes the file (maybe an > > > over-zealous "gc --prune=now" deletes the in-use temporary file): > > > > > > - if process B deletes it between steps (a) and (b), process A returns > > > an error (there is nothing to link). The caller knows that the data > > > was not stored. > > > > > > - if process B deletes it between (b) and (c), then before your patch > > > we see an error (because we can't compare the files). After your > > > patch, we continue on and return success. The caller knows the data > > > was stored (via the original file, not our new copy). > > > > > > - if process B deletes it between (c) and (d), then process A has no > > > idea. But at this point it does not matter. If the files were > > > identical, we return success (and in fact, process A deletes the file > > > itself). And if not identical, then we return error, and the callers > > > knows the data was not stored. > > > > > > So even though the exact behavior may depend on where we hit the race, I > > > think ignoring an ENOENT open() error on the tempfile meaningfully > > > changes what happens in the middle case. > > > > > > In practice I don't really expect this to happen, and "gc --prune=now" > > > is inherently risky in a live repository. But I think we're probably > > > better off to continue treating it as an error if we can't open our own > > > tempfile. > > > > So we'd ignore the racy and flaky tests, as hiding the flake by > > ignoring the error would only hurt the real world users. > > ...is all about ignoring ENOENT on the tmpfile itself. And I think we > can just drop that hunk entirely. The tests do not care here (they are > running simultaneous gc, but _not_ a simultaneous "--prune=now" gc). Yeah, we don't need this hunk then. I'll send a follow-up change. Patrick
diff --git a/object-file.c b/object-file.c index 5b792b3dd42cecde43a1b18abc164fd368cbcd69..f84dcd2f2a7b88716ab47bc00ee7a605a82e8d21 100644 --- a/object-file.c +++ b/object-file.c @@ -1978,13 +1978,15 @@ static int check_collision(const char *filename_a, const char *filename_b) fd_a = open(filename_a, O_RDONLY); if (fd_a < 0) { - ret = error_errno(_("unable to open %s"), filename_a); + if (errno != ENOENT) + ret = error_errno(_("unable to open %s"), filename_a); goto out; } fd_b = open(filename_b, O_RDONLY); if (fd_b < 0) { - ret = error_errno(_("unable to open %s"), filename_b); + if (errno != ENOENT) + ret = error_errno(_("unable to open %s"), filename_b); goto out; }
One of the tests in t5616 asserts that git-fetch(1) with `--refetch` triggers repository maintenance with the correct set of arguments. This test is flaky and causes us to fail sometimes: ++ git -c protocol.version=0 -c gc.autoPackLimit=0 -c maintenance.incremental-repack.auto=1234 -C pc1 fetch --refetch origin error: unable to open .git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack: No such file or directory fatal: unable to rename temporary file to '.git/objects/pack/pack-029d08823bd8a8eab510ad6ac75c823cfd3ed31e.pack' fatal: could not finish pack-objects to repack local links fatal: index-pack failed error: last command exited with $?=128 The error message is quite confusing as it talks about trying to rename a temporary packfile. A first hunch would thus be that this packfile gets written by git-fetch(1), but removed by git-maintenance(1) while it hasn't yet been finalized, which shouldn't ever happen. And indeed, when looking closer one notices that the file that is supposedly of temporary nature does not have the typical `tmp_pack_` prefix. As it turns out, the "unable to rename temporary file" fatal error is a red herring and the real error is "unable to open". That error is raised by `check_collision()`, which is called by `finalize_object_file()` when moving the new packfile into place. Because t5616 re-fetches objects, we end up with the exact same pack as we already have in the repository. So when the concurrent git-maintenance(1) process rewrites the preexisting pack and unlinks it exactly at the point in time where git-fetch(1) wants to check the old and new packfiles for equality we will see ENOENT and thus `check_collision()` returns an error, which gets bubbled up by `finalize_object_file()` and is then handled by `rename_tmp_packfile()`. That function does not know about the exact root cause of the error and instead just claims that the rename has failed. This race is thus caused by b1b8dfde69 (finalize_object_file(): implement collision check, 2024-09-26), where we have newly introduced the collision check. By definition, two files cannot collide with each other when one of them has been removed. We can thus trivially fix the issue by ignoring ENOENT when opening either of the files we're about to check for collision. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- Hi, this patch is the follow-up for [1], where I've mentioned a couple of CI flakes that happen rather regularly. As it turns out, this race was a real bug hiding in the newly nitroduced object collision check in case one of the files got unlinked while performing the check. Thanks! Patrick [1]: <Z2-2dbYVuuLxpNmK@pks.im> --- object-file.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) --- base-commit: 306ab352f4e98f6809ce52fc4e5d63fb947d0635 change-id: 20241230-b4-pks-object-file-racy-collision-check-62a8d1588116