refs: sync loose refs to disk before committing them

Message ID	dd65718814011eb93ccc4428f9882e0f025224a6.1636029491.git.ps@pks.im (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Date: Thu, 4 Nov 2021 13:38:22 +0100 From: Patrick Steinhardt <ps@pks.im> To: git@vger.kernel.org Subject: [PATCH] refs: sync loose refs to disk before committing them Message-ID: <dd65718814011eb93ccc4428f9882e0f025224a6.1636029491.git.ps@pks.im> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="C3L1vaivyyL9M/+0" Content-Disposition: inline Precedence: bulk
Series	refs: sync loose refs to disk before committing them \| expand refs: sync loose refs to disk before committing them

Patrick Steinhardt Nov. 4, 2021, 12:38 p.m. UTC

When writing loose refs, we first create a lockfile, write the new ref
into that lockfile, close it and then rename the lockfile into place
such that the actual update is atomic for that single ref. While this
works as intended under normal circumstences, at GitLab we infrequently
encounter corrupt loose refs in repositories after a machine encountered
a hard reset. The corruption is always of the same type: the ref has
been committed into place, but it is completely empty.

The root cause of this is likely that we don't sync contents of the
lockfile to disk before renaming it into place. As a result, it's not
guaranteed that the contents are properly persisted and one may observe
weird in-between states on hard resets. Quoting ext4 documentation [1]:

    Many broken applications don't use fsync() when replacing existing
    files via patterns such as fd =
    open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or
    worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If
    auto_da_alloc is enabled, ext4 will detect the replace-via-rename
    and replace-via-truncate patterns and force that any delayed
    allocation blocks are allocated such that at the next journal
    commit, in the default data=ordered mode, the data blocks of the new
    file are forced to disk before the rename() operation is committed.
    This provides roughly the same level of guarantees as ext3, and
    avoids the "zero-length" problem that can happen when a system
    crashes before the delayed allocation blocks are forced to disk.

This explicitly points out that one must call fsync(3P) before doing the
rename(3P) call, or otherwise data may not be correctly persisted to
disk.

Fix this by always flushing refs to disk before committing them into
place to avoid this class of corruption.

[1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 refs/files-backend.c | 1 +
 1 file changed, 1 insertion(+)

Ævar Arnfjörð Bjarmason Nov. 4, 2021, 1:14 p.m. UTC | #1

On Thu, Nov 04 2021, Patrick Steinhardt wrote:

> When writing loose refs, we first create a lockfile, write the new ref
> into that lockfile, close it and then rename the lockfile into place
> such that the actual update is atomic for that single ref. While this
> works as intended under normal circumstences, at GitLab we infrequently
> encounter corrupt loose refs in repositories after a machine encountered
> a hard reset. The corruption is always of the same type: the ref has
> been committed into place, but it is completely empty.
>
> The root cause of this is likely that we don't sync contents of the
> lockfile to disk before renaming it into place. As a result, it's not
> guaranteed that the contents are properly persisted and one may observe
> weird in-between states on hard resets. Quoting ext4 documentation [1]:
>
>     Many broken applications don't use fsync() when replacing existing
>     files via patterns such as fd =
>     open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or
>     worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If
>     auto_da_alloc is enabled, ext4 will detect the replace-via-rename
>     and replace-via-truncate patterns and force that any delayed
>     allocation blocks are allocated such that at the next journal
>     commit, in the default data=ordered mode, the data blocks of the new
>     file are forced to disk before the rename() operation is committed.
>     This provides roughly the same level of guarantees as ext3, and
>     avoids the "zero-length" problem that can happen when a system
>     crashes before the delayed allocation blocks are forced to disk.
>
> This explicitly points out that one must call fsync(3P) before doing the
> rename(3P) call, or otherwise data may not be correctly persisted to
> disk.
>
> Fix this by always flushing refs to disk before committing them into
> place to avoid this class of corruption.
>
> [1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  refs/files-backend.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/refs/files-backend.c b/refs/files-backend.c
> index 151b0056fe..06a3f0bdea 100644
> --- a/refs/files-backend.c
> +++ b/refs/files-backend.c
> @@ -1749,6 +1749,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
>  	fd = get_lock_file_fd(&lock->lk);
>  	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
>  	    write_in_full(fd, &term, 1) < 0 ||
> +	    fsync(fd) < 0 ||
>  	    close_ref_gently(lock) < 0) {
>  		strbuf_addf(err,
>  			    "couldn't write '%s'", get_lock_file_path(&lock->lk));

Yeah, that really does seem like it's the cause of such zeroing out
issues.

This has a semantic conflict with some other changes in flight, see:

    git log -p origin/master..origin/seen -- write-or-die.c

I.e. here you do want to not die, so fsync_or_die() doesn't make sense
per-se, but in those changes that function has grown to mean
fsync_with_configured_strategy_or_die().

Also we need the loop around fsync, see cccdfd22436 (fsync(): be
prepared to see EINTR, 2021-06-04).

I think it would probably be best to create a git_fsync_fd() function
which is non-fatal and has that config/while loop, and have
fsync_or_die() be a "..or die()" wrapper around that, then you could
call that git_fsync_fd() here.

On the change more generally there's some performance numbers quoted at,
so re the recent discussions about fsync() performance I wonder how this
changes things.

I've also noted in those threads recently that our overall use of fsync
is quite, bad, and especially when it comes to assuming that we don't
need to fsync dir entries, which we still don't do here.

The ext4 docs seem to suggest that this will be the right thing to do in
either case, but I wonder if this won't increase the odds of corruption
on some other fs's.

I.e. before we'd write() && rename() without the fsync(), so on systems
that deferred fsync() until some global sync point we might have been
able to rely on those happening atomically (although clearly not on
others, e.g. ext4).

But now we'll fsync() the data explicitly, then do a rename(), but we
don't fsync the dir entry, so per POSIX an external application can't
rely on seeing that rename yet. Will that bite us still, but just in
another way on some other systems?

1. https://stackoverflow.com/questions/7433057/is-rename-without-fsync-safe

Patrick Steinhardt Nov. 4, 2021, 2:51 p.m. UTC | #2

On Thu, Nov 04, 2021 at 02:14:54PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Nov 04 2021, Patrick Steinhardt wrote:
> 
> > When writing loose refs, we first create a lockfile, write the new ref
> > into that lockfile, close it and then rename the lockfile into place
> > such that the actual update is atomic for that single ref. While this
> > works as intended under normal circumstences, at GitLab we infrequently
> > encounter corrupt loose refs in repositories after a machine encountered
> > a hard reset. The corruption is always of the same type: the ref has
> > been committed into place, but it is completely empty.
> >
> > The root cause of this is likely that we don't sync contents of the
> > lockfile to disk before renaming it into place. As a result, it's not
> > guaranteed that the contents are properly persisted and one may observe
> > weird in-between states on hard resets. Quoting ext4 documentation [1]:
> >
> >     Many broken applications don't use fsync() when replacing existing
> >     files via patterns such as fd =
> >     open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or
> >     worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If
> >     auto_da_alloc is enabled, ext4 will detect the replace-via-rename
> >     and replace-via-truncate patterns and force that any delayed
> >     allocation blocks are allocated such that at the next journal
> >     commit, in the default data=ordered mode, the data blocks of the new
> >     file are forced to disk before the rename() operation is committed.
> >     This provides roughly the same level of guarantees as ext3, and
> >     avoids the "zero-length" problem that can happen when a system
> >     crashes before the delayed allocation blocks are forced to disk.
> >
> > This explicitly points out that one must call fsync(3P) before doing the
> > rename(3P) call, or otherwise data may not be correctly persisted to
> > disk.
> >
> > Fix this by always flushing refs to disk before committing them into
> > place to avoid this class of corruption.
> >
> > [1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
> >
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >  refs/files-backend.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/refs/files-backend.c b/refs/files-backend.c
> > index 151b0056fe..06a3f0bdea 100644
> > --- a/refs/files-backend.c
> > +++ b/refs/files-backend.c
> > @@ -1749,6 +1749,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
> >  	fd = get_lock_file_fd(&lock->lk);
> >  	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
> >  	    write_in_full(fd, &term, 1) < 0 ||
> > +	    fsync(fd) < 0 ||
> >  	    close_ref_gently(lock) < 0) {
> >  		strbuf_addf(err,
> >  			    "couldn't write '%s'", get_lock_file_path(&lock->lk));
> 
> Yeah, that really does seem like it's the cause of such zeroing out
> issues.
> 
> This has a semantic conflict with some other changes in flight, see:
> 
>     git log -p origin/master..origin/seen -- write-or-die.c
> 
> I.e. here you do want to not die, so fsync_or_die() doesn't make sense
> per-se, but in those changes that function has grown to mean
> fsync_with_configured_strategy_or_die().
> 
> Also we need the loop around fsync, see cccdfd22436 (fsync(): be
> prepared to see EINTR, 2021-06-04).
> 
> I think it would probably be best to create a git_fsync_fd() function
> which is non-fatal and has that config/while loop, and have
> fsync_or_die() be a "..or die()" wrapper around that, then you could
> call that git_fsync_fd() here.

Thanks for pointing it out, I'll base v2 on next in that case.

> On the change more generally there's some performance numbers quoted at,
> so re the recent discussions about fsync() performance I wonder how this
> changes things.

Yeah, good question. I punted on doing benchmarks for this given that I
wasn't completely sure whether there's any preexisting ones which would
fit best here.

No matter the results, I'd still take the stance that we should by
default try to do the right thing and try hard to not end up with
corrupt data, and if filesystem docs explicitly say we must fsync(3P)
then that's what we should be doing. That being said, I wouldn't mind
introducing something like `core.fsyncObjectFiles` for refs, too, so
that folks who want an escape hatch have one.

> I've also noted in those threads recently that our overall use of fsync
> is quite, bad, and especially when it comes to assuming that we don't
> need to fsync dir entries, which we still don't do here.

Yeah. I also thought about not putting the fsync(3P) logic into ref
logic, but instead into our lockfiles. In theory, we should always be
doing this before committing lockfiles into place, so it would fit in
there quite naturally.

> The ext4 docs seem to suggest that this will be the right thing to do in
> either case, but I wonder if this won't increase the odds of corruption
> on some other fs's.
> 
> I.e. before we'd write() && rename() without the fsync(), so on systems
> that deferred fsync() until some global sync point we might have been
> able to rely on those happening atomically (although clearly not on
> others, e.g. ext4).
> 
> But now we'll fsync() the data explicitly, then do a rename(), but we
> don't fsync the dir entry, so per POSIX an external application can't
> rely on seeing that rename yet. Will that bite us still, but just in
> another way on some other systems?
>
> 1. https://stackoverflow.com/questions/7433057/is-rename-without-fsync-safe

Good point. I'd happy to extend this patch to also fsync(3P) the dir
entry. But it does sound like even more of a reason to move the logic
into the lockfiles such that we don't have to duplicate it wherever we
really don't want to end up with corrupted data.

Patrick

Junio C Hamano Nov. 4, 2021, 9:24 p.m. UTC | #3

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I think it would probably be best to create a git_fsync_fd() function
> which is non-fatal and has that config/while loop, and have
> fsync_or_die() be a "..or die()" wrapper around that, then you could
> call that git_fsync_fd() here.

Adding git_fsync_fd() smells like a poor taste, as git_fsync() takes
an fd already.  How about making the current one into a static helper

	-int git_fsync(int fd, enum fsync_action action)
	+static int git_fsync_once(int fd, enum fsync_action action)
	 ...

and then hide the looping behavior behind git_fsync() proper?

        int git_fsync(int fd, enum fsync_action action)
        {
                while (git_fsync_once(fd, action) < 0)
                        if (errno != EINTR)
                                return -1;
                return 0;
        }

fsync_or_die() can be simplified by getting rid of its loop.

None of that needs to block Patrick's work, I think.  A version that
uses raw fsync() and punts on EINTR can graduate first, which makes
the situation better than the status quo, and all the ongoing work
that deal with fsync can be extended with an eye to make it also
usable to replace the fsync() call Patrick's fix adds.

Neeraj Singh Nov. 4, 2021, 10:36 p.m. UTC | #4

On Thu, Nov 04, 2021 at 02:24:26PM -0700, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
> > I think it would probably be best to create a git_fsync_fd() function
> > which is non-fatal and has that config/while loop, and have
> > fsync_or_die() be a "..or die()" wrapper around that, then you could
> > call that git_fsync_fd() here.
> 
> Adding git_fsync_fd() smells like a poor taste, as git_fsync() takes
> an fd already.  How about making the current one into a static helper
> 
> 	-int git_fsync(int fd, enum fsync_action action)
> 	+static int git_fsync_once(int fd, enum fsync_action action)
> 	 ...
> 
> and then hide the looping behavior behind git_fsync() proper?
> 
>         int git_fsync(int fd, enum fsync_action action)
>         {
>                 while (git_fsync_once(fd, action) < 0)
>                         if (errno != EINTR)
>                                 return -1;
>                 return 0;
>         }
> 
> fsync_or_die() can be simplified by getting rid of its loop.
> 
> None of that needs to block Patrick's work, I think.  A version that
> uses raw fsync() and punts on EINTR can graduate first, which makes
> the situation better than the status quo, and all the ongoing work
> that deal with fsync can be extended with an eye to make it also
> usable to replace the fsync() call Patrick's fix adds.

Is there some reason we shouldn't die if writing the ref fails? We are
already accustomed to dying if fsyncing a packfile or the index fails.

I assume the number of refs updated is not that high on any given git
operation, so it's not worth having a batch mode for this eventually.

Thanks,
Neeraj

Junio C Hamano Nov. 5, 2021, 1:40 a.m. UTC | #5

Neeraj Singh <nksingh85@gmail.com> writes:

> Is there some reason we shouldn't die if writing the ref fails? We are
> already accustomed to dying if fsyncing a packfile or the index fails.

If we look at the surrounding code and the callers of the function,
this caller of fsync() is about to signal a failure to its callers
by returning an error().  The callers are prepared to see an error
and cope with it.

Introducing die() to such a library-ish part of the code deep in the
callchain is not exactly a good change, especially when it is
obvious how we can avoid it.

Jeff King Nov. 5, 2021, 6:36 a.m. UTC | #6

On Thu, Nov 04, 2021 at 06:40:22PM -0700, Junio C Hamano wrote:

> Neeraj Singh <nksingh85@gmail.com> writes:
> 
> > Is there some reason we shouldn't die if writing the ref fails? We are
> > already accustomed to dying if fsyncing a packfile or the index fails.
> 
> If we look at the surrounding code and the callers of the function,
> this caller of fsync() is about to signal a failure to its callers
> by returning an error().  The callers are prepared to see an error
> and cope with it.
> 
> Introducing die() to such a library-ish part of the code deep in the
> callchain is not exactly a good change, especially when it is
> obvious how we can avoid it.

And here's a good concrete example that relies on it: when receive-pack
is taking in a push, if the ref fails it should report that via the
protocol back to the client, rather than just hanging up the connection.
That leaves the client less confused, and gives it the opportunity to
attempt and report status on other ref updates.

-Peff

Jeff King Nov. 5, 2021, 7:07 a.m. UTC | #7

On Thu, Nov 04, 2021 at 01:38:22PM +0100, Patrick Steinhardt wrote:

> When writing loose refs, we first create a lockfile, write the new ref
> into that lockfile, close it and then rename the lockfile into place
> such that the actual update is atomic for that single ref. While this
> works as intended under normal circumstences, at GitLab we infrequently
> encounter corrupt loose refs in repositories after a machine encountered
> a hard reset. The corruption is always of the same type: the ref has
> been committed into place, but it is completely empty.

Yes, I've seen this quite a bit, too, and I agree that more fsync()-ing
will probably help.

There are two things that have made me hesitate, though:

  1. This doesn't quite make the write durable. Doing that would also
     require fsyncing the surrounding directory (and so on up the tree).
     This is a much trickier problem, because we often don't even have
     open descriptors for those.

     I think we can ignore that, though. Doing an fsync() here makes
     things strictly better with respect to robustness (especially if
     you care less about "has my ref update been committed to disk" and
     more about "does this end up corrupt after a power failure").

  2. It's not clear what the performance implications will be,
     especially on a busy server doing a lot of ref updates, or on a
     filesystem where fsync() ends up syncing everything, not just the
     one file (my impression is ext3 is such a system, but not ext4).
     Whereas another solution may be journaling data and metadata writes
     in order without worrying about the durability of writing them to
     disk.

     I suspect for small updates (say, a push of one or two refs), this
     will have little impact. We'd generally fsync the incoming packfile
     and its idx anyway, so we're adding may one or two fsyncs on top of
     that. But if you're pushing 100 refs, that will be 100 sequential
     fsyncs, which may add up to quite a bit of latency. It would be
     nice if we could batch these by somehow (e.g., by opening up all of
     the lockfiles, writing and fsyncing them, and then renaming one by
     one).

     So I think we want to at least add a config knob here. That would
     make it easier to experiment, and provides an escape hatch.

> diff --git a/refs/files-backend.c b/refs/files-backend.c
> index 151b0056fe..06a3f0bdea 100644
> --- a/refs/files-backend.c
> +++ b/refs/files-backend.c
> @@ -1749,6 +1749,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
>  	fd = get_lock_file_fd(&lock->lk);
>  	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
>  	    write_in_full(fd, &term, 1) < 0 ||
> +	    fsync(fd) < 0 ||
>  	    close_ref_gently(lock) < 0) {
>  		strbuf_addf(err,
>  			    "couldn't write '%s'", get_lock_file_path(&lock->lk));

I agree with the sentiment elsewhere that this would probably make sense
for any lockfile. That does probably make batching a bit more difficult,
though it could be layered on top (i.e., we could still fsync in the ref
code, and the lockfile fsync-ing should then become a noop).

-Peff

Jeff King Nov. 5, 2021, 7:17 a.m. UTC | #8

On Fri, Nov 05, 2021 at 03:07:18AM -0400, Jeff King wrote:

>   2. It's not clear what the performance implications will be,
>      especially on a busy server doing a lot of ref updates, or on a
>      filesystem where fsync() ends up syncing everything, not just the
>      one file (my impression is ext3 is such a system, but not ext4).
>      Whereas another solution may be journaling data and metadata writes
>      in order without worrying about the durability of writing them to
>      disk.
> 
>      I suspect for small updates (say, a push of one or two refs), this
>      will have little impact. We'd generally fsync the incoming packfile
>      and its idx anyway, so we're adding may one or two fsyncs on top of
>      that. But if you're pushing 100 refs, that will be 100 sequential
>      fsyncs, which may add up to quite a bit of latency. It would be
>      nice if we could batch these by somehow (e.g., by opening up all of
>      the lockfiles, writing and fsyncing them, and then renaming one by
>      one).

So here's a quick experiment that shows a worst case: a small push that
updates a bunch of refs. After building Git with and without your patch,
I set up a small repo like:

  git init
  git commit --allow-empty -m foo
  for i in $(seq 100); do
    git update-ref refs/heads/$i HEAD
  done

To give a clean slate between runs, I stuck this in a script called
"setup":

  #!/bin/sh
  rm -rf dst.git
  git init --bare dst.git
  sync

And then ran:

  $ hyperfine -L v orig,fsync -p ./setup '/tmp/{v}/bin/git push dst.git refs/heads/*'
  Benchmark 1: /tmp/orig/bin/git push dst.git refs/heads/*
    Time (mean ± σ):       9.9 ms ±   0.2 ms    [User: 6.3 ms, System: 4.7 ms]
    Range (min … max):     9.5 ms …  10.5 ms    111 runs

  Benchmark 2: /tmp/fsync/bin/git push dst.git refs/heads/*
    Time (mean ± σ):     401.0 ms ±   7.7 ms    [User: 9.4 ms, System: 15.2 ms]
    Range (min … max):   389.4 ms … 412.4 ms    10 runs

  Summary
    '/tmp/orig/bin/git push dst.git refs/heads/*' ran
     40.68 ± 1.16 times faster than '/tmp/fsync/bin/git push dst.git refs/heads/*'

So it really does produce a noticeable impact (this is on a system with
a decent SSD and no other disk load, so I'd expect it to be about
average for modern hardware).

Now this test isn't entirely fair. 100 refs is a larger than average
number to be pushing, and the effect is out-sized because there's
virtually no time spent dealing with the objects themselves, nor is
there any network latency. But 400ms feels like a non-trivial amount of
time just in absolute numbers.

The numbers scale pretty linearly, as you'd expect. Pushing 10 refs
takes ~40ms, 100 takes ~400ms, and 1000 takes ~4s. The non-fsyncing
version gets slower, too (there's more work to do), but much more slowly
(6ms, 10ms, and 50ms respectively).

So this will definitely hurt at edge / pathological cases.

-Peff

Ævar Arnfjörð Bjarmason Nov. 5, 2021, 8:35 a.m. UTC | #9

On Thu, Nov 04 2021, Neeraj Singh wrote:

> On Thu, Nov 04, 2021 at 02:24:26PM -0700, Junio C Hamano wrote:
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>> 
>> > I think it would probably be best to create a git_fsync_fd() function
>> > which is non-fatal and has that config/while loop, and have
>> > fsync_or_die() be a "..or die()" wrapper around that, then you could
>> > call that git_fsync_fd() here.
>> 
>> Adding git_fsync_fd() smells like a poor taste, as git_fsync() takes
>> an fd already.  How about making the current one into a static helper
>> 
>> 	-int git_fsync(int fd, enum fsync_action action)
>> 	+static int git_fsync_once(int fd, enum fsync_action action)
>> 	 ...
>> 
>> and then hide the looping behavior behind git_fsync() proper?
>> 
>>         int git_fsync(int fd, enum fsync_action action)
>>         {
>>                 while (git_fsync_once(fd, action) < 0)
>>                         if (errno != EINTR)
>>                                 return -1;
>>                 return 0;
>>         }
>> 
>> fsync_or_die() can be simplified by getting rid of its loop.
>> 
>> None of that needs to block Patrick's work, I think.  A version that
>> uses raw fsync() and punts on EINTR can graduate first, which makes
>> the situation better than the status quo, and all the ongoing work
>> that deal with fsync can be extended with an eye to make it also
>> usable to replace the fsync() call Patrick's fix adds.
>
> Is there some reason we shouldn't die if writing the ref fails? We are
> already accustomed to dying if fsyncing a packfile or the index fails.
>
> I assume the number of refs updated is not that high on any given git
> operation, so it's not worth having a batch mode for this eventually.

Others have mostly touched on this, but usually (always?) we're about to
die anyway, but by not calling die() here we'll die with a less crappy
message.

I.e. this bit:

    strbuf_addf(err, [...]

Is effectively a deferred die() in most (all?) cases, and be passed up
to the ref update code. because instead of a nondescript message like:

    fsync: failed somewhere

We want:

    couldn't update ref xyz to OID abc: fsync failed

Or similar. 

Hrm, actually having written the above there's a really good reason not
to die, which is that fsync can return ENOSPC.

So if we're in the middle of a transaction and have created and written
the lockfile we might only notice that the disk has is full when we do
the fsync().

In that case we'll (or should, I didn't check just now) unroll the ref
transaction, and delete the *.lock files we created, which presumably
will succeed in that scenario.

So calling die() at this level is the difference between leaving the
repo in an inconsistent state due to a disk error, and something like
"git fetch --atomic" gracefully failing.

Jeff King Nov. 5, 2021, 9:04 a.m. UTC | #10

On Fri, Nov 05, 2021 at 09:35:24AM +0100, Ævar Arnfjörð Bjarmason wrote:

> So if we're in the middle of a transaction and have created and written
> the lockfile we might only notice that the disk has is full when we do
> the fsync().
> 
> In that case we'll (or should, I didn't check just now) unroll the ref
> transaction, and delete the *.lock files we created, which presumably
> will succeed in that scenario.
> 
> So calling die() at this level is the difference between leaving the
> repo in an inconsistent state due to a disk error, and something like
> "git fetch --atomic" gracefully failing.

We should rollback the lockfiles even if we call die(), via the atexit()
handler. Ditto if we receive a fatal signal.

(But I completely agree that if we have the opportunity to just pass the
error up the stack, we should do so).

-Peff

Johannes Schindelin Nov. 5, 2021, 9:12 a.m. UTC | #11

Hi Peff & Patrick,

On Fri, 5 Nov 2021, Jeff King wrote:

> On Fri, Nov 05, 2021 at 03:07:18AM -0400, Jeff King wrote:
>
> >   2. It's not clear what the performance implications will be,
> >      especially on a busy server doing a lot of ref updates, or on a
> >      filesystem where fsync() ends up syncing everything, not just the
> >      one file (my impression is ext3 is such a system, but not ext4).
> >      Whereas another solution may be journaling data and metadata writes
> >      in order without worrying about the durability of writing them to
> >      disk.
> >
> >      I suspect for small updates (say, a push of one or two refs), this
> >      will have little impact. We'd generally fsync the incoming packfile
> >      and its idx anyway, so we're adding may one or two fsyncs on top of
> >      that. But if you're pushing 100 refs, that will be 100 sequential
> >      fsyncs, which may add up to quite a bit of latency. It would be
> >      nice if we could batch these by somehow (e.g., by opening up all of
> >      the lockfiles, writing and fsyncing them, and then renaming one by
> >      one).
>
> So here's a quick experiment that shows a worst case: a small push that
> updates a bunch of refs. After building Git with and without your patch,
> I set up a small repo like:
>
>   git init
>   git commit --allow-empty -m foo
>   for i in $(seq 100); do
>     git update-ref refs/heads/$i HEAD
>   done
>
> To give a clean slate between runs, I stuck this in a script called
> "setup":
>
>   #!/bin/sh
>   rm -rf dst.git
>   git init --bare dst.git
>   sync
>
> And then ran:
>
>   $ hyperfine -L v orig,fsync -p ./setup '/tmp/{v}/bin/git push dst.git refs/heads/*'
>   Benchmark 1: /tmp/orig/bin/git push dst.git refs/heads/*
>     Time (mean ± σ):       9.9 ms ±   0.2 ms    [User: 6.3 ms, System: 4.7 ms]
>     Range (min … max):     9.5 ms …  10.5 ms    111 runs
>
>   Benchmark 2: /tmp/fsync/bin/git push dst.git refs/heads/*
>     Time (mean ± σ):     401.0 ms ±   7.7 ms    [User: 9.4 ms, System: 15.2 ms]
>     Range (min … max):   389.4 ms … 412.4 ms    10 runs
>
>   Summary
>     '/tmp/orig/bin/git push dst.git refs/heads/*' ran
>      40.68 ± 1.16 times faster than '/tmp/fsync/bin/git push dst.git refs/heads/*'
>
> So it really does produce a noticeable impact (this is on a system with
> a decent SSD and no other disk load, so I'd expect it to be about
> average for modern hardware).
>
> Now this test isn't entirely fair. 100 refs is a larger than average
> number to be pushing, and the effect is out-sized because there's
> virtually no time spent dealing with the objects themselves, nor is
> there any network latency. But 400ms feels like a non-trivial amount of
> time just in absolute numbers.
>
> The numbers scale pretty linearly, as you'd expect. Pushing 10 refs
> takes ~40ms, 100 takes ~400ms, and 1000 takes ~4s. The non-fsyncing
> version gets slower, too (there's more work to do), but much more slowly
> (6ms, 10ms, and 50ms respectively).
>
> So this will definitely hurt at edge / pathological cases.

Ouch.

I wonder whether this could be handled similarly to the
`core.fsyncObjectFiles=batch` mode that has been proposed in
https://lore.kernel.org/git/pull.1076.v8.git.git.1633366667.gitgitgadget@gmail.com/

Essentially, we would have to find a better layer to do this, where we
can synchronize after a potentially quite large number of ref updates has
happened. That would definitely be a different layer than the file-based
refs backend, of course, and would probably apply in a different way to
other refs backends.

Ciao,
Dscho

Patrick Steinhardt Nov. 5, 2021, 9:22 a.m. UTC | #12

On Fri, Nov 05, 2021 at 10:12:25AM +0100, Johannes Schindelin wrote:
> Hi Peff & Patrick,
> 
> On Fri, 5 Nov 2021, Jeff King wrote:
> 
> > On Fri, Nov 05, 2021 at 03:07:18AM -0400, Jeff King wrote:
> >
> > >   2. It's not clear what the performance implications will be,
> > >      especially on a busy server doing a lot of ref updates, or on a
> > >      filesystem where fsync() ends up syncing everything, not just the
> > >      one file (my impression is ext3 is such a system, but not ext4).
> > >      Whereas another solution may be journaling data and metadata writes
> > >      in order without worrying about the durability of writing them to
> > >      disk.
> > >
> > >      I suspect for small updates (say, a push of one or two refs), this
> > >      will have little impact. We'd generally fsync the incoming packfile
> > >      and its idx anyway, so we're adding may one or two fsyncs on top of
> > >      that. But if you're pushing 100 refs, that will be 100 sequential
> > >      fsyncs, which may add up to quite a bit of latency. It would be
> > >      nice if we could batch these by somehow (e.g., by opening up all of
> > >      the lockfiles, writing and fsyncing them, and then renaming one by
> > >      one).
> >
> > So here's a quick experiment that shows a worst case: a small push that
> > updates a bunch of refs. After building Git with and without your patch,
> > I set up a small repo like:
> >
> >   git init
> >   git commit --allow-empty -m foo
> >   for i in $(seq 100); do
> >     git update-ref refs/heads/$i HEAD
> >   done
> >
> > To give a clean slate between runs, I stuck this in a script called
> > "setup":
> >
> >   #!/bin/sh
> >   rm -rf dst.git
> >   git init --bare dst.git
> >   sync
> >
> > And then ran:
> >
> >   $ hyperfine -L v orig,fsync -p ./setup '/tmp/{v}/bin/git push dst.git refs/heads/*'
> >   Benchmark 1: /tmp/orig/bin/git push dst.git refs/heads/*
> >     Time (mean ± σ):       9.9 ms ±   0.2 ms    [User: 6.3 ms, System: 4.7 ms]
> >     Range (min … max):     9.5 ms …  10.5 ms    111 runs
> >
> >   Benchmark 2: /tmp/fsync/bin/git push dst.git refs/heads/*
> >     Time (mean ± σ):     401.0 ms ±   7.7 ms    [User: 9.4 ms, System: 15.2 ms]
> >     Range (min … max):   389.4 ms … 412.4 ms    10 runs
> >
> >   Summary
> >     '/tmp/orig/bin/git push dst.git refs/heads/*' ran
> >      40.68 ± 1.16 times faster than '/tmp/fsync/bin/git push dst.git refs/heads/*'
> >
> > So it really does produce a noticeable impact (this is on a system with
> > a decent SSD and no other disk load, so I'd expect it to be about
> > average for modern hardware).
> >
> > Now this test isn't entirely fair. 100 refs is a larger than average
> > number to be pushing, and the effect is out-sized because there's
> > virtually no time spent dealing with the objects themselves, nor is
> > there any network latency. But 400ms feels like a non-trivial amount of
> > time just in absolute numbers.
> >
> > The numbers scale pretty linearly, as you'd expect. Pushing 10 refs
> > takes ~40ms, 100 takes ~400ms, and 1000 takes ~4s. The non-fsyncing
> > version gets slower, too (there's more work to do), but much more slowly
> > (6ms, 10ms, and 50ms respectively).
> >
> > So this will definitely hurt at edge / pathological cases.
> 
> Ouch.
> 
> I wonder whether this could be handled similarly to the
> `core.fsyncObjectFiles=batch` mode that has been proposed in
> https://lore.kernel.org/git/pull.1076.v8.git.git.1633366667.gitgitgadget@gmail.com/
> 
> Essentially, we would have to find a better layer to do this, where we
> can synchronize after a potentially quite large number of ref updates has
> happened. That would definitely be a different layer than the file-based
> refs backend, of course, and would probably apply in a different way to
> other refs backends.
> 
> Ciao,
> Dscho

Good question. We'd likely have to make this part of the ref_transaction
layer to make this work without having to update all callers. It would
likely work something like the following:

    1. The caller queues up all ref updates.

    2. The caller moves the transaction into prepared state, which
       creates all the lockfiles for us.

    3. After all lockfiles have been created, we must sync them to disk.
       This may either happen when preparing the transaction, or when
       committing it. I think it's better located in prepare though so
       that the likelihood that the transaction succeeds after a
       successful prepare is high.

    4. Now we can rename all files into place.

Naturally, we'd only benefit from this batching in case the caller
actually queues up all ref updates into a single transaction. And
there's going to be at least some which don't by default. The first that
comes to my mind is git-fetch(1), which explicitly requires the user to
use `--atomic` to queue them all up into a single transaction. And I
guess there's going to be others which use one transaction per ref.

Patrick

Jeff King Nov. 5, 2021, 9:34 a.m. UTC | #13

On Fri, Nov 05, 2021 at 10:12:25AM +0100, Johannes Schindelin wrote:

> > So this will definitely hurt at edge / pathological cases.
> 
> Ouch.
> 
> I wonder whether this could be handled similarly to the
> `core.fsyncObjectFiles=batch` mode that has been proposed in
> https://lore.kernel.org/git/pull.1076.v8.git.git.1633366667.gitgitgadget@gmail.com/

Yeah, that was along the lines I was thinking.

I hadn't really looked at the details of the batch-fsync there. The big
trick seems to be doing a pagecache writeback for each file, and then
stimulating an actual disk write (by fsyncing a tempfile) after the
batch is done.

That would be pretty easy to do for the refs (it's git_fsync() while
closing each file where Patrick is calling fsync(), followed by a single
do_batch_fsync() after everything is closed but before we rename).

> Essentially, we would have to find a better layer to do this, where we
> can synchronize after a potentially quite large number of ref updates has
> happened. That would definitely be a different layer than the file-based
> refs backend, of course, and would probably apply in a different way to
> other refs backends.

We do have the concept of a ref_transaction, so that would be the
natural place for it. Not every caller uses it, though, because it
implies atomicity of the transaction (so some may do a sequence of N
independent transactions, because they don't want failure of one to
impact others). I think that could be changed, if the ref_transaction
learned about non-atomicity, but it may take some surgery.

I expect that reftables would similarly benefit; it is probably much
more efficient to write a table slice with N entries than it is to write
N slices, even before accounting for fsync(). And once doing that, then
the fsync() part becomes trivial.

-Peff

Patrick Steinhardt Nov. 9, 2021, 11:25 a.m. UTC | #14

On Fri, Nov 05, 2021 at 05:34:44AM -0400, Jeff King wrote:
> On Fri, Nov 05, 2021 at 10:12:25AM +0100, Johannes Schindelin wrote:
> 
> > > So this will definitely hurt at edge / pathological cases.
> > 
> > Ouch.
> > 
> > I wonder whether this could be handled similarly to the
> > `core.fsyncObjectFiles=batch` mode that has been proposed in
> > https://lore.kernel.org/git/pull.1076.v8.git.git.1633366667.gitgitgadget@gmail.com/
> 
> Yeah, that was along the lines I was thinking.
> 
> I hadn't really looked at the details of the batch-fsync there. The big
> trick seems to be doing a pagecache writeback for each file, and then
> stimulating an actual disk write (by fsyncing a tempfile) after the
> batch is done.
> 
> That would be pretty easy to do for the refs (it's git_fsync() while
> closing each file where Patrick is calling fsync(), followed by a single
> do_batch_fsync() after everything is closed but before we rename).
> 
> > Essentially, we would have to find a better layer to do this, where we
> > can synchronize after a potentially quite large number of ref updates has
> > happened. That would definitely be a different layer than the file-based
> > refs backend, of course, and would probably apply in a different way to
> > other refs backends.
> 
> We do have the concept of a ref_transaction, so that would be the
> natural place for it. Not every caller uses it, though, because it
> implies atomicity of the transaction (so some may do a sequence of N
> independent transactions, because they don't want failure of one to
> impact others). I think that could be changed, if the ref_transaction
> learned about non-atomicity, but it may take some surgery.
> 
> I expect that reftables would similarly benefit; it is probably much
> more efficient to write a table slice with N entries than it is to write
> N slices, even before accounting for fsync(). And once doing that, then
> the fsync() part becomes trivial.
> 
> -Peff

So I've finally found the time to have another look at massaging this
into the ref_transaction mechanism. If we do want to batch the fsync(3P)
calls, then we basically have two different alternatives:

    1. We first lock all loose refs by creating the respective lock
       files and writing the updated ref into that file. We keep the
       file descriptor open such that we can then flush them all in one
       go.

    2. Same as before, we lock all refs and write the updated pointers
       into the lockfiles, but this time we close each lockfile after
       having written to it. Later, we reopen them all to fsync(3P) them
       to disk.

I'm afraid both alternatives aren't any good: the first alternative
risks running out of file descriptors if you queue up lots of refs. And
the second alternative is going to be slow, especially so on Windows if
I'm not mistaken.

So with both not being feasible, we'll likely have to come up with a
more complex scheme if we want to batch-sync files. One idea would be to
fsync(3P) all lockfiles every $n refs, but it adds complexity in a place
where I'd really like to have things as simple as possible. It also
raises the question what $n would have to be.

Does anybody else have better ideas than I do?

Patrick

Jeff King Nov. 10, 2021, 8:36 a.m. UTC | #15

On Tue, Nov 09, 2021 at 12:25:46PM +0100, Patrick Steinhardt wrote:

> So I've finally found the time to have another look at massaging this
> into the ref_transaction mechanism. If we do want to batch the fsync(3P)
> calls, then we basically have two different alternatives:
> 
>     1. We first lock all loose refs by creating the respective lock
>        files and writing the updated ref into that file. We keep the
>        file descriptor open such that we can then flush them all in one
>        go.
> 
>     2. Same as before, we lock all refs and write the updated pointers
>        into the lockfiles, but this time we close each lockfile after
>        having written to it. Later, we reopen them all to fsync(3P) them
>        to disk.
> 
> I'm afraid both alternatives aren't any good: the first alternative
> risks running out of file descriptors if you queue up lots of refs. And
> the second alternative is going to be slow, especially so on Windows if
> I'm not mistaken.

I agree the first is a dead end. I had imagined something like the
second, but I agree that we'd have to measure the cost of re-opening.
It's too bad there is not a syscall to sync a particular set of paths
(or even better, a directory tree recursively).

There is another option, though: the batch-fsync code for objects does a
"cheap" fsync while we have the descriptor open, and then later triggers
a to-disk sync on a separate file. My understanding is that this works
because modern filesystems will make sure the data write is in the
journal on the cheap sync, and then the separate-file sync makes sure
the journal goes to disk.

We could do something like that here. In fact, if you don't care about
durability and just filesystem corruption, then you only care about the
first sync (because the bad case is when the rename gets journaled but
the data write doesn't).

In fact, even if you did choose to re-open and fsync each one, that's
still sequential. We'd need some way to tell the kernel to sync them all
at once. The batch-fsync trickery above is one such way (I haven't
tried, but I wonder if making a bunch of fsync calls in parallel would
work similarly).

> So with both not being feasible, we'll likely have to come up with a
> more complex scheme if we want to batch-sync files. One idea would be to
> fsync(3P) all lockfiles every $n refs, but it adds complexity in a place
> where I'd really like to have things as simple as possible. It also
> raises the question what $n would have to be.

I do think syncing every $n would not be too hard to implement. It could
all be hidden behind a state machine API that collects lock files and
flushes when it sees fit. You'd just call a magic "batch_fsync_and_close"
instead of "fsync() and close()", though you would have to remember to
do a final flush call to tell it there are no more coming.

-Peff

Patrick Steinhardt Nov. 10, 2021, 9:16 a.m. UTC | #16

On Wed, Nov 10, 2021 at 03:36:59AM -0500, Jeff King wrote:
> On Tue, Nov 09, 2021 at 12:25:46PM +0100, Patrick Steinhardt wrote:
> 
> > So I've finally found the time to have another look at massaging this
> > into the ref_transaction mechanism. If we do want to batch the fsync(3P)
> > calls, then we basically have two different alternatives:
> > 
> >     1. We first lock all loose refs by creating the respective lock
> >        files and writing the updated ref into that file. We keep the
> >        file descriptor open such that we can then flush them all in one
> >        go.
> > 
> >     2. Same as before, we lock all refs and write the updated pointers
> >        into the lockfiles, but this time we close each lockfile after
> >        having written to it. Later, we reopen them all to fsync(3P) them
> >        to disk.
> > 
> > I'm afraid both alternatives aren't any good: the first alternative
> > risks running out of file descriptors if you queue up lots of refs. And
> > the second alternative is going to be slow, especially so on Windows if
> > I'm not mistaken.
> 
> I agree the first is a dead end. I had imagined something like the
> second, but I agree that we'd have to measure the cost of re-opening.
> It's too bad there is not a syscall to sync a particular set of paths
> (or even better, a directory tree recursively).
> 
> There is another option, though: the batch-fsync code for objects does a
> "cheap" fsync while we have the descriptor open, and then later triggers
> a to-disk sync on a separate file. My understanding is that this works
> because modern filesystems will make sure the data write is in the
> journal on the cheap sync, and then the separate-file sync makes sure
> the journal goes to disk.
> 
> We could do something like that here. In fact, if you don't care about
> durability and just filesystem corruption, then you only care about the
> first sync (because the bad case is when the rename gets journaled but
> the data write doesn't).

Ah, interesting. That does sound like a good way forward to me, thanks
for the pointers!

Patrick

> In fact, even if you did choose to re-open and fsync each one, that's
> still sequential. We'd need some way to tell the kernel to sync them all
> at once. The batch-fsync trickery above is one such way (I haven't
> tried, but I wonder if making a bunch of fsync calls in parallel would
> work similarly).
> 
> > So with both not being feasible, we'll likely have to come up with a
> > more complex scheme if we want to batch-sync files. One idea would be to
> > fsync(3P) all lockfiles every $n refs, but it adds complexity in a place
> > where I'd really like to have things as simple as possible. It also
> > raises the question what $n would have to be.
> 
> I do think syncing every $n would not be too hard to implement. It could
> all be hidden behind a state machine API that collects lock files and
> flushes when it sees fit. You'd just call a magic "batch_fsync_and_close"
> instead of "fsync() and close()", though you would have to remember to
> do a final flush call to tell it there are no more coming.
> 
> -Peff

refs: sync loose refs to disk before committing them

Commit Message

Comments

Patch