[v4,2/6] core.fsyncobjectfiles: batched disk flushes

Message ID	12cad737635663ed596e52f89f0f4f22f58bfe38.1632176111.git.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <12cad737635663ed596e52f89f0f4f22f58bfe38.1632176111.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1076.v4.git.git.1632176111.gitgitgadget@gmail.com> References: <pull.1076.v3.git.git.1631590725.gitgitgadget@gmail.com> <pull.1076.v4.git.git.1632176111.gitgitgadget@gmail.com> Date: Mon, 20 Sep 2021 22:15:07 +0000 Subject: [PATCH v4 2/6] core.fsyncobjectfiles: batched disk flushes Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Neeraj-Personal <nksingh85@gmail.com>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, Jeff King <peff@peff.net>, Jeff Hostetler <jeffhost@microsoft.com>, Christoph Hellwig <hch@lst.de>, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@gmail.com>, "Randall S. Becker" <rsbecker@nexbridge.com>, Bagas Sanjaya <bagasdotme@gmail.com>, "Neeraj K. Singh" <neerajsi@microsoft.com>, Neeraj Singh <neerajsi@microsoft.com> Precedence: bulk From: Neeraj Singh <neerajsi@microsoft.com>
Series	Implement a batched fsync option for core.fsyncObjectFiles \| expand [v4,0/6] Implement a batched fsync option for core.fsyncObjectFiles [v4,1/6] bulk-checkin: rename 'state' variable and separate 'plugged' boolean [v4,2/6] core.fsyncobjectfiles: batched disk flushes [v4,3/6] core.fsyncobjectfiles: add windows support for batch mode [v4,4/6] update-index: use the bulk-checkin infrastructure [v4,5/6] core.fsyncobjectfiles: tests for batch mode [v4,6/6] core.fsyncobjectfiles: performance tests for add and stash

Message ID

12cad737635663ed596e52f89f0f4f22f58bfe38.1632176111.git.gitgitgadget@gmail.com (mailing list archive)

State

Superseded

Headers

Message-Id: 
 <12cad737635663ed596e52f89f0f4f22f58bfe38.1632176111.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.1076.v4.git.git.1632176111.gitgitgadget@gmail.com>
References: <pull.1076.v3.git.git.1631590725.gitgitgadget@gmail.com>
        <pull.1076.v4.git.git.1632176111.gitgitgadget@gmail.com>
Date: Mon, 20 Sep 2021 22:15:07 +0000
Subject: [PATCH v4 2/6] core.fsyncobjectfiles: batched disk flushes
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: Neeraj-Personal <nksingh85@gmail.com>,
 Johannes Schindelin <Johannes.Schindelin@gmx.de>, Jeff King <peff@peff.net>,
 Jeff Hostetler <jeffhost@microsoft.com>, Christoph Hellwig <hch@lst.de>,
	=?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason  <avarab@gmail.com>,
 "Randall S. Becker" <rsbecker@nexbridge.com>,
 Bagas Sanjaya <bagasdotme@gmail.com>,
 "Neeraj K. Singh" <neerajsi@microsoft.com>,
 Neeraj Singh <neerajsi@microsoft.com>
Precedence: bulk
From: Neeraj Singh <neerajsi@microsoft.com>

Series

Implement a batched fsync option for core.fsyncObjectFiles | expand

Commit Message

Neeraj Singh (WINDOWS-SFS) Sept. 20, 2021, 10:15 p.m. UTC

From: Neeraj Singh <neerajsi@microsoft.com>

When adding many objects to a repo with core.fsyncObjectFiles set to
true, the cost of fsync'ing each object file can become prohibitive.

One major source of the cost of fsync is the implied flush of the
hardware writeback cache within the disk drive. Fortunately, Windows,
macOS, and Linux each offer mechanisms to write data from the filesystem
page cache without initiating a hardware flush.

This patch introduces a new 'core.fsyncObjectFiles = batch' option that
takes advantage of the bulk-checkin infrastructure to batch up hardware
flushes.

When the new mode is enabled we do the following for new objects:

1. Create a tmp_obj_XXXX file and write the object data to it.
2. Issue a pagecache writeback request and wait for it to complete.
3. Record the tmp name and the final name in the bulk-checkin state for
   later rename.

At the end of the entire transaction we:
1. Issue a fsync against the lock file to flush the hardware writeback
   cache, which should by now have processed the tmp file writes.
2. Rename all of the temp files to their final names.
3. When updating the index and/or refs, we assume that Git will issue
   another fsync internal to that operation.

On a filesystem with a singular journal that is updated during name
operations (e.g. create, link, rename, etc), such as NTFS and HFS+, we
would expect the fsync to trigger a journal writeout so that this
sequence is enough to ensure that the user's data is durable by the time
the git command returns.

This change also updates the macOS code to trigger a real hardware flush
via fnctl(fd, F_FULLFSYNC) when fsync_or_die is called. Previously, on
macOS there was no guarantee of durability since a simple fsync(2) call
does not flush any hardware caches.

_Performance numbers_:

Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
Windows - Same host as Linux, a preview version of Windows 11.
	  This number is from a patch later in the series.

Adding 500 files to the repo with 'git add' Times reported in seconds.

core.fsyncObjectFiles | Linux | Mac   | Windows
----------------------|-------|-------|--------
                false | 0.06  |  0.35 | 0.61
                true  | 1.88  | 11.18 | 2.47
                batch | 0.15  |  0.41 | 1.53

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
---
 Documentation/config/core.txt | 26 ++++++++---
 Makefile                      |  6 +++
 builtin/add.c                 |  3 +-
 bulk-checkin.c                | 81 ++++++++++++++++++++++++++++++++++-
 bulk-checkin.h                |  5 ++-
 cache.h                       |  8 +++-
 config.c                      |  7 ++-
 config.mak.uname              |  1 +
 configure.ac                  |  8 ++++
 environment.c                 |  2 +-
 git-compat-util.h             |  7 +++
 object-file.c                 | 22 +---------
 wrapper.c                     | 44 +++++++++++++++++++
 write-or-die.c                |  2 +-
 14 files changed, 189 insertions(+), 33 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Sept. 21, 2021, 11:16 p.m. UTC | #1

On Mon, Sep 20 2021, Neeraj Singh via GitGitGadget wrote:

> When the new mode is enabled we do the following for new objects:
>
> 1. Create a tmp_obj_XXXX file and write the object data to it.
> 2. Issue a pagecache writeback request and wait for it to complete.
> 3. Record the tmp name and the final name in the bulk-checkin state for
>    later rename.
>
> At the end of the entire transaction we:
> 1. Issue a fsync against the lock file to flush the hardware writeback
>    cache, which should by now have processed the tmp file writes.
> 2. Rename all of the temp files to their final names.
> 3. When updating the index and/or refs, we assume that Git will issue
>    another fsync internal to that operation.

Perhaps note too that:

4. For loose objects, refs etc. we may or may not create directories,
   and most certainly will be updating metadata on the immediate
   directory containing the file, but none of that's fsync()'d.

> On a filesystem with a singular journal that is updated during name
> operations (e.g. create, link, rename, etc), such as NTFS and HFS+, we
> would expect the fsync to trigger a journal writeout so that this
> sequence is enough to ensure that the user's data is durable by the time
> the git command returns.
>
> This change also updates the macOS code to trigger a real hardware flush
> via fnctl(fd, F_FULLFSYNC) when fsync_or_die is called. Previously, on
> macOS there was no guarantee of durability since a simple fsync(2) call
> does not flush any hardware caches.

There's no discussion of whether this is or isn't known to also work
some Linux FS's, and for these OS's where this does work is this only
for the object files themselves, or does metadata also "ride along"?

> _Performance numbers_:
>
> Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
> Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
> Windows - Same host as Linux, a preview version of Windows 11.
> 	  This number is from a patch later in the series.
>
> Adding 500 files to the repo with 'git add' Times reported in seconds.
>
> core.fsyncObjectFiles | Linux | Mac   | Windows
> ----------------------|-------|-------|--------
>                 false | 0.06  |  0.35 | 0.61
>                 true  | 1.88  | 11.18 | 2.47
>                 batch | 0.15  |  0.41 | 1.53

Per my https://lore.kernel.org/git/87mtp5cwpn.fsf@evledraar.gmail.com
and 6/6 in this series we've got perf tests for add/stash, but it would
be really interesting to see how this is impacted by
transfer.unpackLimit in cases where we may be writing packs or loose
objects.

> [...]
>  core.fsyncObjectFiles::
> -	This boolean will enable 'fsync()' when writing object files.
> -+
> -This is a total waste of time and effort on a filesystem that orders
> -data writes properly, but can be useful for filesystems that do not use
> -journalling (traditional UNIX filesystems) or that only journal metadata
> -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").
> +	A value indicating the level of effort Git will expend in
> +	trying to make objects added to the repo durable in the event
> +	of an unclean system shutdown. This setting currently only
> +	controls the object store, so updates to any refs or the
> +	index may not be equally durable.

All these mentions of "object" should really clarify that it's "loose
objects", i.e. we always fsync pack files. 

> +* `false` allows data to remain in file system caches according to
> +  operating system policy, whence it may be lost if the system loses power
> +  or crashes.

As noted in point #4 of
https://lore.kernel.org/git/87mtp5cwpn.fsf@evledraar.gmail.com/ while
this direction is overall an improvement over the previously flippant
docs, they at least alluded to the context that the assumption behind
"false" is that you don't really care about loose objects, you care
about loose objects *and* the ref update or whatever.

As I think (this is from memory) we've covered already this may have
been all based on some old ext3 assumption, but it's probably worth
summarizing that here, i.e. if you've got an FS with global ordered
operations you can probably skip this, but probably not etc.

> +* `true` triggers a data integrity flush for each object added to the
> +  object store. This is the safest setting that is likely to ensure durability
> +  across all operating systems and file systems that honor the 'fsync' system
> +  call. However, this setting comes with a significant performance cost on
> +  common hardware.

This is really overpromising things by omitting the fact that eve if
we're getting this feature you've hacked up right, we're still not
fsyncing dir entries etc (also noted above).

So something that describes the narrow scope here, along with "loose
objects" etc....

> +* `batch` enables an experimental mode that uses interfaces available in some
> +  operating systems to write object data with a minimal set of FLUSH CACHE
> +  (or equivalent) commands sent to the storage controller. If the operating
> +  system interfaces are not available, this mode behaves the same as `true`.
> +  This mode is expected to be safe on macOS for repos stored on HFS+ or APFS
> +  filesystems and on Windows for repos stored on NTFS or ReFS.

Again, even if it's called "core.fsyncObjectFiles" if we're going to say
"safe" we really need to say safe in what sense. Having written and
fsync()'d the file is helping nobody if the metadata never arrives....

> +static void do_sync_and_rename(struct string_list *fsync_state, struct lock_file *lock_file)
> +{
> +	if (fsync_state->nr) {

I think less indentation here would be nice:

    if (!fsync_state->nr)
        return;
    /* rest of unindented body */

Or better yet do this check in unplug_bulk_checkin(), then here:

    fsync_or_die();
    for_each_string_list_item() { ...}
    string_list_clear(....);


> +		struct string_list_item *rename;
> +
> +		/*
> +		 * Issue a full hardware flush against the lock file to ensure
> +		 * that all objects are durable before any renames occur.
> +		 * The code in fsync_and_close_loose_object_bulk_checkin has
> +		 * already ensured that writeout has occurred, but it has not
> +		 * flushed any writeback cache in the storage hardware.
> +		 */
> +		fsync_or_die(get_lock_file_fd(lock_file), get_lock_file_path(lock_file));
> +
> +		for_each_string_list_item(rename, fsync_state) {
> +			const char *src = rename->string;
> +			const char *dst = rename->util;
> +
> +			if (finalize_object_file(src, dst))
> +				die_errno(_("could not rename '%s' to '%s'"), src, dst);
> +		}
> +
> +		string_list_clear(fsync_state, 1);
> +	}
> +}
> +
>  static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
>  {
>  	int i;
> @@ -256,6 +286,53 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
>  	return 0;
>  }
>  
> +static void add_rename_bulk_checkin(struct string_list *fsync_state,
> +				    const char *src, const char *dst)
> +{
> +	string_list_insert(fsync_state, src)->util = xstrdup(dst);
> +}

Just has one caller, why not just inline the string_list_insert()
call...

> +int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile,
> +					      const char *filename, time_t mtime)
> +{
> +	int do_finalize = 1;
> +	int ret = 0;
> +
> +	if (fsync_object_files != FSYNC_OBJECT_FILES_OFF) {

Let's do postive enum comparisons, and with switch() statements, so the
compiler helps us to see if we've covered them all.

> +		/*
> +		 * If we have a plugged bulk checkin, we issue a call that
> +		 * cleans the filesystem page cache but avoids a hardware flush
> +		 * command. Later on we will issue a single hardware flush
> +		 * before renaming files as part of do_sync_and_rename.
> +		 */
> +		if (bulk_checkin_plugged &&
> +		    fsync_object_files == FSYNC_OBJECT_FILES_BATCH &&
> +		    git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
> +			add_rename_bulk_checkin(&bulk_fsync_state, tmpfile, filename);
> +			do_finalize = 0;
> +
> +		} else {
> +			fsync_or_die(fd, "loose object file");
> +		}
> +	}

So nothing ever explicitly checks FSYNC_OBJECT_FILES_ON...?

> -extern int fsync_object_files;
> +enum FSYNC_OBJECT_FILES_MODE {
> +    FSYNC_OBJECT_FILES_OFF,
> +    FSYNC_OBJECT_FILES_ON,
> +    FSYNC_OBJECT_FILES_BATCH
> +};

Style: We don't use ALL_CAPS for type names in this codebase, just the
enum labels themselves....

> +extern enum FSYNC_OBJECT_FILES_MODE fsync_object_files;

...to the point where I had to rub my eyes to see what was going on here
... :)


> -		fsync_object_files = git_config_bool(var, value);
> +		if (value && !strcmp(value, "batch"))
> +			fsync_object_files = FSYNC_OBJECT_FILES_BATCH;
> +		else if (git_config_bool(var, value))
> +			fsync_object_files = FSYNC_OBJECT_FILES_ON;
> +		else
> +			fsync_object_files = FSYNC_OBJECT_FILES_OFF;

Since the point of this setting is safety, let's explicitly check
true/false here, use git_config_maybe_bool(), and perhaps issue a
warning on unknown values, but maybe that would get too verbose...

If we have a future "supersafe" mode, it'll get mapped to "false" on
older versions of git, probably not a good idea...

>  		return 0;
>  	}
>  
> diff --git a/config.mak.uname b/config.mak.uname
> index 76516aaa9a5..e6d482fbcc6 100644
> --- a/config.mak.uname
> +++ b/config.mak.uname
> @@ -53,6 +53,7 @@ ifeq ($(uname_S),Linux)
>  	HAVE_CLOCK_MONOTONIC = YesPlease
>  	# -lrt is needed for clock_gettime on glibc <= 2.16
>  	NEEDS_LIBRT = YesPlease
> +	HAVE_SYNC_FILE_RANGE = YesPlease
>  	HAVE_GETDELIM = YesPlease
>  	SANE_TEXT_GREP=-a
>  	FREAD_READS_DIRECTORIES = UnfortunatelyYes
> diff --git a/configure.ac b/configure.ac
> index 031e8d3fee8..c711037d625 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -1090,6 +1090,14 @@ AC_COMPILE_IFELSE([CLOCK_MONOTONIC_SRC],
>  	[AC_MSG_RESULT([no])
>  	HAVE_CLOCK_MONOTONIC=])
>  GIT_CONF_SUBST([HAVE_CLOCK_MONOTONIC])
> +
> +#
> +# Define HAVE_SYNC_FILE_RANGE=YesPlease if sync_file_range is available.
> +GIT_CHECK_FUNC(sync_file_range,
> +	[HAVE_SYNC_FILE_RANGE=YesPlease],
> +	[HAVE_SYNC_FILE_RANGE])
> +GIT_CONF_SUBST([HAVE_SYNC_FILE_RANGE])
> +
>  #
>  # Define NO_SETITIMER if you don't have setitimer.
>  GIT_CHECK_FUNC(setitimer,
> diff --git a/environment.c b/environment.c
> index d6b22ede7ea..3e23eafff80 100644
> --- a/environment.c
> +++ b/environment.c
> @@ -43,7 +43,7 @@ const char *git_hooks_path;
>  int zlib_compression_level = Z_BEST_SPEED;
>  int core_compression_level;
>  int pack_compression_level = Z_DEFAULT_COMPRESSION;
> -int fsync_object_files;
> +enum FSYNC_OBJECT_FILES_MODE fsync_object_files;
>  size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
>  size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
>  size_t delta_base_cache_limit = 96 * 1024 * 1024;
> diff --git a/git-compat-util.h b/git-compat-util.h
> index b46605300ab..d14e2436276 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -1210,6 +1210,13 @@ __attribute__((format (printf, 1, 2))) NORETURN
>  void BUG(const char *fmt, ...);
>  #endif
>  
> +enum fsync_action {
> +    FSYNC_WRITEOUT_ONLY,
> +    FSYNC_HARDWARE_FLUSH
> +};
> +
> +int git_fsync(int fd, enum fsync_action action);
> +
>  /*
>   * Preserves errno, prints a message, but gives no warning for ENOENT.
>   * Returns 0 on success, which includes trying to unlink an object that does
> diff --git a/object-file.c b/object-file.c
> index a8be8994814..ea14c3a3483 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1859,15 +1859,6 @@ int hash_object_file(const struct git_hash_algo *algo, const void *buf,
>  	return 0;
>  }
>  
> -/* Finalize a file on disk, and close it. */
> -static void close_loose_object(int fd)
> -{
> -	if (fsync_object_files)
> -		fsync_or_die(fd, "loose object file");
> -	if (close(fd) != 0)
> -		die_errno(_("error when closing loose object file"));
> -}
> -
>  /* Size of directory component, including the ending '/' */
>  static inline int directory_size(const char *filename)
>  {
> @@ -1973,17 +1964,8 @@ static int write_loose_object(const struct object_id *oid, char *hdr,
>  		die(_("confused by unstable object source data for %s"),
>  		    oid_to_hex(oid));
>  
> -	close_loose_object(fd);
> -
> -	if (mtime) {
> -		struct utimbuf utb;
> -		utb.actime = mtime;
> -		utb.modtime = mtime;
> -		if (utime(tmp_file.buf, &utb) < 0)
> -			warning_errno(_("failed utime() on %s"), tmp_file.buf);
> -	}
> -
> -	return finalize_object_file(tmp_file.buf, filename.buf);
> +	return fsync_and_close_loose_object_bulk_checkin(fd, tmp_file.buf,
> +							 filename.buf, mtime);
>  }
>  
>  static int freshen_loose_object(const struct object_id *oid)
> diff --git a/wrapper.c b/wrapper.c
> index 7c6586af321..bb4f9f043ce 100644
> --- a/wrapper.c
> +++ b/wrapper.c
> @@ -540,6 +540,50 @@ int xmkstemp_mode(char *filename_template, int mode)
>  	return fd;
>  }
>  
> +int git_fsync(int fd, enum fsync_action action)
> +{
> +	switch (action) {
> +	case FSYNC_WRITEOUT_ONLY:
> +
> +#ifdef __APPLE__
> +		/*
> +		 * on macOS, fsync just causes filesystem cache writeback but does not
> +		 * flush hardware caches.
> +		 */
> +		return fsync(fd);
> +#endif
> +
> +#ifdef HAVE_SYNC_FILE_RANGE
> +		/*
> +		 * On linux 2.6.17 and above, sync_file_range is the way to issue
> +		 * a writeback without a hardware flush. An offset of 0 and size of 0
> +		 * indicates writeout of the entire file and the wait flags ensure that all
> +		 * dirty data is written to the disk (potentially in a disk-side cache)
> +		 * before we continue.
> +		 */
> +
> +		return sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE |
> +						 SYNC_FILE_RANGE_WRITE |
> +						 SYNC_FILE_RANGE_WAIT_AFTER);
> +#endif
> +
> +		errno = ENOSYS;
> +		return -1;
> +
> +	case FSYNC_HARDWARE_FLUSH:
> +
> +#ifdef __APPLE__
> +		return fcntl(fd, F_FULLFSYNC);
> +#else
> +		return fsync(fd);
> +#endif
> +
> +	default:
> +		BUG("unexpected git_fsync(%d) call", action);
> +	}
> +
> +}
> +
>  static int warn_if_unremovable(const char *op, const char *file, int rc)
>  {
>  	int err;
> diff --git a/write-or-die.c b/write-or-die.c
> index d33e68f6abb..8f53953d4ab 100644
> --- a/write-or-die.c
> +++ b/write-or-die.c
> @@ -57,7 +57,7 @@ void fprintf_or_die(FILE *f, const char *fmt, ...)
>  
>  void fsync_or_die(int fd, const char *msg)
>  {
> -	while (fsync(fd) < 0) {
> +	while (git_fsync(fd, FSYNC_HARDWARE_FLUSH) < 0) {
>  		if (errno != EINTR)
>  			die_errno("fsync error on '%s'", msg);
>  	}

Neeraj Singh Sept. 22, 2021, 1:23 a.m. UTC | #2

On Tue, Sep 21, 2021 at 4:41 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Mon, Sep 20 2021, Neeraj Singh via GitGitGadget wrote:
>
> > When the new mode is enabled we do the following for new objects:
> >
> > 1. Create a tmp_obj_XXXX file and write the object data to it.
> > 2. Issue a pagecache writeback request and wait for it to complete.
> > 3. Record the tmp name and the final name in the bulk-checkin state for
> >    later rename.
> >
> > At the end of the entire transaction we:
> > 1. Issue a fsync against the lock file to flush the hardware writeback
> >    cache, which should by now have processed the tmp file writes.
> > 2. Rename all of the temp files to their final names.
> > 3. When updating the index and/or refs, we assume that Git will issue
> >    another fsync internal to that operation.
>
> Perhaps note too that:
>
> 4. For loose objects, refs etc. we may or may not create directories,
>    and most certainly will be updating metadata on the immediate
>    directory containing the file, but none of that's fsync()'d.
>
> > On a filesystem with a singular journal that is updated during name
> > operations (e.g. create, link, rename, etc), such as NTFS and HFS+, we
> > would expect the fsync to trigger a journal writeout so that this
> > sequence is enough to ensure that the user's data is durable by the time
> > the git command returns.
> >
> > This change also updates the macOS code to trigger a real hardware flush
> > via fnctl(fd, F_FULLFSYNC) when fsync_or_die is called. Previously, on
> > macOS there was no guarantee of durability since a simple fsync(2) call
> > does not flush any hardware caches.
>
> There's no discussion of whether this is or isn't known to also work
> some Linux FS's, and for these OS's where this does work is this only
> for the object files themselves, or does metadata also "ride along"?
>

I unfortunately can't examine Linux kernel source code and the details
of metadata
consistency behavior across files is not something that anyone in that
group wants
to pin down. As far as I can tell, the only thing that's really
guaranteed is fsyncing
every single file you write down and its parent directory if you're
creating a new file
(which we always are).  As came up in conversation with Christoph
Hellwig elsewhere
on thread, Linux doesn't have any set of syscalls to make batch mode
safe.  It does look
like XFS would be safe if sync_file_ranges actually promised to wait
for all pagecache
writeback definitively, since it would do a "log force" to push all
the dirty metadata to
disk when we do our final fsync.

I really didn't want to say something definitive about what Linux can
or will do, since I'm
not in a position to really know or influence them.  Christoph did say
that he would be
interested in contributing a variant to this patch that would be
definitively safe on filesystems
that honor syncfs.

> > _Performance numbers_:
> >
> > Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
> > Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
> > Windows - Same host as Linux, a preview version of Windows 11.
> >         This number is from a patch later in the series.
> >
> > Adding 500 files to the repo with 'git add' Times reported in seconds.
> >
> > core.fsyncObjectFiles | Linux | Mac   | Windows
> > ----------------------|-------|-------|--------
> >                 false | 0.06  |  0.35 | 0.61
> >                 true  | 1.88  | 11.18 | 2.47
> >                 batch | 0.15  |  0.41 | 1.53
>
> Per my https://lore.kernel.org/git/87mtp5cwpn.fsf@evledraar.gmail.com
> and 6/6 in this series we've got perf tests for add/stash, but it would
> be really interesting to see how this is impacted by
> transfer.unpackLimit in cases where we may be writing packs or loose
> objects.

I'm having trouble understanding how unpackLimit is related to 'git stash'
or 'git add'. From code inspection, it doesn't look like we're using
those settings
for adding objects except from across a transport.

Are you proposing that we have a similar setting for adding objects
via 'add' using
a packfile?  I think that would be a good goal, but it might be a bit
tricky since we've
likely done a lot of the work to buffer the input objects in order to
compute their OIDs,
before we know how many objects there are to add. If the policy were
to "always add to
a packfile", it would be easier.

>
> > [...]
> >  core.fsyncObjectFiles::
> > -     This boolean will enable 'fsync()' when writing object files.
> > -+
> > -This is a total waste of time and effort on a filesystem that orders
> > -data writes properly, but can be useful for filesystems that do not use
> > -journalling (traditional UNIX filesystems) or that only journal metadata
> > -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").
> > +     A value indicating the level of effort Git will expend in
> > +     trying to make objects added to the repo durable in the event
> > +     of an unclean system shutdown. This setting currently only
> > +     controls the object store, so updates to any refs or the
> > +     index may not be equally durable.
>
> All these mentions of "object" should really clarify that it's "loose
> objects", i.e. we always fsync pack files.
>
> > +* `false` allows data to remain in file system caches according to
> > +  operating system policy, whence it may be lost if the system loses power
> > +  or crashes.
>
> As noted in point #4 of
> https://lore.kernel.org/git/87mtp5cwpn.fsf@evledraar.gmail.com/ while
> this direction is overall an improvement over the previously flippant
> docs, they at least alluded to the context that the assumption behind
> "false" is that you don't really care about loose objects, you care
> about loose objects *and* the ref update or whatever.
>
> As I think (this is from memory) we've covered already this may have
> been all based on some old ext3 assumption, but it's probably worth
> summarizing that here, i.e. if you've got an FS with global ordered
> operations you can probably skip this, but probably not etc.
>
> > +* `true` triggers a data integrity flush for each object added to the
> > +  object store. This is the safest setting that is likely to ensure durability
> > +  across all operating systems and file systems that honor the 'fsync' system
> > +  call. However, this setting comes with a significant performance cost on
> > +  common hardware.
>
> This is really overpromising things by omitting the fact that eve if
> we're getting this feature you've hacked up right, we're still not
> fsyncing dir entries etc (also noted above).
>
> So something that describes the narrow scope here, along with "loose
> objects" etc....
>
> > +* `batch` enables an experimental mode that uses interfaces available in some
> > +  operating systems to write object data with a minimal set of FLUSH CACHE
> > +  (or equivalent) commands sent to the storage controller. If the operating
> > +  system interfaces are not available, this mode behaves the same as `true`.
> > +  This mode is expected to be safe on macOS for repos stored on HFS+ or APFS
> > +  filesystems and on Windows for repos stored on NTFS or ReFS.
>
> Again, even if it's called "core.fsyncObjectFiles" if we're going to say
> "safe" we really need to say safe in what sense. Having written and
> fsync()'d the file is helping nobody if the metadata never arrives....
>

My concern with your feedback here is that this is user-facing documentation.
I'd assume that people who are not intimately familiar with both their
filesystem
and Git's internals would just be completely mystified by a long commentary on
the specifics in the Config documentation. I think over time Git should focus on
making this setting really guarantee durability in a meaningful way
across the entire
repository.

> > +static void do_sync_and_rename(struct string_list *fsync_state, struct lock_file *lock_file)
> > +{
> > +     if (fsync_state->nr) {
>
> I think less indentation here would be nice:
>
>     if (!fsync_state->nr)
>         return;
>     /* rest of unindented body */
>

Will fix.

> Or better yet do this check in unplug_bulk_checkin(), then here:
>
>     fsync_or_die();
>     for_each_string_list_item() { ...}
>     string_list_clear(....);
>
>

I'd prefer to put it in the callee for reasons of
separation-of-concerns.  I don't want
to have the caller and callee partially implement the contract. The
compiler should
do a good enough job, since it's only one caller and will probably get
totally inilined.

> > +             struct string_list_item *rename;
> > +
> > +             /*
> > +              * Issue a full hardware flush against the lock file to ensure
> > +              * that all objects are durable before any renames occur.
> > +              * The code in fsync_and_close_loose_object_bulk_checkin has
> > +              * already ensured that writeout has occurred, but it has not
> > +              * flushed any writeback cache in the storage hardware.
> > +              */
> > +             fsync_or_die(get_lock_file_fd(lock_file), get_lock_file_path(lock_file));
> > +
> > +             for_each_string_list_item(rename, fsync_state) {
> > +                     const char *src = rename->string;
> > +                     const char *dst = rename->util;
> > +
> > +                     if (finalize_object_file(src, dst))
> > +                             die_errno(_("could not rename '%s' to '%s'"), src, dst);
> > +             }
> > +
> > +             string_list_clear(fsync_state, 1);
> > +     }
> > +}
> > +
> >  static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
> >  {
> >       int i;
> > @@ -256,6 +286,53 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
> >       return 0;
> >  }
> >
> > +static void add_rename_bulk_checkin(struct string_list *fsync_state,
> > +                                 const char *src, const char *dst)
> > +{
> > +     string_list_insert(fsync_state, src)->util = xstrdup(dst);
> > +}
>
> Just has one caller, why not just inline the string_list_insert()
> call...
>

I thought about doing that before.  I'll do it.

> > +int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile,
> > +                                           const char *filename, time_t mtime)
> > +{
> > +     int do_finalize = 1;
> > +     int ret = 0;
> > +
> > +     if (fsync_object_files != FSYNC_OBJECT_FILES_OFF) {
>
> Let's do postive enum comparisons, and with switch() statements, so the
> compiler helps us to see if we've covered them all.
>

Ok, will switch to switch.

> > +             /*
> > +              * If we have a plugged bulk checkin, we issue a call that
> > +              * cleans the filesystem page cache but avoids a hardware flush
> > +              * command. Later on we will issue a single hardware flush
> > +              * before renaming files as part of do_sync_and_rename.
> > +              */
> > +             if (bulk_checkin_plugged &&
> > +                 fsync_object_files == FSYNC_OBJECT_FILES_BATCH &&
> > +                 git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
> > +                     add_rename_bulk_checkin(&bulk_fsync_state, tmpfile, filename);
> > +                     do_finalize = 0;
> > +
> > +             } else {
> > +                     fsync_or_die(fd, "loose object file");
> > +             }
> > +     }
>
> So nothing ever explicitly checks FSYNC_OBJECT_FILES_ON...?
>

Yeah, I did it this way to avoid any code duplication, but I can change to
a switch if it doesn't require too much repetition.

> > -extern int fsync_object_files;
> > +enum FSYNC_OBJECT_FILES_MODE {
> > +    FSYNC_OBJECT_FILES_OFF,
> > +    FSYNC_OBJECT_FILES_ON,
> > +    FSYNC_OBJECT_FILES_BATCH
> > +};
>
> Style: We don't use ALL_CAPS for type names in this codebase, just the
> enum labels themselves....
>
> > +extern enum FSYNC_OBJECT_FILES_MODE fsync_object_files;
>
> ...to the point where I had to rub my eyes to see what was going on here
> ... :)
>

Sorry, Windows Developer :). Will fix.


> > -             fsync_object_files = git_config_bool(var, value);
> > +             if (value && !strcmp(value, "batch"))
> > +                     fsync_object_files = FSYNC_OBJECT_FILES_BATCH;
> > +             else if (git_config_bool(var, value))
> > +                     fsync_object_files = FSYNC_OBJECT_FILES_ON;
> > +             else
> > +                     fsync_object_files = FSYNC_OBJECT_FILES_OFF;
>
> Since the point of this setting is safety, let's explicitly check
> true/false here, use git_config_maybe_bool(), and perhaps issue a
> warning on unknown values, but maybe that would get too verbose...
>
> If we have a future "supersafe" mode, it'll get mapped to "false" on
> older versions of git, probably not a good idea...
>

 I took Junio's suggestion verbatim.  I'll try a warning if the value
exists, and is not 'batch' or <maybe bool>.


Thanks for looking at my changes so thoroughly!
-Neeraj

Ævar Arnfjörð Bjarmason Sept. 22, 2021, 2:02 a.m. UTC | #3

On Tue, Sep 21 2021, Neeraj Singh wrote:

> On Tue, Sep 21, 2021 at 4:41 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Mon, Sep 20 2021, Neeraj Singh via GitGitGadget wrote:
>>
>> > When the new mode is enabled we do the following for new objects:
>> >
>> > 1. Create a tmp_obj_XXXX file and write the object data to it.
>> > 2. Issue a pagecache writeback request and wait for it to complete.
>> > 3. Record the tmp name and the final name in the bulk-checkin state for
>> >    later rename.
>> >
>> > At the end of the entire transaction we:
>> > 1. Issue a fsync against the lock file to flush the hardware writeback
>> >    cache, which should by now have processed the tmp file writes.
>> > 2. Rename all of the temp files to their final names.
>> > 3. When updating the index and/or refs, we assume that Git will issue
>> >    another fsync internal to that operation.
>>
>> Perhaps note too that:
>>
>> 4. For loose objects, refs etc. we may or may not create directories,
>>    and most certainly will be updating metadata on the immediate
>>    directory containing the file, but none of that's fsync()'d.
>>
>> > On a filesystem with a singular journal that is updated during name
>> > operations (e.g. create, link, rename, etc), such as NTFS and HFS+, we
>> > would expect the fsync to trigger a journal writeout so that this
>> > sequence is enough to ensure that the user's data is durable by the time
>> > the git command returns.
>> >
>> > This change also updates the macOS code to trigger a real hardware flush
>> > via fnctl(fd, F_FULLFSYNC) when fsync_or_die is called. Previously, on
>> > macOS there was no guarantee of durability since a simple fsync(2) call
>> > does not flush any hardware caches.
>>
>> There's no discussion of whether this is or isn't known to also work
>> some Linux FS's, and for these OS's where this does work is this only
>> for the object files themselves, or does metadata also "ride along"?
>>
>
> I unfortunately can't examine Linux kernel source code and the details
> of metadata
> consistency behavior across files is not something that anyone in that
> group wants
> to pin down. As far as I can tell, the only thing that's really
> guaranteed is fsyncing
> every single file you write down and its parent directory if you're
> creating a new file
> (which we always are).  As came up in conversation with Christoph
> Hellwig elsewhere
> on thread, Linux doesn't have any set of syscalls to make batch mode
> safe.  It does look
> like XFS would be safe if sync_file_ranges actually promised to wait
> for all pagecache
> writeback definitively, since it would do a "log force" to push all
> the dirty metadata to
> disk when we do our final fsync.
>
> I really didn't want to say something definitive about what Linux can
> or will do, since I'm
> not in a position to really know or influence them.  Christoph did say
> that he would be
> interested in contributing a variant to this patch that would be
> definitively safe on filesystems
> that honor syncfs.

*nod*, it's fine if it's omitted. Just wondering if we knew but weren't
 saying etc.

>> > _Performance numbers_:
>> >
>> > Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
>> > Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
>> > Windows - Same host as Linux, a preview version of Windows 11.
>> >         This number is from a patch later in the series.
>> >
>> > Adding 500 files to the repo with 'git add' Times reported in seconds.
>> >
>> > core.fsyncObjectFiles | Linux | Mac   | Windows
>> > ----------------------|-------|-------|--------
>> >                 false | 0.06  |  0.35 | 0.61
>> >                 true  | 1.88  | 11.18 | 2.47
>> >                 batch | 0.15  |  0.41 | 1.53
>>
>> Per my https://lore.kernel.org/git/87mtp5cwpn.fsf@evledraar.gmail.com
>> and 6/6 in this series we've got perf tests for add/stash, but it would
>> be really interesting to see how this is impacted by
>> transfer.unpackLimit in cases where we may be writing packs or loose
>> objects.
>
> I'm having trouble understanding how unpackLimit is related to 'git stash'
> or 'git add'. From code inspection, it doesn't look like we're using
> those settings
> for adding objects except from across a transport.
>
> Are you proposing that we have a similar setting for adding objects
> via 'add' using
> a packfile?  I think that would be a good goal, but it might be a bit
> tricky since we've
> likely done a lot of the work to buffer the input objects in order to
> compute their OIDs,
> before we know how many objects there are to add. If the policy were
> to "always add to
> a packfile", it would be easier.

No, just that in the documentation that we should be explaining to the
reader that this mode that optimizes for loose object writing benefits
particular commands, but e.g. on the server-side that we'll probably
never write 500 objects, but stream them to one pack.

Which might also inform next steps for the commands this does help with,
i.e. can we make more things stream to packs? I think having this mode
is at worst a good transitory thing to have, but perhaps longer term
we'll want to simply write fewer individual loose objects.

In any case, pushing to a server with this configured and scaling that
by transfer.unpackLimit should nicely demonstrate the pack v.s. loose
object scenario at different fsck-settings.

>>
>> > [...]
>> >  core.fsyncObjectFiles::
>> > -     This boolean will enable 'fsync()' when writing object files.
>> > -+
>> > -This is a total waste of time and effort on a filesystem that orders
>> > -data writes properly, but can be useful for filesystems that do not use
>> > -journalling (traditional UNIX filesystems) or that only journal metadata
>> > -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").
>> > +     A value indicating the level of effort Git will expend in
>> > +     trying to make objects added to the repo durable in the event
>> > +     of an unclean system shutdown. This setting currently only
>> > +     controls the object store, so updates to any refs or the
>> > +     index may not be equally durable.
>>
>> All these mentions of "object" should really clarify that it's "loose
>> objects", i.e. we always fsync pack files.
>>
>> > +* `false` allows data to remain in file system caches according to
>> > +  operating system policy, whence it may be lost if the system loses power
>> > +  or crashes.
>>
>> As noted in point #4 of
>> https://lore.kernel.org/git/87mtp5cwpn.fsf@evledraar.gmail.com/ while
>> this direction is overall an improvement over the previously flippant
>> docs, they at least alluded to the context that the assumption behind
>> "false" is that you don't really care about loose objects, you care
>> about loose objects *and* the ref update or whatever.
>>
>> As I think (this is from memory) we've covered already this may have
>> been all based on some old ext3 assumption, but it's probably worth
>> summarizing that here, i.e. if you've got an FS with global ordered
>> operations you can probably skip this, but probably not etc.
>>
>> > +* `true` triggers a data integrity flush for each object added to the
>> > +  object store. This is the safest setting that is likely to ensure durability
>> > +  across all operating systems and file systems that honor the 'fsync' system
>> > +  call. However, this setting comes with a significant performance cost on
>> > +  common hardware.
>>
>> This is really overpromising things by omitting the fact that eve if
>> we're getting this feature you've hacked up right, we're still not
>> fsyncing dir entries etc (also noted above).
>>
>> So something that describes the narrow scope here, along with "loose
>> objects" etc....
>>
>> > +* `batch` enables an experimental mode that uses interfaces available in some
>> > +  operating systems to write object data with a minimal set of FLUSH CACHE
>> > +  (or equivalent) commands sent to the storage controller. If the operating
>> > +  system interfaces are not available, this mode behaves the same as `true`.
>> > +  This mode is expected to be safe on macOS for repos stored on HFS+ or APFS
>> > +  filesystems and on Windows for repos stored on NTFS or ReFS.
>>
>> Again, even if it's called "core.fsyncObjectFiles" if we're going to say
>> "safe" we really need to say safe in what sense. Having written and
>> fsync()'d the file is helping nobody if the metadata never arrives....
>>
>
> My concern with your feedback here is that this is user-facing documentation.
> I'd assume that people who are not intimately familiar with both their
> filesystem
> and Git's internals would just be completely mystified by a long commentary on
> the specifics in the Config documentation. I think over time Git should focus on
> making this setting really guarantee durability in a meaningful way
> across the entire
> repository.

Yeah, this setting though is probably going to be tweaked only by fairly
expert-level users of git.

I think it's fine if it just explicitly punts and says something like
'this is what it does, this may or may not work on your FS' etc., my
main issue with the current docs is that they give off this vibe of
knowing a lot more than they're telling you.

>> > +static void do_sync_and_rename(struct string_list *fsync_state, struct lock_file *lock_file)
>> > +{
>> > +     if (fsync_state->nr) {
>>
>> I think less indentation here would be nice:
>>
>>     if (!fsync_state->nr)
>>         return;
>>     /* rest of unindented body */
>>
>
> Will fix.
>
>> Or better yet do this check in unplug_bulk_checkin(), then here:
>>
>>     fsync_or_die();
>>     for_each_string_list_item() { ...}
>>     string_list_clear(....);
>>
>>
>
> I'd prefer to put it in the callee for reasons of
> separation-of-concerns.  I don't want
> to have the caller and callee partially implement the contract. The
> compiler should
> do a good enough job, since it's only one caller and will probably get
> totally inilined.

*nod*

For what it's worth I meant the "inlined" just in terms of avoiding the
indirection for human readers, it won't matter to the machine,
especially since this is all I/O bound...

Neeraj Singh Sept. 22, 2021, 7:46 p.m. UTC | #4

On Tue, Sep 21, 2021 at 6:23 PM Neeraj Singh <nksingh85@gmail.com> wrote:
>
> On Tue, Sep 21, 2021 at 4:41 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
> >
> >
> > On Mon, Sep 20 2021, Neeraj Singh via GitGitGadget wrote:

> > > -             fsync_object_files = git_config_bool(var, value);
> > > +             if (value && !strcmp(value, "batch"))
> > > +                     fsync_object_files = FSYNC_OBJECT_FILES_BATCH;
> > > +             else if (git_config_bool(var, value))
> > > +                     fsync_object_files = FSYNC_OBJECT_FILES_ON;
> > > +             else
> > > +                     fsync_object_files = FSYNC_OBJECT_FILES_OFF;
> >
> > Since the point of this setting is safety, let's explicitly check
> > true/false here, use git_config_maybe_bool(), and perhaps issue a
> > warning on unknown values, but maybe that would get too verbose...
> >
> > If we have a future "supersafe" mode, it'll get mapped to "false" on
> > older versions of git, probably not a good idea...
> >
>
>  I took Junio's suggestion verbatim.  I'll try a warning if the value
> exists, and is not 'batch' or <maybe bool>.

An update on this.  I tested out some values:
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=batch add ./
    fsync_object_files: 2
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=0 add ./
    fsync_object_files: 0
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=1 add ./
    fsync_object_files: 1
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=2 add ./
    fsync_object_files: 1
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=barf add ./
    fatal: bad boolean config value 'barf' for 'core.fsyncobjectfiles'
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=true add ./
    fsync_object_files: 1
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=false add ./
    fsync_object_files: 0
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=t add ./
    fatal: bad boolean config value 't' for 'core.fsyncobjectfiles'
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=y add ./
    fatal: bad boolean config value 'y' for 'core.fsyncobjectfiles'
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=yes add ./
    fsync_object_files: 1
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=no add ./
    fsync_object_files: 0
    nksingh@neerajsi-x1:~/src/git$ ./git -c core.fsyncobjectfiles=nope add ./
    fatal: bad boolean config value 'nope' for 'core.fsyncobjectfiles'

So I think the code already works like you are suggesting (thanks Junio!).

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index c04f62a54a1..0006d90980d 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -548,12 +548,26 @@  core.whitespace::
   errors. The default tab width is 8. Allowed values are 1 to 63.
 
 core.fsyncObjectFiles::
-	This boolean will enable 'fsync()' when writing object files.
-+
-This is a total waste of time and effort on a filesystem that orders
-data writes properly, but can be useful for filesystems that do not use
-journalling (traditional UNIX filesystems) or that only journal metadata
-and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").
+	A value indicating the level of effort Git will expend in
+	trying to make objects added to the repo durable in the event
+	of an unclean system shutdown. This setting currently only
+	controls the object store, so updates to any refs or the
+	index may not be equally durable.
++
+* `false` allows data to remain in file system caches according to
+  operating system policy, whence it may be lost if the system loses power
+  or crashes.
+* `true` triggers a data integrity flush for each object added to the
+  object store. This is the safest setting that is likely to ensure durability
+  across all operating systems and file systems that honor the 'fsync' system
+  call. However, this setting comes with a significant performance cost on
+  common hardware.
+* `batch` enables an experimental mode that uses interfaces available in some
+  operating systems to write object data with a minimal set of FLUSH CACHE
+  (or equivalent) commands sent to the storage controller. If the operating
+  system interfaces are not available, this mode behaves the same as `true`.
+  This mode is expected to be safe on macOS for repos stored on HFS+ or APFS
+  filesystems and on Windows for repos stored on NTFS or ReFS.
 
 core.preloadIndex::
 	Enable parallel index preload for operations like 'git diff'
diff --git a/Makefile b/Makefile
index 429c276058d..326c7607e0f 100644
--- a/Makefile
+++ b/Makefile
@@ -406,6 +406,8 @@  all::
 #
 # Define HAVE_CLOCK_MONOTONIC if your platform has CLOCK_MONOTONIC.
 #
+# Define HAVE_SYNC_FILE_RANGE if your platform has sync_file_range.
+#
 # Define NEEDS_LIBRT if your platform requires linking with librt (glibc version
 # before 2.17) for clock_gettime and CLOCK_MONOTONIC.
 #
@@ -1896,6 +1898,10 @@  ifdef HAVE_CLOCK_MONOTONIC
 	BASIC_CFLAGS += -DHAVE_CLOCK_MONOTONIC
 endif
 
+ifdef HAVE_SYNC_FILE_RANGE
+	BASIC_CFLAGS += -DHAVE_SYNC_FILE_RANGE
+endif
+
 ifdef NEEDS_LIBRT
 	EXTLIBS += -lrt
 endif
diff --git a/builtin/add.c b/builtin/add.c
index 2244311d485..dda4bf093a0 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -678,7 +678,8 @@  int cmd_add(int argc, const char **argv, const char *prefix)
 
 	if (chmod_arg && pathspec.nr)
 		exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only);
-	unplug_bulk_checkin();
+
+	unplug_bulk_checkin(&lock_file);
 
 finish:
 	if (write_locked_index(&the_index, &lock_file,
diff --git a/bulk-checkin.c b/bulk-checkin.c
index f117d62c908..ddbab5e5c8c 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -3,15 +3,19 @@ 
  */
 #include "cache.h"
 #include "bulk-checkin.h"
+#include "lockfile.h"
 #include "repository.h"
 #include "csum-file.h"
 #include "pack.h"
 #include "strbuf.h"
+#include "string-list.h"
 #include "packfile.h"
 #include "object-store.h"
 
 static int bulk_checkin_plugged;
 
+static struct string_list bulk_fsync_state = STRING_LIST_INIT_DUP;
+
 static struct bulk_checkin_state {
 	char *pack_tmp_name;
 	struct hashfile *f;
@@ -62,6 +66,32 @@  clear_exit:
 	reprepare_packed_git(the_repository);
 }
 
+static void do_sync_and_rename(struct string_list *fsync_state, struct lock_file *lock_file)
+{
+	if (fsync_state->nr) {
+		struct string_list_item *rename;
+
+		/*
+		 * Issue a full hardware flush against the lock file to ensure
+		 * that all objects are durable before any renames occur.
+		 * The code in fsync_and_close_loose_object_bulk_checkin has
+		 * already ensured that writeout has occurred, but it has not
+		 * flushed any writeback cache in the storage hardware.
+		 */
+		fsync_or_die(get_lock_file_fd(lock_file), get_lock_file_path(lock_file));
+
+		for_each_string_list_item(rename, fsync_state) {
+			const char *src = rename->string;
+			const char *dst = rename->util;
+
+			if (finalize_object_file(src, dst))
+				die_errno(_("could not rename '%s' to '%s'"), src, dst);
+		}
+
+		string_list_clear(fsync_state, 1);
+	}
+}
+
 static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
 {
 	int i;
@@ -256,6 +286,53 @@  static int deflate_to_pack(struct bulk_checkin_state *state,
 	return 0;
 }
 
+static void add_rename_bulk_checkin(struct string_list *fsync_state,
+				    const char *src, const char *dst)
+{
+	string_list_insert(fsync_state, src)->util = xstrdup(dst);
+}
+
+int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile,
+					      const char *filename, time_t mtime)
+{
+	int do_finalize = 1;
+	int ret = 0;
+
+	if (fsync_object_files != FSYNC_OBJECT_FILES_OFF) {
+		/*
+		 * If we have a plugged bulk checkin, we issue a call that
+		 * cleans the filesystem page cache but avoids a hardware flush
+		 * command. Later on we will issue a single hardware flush
+		 * before renaming files as part of do_sync_and_rename.
+		 */
+		if (bulk_checkin_plugged &&
+		    fsync_object_files == FSYNC_OBJECT_FILES_BATCH &&
+		    git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
+			add_rename_bulk_checkin(&bulk_fsync_state, tmpfile, filename);
+			do_finalize = 0;
+
+		} else {
+			fsync_or_die(fd, "loose object file");
+		}
+	}
+
+	if (close(fd))
+		die_errno(_("error when closing loose object file"));
+
+	if (mtime) {
+		struct utimbuf utb;
+		utb.actime = mtime;
+		utb.modtime = mtime;
+		if (utime(tmpfile, &utb) < 0)
+			warning_errno(_("failed utime() on %s"), tmpfile);
+	}
+
+	if (do_finalize)
+		ret = finalize_object_file(tmpfile, filename);
+
+	return ret;
+}
+
 int index_bulk_checkin(struct object_id *oid,
 		       int fd, size_t size, enum object_type type,
 		       const char *path, unsigned flags)
@@ -273,10 +350,12 @@  void plug_bulk_checkin(void)
 	bulk_checkin_plugged = 1;
 }
 
-void unplug_bulk_checkin(void)
+void unplug_bulk_checkin(struct lock_file *lock_file)
 {
 	assert(bulk_checkin_plugged);
 	bulk_checkin_plugged = 0;
 	if (bulk_checkin_state.f)
 		finish_bulk_checkin(&bulk_checkin_state);
+
+	do_sync_and_rename(&bulk_fsync_state, lock_file);
 }
diff --git a/bulk-checkin.h b/bulk-checkin.h
index b26f3dc3b74..4a3309c1531 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -6,11 +6,14 @@ 
 
 #include "cache.h"
 
+int fsync_and_close_loose_object_bulk_checkin(int fd, const char *tmpfile,
+					      const char *filename, time_t mtime);
+
 int index_bulk_checkin(struct object_id *oid,
 		       int fd, size_t size, enum object_type type,
 		       const char *path, unsigned flags);
 
 void plug_bulk_checkin(void);
-void unplug_bulk_checkin(void);
+void unplug_bulk_checkin(struct lock_file *);
 
 #endif
diff --git a/cache.h b/cache.h
index d23de693680..39b3a88181a 100644
--- a/cache.h
+++ b/cache.h
@@ -985,7 +985,13 @@  void reset_shared_repository(void);
 extern int read_replace_refs;
 extern char *git_replace_ref_base;
 
-extern int fsync_object_files;
+enum FSYNC_OBJECT_FILES_MODE {
+    FSYNC_OBJECT_FILES_OFF,
+    FSYNC_OBJECT_FILES_ON,
+    FSYNC_OBJECT_FILES_BATCH
+};
+
+extern enum FSYNC_OBJECT_FILES_MODE fsync_object_files;
 extern int core_preload_index;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index cb4a8058bff..1b403e00241 100644
--- a/config.c
+++ b/config.c
@@ -1509,7 +1509,12 @@  static int git_default_core_config(const char *var, const char *value, void *cb)
 	}
 
 	if (!strcmp(var, "core.fsyncobjectfiles")) {
-		fsync_object_files = git_config_bool(var, value);
+		if (value && !strcmp(value, "batch"))
+			fsync_object_files = FSYNC_OBJECT_FILES_BATCH;
+		else if (git_config_bool(var, value))
+			fsync_object_files = FSYNC_OBJECT_FILES_ON;
+		else
+			fsync_object_files = FSYNC_OBJECT_FILES_OFF;
 		return 0;
 	}
 
diff --git a/config.mak.uname b/config.mak.uname
index 76516aaa9a5..e6d482fbcc6 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -53,6 +53,7 @@  ifeq ($(uname_S),Linux)
 	HAVE_CLOCK_MONOTONIC = YesPlease
 	# -lrt is needed for clock_gettime on glibc <= 2.16
 	NEEDS_LIBRT = YesPlease
+	HAVE_SYNC_FILE_RANGE = YesPlease
 	HAVE_GETDELIM = YesPlease
 	SANE_TEXT_GREP=-a
 	FREAD_READS_DIRECTORIES = UnfortunatelyYes
diff --git a/configure.ac b/configure.ac
index 031e8d3fee8..c711037d625 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1090,6 +1090,14 @@  AC_COMPILE_IFELSE([CLOCK_MONOTONIC_SRC],
 	[AC_MSG_RESULT([no])
 	HAVE_CLOCK_MONOTONIC=])
 GIT_CONF_SUBST([HAVE_CLOCK_MONOTONIC])
+
+#
+# Define HAVE_SYNC_FILE_RANGE=YesPlease if sync_file_range is available.
+GIT_CHECK_FUNC(sync_file_range,
+	[HAVE_SYNC_FILE_RANGE=YesPlease],
+	[HAVE_SYNC_FILE_RANGE])
+GIT_CONF_SUBST([HAVE_SYNC_FILE_RANGE])
+
 #
 # Define NO_SETITIMER if you don't have setitimer.
 GIT_CHECK_FUNC(setitimer,
diff --git a/environment.c b/environment.c
index d6b22ede7ea..3e23eafff80 100644
--- a/environment.c
+++ b/environment.c
@@ -43,7 +43,7 @@  const char *git_hooks_path;
 int zlib_compression_level = Z_BEST_SPEED;
 int core_compression_level;
 int pack_compression_level = Z_DEFAULT_COMPRESSION;
-int fsync_object_files;
+enum FSYNC_OBJECT_FILES_MODE fsync_object_files;
 size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
 size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
 size_t delta_base_cache_limit = 96 * 1024 * 1024;
diff --git a/git-compat-util.h b/git-compat-util.h
index b46605300ab..d14e2436276 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1210,6 +1210,13 @@  __attribute__((format (printf, 1, 2))) NORETURN
 void BUG(const char *fmt, ...);
 #endif
 
+enum fsync_action {
+    FSYNC_WRITEOUT_ONLY,
+    FSYNC_HARDWARE_FLUSH
+};
+
+int git_fsync(int fd, enum fsync_action action);
+
 /*
  * Preserves errno, prints a message, but gives no warning for ENOENT.
  * Returns 0 on success, which includes trying to unlink an object that does
diff --git a/object-file.c b/object-file.c
index a8be8994814..ea14c3a3483 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1859,15 +1859,6 @@  int hash_object_file(const struct git_hash_algo *algo, const void *buf,
 	return 0;
 }
 
-/* Finalize a file on disk, and close it. */
-static void close_loose_object(int fd)
-{
-	if (fsync_object_files)
-		fsync_or_die(fd, "loose object file");
-	if (close(fd) != 0)
-		die_errno(_("error when closing loose object file"));
-}
-
 /* Size of directory component, including the ending '/' */
 static inline int directory_size(const char *filename)
 {
@@ -1973,17 +1964,8 @@  static int write_loose_object(const struct object_id *oid, char *hdr,
 		die(_("confused by unstable object source data for %s"),
 		    oid_to_hex(oid));
 
-	close_loose_object(fd);
-
-	if (mtime) {
-		struct utimbuf utb;
-		utb.actime = mtime;
-		utb.modtime = mtime;
-		if (utime(tmp_file.buf, &utb) < 0)
-			warning_errno(_("failed utime() on %s"), tmp_file.buf);
-	}
-
-	return finalize_object_file(tmp_file.buf, filename.buf);
+	return fsync_and_close_loose_object_bulk_checkin(fd, tmp_file.buf,
+							 filename.buf, mtime);
 }
 
 static int freshen_loose_object(const struct object_id *oid)
diff --git a/wrapper.c b/wrapper.c
index 7c6586af321..bb4f9f043ce 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -540,6 +540,50 @@  int xmkstemp_mode(char *filename_template, int mode)
 	return fd;
 }
 
+int git_fsync(int fd, enum fsync_action action)
+{
+	switch (action) {
+	case FSYNC_WRITEOUT_ONLY:
+
+#ifdef __APPLE__
+		/*
+		 * on macOS, fsync just causes filesystem cache writeback but does not
+		 * flush hardware caches.
+		 */
+		return fsync(fd);
+#endif
+
+#ifdef HAVE_SYNC_FILE_RANGE
+		/*
+		 * On linux 2.6.17 and above, sync_file_range is the way to issue
+		 * a writeback without a hardware flush. An offset of 0 and size of 0
+		 * indicates writeout of the entire file and the wait flags ensure that all
+		 * dirty data is written to the disk (potentially in a disk-side cache)
+		 * before we continue.
+		 */
+
+		return sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE |
+						 SYNC_FILE_RANGE_WRITE |
+						 SYNC_FILE_RANGE_WAIT_AFTER);
+#endif
+
+		errno = ENOSYS;
+		return -1;
+
+	case FSYNC_HARDWARE_FLUSH:
+
+#ifdef __APPLE__
+		return fcntl(fd, F_FULLFSYNC);
+#else
+		return fsync(fd);
+#endif
+
+	default:
+		BUG("unexpected git_fsync(%d) call", action);
+	}
+
+}
+
 static int warn_if_unremovable(const char *op, const char *file, int rc)
 {
 	int err;
diff --git a/write-or-die.c b/write-or-die.c
index d33e68f6abb..8f53953d4ab 100644
--- a/write-or-die.c
+++ b/write-or-die.c
@@ -57,7 +57,7 @@  void fprintf_or_die(FILE *f, const char *fmt, ...)
 
 void fsync_or_die(int fd, const char *msg)
 {
-	while (fsync(fd) < 0) {
+	while (git_fsync(fd, FSYNC_HARDWARE_FLUSH) < 0) {
 		if (errno != EINTR)
 			die_errno("fsync error on '%s'", msg);
 	}

[v4,2/6] core.fsyncobjectfiles: batched disk flushes

Commit Message

Comments

Patch