[00/23,RFC] Builtin FSMonitor Feature

Message ID	pull.923.git.1617291666.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.923.git.1617291666.gitgitgadget@gmail.com> From: "Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> Date: Thu, 01 Apr 2021 15:40:42 +0000 Subject: [PATCH 00/23] [RFC] Builtin FSMonitor Feature Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Jeff Hostetler <jeffhost@microsoft.com> Precedence: bulk
Series	Builtin FSMonitor Feature \| expand [00/23,RFC] Builtin FSMonitor Feature [01/23] fsmonitor--daemon: man page and documentation [02/23] fsmonitor-ipc: create client routines for git-fsmonitor--daemon [03/23] config: FSMonitor is repository-specific [04/23] fsmonitor: introduce `core.useBuiltinFSMonitor` to call the daemon via IPC [05/23] fsmonitor--daemon: add a built-in fsmonitor daemon [06/23] fsmonitor--daemon: implement client command options [07/23] fsmonitor-fs-listen-win32: stub in backend for Windows [08/23] fsmonitor-fs-listen-macos: stub in backend for MacOS [09/23] fsmonitor--daemon: implement daemon command options [10/23] fsmonitor--daemon: add pathname classification [11/23] fsmonitor--daemon: define token-ids [12/23] fsmonitor--daemon: create token-based changed path cache [13/23] fsmonitor-fs-listen-win32: implement FSMonitor backend on Windows [14/23] fsmonitor-fs-listen-macos: add macos header files for FSEvent [15/23] fsmonitor-fs-listen-macos: implement FSEvent listener on MacOS [16/23] fsmonitor--daemon: implement handle_client callback [17/23] fsmonitor--daemon: periodically truncate list of modified files [18/23] fsmonitor--daemon:: introduce client delay for testing [19/23] fsmonitor--daemon: use a cookie file to sync with file system [20/23] fsmonitor: force update index when fsmonitor token advances [21/23] t7527: create test for fsmonitor--daemon [22/23] p7519: add fsmonitor--daemon [23/23] t7527: test status with untracked-cache and fsmonitor--daemon

Philippe Blain via GitGitGadget April 1, 2021, 3:40 p.m. UTC

This patch series adds a builtin FSMonitor daemon to Git.

This daemon uses platform-specific filesystem notifications to keep track of
changes to a working directory. It also listens over the "Simple IPC"
facility for client requests and responds with a list of files/directories
that have been recently modified.

Client commands, such as git status, already know how to request a list of
modified files via the FSMonitor Hook. This patch series teaches client
commands to talk directly to the daemon via IPC and avoid the overhead of
the hook API. (Hook process creation can be expensive on Windows.)

Since the daemon is a feature of Git, rather than a generic third-party tool
like Watchman, the daemon can format its response to be exactly what the
client needs, so there is no need for a hook process to proxy and reformat
the data. For example, when Watchman is used, Watchman responds in JSON and
the hook process (typically a PERL script) must parse it and convert it into
a simple NUL-delimited list. FSMonitor daemon responses are already in this
NUL-delimited format, so no processing is required.

The current daemon implementation is rather simple in that it just records
the set of files/directories that have changed. For example, it is not aware
of specific Git features, such as .gitignore and doesn't attempt to filter
out ignored files. Having a Git-specific daemon lets us explore such things
in the future.

Finally, having a builtin daemon eliminates the need for user to download
and install a third-party tool. This makes enterprise deployments simpler
since there are fewer parts to install, maintain, and updates to track.

This RFC version includes support for Windows and MacOS file system events.
A Linux version will be submitted in a later patch series.

This patch series is being previewed as an experimental feature in Git for
Windows v2.31.0.windows.1.

This patch series requires the jh/simple-ipc and jh/fsmonitor-prework patch
series.

Jeff Hostetler (21):
  fsmonitor--daemon: man page and documentation
  fsmonitor-ipc: create client routines for git-fsmonitor--daemon
  fsmonitor--daemon: add a built-in fsmonitor daemon
  fsmonitor--daemon: implement client command options
  fsmonitor-fs-listen-win32: stub in backend for Windows
  fsmonitor-fs-listen-macos: stub in backend for MacOS
  fsmonitor--daemon: implement daemon command options
  fsmonitor--daemon: add pathname classification
  fsmonitor--daemon: define token-ids
  fsmonitor--daemon: create token-based changed path cache
  fsmonitor-fs-listen-win32: implement FSMonitor backend on Windows
  fsmonitor-fs-listen-macos: add macos header files for FSEvent
  fsmonitor-fs-listen-macos: implement FSEvent listener on MacOS
  fsmonitor--daemon: implement handle_client callback
  fsmonitor--daemon: periodically truncate list of modified files
  fsmonitor--daemon:: introduce client delay for testing
  fsmonitor--daemon: use a cookie file to sync with file system
  fsmonitor: force update index when fsmonitor token advances
  t7527: create test for fsmonitor--daemon
  p7519: add fsmonitor--daemon
  t7527: test status with untracked-cache and fsmonitor--daemon

Johannes Schindelin (2):
  config: FSMonitor is repository-specific
  fsmonitor: introduce `core.useBuiltinFSMonitor` to call the daemon via
    IPC

 .gitignore                                   |    1 +
 Documentation/config/core.txt                |   45 +-
 Documentation/git-fsmonitor--daemon.txt      |  104 ++
 Documentation/git-update-index.txt           |    4 +-
 Documentation/githooks.txt                   |    3 +-
 Makefile                                     |   15 +
 builtin.h                                    |    1 +
 builtin/fsmonitor--daemon.c                  | 1611 ++++++++++++++++++
 builtin/update-index.c                       |    4 +-
 compat/fsmonitor/fsmonitor-fs-listen-macos.c |  484 ++++++
 compat/fsmonitor/fsmonitor-fs-listen-win32.c |  514 ++++++
 compat/fsmonitor/fsmonitor-fs-listen.h       |   49 +
 config.c                                     |    9 +-
 config.h                                     |    2 +-
 config.mak.uname                             |    4 +
 contrib/buildsystems/CMakeLists.txt          |    8 +
 fsmonitor--daemon.h                          |  142 ++
 fsmonitor-ipc.c                              |  153 ++
 fsmonitor-ipc.h                              |   48 +
 fsmonitor.c                                  |   32 +-
 git.c                                        |    1 +
 help.c                                       |    4 +
 repo-settings.c                              |    3 +
 repository.h                                 |    2 +
 t/perf/p7519-fsmonitor.sh                    |   37 +-
 t/t7527-builtin-fsmonitor.sh                 |  582 +++++++
 26 files changed, 3839 insertions(+), 23 deletions(-)
 create mode 100644 Documentation/git-fsmonitor--daemon.txt
 create mode 100644 builtin/fsmonitor--daemon.c
 create mode 100644 compat/fsmonitor/fsmonitor-fs-listen-macos.c
 create mode 100644 compat/fsmonitor/fsmonitor-fs-listen-win32.c
 create mode 100644 compat/fsmonitor/fsmonitor-fs-listen.h
 create mode 100644 fsmonitor--daemon.h
 create mode 100644 fsmonitor-ipc.c
 create mode 100644 fsmonitor-ipc.h
 create mode 100755 t/t7527-builtin-fsmonitor.sh


base-commit: f1725819714fbcd96c47ae5f14e00cc01045272f
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-923%2Fjeffhostetler%2Fbuiltin-fsmonitor-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-923/jeffhostetler/builtin-fsmonitor-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/923

Junio C Hamano April 16, 2021, 10:44 p.m. UTC | #1

"Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series adds a builtin FSMonitor daemon to Git.

This hasn't seen much (if any) activity for a few weeks.

Does that mean nobody (other than obviously the author and whoever
wanted to have this feature) is interested?

What does it need to get this topic unstuck?

> Finally, having a builtin daemon eliminates the need for user to download
> and install a third-party tool. This makes enterprise deployments simpler
> since there are fewer parts to install, maintain, and updates to track.
>
> This RFC version includes support for Windows and MacOS file system events.
> A Linux version will be submitted in a later patch series.

Johannes Schindelin April 20, 2021, 3:27 p.m. UTC | #2

Hi Junio,

On Fri, 16 Apr 2021, Junio C Hamano wrote:

> "Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This patch series adds a builtin FSMonitor daemon to Git.
>
> This hasn't seen much (if any) activity for a few weeks.

It actually is a good sign: I integrated this into Git for Windows (as an
experimental feature) and am running with it for a couple of weeks already
(in _all_ worktrees, not just the massively large ones).

At first, I ran into a handful Blue Screens of Death, and I was worried
that they should be attributed to FSMonitor. But it turns out that this
issue was most likely caused by a Windows update, and semi-resolved with
another Windows update (and only happens because I use WSL extensively).
In other words, those crashes are not related to FSMonitor.

So yeah, I find the lack of activity pretty good news.

However, I would have hoped that this patch series would see a couple of
reviews in the meantime. Since I was involved in the development of this
patch series (I started it just before I got dragged into all that
security work that led to v2.24.1, and never quite got back to it after
that), I wondered whether it would be "self review" if I reviewed those
patches, which is something I'd rather avoid.

But if nobody else reviews the patches, I will.

> Does that mean nobody (other than obviously the author and whoever
> wanted to have this feature) is interested?

The most likely reason why this does not see more reviews is that it
matters most for massive worktrees, and I don't think anybody here works
with those. The closest to a massive worktree I have is the `git-sdk-64`
repository (which has pretty much nothing to do with source code at all,
it is just a matter of convenience that this is a Git repository; Think of
it as if somebody mirrored their Ubuntu installation by tracking it in a
Git repository and cloning it onto all of their machines). And that is not
really all that massive:

	$ git -C / ls-files | wc -l
	162975

That's tiny compared to some worktrees I saw.

But we should not mistake the needs of those on the Git mailing list (`git
ls-tree -r v2.31.1 | wc -l` says we have only 3901 files/symlinks) for the
needs of some of our biggest users.

So I would like to respectfully ask for this patch series to be kept under
consideration for `next`.

> What does it need to get this topic unstuck?

The same resource that you keep complaining about, and that seems to be
drained more quickly than it can be replenished: reviewers.

I am as guilty as the next person, of course, and it does not help that I
get Cc:ed on several dozen patches seemingly every couple of days: this is
just too much, and I cannot do it, so I admittedly neglect too many patch
series (even the ones that I _do_ want to review, such as the
`bisect-in-c` one). My inbox is seriously no fun place to visit right now.

> > Finally, having a builtin daemon eliminates the need for user to download
> > and install a third-party tool. This makes enterprise deployments simpler
> > since there are fewer parts to install, maintain, and updates to track.
> >
> > This RFC version includes support for Windows and MacOS file system events.
> > A Linux version will be submitted in a later patch series.

I guess this is another reason why this patch series did not see many
reviews: the lack of a Linux backend. And I fear that the statement "A
Linux version will be submitted in a later patch series" is a bit strong,
given that my original implementation of that backend does not really do
its job well: it uses `inotify` and therefore requires one handle _per
directory_, which in turn drains the number of file handles rather quickly
when your worktree has many directories. Meaning: It fails todoes not work in the
massive worktrees for which it was intended.

Now, I heard rumors that there is a saner way to monitor directory trees
in recent Linux kernel versions (Jeff, can you fill in where I am
blanking?) and it might be a good idea to solicit volunteers to tackle
this backend, so that the Linux-leaning crowd on this here mailing list
is interested a bit more?

Ciao,
Dscho

Jeff Hostetler April 20, 2021, 7:13 p.m. UTC | #3

On 4/20/21 11:27 AM, Johannes Schindelin wrote:
> Hi Junio,
...
>>> This RFC version includes support for Windows and MacOS file system events.
>>> A Linux version will be submitted in a later patch series.
> 
> I guess this is another reason why this patch series did not see many
> reviews: the lack of a Linux backend. And I fear that the statement "A
> Linux version will be submitted in a later patch series" is a bit strong,
> given that my original implementation of that backend does not really do
> its job well: it uses `inotify` and therefore requires one handle _per
> directory_, which in turn drains the number of file handles rather quickly
> when your worktree has many directories. Meaning: It fails todoes not work in the
> massive worktrees for which it was intended.
> 
> Now, I heard rumors that there is a saner way to monitor directory trees
> in recent Linux kernel versions (Jeff, can you fill in where I am
> blanking?) and it might be a good idea to solicit volunteers to tackle
> this backend, so that the Linux-leaning crowd on this here mailing list
> is interested a bit more?

Yes, I removed the early inotify-based version because the kernel limits
the number of inotify handles to 8k (at least on my Mint box) and that
is a global limit -- shared by any process wanting to use inotify.
The first monorepo that I tried it on had 120K directories in my
sparse checkout...

I'm told there is a newer "fanotify" facility available in newer
Linux kernels that behaves more like Windows and MacOS and handles
subdirectories.  I intend to jump into that shortly (unless someone
is already familiar with fanotify and and wants to try it).

Jeff

Derrick Stolee April 21, 2021, 1:17 p.m. UTC | #4

On 4/20/2021 11:27 AM, Johannes Schindelin wrote:
> Hi Junio,
> 
> On Fri, 16 Apr 2021, Junio C Hamano wrote:
> 
>> "Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> This patch series adds a builtin FSMonitor daemon to Git.
>>
>> This hasn't seen much (if any) activity for a few weeks.
...
>> What does it need to get this topic unstuck?
> 
> The same resource that you keep complaining about, and that seems to be
> drained more quickly than it can be replenished: reviewers.

I purposefully stayed away from reviewing the series since we are on
the same team, but I have _not_ been involved in the development. At
least that lets me have fresh eyes.

If no external community members are willing to review it, then I will
dedicate time for a careful review this week.

Thanks,
-Stolee

Derrick Stolee April 27, 2021, 7:31 p.m. UTC | #5

On 4/1/21 11:40 AM, Jeff Hostetler via GitGitGadget wrote:
> This patch series adds a builtin FSMonitor daemon to Git.
> 
> This daemon uses platform-specific filesystem notifications to keep track of
> changes to a working directory. It also listens over the "Simple IPC"
> facility for client requests and responds with a list of files/directories
> that have been recently modified.
...
> This RFC version includes support for Windows and MacOS file system events.
> A Linux version will be submitted in a later patch series.

Similarly to my message about testing the Windows performance, I
repeated those tests on macOS.

The same testing procedure was used, except now I'm on a MacBook
Pro laptop instead of a desktop, so the CPU power is likely to be
significantly less.

However, I am pleased to report that the FS Monitor feature is
a clear winner in all scenarios. Using the untracked cache is
still highly recommended, but not necessary in order to get a
speed boost from the builtin FS Montiro.


Sparse Index Disabled, Untracked Cache Enabled
----------------------------------------------

Benchmark #1: none (clean)
  Time (mean ± σ):      3.980 s ±  0.026 s    [User: 919.1 ms, System: 1891.8 ms]
  Range (min … max):    3.940 s …  4.028 s    10 runs
 
Benchmark #2: builtin (clean)
  Time (mean ± σ):     477.9 ms ±   6.6 ms    [User: 772.9 ms, System: 379.7 ms]
  Range (min … max):   468.1 ms … 489.5 ms    10 runs
 
Summary
  'builtin (clean)' ran
    8.33 ± 0.13 times faster than 'none (clean)'

Benchmark #1: none (dirty)
  Time (mean ± σ):      5.411 s ±  0.199 s    [User: 2.993 s, System: 4.120 s]
  Range (min … max):    5.026 s …  5.756 s    10 runs
 
Benchmark #2: builtin (dirty)
  Time (mean ± σ):      2.588 s ±  0.025 s    [User: 3.752 s, System: 2.853 s]
  Range (min … max):    2.540 s …  2.628 s    10 runs
 
Summary
  'builtin (dirty)' ran
    2.09 ± 0.08 times faster than 'none (dirty)'

Sparse Index Disabled, Untracked Cache Disabled
-----------------------------------------------

Benchmark #1: none (clean)
  Time (mean ± σ):      2.993 s ±  0.115 s    [User: 1.562 s, System: 2.289 s]
  Range (min … max):    2.741 s …  3.167 s    10 runs
 
Benchmark #2: builtin (clean)
  Time (mean ± σ):     939.4 ms ±  10.1 ms    [User: 1.452 s, System: 1.519 s]
  Range (min … max):   925.1 ms … 961.0 ms    10 runs
 
Summary
  'builtin (clean)' ran
    3.19 ± 0.13 times faster than 'none (clean)'

Benchmark #1: none (dirty)
  Time (mean ± σ):      8.245 s ±  1.118 s    [User: 3.204 s, System: 5.684 s]
  Range (min … max):    5.927 s …  8.985 s    10 runs
 
Benchmark #2: builtin (dirty)
  Time (mean ± σ):      2.969 s ±  0.034 s    [User: 3.832 s, System: 3.160 s]
  Range (min … max):    2.927 s …  3.023 s    10 runs
 
Summary
  'builtin (dirty)' ran
    2.78 ± 0.38 times faster than 'none (dirty)'


Sparse Index Enabled, Untracked Cache Enabled
---------------------------------------------

Benchmark #1: none (clean)
  Time (mean ± σ):      1.250 s ±  0.050 s    [User: 216.9 ms, System: 1836.9 ms]
  Range (min … max):    1.177 s …  1.300 s    10 runs
 
Benchmark #2: builtin (clean)
  Time (mean ± σ):      89.3 ms ±   2.9 ms    [User: 51.3 ms, System: 22.6 ms]
  Range (min … max):    81.9 ms …  93.5 ms    31 runs
 
Summary
  'builtin (clean)' ran
   14.01 ± 0.72 times faster than 'none (clean)'

Benchmark #1: none (dirty)
  Time (mean ± σ):      2.087 s ±  0.095 s    [User: 320.9 ms, System: 3327.5 ms]
  Range (min … max):    1.943 s …  2.242 s    10 runs
 
Benchmark #2: builtin (dirty)
  Time (mean ± σ):     233.5 ms ±   2.7 ms    [User: 165.5 ms, System: 74.1 ms]
  Range (min … max):   227.8 ms … 237.1 ms    12 runs
 
Summary
  'builtin (dirty)' ran
    8.94 ± 0.42 times faster than 'none (dirty)'


Sparse Index Enabled, Untracked Cache Disabled
----------------------------------------------

Benchmark #1: none (clean)
  Time (mean ± σ):      1.277 s ±  0.101 s    [User: 215.5 ms, System: 1877.9 ms]
  Range (min … max):    1.138 s …  1.458 s    10 runs
 
Benchmark #2: builtin (clean)
  Time (mean ± σ):     300.0 ms ±   6.1 ms    [User: 119.4 ms, System: 183.1 ms]
  Range (min … max):   293.0 ms … 313.2 ms    10 runs
 
Summary
  'builtin (clean)' ran
    4.26 ± 0.35 times faster than 'none (clean)'
Benchmark #1: none (dirty)
  Time (mean ± σ):      2.488 s ±  0.088 s    [User: 432.6 ms, System: 3631.6 ms]
  Range (min … max):    2.328 s …  2.601 s    10 runs
 
Benchmark #2: builtin (dirty)
  Time (mean ± σ):     636.4 ms ±  12.8 ms    [User: 266.2 ms, System: 374.0 ms]
  Range (min … max):   624.4 ms … 671.0 ms    10 runs

Summary
  'builtin (dirty)' ran
    3.91 ± 0.16 times faster than 'none (dirty)'


Here are my results for the Git repository:

Untracked Cache Enabled
-----------------------

Benchmark #1: none (clean)
  Time (mean ± σ):      51.2 ms ±   4.0 ms    [User: 12.9 ms, System: 61.2 ms]
  Range (min … max):    46.2 ms …  65.7 ms    54 runs
 
Benchmark #2: builtin (clean)
  Time (mean ± σ):      38.6 ms ±   1.7 ms    [User: 9.9 ms, System: 9.7 ms]
  Range (min … max):    28.6 ms …  42.4 ms    75 runs
 
Summary
  'builtin (clean)' ran
    1.33 ± 0.12 times faster than 'none (clean)'

Benchmark #1: none (dirty)
  Time (mean ± σ):     108.1 ms ±   7.2 ms    [User: 27.2 ms, System: 126.9 ms]
  Range (min … max):    97.6 ms … 130.4 ms    25 runs
 
Benchmark #2: builtin (dirty)
  Time (mean ± σ):      91.7 ms ±   3.8 ms    [User: 25.4 ms, System: 27.0 ms]
  Range (min … max):    88.5 ms … 105.1 ms    32 runs
 
Summary
  'builtin (dirty)' ran
    1.18 ± 0.09 times faster than 'none (dirty)'


Untracked Cache Disabled
------------------------

Benchmark #1: none (clean)
  Time (mean ± σ):      59.5 ms ±   4.0 ms    [User: 15.2 ms, System: 67.7 ms]
  Range (min … max):    55.5 ms …  71.6 ms    46 runs
 
Benchmark #2: builtin (clean)
  Time (mean ± σ):      48.9 ms ±   1.0 ms    [User: 12.5 ms, System: 17.3 ms]
  Range (min … max):    46.7 ms …  51.3 ms    58 runs
 
Summary
  'builtin (clean)' ran
    1.22 ± 0.08 times faster than 'none (clean)'

Benchmark #1: none (dirty)
  Time (mean ± σ):     124.4 ms ±   6.8 ms    [User: 31.5 ms, System: 140.2 ms]
  Range (min … max):   116.8 ms … 140.0 ms    24 runs
 
Benchmark #2: builtin (dirty)
  Time (mean ± σ):     104.1 ms ±   1.7 ms    [User: 27.4 ms, System: 37.8 ms]
  Range (min … max):    99.7 ms … 106.6 ms    27 runs
 
Summary
  'builtin (dirty)' ran
    1.19 ± 0.07 times faster than 'none (dirty)'

I think it valuable to point out that in my initial tests I had forgotten
to disable the Watchman-based FS Monitor hook, and the results looked even
more impressive (on the small Git repository). Dropping the hook overhead is
a huge benefit here.

Thanks,
-Stolee

[00/23,RFC] Builtin FSMonitor Feature

Message

Comments