[v2,3/3] gitfaq: add entry about syncing working trees

Message ID	20211107225525.431138-4-sandals@crustytoothpaste.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: "brian m. carlson" <sandals@crustytoothpaste.net> To: <git@vger.kernel.org> Cc: Jeff King <peff@peff.net>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, Bagas Sanjaya <bagasdotme@gmail.com>, Eric Sunshine <sunshine@sunshineco.com>, Junio C Hamano <gitster@pobox.com>, Derrick Stolee <dstolee@microsoft.com> Subject: [PATCH v2 3/3] gitfaq: add entry about syncing working trees Date: Sun, 7 Nov 2021 22:55:25 +0000 Message-Id: <20211107225525.431138-4-sandals@crustytoothpaste.net> In-Reply-To: <20211107225525.431138-1-sandals@crustytoothpaste.net> References: <20211107225525.431138-1-sandals@crustytoothpaste.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Additional FAQ entries \| expand [v2,0/3] Additional FAQ entries [v2,1/3] gitfaq: add documentation on proxies [v2,2/3] gitfaq: give advice on using eol attribute in gitattributes [v2,3/3] gitfaq: add entry about syncing working trees

brian m. carlson Nov. 7, 2021, 10:55 p.m. UTC

Users very commonly want to sync their working tree with uncommitted
changes across machines, often to carry across in-progress work or
stashes.  Despite this not being a recommended approach, users want to
do it and are not dissuaded by suggestions not to, so let's recommend a
sensible technique.

The technique that many users are using is their preferred cloud syncing
service, which is a bad idea.  Users have reported problems where they
end up with duplicate files that won't go away (with names like "file.c
2"), broken references, oddly named references that have date stamps
appended to them, missing objects, and general corruption and data loss.
That's because almost all of these tools sync file by file, which is a
great technique if your project is a single word processing document or
spreadsheet, but is utterly abysmal for Git repositories because they
don't necessarily snapshot the entire repository correctly.  They also
tend to sync the files immediately instead of when the repository is
quiescent, so writing multiple files, as occurs during a commit or a gc,
can confuse the tools and lead to corruption.

We know that the old standby, rsync, is up to the task, provided that
the repository is quiescent, so let's suggest that and dissuade people
from using cloud syncing tools.  Let's tell people about common things
they should be aware of before doing this and that this is still
potentially risky.  Additionally, let's tell people that Git's security
model does not permit sharing working trees across users in case they
planned to do that.  While we'd still prefer users didn't try to do
this, hopefully this will lead them in a safer direction.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/gitfaq.txt | 47 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

Eric Sunshine Nov. 8, 2021, 12:12 a.m. UTC | #1

On Sun, Nov 7, 2021 at 5:55 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> Users very commonly want to sync their working tree with uncommitted
> changes across machines, often to carry across in-progress work or
> stashes.  Despite this not being a recommended approach, users want to
> do it and are not dissuaded by suggestions not to, so let's recommend a
> sensible technique.
> [...]
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt
> @@ -185,6 +185,49 @@ Then, you can adjust your push URL to use `git@example_author` or
> +[[sync-working-tree]]
> +How do I sync a working tree across systems?::
> +       First, decide whether you want to do this at all.  Git usually works better
> +       when you push or pull your work using the typical `git push` and `git fetch`
> +       commands and isn't designed to share a working tree across systems.  This is
> +       potentially risky and in some cases can cause repository corruption or data
> +       loss.

The phrase "usually works better" makes this statement feel weak, thus
it may not convey the potential severity of the issue. I wonder if
rewording it something like this would make the statement more
forceful:

    Git works best when you `git push` and `git pull` your work
    between machines; it is not designed to share a working tree
    across systems. [...]

> +Usually, doing so will cause `git status` to need to re-read every file in the
> +working tree.  Additionally, Git's security model does not permit sharing a
> +working tree across untrusted users, so it is only safe to sync a working tree
> +if it will only be used by a single user across all machines.
> ++
> +It is important not to use a cloud syncing service to sync any portion of a Git
> +repository, since this can cause corruption, such as missing objects, changed
> +or added files, broken refs, and a wide variety of other corruption.  These
> +services tend to sync file by file and don't understand the structure of a Git
> +repository.  This is especially bad if they sync the repository in the middle of
> +it being updated, since that is very likely to cause incomplete or partial
> +updates and therefore data loss.

Taking into consideration that people who are experiencing such
corruption will likely include the name of the syncing service in
their search query, would it make sense to mention some well-known
services here in order to make it more likely that people will
actually find this entry? Something like this, perhaps:

    It is important not to use a cloud syncing service (such as DropBox,
    FooBar, CowMoo, BuzzingBee, etc.) to sync any portion of a Git
    repository...

> +Therefore, it's better to push your work to either the other system or a central
> +server using the normal push and pull mechanism.  However, this doesn't always
> +preserve important data, like stashes, so some people prefer to share a working
> +tree across systems.
> ++
> +If you do this, the recommended approach is to use `rsync -a --delete-after`
> +(ideally with an encrypted connection such as with `ssh`) on the root of
> +repository.  You should ensure several things when you do this:
> ++
> +* There are no additional worktrees enabled for your repository.

I don't fully understand this restriction. Can you explain it (at
least here in the email discussion)?

> +* You are not using a separate Git directory outside of your repository root.

Same question about this restriction.

> +* You are comfortable with the destination directory being an exact copy of the
> +       source directory, _deleting any data that is already there_.
> +* The repository is in a quiescent state for the duration of the transfer (that
> +       is, no operations of any sort are taking place on it, including background
> +       operations like `git gc` and operations invoked by your editor).
> ++
> +Be aware that even with these recommendations, syncing in this way has some risk
> +since it bypasses Git's normal integrity checking for repositories, so having
> +backups is advised.  You may also with to do a `git fsck` to verify the
> +integrity of your data on the destination system after syncing.

s/with/wish/

In fact, as with "usually" above, "wish" may be too weak. Perhaps say
instead that it is "recommended" that you use `git fsck` to verify the
integrity.

Junio C Hamano Nov. 8, 2021, 10:09 p.m. UTC | #2

Eric Sunshine <sunshine@sunshineco.com> writes:

> Taking into consideration that people who are experiencing such
> corruption will likely include the name of the syncing service in
> their search query, would it make sense to mention some well-known
> services here in order to make it more likely that people will
> actually find this entry? Something like this, perhaps:
>
>     It is important not to use a cloud syncing service (such as DropBox,
>     FooBar, CowMoo, BuzzingBee, etc.) to sync any portion of a Git
>     repository...

I do agree that in a repository being actively modified, any
backup/sync solution that works per-file fashion would not work
well.  But is "cloud" a good word to characterise and group these
per-file backup/sync solution?  

Doesn't rsync work the same per-file fashion, and the only reason
why it is a better fit is because it is not continuous, not
attempting to "sync" while the repository is in use, until the user
explicitly says "OK, I am ready to go home, so let's stop working
here and send everything over to home with rsync"?

>> +* There are no additional worktrees enabled for your repository.
>
> I don't fully understand this restriction. Can you explain it (at
> least here in the email discussion)?
>
>> +* You are not using a separate Git directory outside of your repository root.
>
> Same question about this restriction.

As long as you know what you are doing and catch everything in
quiescent state, you should be fine, I would think.

Eric Sunshine Nov. 9, 2021, 12:02 a.m. UTC | #3

On Mon, Nov 8, 2021 at 5:34 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> On 2021-11-08 at 00:12:14, Eric Sunshine wrote:
> > Taking into consideration that people who are experiencing such
> > corruption will likely include the name of the syncing service in
> > their search query, would it make sense to mention some well-known
> > services here in order to make it more likely that people will
> > actually find this entry? Something like this, perhaps:
> >
> >     It is important not to use a cloud syncing service (such as DropBox,
> >     FooBar, CowMoo, BuzzingBee, etc.) to sync any portion of a Git
> >     repository...
>
> There are a lot of these services.  My preference is to not name
> specific ones here, just like we don't name specific ones when we state
> that you shouldn't use an antivirus or firewall other than the Windows
> default, mostly because I'm not interested in angering corporate
> lawyers.  My advice on this topic is always general: XYZ is a cloud
> syncing service, and you should not use any cloud syncing services for Git.

My "would it make sense" question arose from taking into consideration
how people are likely to use a search engine for a particular problem:

    BuzzingBee syncing corrupted git repository

Without naming specific services or tools, it seems much less likely
that people will find this FAQ entry, thus will end up posting to
StackOverflow (or wherever) anyhow -- despite your intention and
effort behind the FAQ in the first place. I'm thinking about
discoverability -- which is the same reason I suggested in my review
of the other patch that it might be a good idea to add actual error
messages a person might encounter with a CRLF shell script or an LF
batch file.

> Additionally, the popular options differ by region.  For example, there
> are some services which are probably popular in China which are not
> popular elsewhere, and I don't think it's a good idea to try to guess
> which ones happen to be most popular around the world.

The other way to look at it is that listing many popular services
(wherever they happen to be) makes it more likely that search engines
will lead them to this FAQ entry and alleviate the need to post a
question somewhere.

Anyhow, it was just a thought. I think I've said all I have to say on
this subject. As I mentioned in response to the cover letter, all of
my comments were of the bikeshedding variety, and I didn't see any
show-stoppers in the series as posted.

> > > +* There are no additional worktrees enabled for your repository.
> >
> > I don't fully understand this restriction. Can you explain it (at
> > least here in the email discussion)?
>
> If you sync the main repository and working tree, but not the worktrees
> themselves, then you end up with incorrect data in the .git directory,
> which contains information about the worktree.

I still don't follow. What incorrect data will be in the .git
directory? Do you mean the absolute path pointing from the .git
directory to the worktree?

> More importantly, it can
> contain absolute path information, which would be undesirable even if
> you did sync the worktrees, say, if you used two different usernames
> (and hence two different home directories) on the two systems.

Okay, I figured that that was probably one of your concerns, but it
was difficult to tell from the raw "no additional worktrees". On the
other hand, I can easily see people syncing between machines in which
they have the same username and same directory structure, so this
bullet point seems overly restrictive.

> > > +* You are not using a separate Git directory outside of your repository root.
> >
> > Same question about this restriction.
>
> Again, if you sync the root of the working tree but don't sync the
> separate Git directory, you won't have copied the index or any of the
> other data.

Okay, again, this is indeed what I thought you might have in mind, but
it was difficult to be sure. And, again, this seems overly restrictive
since there may be plenty of scenarios in which this works well.

I suppose both of these points would feel more reasonable if they
didn't sound like hard requirements, perhaps by explaining that the
simple case of no worktrees and no separate repository has the least
restrictions and is easiest to get up and running; more complex cases
can work too with the caveat that worktree and separate repository
directories ought to be synced too, and that pathnames need to be
identical on all machines.

Junio C Hamano Nov. 9, 2021, 12:10 a.m. UTC | #4

Junio C Hamano <gitster@pobox.com> writes:

> Doesn't rsync work the same per-file fashion, and the only reason
> why it is a better fit is because it is not continuous, not
> attempting to "sync" while the repository is in use, until the user
> explicitly says "OK, I am ready to go home, so let's stop working
> here and send everything over to home with rsync"?

OK, so not "per-file" but "continuous" is the root problem, and
"cloud" would be a good word because all the popular ones share that
"continuous" trait.

This part of the proposed patch text may need rethinking a bit.

> +or added files, broken refs, and a wide variety of other corruption.  These
> +services tend to sync file by file and don't understand the structure of a Git
> +repository.  This is especially bad if they sync the repository in the middle of

That is, "file by file" is not a problem per-se, "don't understand
the structure" is a huge problem, and "continuous" may contribute to
the problem further.

I wonder if you let the "cloud" services to continuously sync your
repository, then go quiescent for a while and then start touching
the destination, it would be sufficient, though.  The refs with
funny "2" suffixes and the like are the symptom of two sides
simultanously mucking with the same file (e.g ".git/refs/main") and
the "cloud sync" could not decide what the right final outcome is,
right?

I also wonder if we add a way to transfer reflog entries, that imply
the object reachability, say "git push --with-reflog", over the
wire, it would be sufficient to do everything with Git.

Before you go home, you'd do

    git stash save --untracked && git stash apply
    git push --mirror --with-reflog --with-stash

to save away modified and untracked files to a stash entry [*], and push
all the refs with their reflog entries (including refs/stash which
normally gets refused because it has only two levels).

    Side note. If there were a variant of "git stash save" that only
    saves away without modifying the working tree and the index, I'd
    use that single command instead of "save and immediately restore
    by applying" kludge.

Then at the destination, you'd figure out what the current branch
was (the stash message should record tha name of the branch), check
that branch out, and running "git stash pop" will give you pretty
much the same environment.

brian m. carlson Nov. 14, 2021, 11:40 p.m. UTC | #5

On 2021-11-09 at 00:10:05, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Doesn't rsync work the same per-file fashion, and the only reason
> > why it is a better fit is because it is not continuous, not
> > attempting to "sync" while the repository is in use, until the user
> > explicitly says "OK, I am ready to go home, so let's stop working
> > here and send everything over to home with rsync"?
> 
> OK, so not "per-file" but "continuous" is the root problem, and
> "cloud" would be a good word because all the popular ones share that
> "continuous" trait.
> 
> This part of the proposed patch text may need rethinking a bit.
> 
> > +or added files, broken refs, and a wide variety of other corruption.  These
> > +services tend to sync file by file and don't understand the structure of a Git
> > +repository.  This is especially bad if they sync the repository in the middle of
> 
> That is, "file by file" is not a problem per-se, "don't understand
> the structure" is a huge problem, and "continuous" may contribute to
> the problem further.

"File by file" is not a problem per se, but it is a problem if you don't
snapshot the repository or it's in use.  For example, if you used an LVM
snapshot on an in-use repository and then synced that, we'd guarantee
that the repository was fully consistent.  Practically nobody does that,
though.

But to clarify: the goal of this type of software is to sync single,
independent documents.  As such, these tools don't necessarily consider
that it's important to sync other files in the same directory at the
same time.  You can end up with cases where some parts of your
repository get synced, then some other files in other directories, then
other parts of the repository, resulting in different states over
different times.  There's just no guarantee about how they behave with
all of the files in a given repository because they aren't considered a
relevant unit.  That's why I mention that they sync file by file.  If
they did syncing directory by directory, we'd likely have a lot fewer
problems.

> I wonder if you let the "cloud" services to continuously sync your
> repository, then go quiescent for a while and then start touching
> the destination, it would be sufficient, though.  The refs with
> funny "2" suffixes and the like are the symptom of two sides
> simultanously mucking with the same file (e.g ".git/refs/main") and
> the "cloud sync" could not decide what the right final outcome is,
> right?

I think this is very risky.  Yes, usually the duplicate files are caused
by competing changes on both sides, but having seen users with this
problem before, I'm not sure if we can safely guarantee this is the only
case.

> I also wonder if we add a way to transfer reflog entries, that imply
> the object reachability, say "git push --with-reflog", over the
> wire, it would be sufficient to do everything with Git.

Yes, please.  I'd love to see this.

[v2,3/3] gitfaq: add entry about syncing working trees

Commit Message

Comments

Patch