Message ID | 20211107225525.431138-4-sandals@crustytoothpaste.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Additional FAQ entries | expand |
On Sun, Nov 7, 2021 at 5:55 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > Users very commonly want to sync their working tree with uncommitted > changes across machines, often to carry across in-progress work or > stashes. Despite this not being a recommended approach, users want to > do it and are not dissuaded by suggestions not to, so let's recommend a > sensible technique. > [...] > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> > --- > diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt > @@ -185,6 +185,49 @@ Then, you can adjust your push URL to use `git@example_author` or > +[[sync-working-tree]] > +How do I sync a working tree across systems?:: > + First, decide whether you want to do this at all. Git usually works better > + when you push or pull your work using the typical `git push` and `git fetch` > + commands and isn't designed to share a working tree across systems. This is > + potentially risky and in some cases can cause repository corruption or data > + loss. The phrase "usually works better" makes this statement feel weak, thus it may not convey the potential severity of the issue. I wonder if rewording it something like this would make the statement more forceful: Git works best when you `git push` and `git pull` your work between machines; it is not designed to share a working tree across systems. [...] > +Usually, doing so will cause `git status` to need to re-read every file in the > +working tree. Additionally, Git's security model does not permit sharing a > +working tree across untrusted users, so it is only safe to sync a working tree > +if it will only be used by a single user across all machines. > ++ > +It is important not to use a cloud syncing service to sync any portion of a Git > +repository, since this can cause corruption, such as missing objects, changed > +or added files, broken refs, and a wide variety of other corruption. These > +services tend to sync file by file and don't understand the structure of a Git > +repository. This is especially bad if they sync the repository in the middle of > +it being updated, since that is very likely to cause incomplete or partial > +updates and therefore data loss. Taking into consideration that people who are experiencing such corruption will likely include the name of the syncing service in their search query, would it make sense to mention some well-known services here in order to make it more likely that people will actually find this entry? Something like this, perhaps: It is important not to use a cloud syncing service (such as DropBox, FooBar, CowMoo, BuzzingBee, etc.) to sync any portion of a Git repository... > +Therefore, it's better to push your work to either the other system or a central > +server using the normal push and pull mechanism. However, this doesn't always > +preserve important data, like stashes, so some people prefer to share a working > +tree across systems. > ++ > +If you do this, the recommended approach is to use `rsync -a --delete-after` > +(ideally with an encrypted connection such as with `ssh`) on the root of > +repository. You should ensure several things when you do this: > ++ > +* There are no additional worktrees enabled for your repository. I don't fully understand this restriction. Can you explain it (at least here in the email discussion)? > +* You are not using a separate Git directory outside of your repository root. Same question about this restriction. > +* You are comfortable with the destination directory being an exact copy of the > + source directory, _deleting any data that is already there_. > +* The repository is in a quiescent state for the duration of the transfer (that > + is, no operations of any sort are taking place on it, including background > + operations like `git gc` and operations invoked by your editor). > ++ > +Be aware that even with these recommendations, syncing in this way has some risk > +since it bypasses Git's normal integrity checking for repositories, so having > +backups is advised. You may also with to do a `git fsck` to verify the > +integrity of your data on the destination system after syncing. s/with/wish/ In fact, as with "usually" above, "wish" may be too weak. Perhaps say instead that it is "recommended" that you use `git fsck` to verify the integrity.
Eric Sunshine <sunshine@sunshineco.com> writes: > Taking into consideration that people who are experiencing such > corruption will likely include the name of the syncing service in > their search query, would it make sense to mention some well-known > services here in order to make it more likely that people will > actually find this entry? Something like this, perhaps: > > It is important not to use a cloud syncing service (such as DropBox, > FooBar, CowMoo, BuzzingBee, etc.) to sync any portion of a Git > repository... I do agree that in a repository being actively modified, any backup/sync solution that works per-file fashion would not work well. But is "cloud" a good word to characterise and group these per-file backup/sync solution? Doesn't rsync work the same per-file fashion, and the only reason why it is a better fit is because it is not continuous, not attempting to "sync" while the repository is in use, until the user explicitly says "OK, I am ready to go home, so let's stop working here and send everything over to home with rsync"? >> +* There are no additional worktrees enabled for your repository. > > I don't fully understand this restriction. Can you explain it (at > least here in the email discussion)? > >> +* You are not using a separate Git directory outside of your repository root. > > Same question about this restriction. As long as you know what you are doing and catch everything in quiescent state, you should be fine, I would think.
On Mon, Nov 8, 2021 at 5:34 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > On 2021-11-08 at 00:12:14, Eric Sunshine wrote: > > Taking into consideration that people who are experiencing such > > corruption will likely include the name of the syncing service in > > their search query, would it make sense to mention some well-known > > services here in order to make it more likely that people will > > actually find this entry? Something like this, perhaps: > > > > It is important not to use a cloud syncing service (such as DropBox, > > FooBar, CowMoo, BuzzingBee, etc.) to sync any portion of a Git > > repository... > > There are a lot of these services. My preference is to not name > specific ones here, just like we don't name specific ones when we state > that you shouldn't use an antivirus or firewall other than the Windows > default, mostly because I'm not interested in angering corporate > lawyers. My advice on this topic is always general: XYZ is a cloud > syncing service, and you should not use any cloud syncing services for Git. My "would it make sense" question arose from taking into consideration how people are likely to use a search engine for a particular problem: BuzzingBee syncing corrupted git repository Without naming specific services or tools, it seems much less likely that people will find this FAQ entry, thus will end up posting to StackOverflow (or wherever) anyhow -- despite your intention and effort behind the FAQ in the first place. I'm thinking about discoverability -- which is the same reason I suggested in my review of the other patch that it might be a good idea to add actual error messages a person might encounter with a CRLF shell script or an LF batch file. > Additionally, the popular options differ by region. For example, there > are some services which are probably popular in China which are not > popular elsewhere, and I don't think it's a good idea to try to guess > which ones happen to be most popular around the world. The other way to look at it is that listing many popular services (wherever they happen to be) makes it more likely that search engines will lead them to this FAQ entry and alleviate the need to post a question somewhere. Anyhow, it was just a thought. I think I've said all I have to say on this subject. As I mentioned in response to the cover letter, all of my comments were of the bikeshedding variety, and I didn't see any show-stoppers in the series as posted. > > > +* There are no additional worktrees enabled for your repository. > > > > I don't fully understand this restriction. Can you explain it (at > > least here in the email discussion)? > > If you sync the main repository and working tree, but not the worktrees > themselves, then you end up with incorrect data in the .git directory, > which contains information about the worktree. I still don't follow. What incorrect data will be in the .git directory? Do you mean the absolute path pointing from the .git directory to the worktree? > More importantly, it can > contain absolute path information, which would be undesirable even if > you did sync the worktrees, say, if you used two different usernames > (and hence two different home directories) on the two systems. Okay, I figured that that was probably one of your concerns, but it was difficult to tell from the raw "no additional worktrees". On the other hand, I can easily see people syncing between machines in which they have the same username and same directory structure, so this bullet point seems overly restrictive. > > > +* You are not using a separate Git directory outside of your repository root. > > > > Same question about this restriction. > > Again, if you sync the root of the working tree but don't sync the > separate Git directory, you won't have copied the index or any of the > other data. Okay, again, this is indeed what I thought you might have in mind, but it was difficult to be sure. And, again, this seems overly restrictive since there may be plenty of scenarios in which this works well. I suppose both of these points would feel more reasonable if they didn't sound like hard requirements, perhaps by explaining that the simple case of no worktrees and no separate repository has the least restrictions and is easiest to get up and running; more complex cases can work too with the caveat that worktree and separate repository directories ought to be synced too, and that pathnames need to be identical on all machines.
Junio C Hamano <gitster@pobox.com> writes: > Doesn't rsync work the same per-file fashion, and the only reason > why it is a better fit is because it is not continuous, not > attempting to "sync" while the repository is in use, until the user > explicitly says "OK, I am ready to go home, so let's stop working > here and send everything over to home with rsync"? OK, so not "per-file" but "continuous" is the root problem, and "cloud" would be a good word because all the popular ones share that "continuous" trait. This part of the proposed patch text may need rethinking a bit. > +or added files, broken refs, and a wide variety of other corruption. These > +services tend to sync file by file and don't understand the structure of a Git > +repository. This is especially bad if they sync the repository in the middle of That is, "file by file" is not a problem per-se, "don't understand the structure" is a huge problem, and "continuous" may contribute to the problem further. I wonder if you let the "cloud" services to continuously sync your repository, then go quiescent for a while and then start touching the destination, it would be sufficient, though. The refs with funny "2" suffixes and the like are the symptom of two sides simultanously mucking with the same file (e.g ".git/refs/main") and the "cloud sync" could not decide what the right final outcome is, right? I also wonder if we add a way to transfer reflog entries, that imply the object reachability, say "git push --with-reflog", over the wire, it would be sufficient to do everything with Git. Before you go home, you'd do git stash save --untracked && git stash apply git push --mirror --with-reflog --with-stash to save away modified and untracked files to a stash entry [*], and push all the refs with their reflog entries (including refs/stash which normally gets refused because it has only two levels). Side note. If there were a variant of "git stash save" that only saves away without modifying the working tree and the index, I'd use that single command instead of "save and immediately restore by applying" kludge. Then at the destination, you'd figure out what the current branch was (the stash message should record tha name of the branch), check that branch out, and running "git stash pop" will give you pretty much the same environment.
On 2021-11-09 at 00:10:05, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > Doesn't rsync work the same per-file fashion, and the only reason > > why it is a better fit is because it is not continuous, not > > attempting to "sync" while the repository is in use, until the user > > explicitly says "OK, I am ready to go home, so let's stop working > > here and send everything over to home with rsync"? > > OK, so not "per-file" but "continuous" is the root problem, and > "cloud" would be a good word because all the popular ones share that > "continuous" trait. > > This part of the proposed patch text may need rethinking a bit. > > > +or added files, broken refs, and a wide variety of other corruption. These > > +services tend to sync file by file and don't understand the structure of a Git > > +repository. This is especially bad if they sync the repository in the middle of > > That is, "file by file" is not a problem per-se, "don't understand > the structure" is a huge problem, and "continuous" may contribute to > the problem further. "File by file" is not a problem per se, but it is a problem if you don't snapshot the repository or it's in use. For example, if you used an LVM snapshot on an in-use repository and then synced that, we'd guarantee that the repository was fully consistent. Practically nobody does that, though. But to clarify: the goal of this type of software is to sync single, independent documents. As such, these tools don't necessarily consider that it's important to sync other files in the same directory at the same time. You can end up with cases where some parts of your repository get synced, then some other files in other directories, then other parts of the repository, resulting in different states over different times. There's just no guarantee about how they behave with all of the files in a given repository because they aren't considered a relevant unit. That's why I mention that they sync file by file. If they did syncing directory by directory, we'd likely have a lot fewer problems. > I wonder if you let the "cloud" services to continuously sync your > repository, then go quiescent for a while and then start touching > the destination, it would be sufficient, though. The refs with > funny "2" suffixes and the like are the symptom of two sides > simultanously mucking with the same file (e.g ".git/refs/main") and > the "cloud sync" could not decide what the right final outcome is, > right? I think this is very risky. Yes, usually the duplicate files are caused by competing changes on both sides, but having seen users with this problem before, I'm not sure if we can safely guarantee this is the only case. > I also wonder if we add a way to transfer reflog entries, that imply > the object reachability, say "git push --with-reflog", over the > wire, it would be sufficient to do everything with Git. Yes, please. I'd love to see this.
diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt index ae1b526565..e5dab89d6c 100644 --- a/Documentation/gitfaq.txt +++ b/Documentation/gitfaq.txt @@ -83,8 +83,8 @@ Windows would be the configuration `"C:\Program Files\Vim\gvim.exe" --nofork`, which quotes the filename with spaces and specifies the `--nofork` option to avoid backgrounding the process. -Credentials ------------ +Credentials and Transfers +------------------------- [[http-credentials]] How do I specify my credentials when pushing over HTTP?:: @@ -185,6 +185,49 @@ Then, you can adjust your push URL to use `git@example_author` or `git@example_committer` instead of `git@example.org` (e.g., `git remote set-url git@example_author:org1/project1.git`). +[[sync-working-tree]] +How do I sync a working tree across systems?:: + First, decide whether you want to do this at all. Git usually works better + when you push or pull your work using the typical `git push` and `git fetch` + commands and isn't designed to share a working tree across systems. This is + potentially risky and in some cases can cause repository corruption or data + loss. ++ +Usually, doing so will cause `git status` to need to re-read every file in the +working tree. Additionally, Git's security model does not permit sharing a +working tree across untrusted users, so it is only safe to sync a working tree +if it will only be used by a single user across all machines. ++ +It is important not to use a cloud syncing service to sync any portion of a Git +repository, since this can cause corruption, such as missing objects, changed +or added files, broken refs, and a wide variety of other corruption. These +services tend to sync file by file and don't understand the structure of a Git +repository. This is especially bad if they sync the repository in the middle of +it being updated, since that is very likely to cause incomplete or partial +updates and therefore data loss. ++ +Therefore, it's better to push your work to either the other system or a central +server using the normal push and pull mechanism. However, this doesn't always +preserve important data, like stashes, so some people prefer to share a working +tree across systems. ++ +If you do this, the recommended approach is to use `rsync -a --delete-after` +(ideally with an encrypted connection such as with `ssh`) on the root of +repository. You should ensure several things when you do this: ++ +* There are no additional worktrees enabled for your repository. +* You are not using a separate Git directory outside of your repository root. +* You are comfortable with the destination directory being an exact copy of the + source directory, _deleting any data that is already there_. +* The repository is in a quiescent state for the duration of the transfer (that + is, no operations of any sort are taking place on it, including background + operations like `git gc` and operations invoked by your editor). ++ +Be aware that even with these recommendations, syncing in this way has some risk +since it bypasses Git's normal integrity checking for repositories, so having +backups is advised. You may also with to do a `git fsck` to verify the +integrity of your data on the destination system after syncing. + Common Issues -------------
Users very commonly want to sync their working tree with uncommitted changes across machines, often to carry across in-progress work or stashes. Despite this not being a recommended approach, users want to do it and are not dissuaded by suggestions not to, so let's recommend a sensible technique. The technique that many users are using is their preferred cloud syncing service, which is a bad idea. Users have reported problems where they end up with duplicate files that won't go away (with names like "file.c 2"), broken references, oddly named references that have date stamps appended to them, missing objects, and general corruption and data loss. That's because almost all of these tools sync file by file, which is a great technique if your project is a single word processing document or spreadsheet, but is utterly abysmal for Git repositories because they don't necessarily snapshot the entire repository correctly. They also tend to sync the files immediately instead of when the repository is quiescent, so writing multiple files, as occurs during a commit or a gc, can confuse the tools and lead to corruption. We know that the old standby, rsync, is up to the task, provided that the repository is quiescent, so let's suggest that and dissuade people from using cloud syncing tools. Let's tell people about common things they should be aware of before doing this and that this is still potentially risky. Additionally, let's tell people that Git's security model does not permit sharing working trees across users in case they planned to do that. While we'd still prefer users didn't try to do this, hopefully this will lead them in a safer direction. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Documentation/gitfaq.txt | 47 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 45 insertions(+), 2 deletions(-)