Message ID | 20210820195558.44275-1-johannes@sipsolutions.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] multi-pack-index: allow operating without pack files | expand |
On 8/20/2021 3:55 PM, Johannes Berg wrote: > Technically, multi-pack-index doesn't need pack files to exist, > but add_packed_git() today checks whether it exists or not. Having a multi-pack-index is supposed to indicate that we have these objects in the objects/pack directory within the specified pack-files. I understand your goal to relax a condition of the multi-pack-index file, but it's triggered by a flag during write and that choice isn't persisted into the file. There is no way for a later Git process to understand that the multi-pack-index doesn't actually guarantee object existence. And in a completely other side: one would think that including a pack-file in the multi-pack-index would allow deleting the .idx file, but there are a few reasons why we do not (including interactions with third-party tools). So, I'm not necessarily on board with this change unless something is added to the multi-pack-index file (such as a new version 2 and an optional chunk understood by that version) that tells future Git processes that the .pack files might not exist. I'm still not sure what Git should do about that other than stop reading the multi-pack-index and ignore its contents. > -struct packed_git *add_packed_git(const char *path, size_t path_len, int local) > +struct packed_git *_add_packed_git(const char *path, size_t path_len, int local, > + int require_pack) The only obvious thing that I noticed in the code is that we typically use <function name>_1() as a way to create a static version that is called by the global version. > struct packed_git *add_packed_git(const char *path, size_t path_len, int local); > +struct packed_git *_add_packed_git(const char *path, size_t path_len, int local, > + int require_pack); ...oh, but that's not what you're doing. What you could do instead is convert the 'local' parameter into a 'flags' parameter (I think we have started to prefer 'enum's recently) and create MIDX_FLAG_LOCAL and MIDX_FLAG_PACKS_OPTIONAL flag values. That avoids multiple methods and minimizes the change to existing callers. Thanks, -Stolee
On Sunday, August 22, 2021 8:34:43 PM MDT Derrick Stolee wrote: > On 8/20/2021 3:55 PM, Johannes Berg wrote: > > Technically, multi-pack-index doesn't need pack files to exist, > > but add_packed_git() today checks whether it exists or not. > > Having a multi-pack-index is supposed to indicate that we have > these objects in the objects/pack directory within the specified > pack-files. Hmm, isn't it a normal supported use case for repacking to potentially delete packfiles which are in the MIDX (I'm specifically thinking about when someone runs git gc with an older git version which knows nothing about MIDX files)? -Martin
On Sun, Aug 22, 2021 at 08:34:43PM -0400, Derrick Stolee wrote: > On 8/20/2021 3:55 PM, Johannes Berg wrote: > > Technically, multi-pack-index doesn't need pack files to exist, > > but add_packed_git() today checks whether it exists or not. > > Having a multi-pack-index is supposed to indicate that we have > these objects in the objects/pack directory within the specified > pack-files. > > I understand your goal to relax a condition of the multi-pack-index > file, but it's triggered by a flag during write and that choice > isn't persisted into the file. There is no way for a later Git > process to understand that the multi-pack-index doesn't actually > guarantee object existence. We're going to run into problems much earlier than that, though: the MIDX needs to load information about objects from packs in order to break ties when multiple copies of the same object exist in multiple packs (according to that pack's mtime). So I'm not sure how we would even write a MIDX without physical packs on disk that we can open and stat, let along how we would teach Git to handle a situation where packs that did exist when writing a MIDX went away when we tried to read from the same MIDX later on. Thanks, Taylor
On Sun, 2021-08-22 at 19:11 -0600, Martin Fick wrote: > On Sunday, August 22, 2021 8:34:43 PM MDT Derrick Stolee wrote: > > On 8/20/2021 3:55 PM, Johannes Berg wrote: > > > Technically, multi-pack-index doesn't need pack files to exist, > > > but add_packed_git() today checks whether it exists or not. > > > > Having a multi-pack-index is supposed to indicate that we have > > these objects in the objects/pack directory within the specified > > pack-files. > > Hmm, isn't it a normal supported use case for repacking to potentially delete > packfiles which are in the MIDX (I'm specifically thinking about when someone > runs git gc with an older git version which knows nothing about MIDX files)? Yeah, the multi-pack-index contains all the pack names that are there (and then which one the object is in), and when git tries to use it, then it doesn't really seem to have any issues? You can "git repack -a -d" with an old version, or save the multi-pack- index temporarily elsewhere and copy it back after repack, and things work just fine, except of course "git multi-pack-index verify" will complain loudly. johannes
On Mon, 2021-08-23 at 00:05 -0400, Taylor Blau wrote: > > We're going to run into problems much earlier than that, though: the > MIDX needs to load information about objects from packs in order to > break ties when multiple copies of the same object exist in multiple > packs (according to that pack's mtime). Huh, I guess I never ran into a need - we make sure in bup that each object only exists once, that's kind of the point :) I guess we could break ties by "lower hash of the pack" instead of "mtime"? It doesn't really matter how they're broken, as long as it's consistent? Arguably, mtime is not a good measure anyway, since the same repo elsewhere would have different mtimes. > So I'm not sure how we would even write a MIDX without physical packs on > disk that we can open and stat, let along how we would teach Git to > handle a situation where packs that did exist when writing a MIDX went > away when we tried to read from the same MIDX later on. It handles that just fine on the read side. johannes
On Sun, 2021-08-22 at 20:34 -0400, Derrick Stolee wrote: > On 8/20/2021 3:55 PM, Johannes Berg wrote: > > Technically, multi-pack-index doesn't need pack files to exist, > > but add_packed_git() today checks whether it exists or not. > > Having a multi-pack-index is supposed to indicate that we have > these objects in the objects/pack directory within the specified > pack-files. Yeah, so, like I tried to explain in the patch email, this is (partially) for a non-git use case. The way you could think of it is that I'm trying to make a local index cache, so we know locally which objects we have, but the actual packs and their contents are stored in some slow/expensive (remote) storage system. With the local cache, I'd like to understand whether or not I have a given object (in particular to avoid storing it again if I see it), and also I'd like to know _where_ it is (which pack and where in the pack) so that I can access only parts, instead of downloading everything. The server isn't necessarily able to run something 'smart', it might just be S3 blob storage, but storing pack files, not individual objects. > I understand your goal to relax a condition of the multi-pack-index > file, but it's triggered by a flag during write and that choice > isn't persisted into the file. There is no way for a later Git > process to understand that the multi-pack-index doesn't actually > guarantee object existence. It does, since the pack files are recorded. I've tested this briefly, and it doesn't really seem to have any problems with it. While I understand the concern, I'd also envision that this tool command-line option is never set when this is used on normal git repositories, only when used in this special use case I tried explaining above. > And in a completely other side: one would think that including > a pack-file in the multi-pack-index would allow deleting the .idx > file, but there are a few reasons why we do not (including > interactions with third-party tools). Well, arguably you can always restore it from the pack file, so you can always delete it ;) Not sure git is set up for that though. Also, even old versions of git would probably choke since they don't read the multi-pack-index. > So, I'm not necessarily on board with this change unless > something is added to the multi-pack-index file (such as a new > version 2 and an optional chunk understood by that version) that > tells future Git processes that the .pack files might not exist. > I'm still not sure what Git should do about that other than stop > reading the multi-pack-index and ignore its contents. It seems to do that today, or at least try to find in the idx if it goes looking for an object and the multi-pack-index references a pack file that doesn't exist - and as Martin also pointed out, that really is necessary for this to work with older versions of git. > [snip comments on the code] Fair enough. I think we need to figure out the conceptual question though - whether or not this is something that's acceptable in git in the first place. Like I said, the intent isn't to break git repositories or to even use this on a git repository that's also normally being read by git (even though I don't see any problems with that, but my testing wasn't very deep). The intent (and the reason for this being [RFC]) is to see if git upstream would be amenable to such a change that lets other tools use git's index - especially multi-pack-index - machinery to operate on such a thing as the "index cache" I described above. In particular, in bup's [1] use case, we currently support our own "midx" format which is similar to multi-pack-index, but only contains *object existence*, not *object location*. Thus, there are two problems: 1. In a normal bup/git repository, when bup is used to write objects (it only ever writes packs), then the midx is also updated to allow bup itself to check for object existence without opening all the *.idx files, allowing it to deduplicate objects efficiently. In this case, bup always uses git to retrieve objects pack, and this leads to inefficiencies: git will not be able to use the midx files (and they don't contain object location anyway), but git will open all the *.idx files. If we ask it, it might maintain its own multi-pack-index. Obviously, in this mode, there's no reason to have this patch - we can just ask git to create the multi-pack-index and add code to bup to understand how to read it in order to check for object existence (for deduplication). 2. However, I'm working on adding encrypted repositories to bup, and there the objects are stored differently - still in something like a pack file, but it's encrypted, in a way that's done at the object level. In this mode, bup maintains an object index in a clear-text cache, i.e. all the *.idx files, and currently adds its own midx to be able to efficiently check for object existence. However, in this case, bup cannot use git to access objects, since there's no real git directory, only a folder containing the *.idx files, the data is "elsewhere". Since the *.idx file format is still the same (no reason to change it), with this patch we can use git to efficiently (in both developer time, I don't need to rewrite it, and runtime, since it's in C not python) ask git to write the multi-pack-index that then bup can use to do both: a) check for object existence during further backups while adding new objects, and b) look for object location while accessing the repository for reading. I completely understand that this is a non-git use case, and I'm entirely happy for you to tell me to get lost, but I figured it was such a small change I had to try, rather than rewriting the entire multi- pack-index maintenance code for bup (or importing git's there). Really for (2) I don't even care about using git's multi-pack-index, since git never uses it, but I wanted to have the same between (1) and (2), and if we want git - called on a bup/git repo - to have a multi- pack-index, then we need git's format, of course. The other hack I considered was to just create empty *.pack files for each *.idx file in the index cache because then git is (currently) happy to accept this, so it never really validates anything there even when the pack files "exist", but that felt even worse and like something git might break with just a little more validity checking. So basically I had choices between: A. implement multi-pack-index in bup, to cover both (1) and (2) B. implement a new midx format in bup, to cover just (2) C. call unmodified git in bup, to cover just (1) D. this patch to cover both (1) and (2) E. create dummy *.pack files for the *.idx files in the cache, so that git will accept it. Obviously, I can also do any combination of these, e.g. B & C, but this was the least amount of work ;) But really, it's up to you, I'm just asking. If you feel this doesn't fit into git's machinery (despite being so simple), I'll just do "something else", not sure which one I might do. Quite possibly A., if only because then I can write tests with "git multi-index-pack verify" :-) johannes [1] https://bup.github.io/, not to be confused with bupstash.io
diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index ffd601bc17b4..23db70fbebc2 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -10,7 +10,7 @@ SYNOPSIS -------- [verse] 'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress] - [--preferred-pack=<pack>] <subcommand> + <subcommand> [<subcommand options>] DESCRIPTION ----------- @@ -40,6 +40,10 @@ write:: multiple packs contain the same object. If not given, ties are broken in favor of the pack with the lowest mtime. + + --no-require-packs:: + Don't require pack files to exist, useful only for + certain non-repository caches. -- verify:: diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index 8ff0dee2ecbb..2c9293b20c49 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -7,7 +7,7 @@ #include "object-store.h" #define BUILTIN_MIDX_WRITE_USAGE \ - N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]") + N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>] [--no-require-packs]") #define BUILTIN_MIDX_VERIFY_USAGE \ N_("git multi-pack-index [<options>] verify") @@ -68,6 +68,9 @@ static int cmd_multi_pack_index_write(int argc, const char **argv) OPT_STRING(0, "preferred-pack", &opts.preferred_pack, N_("preferred-pack"), N_("pack for reuse when computing a multi-pack bitmap")), + OPT_BIT(0, "no-require-packs", &opts.flags, + N_("don't require pack files to exist"), + MIDX_DONT_REQUIRE_PACKS), OPT_END(), }; diff --git a/midx.c b/midx.c index 902e1a7a7d9d..98b3cb33201f 100644 --- a/midx.c +++ b/midx.c @@ -468,6 +468,7 @@ struct write_midx_context { uint32_t num_large_offsets; int preferred_pack_idx; + unsigned flags; }; static void add_pack_to_midx(const char *full_path, size_t full_path_len, @@ -482,9 +483,10 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len, ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); - ctx->info[ctx->nr].p = add_packed_git(full_path, - full_path_len, - 0); + ctx->info[ctx->nr].p = _add_packed_git(full_path, + full_path_len, + 0, + !(ctx->flags & MIDX_DONT_REQUIRE_PACKS)); if (!ctx->info[ctx->nr].p) { warning(_("failed to add packfile '%s'"), @@ -924,6 +926,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index * ctx.nr = 0; ctx.alloc = ctx.m ? ctx.m->num_packs : 16; ctx.info = NULL; + ctx.flags = flags; ALLOC_ARRAY(ctx.info, ctx.alloc); if (ctx.m) { diff --git a/midx.h b/midx.h index 8684cf0fefe8..aa6382d99386 100644 --- a/midx.h +++ b/midx.h @@ -41,6 +41,7 @@ struct multi_pack_index { #define MIDX_PROGRESS (1 << 0) #define MIDX_WRITE_REV_INDEX (1 << 1) +#define MIDX_DONT_REQUIRE_PACKS (1 << 2) char *get_midx_rev_filename(struct multi_pack_index *m); diff --git a/packfile.c b/packfile.c index 9ef6d9829280..dfe994205914 100644 --- a/packfile.c +++ b/packfile.c @@ -687,7 +687,8 @@ void unuse_pack(struct pack_window **w_cursor) } } -struct packed_git *add_packed_git(const char *path, size_t path_len, int local) +struct packed_git *_add_packed_git(const char *path, size_t path_len, int local, + int require_pack) { struct stat st; size_t alloc; @@ -717,7 +718,7 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) p->pack_promisor = 1; xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack"); - if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) { + if (require_pack && (stat(p->pack_name, &st) || !S_ISREG(st.st_mode))) { free(p); return NULL; } @@ -734,6 +735,11 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) return p; } +struct packed_git *add_packed_git(const char *path, size_t path_len, int local) +{ + return _add_packed_git(path, path_len, local, 1); +} + void install_packed_git(struct repository *r, struct packed_git *pack) { if (pack->pack_fd != -1) diff --git a/packfile.h b/packfile.h index 3ae117a8aef0..a921077a05ef 100644 --- a/packfile.h +++ b/packfile.h @@ -96,6 +96,8 @@ void close_object_store(struct raw_object_store *o); void unuse_pack(struct pack_window **); void clear_delta_base_cache(void); struct packed_git *add_packed_git(const char *path, size_t path_len, int local); +struct packed_git *_add_packed_git(const char *path, size_t path_len, int local, + int require_pack); /* * Unlink the .pack and associated extension files.
Technically, multi-pack-index doesn't need pack files to exist, but add_packed_git() today checks whether it exists or not. In bup, a git pack format based backup tool, we'd really like to take advantage of the multi-pack-index, since bup needs it to save new objects to the repository efficiently (to check if something already exists), and uses git to access the repo, so the multi-pack-index can make more efficient. Alternatively, bup has its own 'midx' format, of which multiple can exist in a repository, predating the multi-pack-index. All of this works well as long as the bup repository is just a normal git repository. However, I've been adding encrypted and encrypted remote repositories to bup, where the pack files are not local, similar to promisor remotes, but not really done in the same way. In this case, the local storage is only the idx files, no pack files (it's just a cache), and we access the pack files and objects within in different ways. Unfortunately, in this case we also cannot reuse bup's midx format very well: it only has information on which objects exists, not where to find them, and so reading from the repository requires reading all of the idx files, something that git's multi-pack-index solves. While we'll need to add read access to git's multi-pack-index to bup, having a call to 'git multi-pack-index' write it would be nice and save some duplication. However, in the case of the remote/encrypted repositories, git currently cannot do that as it requires the pack files to exist. Add a command-line option to be able to not require pack files to exist, to make that easier (rather than requiring writing some dummy pack files, git even accepts empty files.) Signed-off-by: Johannes Berg <johannes@sipsolutions.net> --- Documentation/git-multi-pack-index.txt | 6 +++++- builtin/multi-pack-index.c | 5 ++++- midx.c | 9 ++++++--- midx.h | 1 + packfile.c | 10 ++++++++-- packfile.h | 2 ++ 6 files changed, 26 insertions(+), 7 deletions(-)