Message ID | pull.1316.git.1659958159193.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] list-objects-filter: introduce new filter sparse:buffer=<spec> | expand |
"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: ZheNing Hu <adlternative@gmail.com> > > Although we already had a `--filter=sparse:oid=<oid>` which > can used to clone a repository with limited objects which meet > filter rules in the file corresponding to the <oid> on the git > server. But it can only read filter rules which have been record > in the git server before. Was the reason why we have "we limit to an object we already have" restriction because we didn't want to blindly use a piece of uncontrolled arbigrary end-user data here? Just wondering.
Junio C Hamano <gitster@pobox.com> 于2022年8月9日周二 00:15写道: > > "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: ZheNing Hu <adlternative@gmail.com> > > > > Although we already had a `--filter=sparse:oid=<oid>` which > > can used to clone a repository with limited objects which meet > > filter rules in the file corresponding to the <oid> on the git > > server. But it can only read filter rules which have been record > > in the git server before. > > Was the reason why we have "we limit to an object we already have" > restriction because we didn't want to blindly use a piece of > uncontrolled arbigrary end-user data here? Just wondering. > * An end-user's maybe doesn't even have write access to the repository, so they can't config a filterspec file before git clone, what should they do now? * If there are thousands of different developers use the same git repo, and they use "--filter=sparse:oid" to do different partial-clone, then how many filterspec file should repo managers config first? * Why not carefully check "uncontrolled arbigrary end-user data" here, such as add a config like "partialclone.sparsebufferlimit" to limit transport data size, or check if filterspec file is legal? Or if git server don't trust its user... we can use a config to ban this filter, And within some companies, users can basically be trusted. * I'm sure it would be beneficial to let the filtering rules be configured by the user, because now many people have such needs: download only a few of files of directories of the repository. * sparse-checkout + partial-clone is a good reference: we have a ".git/info/sparse-checkout" for record what we actually want to checkout to work-tree, and it will fetch some missing git objects which record in ".git/info/sparse-checkout" from git server. I know it use <oid> to fetch objects one by one instead of "path"... But In hindsight, its performance is extraordinarily bad as a result... Anyway, this patch represents some of my complaints about the current partial-clone feature and I hope the community will move forward with it. Thanks. ZheNing Hu
On 8/8/2022 12:15 PM, Junio C Hamano wrote: > "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> From: ZheNing Hu <adlternative@gmail.com> >> >> Although we already had a `--filter=sparse:oid=<oid>` which >> can used to clone a repository with limited objects which meet >> filter rules in the file corresponding to the <oid> on the git >> server. But it can only read filter rules which have been record >> in the git server before. > > Was the reason why we have "we limit to an object we already have" > restriction because we didn't want to blindly use a piece of > uncontrolled arbigrary end-user data here? Just wondering. One of the ideas here was to limit the opportunity of sending an arbitrary set of data over the Git protocol and avoid exactly the scenario you mention. Another was that it is incredibly expensive to compute the set of reachable objects within an arbitrary sparse-checkout definition, since it requires walking trees (bitmaps do not help here). This is why (to my knowledge) no Git hosting service currently supports this mechanism at scale. At minimum, using the stored OID would allow the host to keep track of these pre-defined sets and do some precomputing of reachable data using bitmaps to keep clones and fetches reasonable at all. The other side of the issue is that we do not have a good solution for resolving how to change this filter in the future, in case the user wants to expand their sparse-checkout definition and update their partial clone filter. There used to be a significant issue where a 'git checkout' would fault in a lot of missing trees because the index needed to reference the files outside of the sparse-checkout definition. Now that the sparse index exists, this is less of an impediment, but it can still cause some pain. At this moment, I think path-scoped filters have a lot of problems that need solving before they can be used effectively in the wild. I would prefer that we solve those problems before making the feature more complicated. That's a tall ask, since these problems do not have simple solutions. Thanks, -Stolee
On Tue, Aug 09, 2022 at 09:37:09AM -0400, Derrick Stolee wrote: > > Was the reason why we have "we limit to an object we already have" > > restriction because we didn't want to blindly use a piece of > > uncontrolled arbigrary end-user data here? Just wondering. > > One of the ideas here was to limit the opportunity of sending an > arbitrary set of data over the Git protocol and avoid exactly the > scenario you mention. One other implication here is that the filter spec is sent inside of a pkt-line. So the implementation here is limiting us to 64kb. That may sound like a lot for simple specs, but I imagine in big repos they can possibly get pretty complex. That would be fixable with a protocol extension to take the data over multiple pkt-lines. That said... > At this moment, I think path-scoped filters have a lot of problems > that need solving before they can be used effectively in the wild. > I would prefer that we solve those problems before making the > feature more complicated. That's a tall ask, since these problems > do not have simple solutions. ...I agree with this. It is nice to put more power in the hands of the clients, but we have to balance that with other issues like server resource use. The approach so far has been to implement the simplest and most efficient operations at the client-server level, and then have the client build local features on top of that. So in this case, probably requesting that _no_ trees are sent in the initial clone, and then faulting them in as the client explores the tree using its own local sparse definition. And I think that mostly works now. Though I admit I do not keep a close watch on the status of partial-checkout features. I mostly always cared about it from the server provider angle. ;) -Peff
Derrick Stolee <derrickstolee@github.com> 于2022年8月9日周二 21:37写道: > > On 8/8/2022 12:15 PM, Junio C Hamano wrote: > > "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > >> From: ZheNing Hu <adlternative@gmail.com> > >> > >> Although we already had a `--filter=sparse:oid=<oid>` which > >> can used to clone a repository with limited objects which meet > >> filter rules in the file corresponding to the <oid> on the git > >> server. But it can only read filter rules which have been record > >> in the git server before. > > > > Was the reason why we have "we limit to an object we already have" > > restriction because we didn't want to blindly use a piece of > > uncontrolled arbigrary end-user data here? Just wondering. > > One of the ideas here was to limit the opportunity of sending an > arbitrary set of data over the Git protocol and avoid exactly the > scenario you mention. > > Another was that it is incredibly expensive to compute the set of > reachable objects within an arbitrary sparse-checkout definition, > since it requires walking trees (bitmaps do not help here). This > is why (to my knowledge) no Git hosting service currently supports > this mechanism at scale. At minimum, using the stored OID would > allow the host to keep track of these pre-defined sets and do some > precomputing of reachable data using bitmaps to keep clones and > fetches reasonable at all. > How about only allowing some easier filter rules? e.g. https://github.com/derrickstolee/sparse-checkout-example User A can use --filter="sparse:buffer=client" to download client/ directory, User B can use --filter="sparse:buffer=service/list" to download only service/list. cat >filterspec <<-EOF && web service EOF User C can use --filter="sparse:buffer=`cat filterspec`" to download web/ and service/. cat >filterspec <<-EOF && service !service/list EOF But user D cannot use --filter="sparse:buffer=service/list" to download service without service/list. I guess many users can benefit from this... > The other side of the issue is that we do not have a good solution > for resolving how to change this filter in the future, in case the > user wants to expand their sparse-checkout definition and update > their partial clone filter. > I guess we don't really need to maintain this "partial clone filter", we can even reuse sparse-checkout rules after we first partial-clone, we maybe should write the first partial-clone filter rules to .git/info/sparse-checkout (only when --sparse is used in git clone?) > There used to be a significant issue where a 'git checkout' > would fault in a lot of missing trees because the index needed to > reference the files outside of the sparse-checkout definition. Now > that the sparse index exists, this is less of an impediment, but > it can still cause some pain. > Agree. > At this moment, I think path-scoped filters have a lot of problems > that need solving before they can be used effectively in the wild. > I would prefer that we solve those problems before making the > feature more complicated. That's a tall ask, since these problems > do not have simple solutions. > Could you tell me where the problem is? I can start to deal with them :) > Thanks, > -Stolee Thanks. ZheNing Hu
Jeff King <peff@peff.net> 于2022年8月11日周四 05:15写道: > > On Tue, Aug 09, 2022 at 09:37:09AM -0400, Derrick Stolee wrote: > > > > Was the reason why we have "we limit to an object we already have" > > > restriction because we didn't want to blindly use a piece of > > > uncontrolled arbigrary end-user data here? Just wondering. > > > > One of the ideas here was to limit the opportunity of sending an > > arbitrary set of data over the Git protocol and avoid exactly the > > scenario you mention. > > One other implication here is that the filter spec is sent inside of a > pkt-line. So the implementation here is limiting us to 64kb. That may > sound like a lot for simple specs, but I imagine in big repos they can > possibly get pretty complex. > > That would be fixable with a protocol extension to take the data over > multiple pkt-lines. > This sounds very scary, a filter rules file has (more then) 64KB... If the filter is really big, I think the server will really be slow to parse it. > That said... > > > At this moment, I think path-scoped filters have a lot of problems > > that need solving before they can be used effectively in the wild. > > I would prefer that we solve those problems before making the > > feature more complicated. That's a tall ask, since these problems > > do not have simple solutions. > > ...I agree with this. It is nice to put more power in the hands of the > clients, but we have to balance that with other issues like server > resource use. The approach so far has been to implement the simplest and > most efficient operations at the client-server level, and then have the > client build local features on top of that. So in this case, probably > requesting that _no_ trees are sent in the initial clone, and then > faulting them in as the client explores the tree using its own local > sparse definition. And I think that mostly works now. > Agree. But we have to fetch these blobs one by one after partial clone, why not reduce some extra network overhead If we can get those blobs that are *most* needed in the first partial clone, right? > Though I admit I do not keep a close watch on the status of > partial-checkout features. I mostly always cared about it from the > server provider angle. ;) > > -Peff Thanks. ZheNing Hu
On Fri, Aug 12, 2022 at 11:49:06PM +0800, ZheNing Hu wrote: > > ...I agree with this. It is nice to put more power in the hands of the > > clients, but we have to balance that with other issues like server > > resource use. The approach so far has been to implement the simplest and > > most efficient operations at the client-server level, and then have the > > client build local features on top of that. So in this case, probably > > requesting that _no_ trees are sent in the initial clone, and then > > faulting them in as the client explores the tree using its own local > > sparse definition. And I think that mostly works now. > > Agree. But we have to fetch these blobs one by one after partial clone, > why not reduce some extra network overhead If we can get those blobs > that are *most* needed in the first partial clone, right? Ideally we wouldn't be getting them one-by-one, but rather batching them. I'm not sure of the current state of the code. But we should at least be able to batch by tree depth. You're right that we'll never be able to be as efficient as a server to whom we can say "I care about these paths", though. -Peff
Derrick Stolee <derrickstolee@github.com> 于2022年8月9日周二 21:37写道: > > On 8/8/2022 12:15 PM, Junio C Hamano wrote: > > "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > >> From: ZheNing Hu <adlternative@gmail.com> > >> > >> Although we already had a `--filter=sparse:oid=<oid>` which > >> can used to clone a repository with limited objects which meet > >> filter rules in the file corresponding to the <oid> on the git > >> server. But it can only read filter rules which have been record > >> in the git server before. > > > > Was the reason why we have "we limit to an object we already have" > > restriction because we didn't want to blindly use a piece of > > uncontrolled arbigrary end-user data here? Just wondering. > > One of the ideas here was to limit the opportunity of sending an > arbitrary set of data over the Git protocol and avoid exactly the > scenario you mention. > I find that sparse-checkout uses a "cone" mode to limit the set of send data, which can achieve performance improvement. I don't know if we can use this mode here? With a brief look, it seems that the "cone" mode is ensuring that the filter rule we add is directory and does not contain some special rule '!', '?', '*', '[', ']'. But now if we transport the filter rule to git server, git server cannot check if the filter rule is a directory, because it involves paths in multiple commits. e.g. in 9e6f67, "test" can be a directory, but in e5e154e, "test" can be a file... I don't know how to solve this problem... > Another was that it is incredibly expensive to compute the set of > reachable objects within an arbitrary sparse-checkout definition, > since it requires walking trees (bitmaps do not help here). This > is why (to my knowledge) no Git hosting service currently supports > this mechanism at scale. At minimum, using the stored OID would > allow the host to keep track of these pre-defined sets and do some > precomputing of reachable data using bitmaps to keep clones and > fetches reasonable at all. > > The other side of the issue is that we do not have a good solution > for resolving how to change this filter in the future, in case the > user wants to expand their sparse-checkout definition and update > their partial clone filter. > > There used to be a significant issue where a 'git checkout' > would fault in a lot of missing trees because the index needed to > reference the files outside of the sparse-checkout definition. Now > that the sparse index exists, this is less of an impediment, but > it can still cause some pain. > > At this moment, I think path-scoped filters have a lot of problems > that need solving before they can be used effectively in the wild. > I would prefer that we solve those problems before making the > feature more complicated. That's a tall ask, since these problems > do not have simple solutions. > I have a good idea that if we can let such path-scoped filters work, we can apply sparse-checkout with it... Maybe one day, users can use: git clone --sparse --filter="sparse:buffer=dir" xxx.git to have the repo with sparse-checkout results... Needless to say, this is very tempting. > Thanks, > -Stolee Thanks, ZheNing Hu
diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt index 195e74eec63..76410cdafd6 100644 --- a/Documentation/rev-list-options.txt +++ b/Documentation/rev-list-options.txt @@ -913,6 +913,10 @@ specification contained in the blob (or blob-expression) '<blob-ish>' to omit blobs that would not be required for a sparse checkout on the requested refs. + +The form '--filter=sparse:buffer=<spec>' uses a sparse-checkout +specification contained in the buffer to omit blobs that would not +be required for a sparse checkout on the requested refs. ++ The form '--filter=tree:<depth>' omits all blobs and trees whose depth from the root tree is >= <depth> (minimum depth if an object is located at multiple depths in the commits traversed). <depth>=0 will not include diff --git a/dir.c b/dir.c index d7cfb08e441..5c438d2e730 100644 --- a/dir.c +++ b/dir.c @@ -1037,10 +1037,6 @@ static void invalidate_directory(struct untracked_cache *uc, dir->dirs[i]->recurse = 0; } -static int add_patterns_from_buffer(char *buf, size_t size, - const char *base, int baselen, - struct pattern_list *pl); - /* Flags for add_patterns() */ #define PATTERN_NOFOLLOW (1<<0) @@ -1123,7 +1119,7 @@ static int add_patterns(const char *fname, const char *base, int baselen, return 0; } -static int add_patterns_from_buffer(char *buf, size_t size, +int add_patterns_from_buffer(char *buf, size_t size, const char *base, int baselen, struct pattern_list *pl) { diff --git a/dir.h b/dir.h index 7bc862030cf..7dfc49cf83a 100644 --- a/dir.h +++ b/dir.h @@ -440,6 +440,9 @@ void add_patterns_from_file(struct dir_struct *, const char *fname); int add_patterns_from_blob_to_list(struct object_id *oid, const char *base, int baselen, struct pattern_list *pl); +int add_patterns_from_buffer(char *buf, size_t size, + const char *base, int baselen, + struct pattern_list *pl); void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen); void add_pattern(const char *string, const char *base, int baselen, struct pattern_list *pl, int srcpos); diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c index 4b25287886d..0afb066e795 100644 --- a/list-objects-filter-options.c +++ b/list-objects-filter-options.c @@ -29,6 +29,8 @@ const char *list_object_filter_config_name(enum list_objects_filter_choice c) return "tree"; case LOFC_SPARSE_OID: return "sparse:oid"; + case LOFC_SPARSE_BUFFER: + return "sparse:buffer"; case LOFC_OBJECT_TYPE: return "object:type"; case LOFC_COMBINE: @@ -84,6 +86,11 @@ int gently_parse_list_objects_filter( } return 1; + } else if (skip_prefix(arg, "sparse:buffer=", &v0)) { + filter_options->spec_buffer = xstrdup(v0); + filter_options->choice = LOFC_SPARSE_BUFFER; + return 0; + } else if (skip_prefix(arg, "object:type=", &v0)) { int type = type_from_string_gently(v0, strlen(v0), 1); if (type < 0) { @@ -338,6 +345,7 @@ void list_objects_filter_release( return; string_list_clear(&filter_options->filter_spec, /*free_util=*/0); free(filter_options->sparse_oid_name); + free(filter_options->spec_buffer); for (sub = 0; sub < filter_options->sub_nr; sub++) list_objects_filter_release(&filter_options->sub[sub]); free(filter_options->sub); diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h index ffc02d77e76..21d9ed4ef5d 100644 --- a/list-objects-filter-options.h +++ b/list-objects-filter-options.h @@ -14,6 +14,7 @@ enum list_objects_filter_choice { LOFC_BLOB_LIMIT, LOFC_TREE_DEPTH, LOFC_SPARSE_OID, + LOFC_SPARSE_BUFFER, LOFC_OBJECT_TYPE, LOFC_COMBINE, LOFC__COUNT /* must be last */ @@ -58,6 +59,8 @@ struct list_objects_filter_options { unsigned long tree_exclude_depth; enum object_type object_type; + char *spec_buffer; + /* LOFC_COMBINE values */ /* This array contains all the subfilters which this filter combines. */ diff --git a/list-objects-filter.c b/list-objects-filter.c index 1c1ee3d1bb1..17127c167d4 100644 --- a/list-objects-filter.c +++ b/list-objects-filter.c @@ -545,6 +545,32 @@ static void filter_sparse_oid__init( filter->free_fn = filter_sparse_free; } +static void filter_sparse_file_path__init( + struct list_objects_filter_options *filter_options, + struct filter *filter) +{ + struct filter_sparse_data *d = xcalloc(1, sizeof(*d)); + struct strbuf strbuf = STRBUF_INIT; + char *buf; + size_t size; + + strbuf_addf(&strbuf, "%s\n", filter_options->spec_buffer); + buf = strbuf_detach(&strbuf, &size); + if (add_patterns_from_buffer(buf, size, "", 0, &d->pl) < 0) + die(_("unable to parse sparse filter data: %s"), + filter_options->spec_buffer); + strbuf_release(&strbuf); + + ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc); + d->array_frame[d->nr].default_match = 0; /* default to include */ + d->array_frame[d->nr].child_prov_omit = 0; + d->nr++; + + filter->filter_data = d; + filter->filter_object_fn = filter_sparse; + filter->free_fn = filter_sparse_free; +} + /* * A filter for list-objects to omit large blobs. * And to OPTIONALLY collect a list of the omitted OIDs. @@ -766,6 +792,7 @@ static filter_init_fn s_filters[] = { filter_blobs_limit__init, filter_trees_depth__init, filter_sparse_oid__init, + filter_sparse_file_path__init, filter_object_type__init, filter_combine__init, }; diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh index 4a3778d04a8..14c435f2f4b 100755 --- a/t/t5616-partial-clone.sh +++ b/t/t5616-partial-clone.sh @@ -421,6 +421,9 @@ test_expect_success 'setup src repo for sparse filter' ' git -C sparse-src config --local uploadpack.allowanysha1inwant 1 && test_commit -C sparse-src one && test_commit -C sparse-src two && + mkdir sparse-src/three && + test_commit -C sparse-src/three four && + test_commit -C sparse-src/three five && echo /one.t >sparse-src/only-one && git -C sparse-src add . && git -C sparse-src commit -m "add sparse checkout files" @@ -451,6 +454,86 @@ test_expect_success 'partial clone with unresolvable sparse filter fails cleanly test_i18ngrep "unable to parse sparse filter data in" err ' +test_expect_success 'partial clone with sparse:buffer filter with single file succeeds' ' + rm -rf dst.git && + git clone --no-local --bare \ + --filter=sparse:buffer=one.t \ + sparse-src dst.git && + ( + cd dst.git && + git rev-list --objects --missing=print HEAD >out && + grep "^$(git rev-parse HEAD:one.t)" out && + grep "^?$(git rev-parse HEAD:two.t)" out && + grep "^?$(git rev-parse HEAD:three/four.t)" out && + grep "^?$(git rev-parse HEAD:three/five.t)" out + ) +' + +test_expect_success 'partial clone with sparse:buffer filter with single dir succeeds' ' + rm -rf dst.git && + git clone --no-local --bare \ + --filter=sparse:buffer=three \ + sparse-src dst.git && + ( + cd dst.git && + git rev-list --objects --missing=print HEAD >out && + grep "^?$(git rev-parse HEAD:one.t)" out && + grep "^?$(git rev-parse HEAD:two.t)" out && + grep "^$(git rev-parse HEAD:three/four.t)" out && + grep "^$(git rev-parse HEAD:three/five.t)" out + ) +' + +test_expect_success 'partial clone with sparse:buffer filter with filterspec file succeeds' ' + test_write_lines one.t three > filterspec && + rm -rf dst.git && + git clone --no-local --bare \ + --filter="sparse:buffer=`cat filterspec`" \ + sparse-src dst.git && + ( + cd dst.git && + git rev-list --objects --missing=print HEAD >out && + grep "^$(git rev-parse HEAD:one.t)" out && + grep "^?$(git rev-parse HEAD:two.t)" out && + grep "^$(git rev-parse HEAD:three/four.t)" out && + grep "^$(git rev-parse HEAD:three/five.t)" out + ) +' + +test_expect_success 'partial clone with sparse:buffer filter with filterspec file and special character' ' + cat >filterspec <<-EOF && + three + !three/four.t + EOF + rm -rf dst.git && + git clone --no-local --bare \ + --filter="sparse:buffer=`cat filterspec`" \ + sparse-src dst.git && + ( + cd dst.git && + git rev-list --objects --missing=print HEAD >out && + grep "^?$(git rev-parse HEAD:one.t)" out && + grep "^?$(git rev-parse HEAD:two.t)" out && + grep "^?$(git rev-parse HEAD:three/four.t)" out && + grep "^$(git rev-parse HEAD:three/five.t)" out + ) +' + +test_expect_success 'partial clone with sparse:buffer filter with unknown filterspec' ' + rm -rf dst.git && + git clone --no-local --bare \ + --filter=sparse:buffer=unknown \ + sparse-src dst.git && + ( + cd dst.git && + git rev-list --objects --missing=print HEAD >out && + grep "^?$(git rev-parse HEAD:one.t)" out && + grep "^?$(git rev-parse HEAD:two.t)" out && + grep "^?$(git rev-parse HEAD:three/four.t)" out && + grep "^?$(git rev-parse HEAD:three/five.t)" out + ) +' + setup_triangle () { rm -rf big-blob.txt server client promisor-remote &&