Message ID | ebbe8569dcc16f2ae235c4167be0e72a53982982.1611596534.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Sparse Index | expand |
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > We will mark an in-memory index_state as having sparse directory entries > with the sparse_index bit. These currently cannot exist, but we will add > a mechanism for collapsing a full index to a sparse one in a later > change. That will happen at write time, so we must first allow parsing > the format before writing it. > > Commands or methods that require a full index in order to operate can > call ensure_full_index() to expand that index in-memory. This requires > parsing trees using that index's repository. > > Sparse directory entries have a specific 'ce_mode' value. The macro > S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This > ce_mode is not possible with the existing index formats, so we don't > also verify all properties of a sparse-directory entry, which are: > > 1. ce->ce_mode == 01000755 This is a weird number. What's the reason for choosing it? It looks deceptively close to 0100755, normal executable files, but has the extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no file type. Since it's a directory, why not use S_IFDIR (040000)? (GITLINK does use the weird 0160000 value, but it happens to be S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a directory and a symlink") > 2. ce->flags & CE_SKIP_WORKTREE is true Makes sense. > 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) Is there a particular reason for this? I'm used to seeing names without the trailing slash, both in the index and in tree objects. I don't know enough to be for or against this idea; just curious at this point. > 4. ce->oid references a tree object. Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000? Also, as a bit of a side comment: There have been other requests in the past to support directory objects in the index. The only use I remember for them requested from others was to allow tracking empty directories. However, I've long wanted to introduce a new "blobtree" object to git, so that a user can "git add" some big binary file, but internally git splits the binary file up and stores it as multiple blobs within git plus a new "blobtree" object that references all the individual blobs with enough information about how to stitch all the blobs together to get the original file. I had a few different forms of "blobtree" things that I was interested in. I think brian once suggested some similar idea. > These are all semi-enforced in ensure_full_index() to some extent. Any > deviation will cause a warning at minimum or a failure in the worst > case. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > cache.h | 11 +++++- > read-cache.c | 9 +++++ > sparse-index.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++- > sparse-index.h | 1 + > 4 files changed, 113 insertions(+), 2 deletions(-) > > diff --git a/cache.h b/cache.h > index f9c7a603841..884046ca5b8 100644 > --- a/cache.h > +++ b/cache.h > @@ -204,6 +204,10 @@ struct cache_entry { > #error "CE_EXTENDED_FLAGS out of range" > #endif > > +#define CE_MODE_SPARSE_DIRECTORY 01000755 > +#define SPARSE_DIR_MODE 0100 Another magic value. Feels like the commit message should reference this one and why it was picked. Seems odd to me, and possibly problematic to re-use file permission bits that might collide with files recorded by really old versions of git. Maybe that's not a concern, though. > +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY) Should the special sauce apply to ce_flags rather than ce_mode? Thus, instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar to CE_SKIP_WORKTREE)? Or, alternatively, do we need a single special state here? Could we check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)? > + > /* Forward structure decls */ > struct pathspec; > struct child_process; > @@ -249,6 +253,8 @@ static inline unsigned int create_ce_mode(unsigned int mode) > { > if (S_ISLNK(mode)) > return S_IFLNK; > + if (mode == SPARSE_DIR_MODE) > + return CE_MODE_SPARSE_DIRECTORY; > if (S_ISDIR(mode) || S_ISGITLINK(mode)) > return S_IFGITLINK; > return S_IFREG | ce_permissions(mode); > @@ -319,7 +325,8 @@ struct index_state { > drop_cache_tree : 1, > updated_workdir : 1, > updated_skipworktree : 1, > - fsmonitor_has_run_once : 1; > + fsmonitor_has_run_once : 1, > + sparse_index : 1; > struct hashmap name_hash; > struct hashmap dir_hash; > struct object_id oid; > @@ -721,6 +728,8 @@ int read_index_from(struct index_state *, const char *path, > const char *gitdir); > int is_index_unborn(struct index_state *); > > +void ensure_full_index(struct index_state *istate); > + > /* For use with `write_locked_index()`. */ > #define COMMIT_LOCK (1 << 0) > #define SKIP_IF_UNCHANGED (1 << 1) > diff --git a/read-cache.c b/read-cache.c > index ecf6f689940..1097ecbf132 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -101,6 +101,9 @@ static const char *alternate_index_output; > > static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) > { > + if (S_ISSPARSEDIR(ce)) > + istate->sparse_index = 1; > + > istate->cache[nr] = ce; > add_name_hash(istate, ce); > } > @@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) > trace2_data_intmax("index", the_repository, "read/cache_nr", > istate->cache_nr); > > + if (!istate->repo) > + istate->repo = the_repository; > + prepare_repo_settings(istate->repo); > + if (istate->repo->settings.command_requires_full_index) > + ensure_full_index(istate); > + > return istate->cache_nr; > > unmap: > diff --git a/sparse-index.c b/sparse-index.c > index 82183ead563..1e70244dc13 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -1,8 +1,100 @@ > #include "cache.h" > #include "repository.h" > #include "sparse-index.h" > +#include "tree.h" > +#include "pathspec.h" > +#include "trace2.h" > + > +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) > +{ > + ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc); > + > + istate->cache[nr] = ce; > + add_name_hash(istate, ce); > +} > + > +static int add_path_to_index(const struct object_id *oid, > + struct strbuf *base, const char *path, > + unsigned int mode, int stage, void *context) > +{ > + struct index_state *istate = (struct index_state *)context; > + struct cache_entry *ce; > + size_t len = base->len; > + > + if (S_ISDIR(mode)) > + return READ_TREE_RECURSIVE; > + > + strbuf_addstr(base, path); > + > + ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0); > + ce->ce_flags |= CE_SKIP_WORKTREE; > + set_index_entry(istate, istate->cache_nr++, ce); > + > + strbuf_setlen(base, len); > + return 0; > +} > > void ensure_full_index(struct index_state *istate) > { > - /* intentionally left blank */ > + int i; > + struct index_state *full; > + > + if (!istate || !istate->sparse_index) > + return; > + > + if (!istate->repo) > + istate->repo = the_repository; > + > + trace2_region_enter("index", "ensure_full_index", istate->repo); > + > + /* initialize basics of new index */ > + full = xcalloc(1, sizeof(struct index_state)); > + memcpy(full, istate, sizeof(struct index_state)); > + > + /* then change the necessary things */ > + full->sparse_index = 0; > + full->cache_alloc = (3 * istate->cache_alloc) / 2; > + full->cache_nr = 0; > + ALLOC_ARRAY(full->cache, full->cache_alloc); > + > + for (i = 0; i < istate->cache_nr; i++) { > + struct cache_entry *ce = istate->cache[i]; > + struct tree *tree; > + struct pathspec ps; > + > + if (!S_ISSPARSEDIR(ce)) { > + set_index_entry(full, full->cache_nr++, ce); > + continue; > + } > + if (!(ce->ce_flags & CE_SKIP_WORKTREE)) > + warning(_("index entry is a directory, but not sparse (%08x)"), > + ce->ce_flags); > + > + /* recursively walk into cd->name */ > + tree = lookup_tree(istate->repo, &ce->oid); > + > + memset(&ps, 0, sizeof(ps)); > + ps.recursive = 1; > + ps.has_wildcard = 1; > + ps.max_depth = -1; > + > + read_tree_recursive(istate->repo, tree, > + ce->name, strlen(ce->name), > + 0, &ps, > + add_path_to_index, full); > + > + /* free directory entries. full entries are re-used */ > + discard_cache_entry(ce); > + } > + > + /* Copy back into original index. */ > + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash)); > + istate->sparse_index = 0; > + istate->cache = full->cache; Haven't you leaked the original istate->cache here? > + istate->cache_nr = full->cache_nr; > + istate->cache_alloc = full->cache_alloc; > + > + free(full); > + > + trace2_region_leave("index", "ensure_full_index", istate->repo); > } > diff --git a/sparse-index.h b/sparse-index.h > index 8dda92032e2..a2777dcac59 100644 > --- a/sparse-index.h > +++ b/sparse-index.h > @@ -3,5 +3,6 @@ > > struct index_state; > void ensure_full_index(struct index_state *istate); > +int convert_to_sparse(struct index_state *istate); Seems logically that you'd add one, but was this meant to be included in a later patch? > > #endif > \ No newline at end of file > -- > gitgitgadget >
On 1/26/2021 10:05 PM, Elijah Newren wrote: > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: ... >> Sparse directory entries have a specific 'ce_mode' value. The macro >> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This >> ce_mode is not possible with the existing index formats, so we don't >> also verify all properties of a sparse-directory entry, which are: >> >> 1. ce->ce_mode == 01000755 > > This is a weird number. What's the reason for choosing it? It looks > deceptively close to 0100755, normal executable files, but has the > extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no > file type. > > Since it's a directory, why not use S_IFDIR (040000)? > > (GITLINK does use the weird 0160000 value, but it happens to be > S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a > directory and a symlink") I forget how exactly I came up with these magic constants, but then completely forgot to think of them critically because I haven't had to look at them in a while. They _are_ important, especially because these values affect the file format itself. I'll think harder on this before submitting a series intended for merging. >> 2. ce->flags & CE_SKIP_WORKTREE is true > > Makes sense. > >> 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) > > Is there a particular reason for this? I'm used to seeing names > without the trailing slash, both in the index and in tree objects. I > don't know enough to be for or against this idea; just curious at this > point. It's yet another way to distinguish directories from files, but there are cases where we do string searches up to a prefix, and having these directory separators did help, IIRC. >> 4. ce->oid references a tree object. > > Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000? ... >> +#define CE_MODE_SPARSE_DIRECTORY 01000755 >> +#define SPARSE_DIR_MODE 0100 > > Another magic value. Feels like the commit message should reference > this one and why it was picked. Seems odd to me, and possibly > problematic to re-use file permission bits that might collide with > files recorded by really old versions of git. Maybe that's not a > concern, though. > >> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY) > > Should the special sauce apply to ce_flags rather than ce_mode? Thus, > instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro > (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar > to CE_SKIP_WORKTREE)? > > Or, alternatively, do we need a single special state here? Could we > check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)? The intention was that ce_mode be a unique value that could only be assigned to a directory entry, which would then by necessity be sparse. Checking both ce_mode and ce_flags seemed wasteful with the given assumptions ... >> + /* Copy back into original index. */ >> + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash)); >> + istate->sparse_index = 0; >> + istate->cache = full->cache; > > Haven't you leaked the original istate->cache here? Yes, seems so. Will fix. Thanks, -Stolee
On Wed, Jan 27, 2021 at 5:43 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 1/26/2021 10:05 PM, Elijah Newren wrote: > > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget > > <gitgitgadget@gmail.com> wrote: > ... > >> Sparse directory entries have a specific 'ce_mode' value. The macro > >> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This > >> ce_mode is not possible with the existing index formats, so we don't > >> also verify all properties of a sparse-directory entry, which are: > >> > >> 1. ce->ce_mode == 01000755 > > > > This is a weird number. What's the reason for choosing it? It looks > > deceptively close to 0100755, normal executable files, but has the > > extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no > > file type. > > > > Since it's a directory, why not use S_IFDIR (040000)? > > > > (GITLINK does use the weird 0160000 value, but it happens to be > > S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a > > directory and a symlink") > > I forget how exactly I came up with these magic constants, but then > completely forgot to think of them critically because I haven't had > to look at them in a while. They _are_ important, especially because > these values affect the file format itself. > > I'll think harder on this before submitting a series intended for > merging. > > >> 2. ce->flags & CE_SKIP_WORKTREE is true > > > > Makes sense. > > > >> 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) > > > > Is there a particular reason for this? I'm used to seeing names > > without the trailing slash, both in the index and in tree objects. I > > don't know enough to be for or against this idea; just curious at this > > point. > > It's yet another way to distinguish directories from files, but > there are cases where we do string searches up to a prefix, and > having these directory separators did help, IIRC. > > >> 4. ce->oid references a tree object. > > > > Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000? > > ... > > >> +#define CE_MODE_SPARSE_DIRECTORY 01000755 > >> +#define SPARSE_DIR_MODE 0100 > > > > Another magic value. Feels like the commit message should reference > > this one and why it was picked. Seems odd to me, and possibly > > problematic to re-use file permission bits that might collide with > > files recorded by really old versions of git. Maybe that's not a > > concern, though. > > > >> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY) > > > > Should the special sauce apply to ce_flags rather than ce_mode? Thus, > > instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro > > (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar > > to CE_SKIP_WORKTREE)? > > > > Or, alternatively, do we need a single special state here? Could we > > check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)? > > The intention was that ce_mode be a unique value that could only > be assigned to a directory entry, which would then by necessity be > sparse. Checking both ce_mode and ce_flags seemed wasteful with the > given assumptions 040000 is a unique value that could only be assigned to a directory entry. Since we have no other uses of directories within the index, you are right, we wouldn't need to check ce_skip_worktree(ce) as well; just a check for the 040000 mode would be enough. > ... > > >> + /* Copy back into original index. */ > >> + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash)); > >> + istate->sparse_index = 0; > >> + istate->cache = full->cache; > > > > Haven't you leaked the original istate->cache here? > > Yes, seems so. Will fix. > > Thanks, > -Stolee
Elijah Newren <newren@gmail.com> writes: >> 1. ce->ce_mode == 01000755 > > This is a weird number. What's the reason for choosing it? It looks > deceptively close to 0100755, normal executable files, but has the > extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no > file type. > > Since it's a directory, why not use S_IFDIR (040000)? > > (GITLINK does use the weird 0160000 value, but it happens to be > S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a > directory and a symlink") Yes, that combination of IFLNK/IFDIR was the reason why we use the value. I tend to think IFDIR is the best thing to use here.
diff --git a/cache.h b/cache.h index f9c7a603841..884046ca5b8 100644 --- a/cache.h +++ b/cache.h @@ -204,6 +204,10 @@ struct cache_entry { #error "CE_EXTENDED_FLAGS out of range" #endif +#define CE_MODE_SPARSE_DIRECTORY 01000755 +#define SPARSE_DIR_MODE 0100 +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY) + /* Forward structure decls */ struct pathspec; struct child_process; @@ -249,6 +253,8 @@ static inline unsigned int create_ce_mode(unsigned int mode) { if (S_ISLNK(mode)) return S_IFLNK; + if (mode == SPARSE_DIR_MODE) + return CE_MODE_SPARSE_DIRECTORY; if (S_ISDIR(mode) || S_ISGITLINK(mode)) return S_IFGITLINK; return S_IFREG | ce_permissions(mode); @@ -319,7 +325,8 @@ struct index_state { drop_cache_tree : 1, updated_workdir : 1, updated_skipworktree : 1, - fsmonitor_has_run_once : 1; + fsmonitor_has_run_once : 1, + sparse_index : 1; struct hashmap name_hash; struct hashmap dir_hash; struct object_id oid; @@ -721,6 +728,8 @@ int read_index_from(struct index_state *, const char *path, const char *gitdir); int is_index_unborn(struct index_state *); +void ensure_full_index(struct index_state *istate); + /* For use with `write_locked_index()`. */ #define COMMIT_LOCK (1 << 0) #define SKIP_IF_UNCHANGED (1 << 1) diff --git a/read-cache.c b/read-cache.c index ecf6f689940..1097ecbf132 100644 --- a/read-cache.c +++ b/read-cache.c @@ -101,6 +101,9 @@ static const char *alternate_index_output; static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) { + if (S_ISSPARSEDIR(ce)) + istate->sparse_index = 1; + istate->cache[nr] = ce; add_name_hash(istate, ce); } @@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) trace2_data_intmax("index", the_repository, "read/cache_nr", istate->cache_nr); + if (!istate->repo) + istate->repo = the_repository; + prepare_repo_settings(istate->repo); + if (istate->repo->settings.command_requires_full_index) + ensure_full_index(istate); + return istate->cache_nr; unmap: diff --git a/sparse-index.c b/sparse-index.c index 82183ead563..1e70244dc13 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -1,8 +1,100 @@ #include "cache.h" #include "repository.h" #include "sparse-index.h" +#include "tree.h" +#include "pathspec.h" +#include "trace2.h" + +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) +{ + ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc); + + istate->cache[nr] = ce; + add_name_hash(istate, ce); +} + +static int add_path_to_index(const struct object_id *oid, + struct strbuf *base, const char *path, + unsigned int mode, int stage, void *context) +{ + struct index_state *istate = (struct index_state *)context; + struct cache_entry *ce; + size_t len = base->len; + + if (S_ISDIR(mode)) + return READ_TREE_RECURSIVE; + + strbuf_addstr(base, path); + + ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0); + ce->ce_flags |= CE_SKIP_WORKTREE; + set_index_entry(istate, istate->cache_nr++, ce); + + strbuf_setlen(base, len); + return 0; +} void ensure_full_index(struct index_state *istate) { - /* intentionally left blank */ + int i; + struct index_state *full; + + if (!istate || !istate->sparse_index) + return; + + if (!istate->repo) + istate->repo = the_repository; + + trace2_region_enter("index", "ensure_full_index", istate->repo); + + /* initialize basics of new index */ + full = xcalloc(1, sizeof(struct index_state)); + memcpy(full, istate, sizeof(struct index_state)); + + /* then change the necessary things */ + full->sparse_index = 0; + full->cache_alloc = (3 * istate->cache_alloc) / 2; + full->cache_nr = 0; + ALLOC_ARRAY(full->cache, full->cache_alloc); + + for (i = 0; i < istate->cache_nr; i++) { + struct cache_entry *ce = istate->cache[i]; + struct tree *tree; + struct pathspec ps; + + if (!S_ISSPARSEDIR(ce)) { + set_index_entry(full, full->cache_nr++, ce); + continue; + } + if (!(ce->ce_flags & CE_SKIP_WORKTREE)) + warning(_("index entry is a directory, but not sparse (%08x)"), + ce->ce_flags); + + /* recursively walk into cd->name */ + tree = lookup_tree(istate->repo, &ce->oid); + + memset(&ps, 0, sizeof(ps)); + ps.recursive = 1; + ps.has_wildcard = 1; + ps.max_depth = -1; + + read_tree_recursive(istate->repo, tree, + ce->name, strlen(ce->name), + 0, &ps, + add_path_to_index, full); + + /* free directory entries. full entries are re-used */ + discard_cache_entry(ce); + } + + /* Copy back into original index. */ + memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash)); + istate->sparse_index = 0; + istate->cache = full->cache; + istate->cache_nr = full->cache_nr; + istate->cache_alloc = full->cache_alloc; + + free(full); + + trace2_region_leave("index", "ensure_full_index", istate->repo); } diff --git a/sparse-index.h b/sparse-index.h index 8dda92032e2..a2777dcac59 100644 --- a/sparse-index.h +++ b/sparse-index.h @@ -3,5 +3,6 @@ struct index_state; void ensure_full_index(struct index_state *istate); +int convert_to_sparse(struct index_state *istate); #endif \ No newline at end of file