Message ID | 213f2c94b73f90fc758c2e3872804cf640cb2005.1643186507.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | scalar: implement the subcommand "diagnose" | expand |
On Wed, Jan 26, 2022 at 08:41:46AM +0000, Matthew John Cheetham via GitGitGadget wrote: > + while ((e = readdir(dir)) != NULL) > + if (!is_dot_or_dotdot(e->d_name) && > + e->d_type == DT_DIR && strlen(e->d_name) == 2 && > + !hex_to_bytes(&c, e->d_name, 1)) { What is this call to hex_to_bytes() for? I assume it's checking to make sure the directory we're looking at is one of the shards of loose objects. Similar to my suggestion on the previous patch, I think that we could get rid of this function entirely and replace it with a call to for_each_loose_file_in_objdir(). We'll pay a little bit of extra cost to parse out each loose object's OID, but it should be negligible since we're not actually opening up each object. > diff --git a/contrib/scalar/t/t9099-scalar.sh b/contrib/scalar/t/t9099-scalar.sh > index b1745851e31..f2ec156d819 100755 > --- a/contrib/scalar/t/t9099-scalar.sh > +++ b/contrib/scalar/t/t9099-scalar.sh > @@ -77,6 +77,8 @@ test_expect_success UNZIP 'scalar diagnose' ' > unzip -p "$zip_path" diagnostics.log >out && > test_file_not_empty out && > unzip -p "$zip_path" packs-local.txt >out && > + test_file_not_empty out && A more comprehensive test (here, and in the earlier instances, too) might be useful beyond just "does this file exist in the archive". Constructing an example repository where the number of loose objects is known ahead of time, and then finding that number in the output of objects-local.txt might be worthwhile to give us some extra confidence that this is working as intended. > + unzip -p "$zip_path" objects-local.txt >out && > test_file_not_empty out > ' Thanks, Taylor
On 1/26/2022 5:50 PM, Taylor Blau wrote: > On Wed, Jan 26, 2022 at 08:41:46AM +0000, Matthew John Cheetham via GitGitGadget wrote: >> + while ((e = readdir(dir)) != NULL) >> + if (!is_dot_or_dotdot(e->d_name) && >> + e->d_type == DT_DIR && strlen(e->d_name) == 2 && >> + !hex_to_bytes(&c, e->d_name, 1)) { > > What is this call to hex_to_bytes() for? I assume it's checking to make > sure the directory we're looking at is one of the shards of loose > objects. > > Similar to my suggestion on the previous patch, I think that we could > get rid of this function entirely and replace it with a call to > for_each_loose_file_in_objdir(). There is a possibility that there are files other than loose objects in these directories, so summarizing those counts might be helpful information. For example: if somehow .git/objects/00/ was full of a bunch of non-objects, it would still slow down Git commands that ask for a short-sha starting with "00". While this shouldn't be a normal case, the 'diagnose' command is built to help us find these extremely odd scenarios because they _have_ happened before (typically because of a VFS for Git bug taught us how to look for these situations). Thanks, -Stolee
On Wed, Jan 26, 2022 at 3:37 PM Matthew John Cheetham via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Matthew John Cheetham <mjcheetham@outlook.com> > > When operating at the scale that Scalar wants to support, certain data > shapes are more likely to cause undesirable performance issues, such as > large numbers or large sizes of loose objects. Makes sense. > By including statistics about this, `scalar diagnose` now makes it > easier to identify such scenarios. > > Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com> > --- > contrib/scalar/scalar.c | 60 ++++++++++++++++++++++++++++++++ > contrib/scalar/t/t9099-scalar.sh | 2 ++ > 2 files changed, 62 insertions(+) > > diff --git a/contrib/scalar/scalar.c b/contrib/scalar/scalar.c > index 690933ffdf3..c0ad4948215 100644 > --- a/contrib/scalar/scalar.c > +++ b/contrib/scalar/scalar.c > @@ -686,6 +686,60 @@ static void dir_file_stats(struct strbuf *buf, const char *path) > closedir(dir); > } > > +static int count_files(char *path) > +{ > + DIR *dir = opendir(path); > + struct dirent *e; > + int count = 0; > + > + if (!dir) > + return 0; > + > + while ((e = readdir(dir)) != NULL) > + if (!is_dot_or_dotdot(e->d_name) && e->d_type == DT_REG) > + count++; > + > + closedir(dir); > + return count; > +} > + > +static void loose_objs_stats(struct strbuf *buf, const char *path) > +{ > + DIR *dir = opendir(path); > + struct dirent *e; > + int count; > + int total = 0; > + unsigned char c; > + struct strbuf count_path = STRBUF_INIT; > + size_t base_path_len; > + > + if (!dir) > + return; > + > + strbuf_addstr(buf, "Object directory stats for "); > + strbuf_add_absolute_path(buf, path); > + strbuf_addstr(buf, ":\n"); > + > + strbuf_add_absolute_path(&count_path, path); > + strbuf_addch(&count_path, '/'); > + base_path_len = count_path.len; > + > + while ((e = readdir(dir)) != NULL) > + if (!is_dot_or_dotdot(e->d_name) && > + e->d_type == DT_DIR && strlen(e->d_name) == 2 && > + !hex_to_bytes(&c, e->d_name, 1)) { You only recurse into directories, ignoring individual files. > + strbuf_setlen(&count_path, base_path_len); > + strbuf_addstr(&count_path, e->d_name); > + total += (count = count_files(count_path.buf)); > + strbuf_addf(buf, "%s : %7d files\n", e->d_name, count); This shows the number of files within a directory. > + } > + > + strbuf_addf(buf, "Total: %d loose objects", total); and this shows the total number of files across all the directories. But the commit message suggested you also wanted to check for large sizes of loose objects. Did that get ripped out at some point with the commit message not being updated, or is it perhaps going to be included later? > + > + strbuf_release(&count_path); > + closedir(dir); > +} > + > static int cmd_diagnose(int argc, const char **argv) > { > struct option options[] = { > @@ -734,6 +788,12 @@ static int cmd_diagnose(int argc, const char **argv) > if ((res = stage(tmp_dir.buf, &buf, "packs-local.txt"))) > goto diagnose_cleanup; > > + strbuf_reset(&buf); > + loose_objs_stats(&buf, ".git/objects"); > + > + if ((res = stage(tmp_dir.buf, &buf, "objects-local.txt"))) > + goto diagnose_cleanup; > + > if ((res = stage_directory(tmp_dir.buf, ".git", 0)) || > (res = stage_directory(tmp_dir.buf, ".git/hooks", 0)) || > (res = stage_directory(tmp_dir.buf, ".git/info", 0)) || > diff --git a/contrib/scalar/t/t9099-scalar.sh b/contrib/scalar/t/t9099-scalar.sh > index b1745851e31..f2ec156d819 100755 > --- a/contrib/scalar/t/t9099-scalar.sh > +++ b/contrib/scalar/t/t9099-scalar.sh > @@ -77,6 +77,8 @@ test_expect_success UNZIP 'scalar diagnose' ' > unzip -p "$zip_path" diagnostics.log >out && > test_file_not_empty out && > unzip -p "$zip_path" packs-local.txt >out && > + test_file_not_empty out && > + unzip -p "$zip_path" objects-local.txt >out && > test_file_not_empty out > ' > > -- > gitgitgadget
Hi Elijah, On Thu, 27 Jan 2022, Elijah Newren wrote: > On Wed, Jan 26, 2022 at 3:37 PM Matthew John Cheetham via GitGitGadget > <gitgitgadget@gmail.com> wrote: > > > > From: Matthew John Cheetham <mjcheetham@outlook.com> > > > > When operating at the scale that Scalar wants to support, certain data > > shapes are more likely to cause undesirable performance issues, such as > > large numbers or large sizes of loose objects. > > Makes sense. > > > By including statistics about this, `scalar diagnose` now makes it > > easier to identify such scenarios. > > > > Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com> > > --- > > contrib/scalar/scalar.c | 60 ++++++++++++++++++++++++++++++++ > > contrib/scalar/t/t9099-scalar.sh | 2 ++ > > 2 files changed, 62 insertions(+) > > > > diff --git a/contrib/scalar/scalar.c b/contrib/scalar/scalar.c > > index 690933ffdf3..c0ad4948215 100644 > > --- a/contrib/scalar/scalar.c > > +++ b/contrib/scalar/scalar.c > > @@ -686,6 +686,60 @@ static void dir_file_stats(struct strbuf *buf, const char *path) > > closedir(dir); > > } > > > > +static int count_files(char *path) > > +{ > > + DIR *dir = opendir(path); > > + struct dirent *e; > > + int count = 0; > > + > > + if (!dir) > > + return 0; > > + > > + while ((e = readdir(dir)) != NULL) > > + if (!is_dot_or_dotdot(e->d_name) && e->d_type == DT_REG) > > + count++; > > + > > + closedir(dir); > > + return count; > > +} > > + > > +static void loose_objs_stats(struct strbuf *buf, const char *path) > > +{ > > + DIR *dir = opendir(path); > > + struct dirent *e; > > + int count; > > + int total = 0; > > + unsigned char c; > > + struct strbuf count_path = STRBUF_INIT; > > + size_t base_path_len; > > + > > + if (!dir) > > + return; > > + > > + strbuf_addstr(buf, "Object directory stats for "); > > + strbuf_add_absolute_path(buf, path); > > + strbuf_addstr(buf, ":\n"); > > + > > + strbuf_add_absolute_path(&count_path, path); > > + strbuf_addch(&count_path, '/'); > > + base_path_len = count_path.len; > > + > > + while ((e = readdir(dir)) != NULL) > > + if (!is_dot_or_dotdot(e->d_name) && > > + e->d_type == DT_DIR && strlen(e->d_name) == 2 && > > + !hex_to_bytes(&c, e->d_name, 1)) { > > You only recurse into directories, ignoring individual files. > > > + strbuf_setlen(&count_path, base_path_len); > > + strbuf_addstr(&count_path, e->d_name); > > + total += (count = count_files(count_path.buf)); > > + strbuf_addf(buf, "%s : %7d files\n", e->d_name, count); > > This shows the number of files within a directory. > > > + } > > + > > + strbuf_addf(buf, "Total: %d loose objects", total); > > and this shows the total number of files across all the directories. > > But the commit message suggested you also wanted to check for large > sizes of loose objects. Did that get ripped out at some point with > the commit message not being updated, or is it perhaps going to be > included later? No, there was no plan to include this information later, as the original .NET implementation of `scalar diagnose` did not provide that information, either (which I take as a strong sign that we never needed this type of information to help users, at least not up until this point). Besides, it would be kind of a difficult thing to say conclusively what makes a loose file "big". Is it the zlib-compressed size on disk? Or the unpacked size? Should there be a configurable threshold to determine when an object is big? Should `core.bigFileThreshold` be co-opted for this? Together with the fact that there was no need for this information in practice, it makes me doubt that we should add this type of information. I actually suspect that _iff_ information of that type would be helpful, a more complete tool like git-sizer (https://github.com/github/git-sizer/) would be needed, and I do not really want to subsume git-sizer's functionality in `scalar diagnose`. I rephrased the commit message. Ciao, Dscho > > > + > > + strbuf_release(&count_path); > > + closedir(dir); > > +} > > + > > static int cmd_diagnose(int argc, const char **argv) > > { > > struct option options[] = { > > @@ -734,6 +788,12 @@ static int cmd_diagnose(int argc, const char **argv) > > if ((res = stage(tmp_dir.buf, &buf, "packs-local.txt"))) > > goto diagnose_cleanup; > > > > + strbuf_reset(&buf); > > + loose_objs_stats(&buf, ".git/objects"); > > + > > + if ((res = stage(tmp_dir.buf, &buf, "objects-local.txt"))) > > + goto diagnose_cleanup; > > + > > if ((res = stage_directory(tmp_dir.buf, ".git", 0)) || > > (res = stage_directory(tmp_dir.buf, ".git/hooks", 0)) || > > (res = stage_directory(tmp_dir.buf, ".git/info", 0)) || > > diff --git a/contrib/scalar/t/t9099-scalar.sh b/contrib/scalar/t/t9099-scalar.sh > > index b1745851e31..f2ec156d819 100755 > > --- a/contrib/scalar/t/t9099-scalar.sh > > +++ b/contrib/scalar/t/t9099-scalar.sh > > @@ -77,6 +77,8 @@ test_expect_success UNZIP 'scalar diagnose' ' > > unzip -p "$zip_path" diagnostics.log >out && > > test_file_not_empty out && > > unzip -p "$zip_path" packs-local.txt >out && > > + test_file_not_empty out && > > + unzip -p "$zip_path" objects-local.txt >out && > > test_file_not_empty out > > ' > > > > -- > > gitgitgadget >
diff --git a/contrib/scalar/scalar.c b/contrib/scalar/scalar.c index 690933ffdf3..c0ad4948215 100644 --- a/contrib/scalar/scalar.c +++ b/contrib/scalar/scalar.c @@ -686,6 +686,60 @@ static void dir_file_stats(struct strbuf *buf, const char *path) closedir(dir); } +static int count_files(char *path) +{ + DIR *dir = opendir(path); + struct dirent *e; + int count = 0; + + if (!dir) + return 0; + + while ((e = readdir(dir)) != NULL) + if (!is_dot_or_dotdot(e->d_name) && e->d_type == DT_REG) + count++; + + closedir(dir); + return count; +} + +static void loose_objs_stats(struct strbuf *buf, const char *path) +{ + DIR *dir = opendir(path); + struct dirent *e; + int count; + int total = 0; + unsigned char c; + struct strbuf count_path = STRBUF_INIT; + size_t base_path_len; + + if (!dir) + return; + + strbuf_addstr(buf, "Object directory stats for "); + strbuf_add_absolute_path(buf, path); + strbuf_addstr(buf, ":\n"); + + strbuf_add_absolute_path(&count_path, path); + strbuf_addch(&count_path, '/'); + base_path_len = count_path.len; + + while ((e = readdir(dir)) != NULL) + if (!is_dot_or_dotdot(e->d_name) && + e->d_type == DT_DIR && strlen(e->d_name) == 2 && + !hex_to_bytes(&c, e->d_name, 1)) { + strbuf_setlen(&count_path, base_path_len); + strbuf_addstr(&count_path, e->d_name); + total += (count = count_files(count_path.buf)); + strbuf_addf(buf, "%s : %7d files\n", e->d_name, count); + } + + strbuf_addf(buf, "Total: %d loose objects", total); + + strbuf_release(&count_path); + closedir(dir); +} + static int cmd_diagnose(int argc, const char **argv) { struct option options[] = { @@ -734,6 +788,12 @@ static int cmd_diagnose(int argc, const char **argv) if ((res = stage(tmp_dir.buf, &buf, "packs-local.txt"))) goto diagnose_cleanup; + strbuf_reset(&buf); + loose_objs_stats(&buf, ".git/objects"); + + if ((res = stage(tmp_dir.buf, &buf, "objects-local.txt"))) + goto diagnose_cleanup; + if ((res = stage_directory(tmp_dir.buf, ".git", 0)) || (res = stage_directory(tmp_dir.buf, ".git/hooks", 0)) || (res = stage_directory(tmp_dir.buf, ".git/info", 0)) || diff --git a/contrib/scalar/t/t9099-scalar.sh b/contrib/scalar/t/t9099-scalar.sh index b1745851e31..f2ec156d819 100755 --- a/contrib/scalar/t/t9099-scalar.sh +++ b/contrib/scalar/t/t9099-scalar.sh @@ -77,6 +77,8 @@ test_expect_success UNZIP 'scalar diagnose' ' unzip -p "$zip_path" diagnostics.log >out && test_file_not_empty out && unzip -p "$zip_path" packs-local.txt >out && + test_file_not_empty out && + unzip -p "$zip_path" objects-local.txt >out && test_file_not_empty out '