Message ID | 2b8ee726690861405f41adede5582b96749e98c5.1587677671.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | commit-graph: handle file descriptor exhaustion | expand |
On Thu, Apr 23 2020, Taylor Blau wrote: > When writing a layered commit-graph, the commit-graph machinery uses > 'commit_graph_filenames_after' and 'commit_graph_hash_after' to keep > track of the layers in the chain that we are in the process of writing. > > When the number of commit-graph layers shrinks, we initialize all > entries in the aforementioned arrays, because we know the structure of > the new commit-graph chain immediately (since there are no new layers, > there are no unknown hash values). > > But when the number of commit-graph layers grows (i.e., that > 'num_commit_graphs_after > num_commit_graphs_before'), then we leave > some entries in the filenames and hashes arrays as uninitialized, > because we will fill them in later as those values become available. > > For instance, we rely on 'write_commit_graph_file's to store the > filename and hash of the last layer in the new chain, which is the one > that it is responsible for writing. But, it's possible that > 'write_commit_graph_file' may fail, e.g., from file descriptor > exhaustion. In this case it is possible that 'git_mkstemp_mode' will > fail, and that function will return early *before* setting the values > for the last commit-graph layer's filename and hash. > > This causes a number of upleasant side-effects. For instance, trying to > 'free()' each entry in 'ctx->commit_graph_filenames_after' (and > similarly for the hashes array) causes us to 'free()' uninitialized > memory, since the area is allocated with 'malloc()' and is therefore > subject to contain garbage (which is left alone when > 'write_commit_graph_file' returns early). > > This can manifest in other issues, like a general protection fault, > and/or leaving a stray 'commit-graph-chain.lock' around after the > process dies. (The reasoning for this is still a mystery to me, since > we'd otherwise usually expect the kernel to run tempfile.c's 'atexit()' > handlers in the case of a normal death...) > > To resolve this, initialize the memory with 'CALLOC_ARRAY' so that > uninitialized entries are filled with zeros, and can thus be 'free()'d > as a noop instead of causing a fault. > [...] > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh > index e5d8d64170..0d1db31b0a 100755 > --- a/t/t5324-split-commit-graph.sh > +++ b/t/t5324-split-commit-graph.sh > @@ -381,4 +381,17 @@ test_expect_success '--split=replace replaces the chain' ' > graph_read_expect 2 > ' > > +test_expect_success ULIMIT_FILE_DESCRIPTORS 'handles file descriptor exhaustion' ' > + git init ulimit && > + ( > + cd ulimit && > + for i in $(test_seq 64) > + do > + test_commit $i && > + test_might_fail run_with_limited_open_files git commit-graph write \ > + --split=no-merge --reachable || return 1 > + done > + ) > +' > + > test_done This test started failing for me, and now I can't reproduce it anymore, but I reproduced enough of it to think it was odd that it hadn't failed before. I.e. here we do an "ulimit -n 32" and then run a command, which makes a lot of assumptions about how git is compiled, starts up etc, a lot of which are outside of our control and up to the OS. It's not 32 files we open, but 32 everything. When I could reproduce this it was failing on opening some libpcre file or other, so maybe I linked to one too many libraries. The one other test that uses this pattern seems like it could be similarly affected, but I haven't had that fail: d415ad022d8 (update-ref: test handling large transactions properly, 2015-04-14) Since I can't reproduce this anymore maybe I'm reporting a nothingburger. I just wonder if this can really work reliably in the general case, and whether a reliably version of this pattern doesn't need to be one/some of: 1. Some GIT_TEST_* mode that sets the (unportable) ulimit itself in the code, after we reach some point. This is how e.g. web-based REPLs often work, load all your known libraries, forbid any file openings (or just N number) after that. 2. Ditto, but have the GIT_TEST_* print to stderr if we reach some "checkpoint", have the test only run under limit N if we can reach that point (i.e. binary search or brute force to find the exact N limit). 3. Maybe we can assume this would work reliably in cases of a really high limit of N, i.e. the d415ad022d8 test doesn't do this, but my understanding of it is that we're trying to guard against having all loose refs opened at once. So if we create e.g. 2k refs and operate on them we can set the limit to "1999". That's still assuming the same things about ulimit that make/made this test flaky, but we can probably safely assume that just getting to "git <something>" being invoked won't open >1k files, but maybe not >32.
On Thu, Jun 24, 2021 at 11:51:09AM +0200, Ævar Arnfjörð Bjarmason wrote: > I.e. here we do an "ulimit -n 32" and then run a command, which makes a > lot of assumptions about how git is compiled, starts up etc, a lot of > which are outside of our control and up to the OS. It's not 32 files we > open, but 32 everything. When I could reproduce this it was failing on > opening some libpcre file or other, so maybe I linked to one too many > libraries. > > The one other test that uses this pattern seems like it could be > similarly affected, but I haven't had that fail: d415ad022d8 > (update-ref: test handling large transactions properly, 2015-04-14) > > Since I can't reproduce this anymore maybe I'm reporting a > nothingburger. I just wonder if this can really work reliably in the > general case, and whether a reliably version of this pattern doesn't > need to be one/some of: > > 1. Some GIT_TEST_* mode that sets the (unportable) ulimit itself in the > code, after we reach some point. This is how e.g. web-based REPLs > often work, load all your known libraries, forbid any file openings > (or just N number) after that. > > 2. Ditto, but have the GIT_TEST_* print to stderr if we reach some > "checkpoint", have the test only run under limit N if we can reach > that point (i.e. binary search or brute force to find the exact N > limit). > > 3. Maybe we can assume this would work reliably in cases of a really > high limit of N, i.e. the d415ad022d8 test doesn't do this, but my > understanding of it is that we're trying to guard against having all > loose refs opened at once. So if we create e.g. 2k refs and operate > on them we can set the limit to "1999". > > That's still assuming the same things about ulimit that make/made > this test flaky, but we can probably safely assume that just getting > to "git <something>" being invoked won't open >1k files, but maybe > not >32. Yes, we could probably just set the "32" a bit higher. Something like "128" may be more conservative. I'd be hesitant to go as high as something like 1999; system defaults are often much lower than that anyway and may get rejected. We may have to adjust the tests to match the idea of what's "a lot of descriptors". (Likewise, the "ulimit -s" stuff has to make the same guess). The prereq also tries running with the lower ulimit, but it only runs "true". Perhaps running something like "git version" would give us a more realistic idea of the baseline for running a git command (in which case the test would be auto-skipped if somehow 32 descriptors isn't enough on some system). That's not foolproof, though (it might take 31 to run "git version", but 33 to run "update-ref" or something). I'm willing to let it lie unless you have a current problem. This is all known to be guesswork/heuristics, and the hope was that it simply wouldn't come up. If it's causing even occasional pain we might need to deal with it. But if it's an ephemeral thing that maybe went away, I'm content to stick my fingers back in my ears. :) -Peff
diff --git a/commit-graph.c b/commit-graph.c index afde075962..b2d2fdfe3d 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1602,8 +1602,8 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx) free(old_graph_name); } - ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after); - ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after); + CALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after); + CALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after); for (i = 0; i < ctx->num_commit_graphs_after && i < ctx->num_commit_graphs_before; i++) diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh index e5d8d64170..0d1db31b0a 100755 --- a/t/t5324-split-commit-graph.sh +++ b/t/t5324-split-commit-graph.sh @@ -381,4 +381,17 @@ test_expect_success '--split=replace replaces the chain' ' graph_read_expect 2 ' +test_expect_success ULIMIT_FILE_DESCRIPTORS 'handles file descriptor exhaustion' ' + git init ulimit && + ( + cd ulimit && + for i in $(test_seq 64) + do + test_commit $i && + test_might_fail run_with_limited_open_files git commit-graph write \ + --split=no-merge --reachable || return 1 + done + ) +' + test_done
When writing a layered commit-graph, the commit-graph machinery uses 'commit_graph_filenames_after' and 'commit_graph_hash_after' to keep track of the layers in the chain that we are in the process of writing. When the number of commit-graph layers shrinks, we initialize all entries in the aforementioned arrays, because we know the structure of the new commit-graph chain immediately (since there are no new layers, there are no unknown hash values). But when the number of commit-graph layers grows (i.e., that 'num_commit_graphs_after > num_commit_graphs_before'), then we leave some entries in the filenames and hashes arrays as uninitialized, because we will fill them in later as those values become available. For instance, we rely on 'write_commit_graph_file's to store the filename and hash of the last layer in the new chain, which is the one that it is responsible for writing. But, it's possible that 'write_commit_graph_file' may fail, e.g., from file descriptor exhaustion. In this case it is possible that 'git_mkstemp_mode' will fail, and that function will return early *before* setting the values for the last commit-graph layer's filename and hash. This causes a number of upleasant side-effects. For instance, trying to 'free()' each entry in 'ctx->commit_graph_filenames_after' (and similarly for the hashes array) causes us to 'free()' uninitialized memory, since the area is allocated with 'malloc()' and is therefore subject to contain garbage (which is left alone when 'write_commit_graph_file' returns early). This can manifest in other issues, like a general protection fault, and/or leaving a stray 'commit-graph-chain.lock' around after the process dies. (The reasoning for this is still a mystery to me, since we'd otherwise usually expect the kernel to run tempfile.c's 'atexit()' handlers in the case of a normal death...) To resolve this, initialize the memory with 'CALLOC_ARRAY' so that uninitialized entries are filled with zeros, and can thus be 'free()'d as a noop instead of causing a fault. Helped-by: Jeff King <peff@peff.net> Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- commit-graph.c | 4 ++-- t/t5324-split-commit-graph.sh | 13 +++++++++++++ 2 files changed, 15 insertions(+), 2 deletions(-)