Message ID | 20190825041348.31835-1-mh@glandium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fast-import: Reinitialize command_buf rather than detach it. | expand |
On Sun, Aug 25, 2019 at 01:13:48PM +0900, Mike Hommey wrote: > command_buf.buf is also stored in cmd_hist, so instead of > strbuf_release, the current code uses strbuf_detach in order to > "leak" the buffer as far as the strbuf is concerned. > > However, strbuf_detach does more than "leak" the strbuf buffer: it > possibly reallocates it to ensure a terminating nul character. And when > that happens, what is already stored in cmd_hist may now be a free()d > buffer. > > In practice, though, command_buf.buf is most of the time a nul > terminated string, meaning command_buf.len < command_buf.alloc, and > strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call), > command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does > allocate a 1 byte buffer to store a nul character in it, which is then > leaked. I think this is stronger than just "most of the time". It's an invariant for strbufs to have a NUL, so the only case where detaching isn't a noop is the empty slopbuf case you mention. Splitting hairs, perhaps, but I think with that explanation, we could probably argue that this case will never come up: strbuf_getline will either have allocated a buffer or will have returned EOF. That said, I do think it's quite confusing and is worth fixing, both for readability and for future-proofing. But... > diff --git a/fast-import.c b/fast-import.c > index b44d6a467e..b1d07efe8c 100644 > --- a/fast-import.c > +++ b/fast-import.c > @@ -1763,7 +1763,9 @@ static int read_next_command(void) > } else { > struct recent_command *rc; > > - strbuf_detach(&command_buf, NULL); > + // command_buf is enther empty or also stored in cmd_hist, > + // reinitialize it. > + strbuf_init(&command_buf, 0); This whole thing is a sign that the code is Doing It Wrong. Whoever is taking ownership of the buffer should be calling strbuf_detach() at that point. It's a bit tricky, though, because we take ownership and then expect people still look at command_buf.buf. Which makes me concerned that there are other operations which might modify the buffer and have the same issue. I think it would be much easier to follow if we simply used the same command_buf over and over, and just copied into the history. The cost is about the same (we still alloc once per line, though we do an extra memcpy now). So I thought something like this would work (we don't need to convert those detaches into a strbuf_reset() calls because strbuf_getline does so automatically): diff --git a/fast-import.c b/fast-import.c index b44d6a467e..31207acd61 100644 --- a/fast-import.c +++ b/fast-import.c @@ -1763,7 +1763,6 @@ static int read_next_command(void) } else { struct recent_command *rc; - strbuf_detach(&command_buf, NULL); stdin_eof = strbuf_getline_lf(&command_buf, stdin); if (stdin_eof) return EOF; @@ -1784,7 +1783,7 @@ static int read_next_command(void) free(rc->buf); } - rc->buf = command_buf.buf; + rc->buf = xstrdup(command_buf.buf); rc->prev = cmd_tail; rc->next = cmd_hist.prev; rc->prev->next = rc; @@ -1833,7 +1832,6 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res) char *term = xstrdup(data); size_t term_len = command_buf.len - (data - command_buf.buf); - strbuf_detach(&command_buf, NULL); for (;;) { if (strbuf_getline_lf(&command_buf, stdin) == EOF) die("EOF in data (terminator '%s' not found)", term); But it doesn't! It turns out that there are other places in the code which assume they can call read_next_command() while hanging onto the existing buffer. Which only works in the old code because the history feature happens to hold on to the old pointer. If I do this on top, then all tests pass: diff --git a/fast-import.c b/fast-import.c index 31207acd61..1f9160b645 100644 --- a/fast-import.c +++ b/fast-import.c @@ -2586,7 +2586,7 @@ static void parse_new_commit(const char *arg) struct branch *b; char *author = NULL; char *committer = NULL; - const char *encoding = NULL; + char *encoding = NULL; struct hash_list *merge_list = NULL; unsigned int merge_count; unsigned char prev_fanout, new_fanout; @@ -2609,8 +2609,10 @@ static void parse_new_commit(const char *arg) } if (!committer) die("Expected committer but didn't get one"); - if (skip_prefix(command_buf.buf, "encoding ", &encoding)) + if (skip_prefix(command_buf.buf, "encoding ", &v)) { + encoding = xstrdup(v); read_next_command(); + } parse_data(&msg, 0, NULL); read_next_command(); parse_from(b); @@ -2684,6 +2686,7 @@ static void parse_new_commit(const char *arg) strbuf_addbuf(&new_data, &msg); free(author); free(committer); + free(encoding); if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark)) b->pack_id = pack_id; And I think this is actually a real bug in the current code! We keep a pointer to the encoding string, which survives because of the history. But that history is bounded, and we could have an indefinite number of changed files in the middle. If I modify t9300 like this: diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 141b7fa35e..d4bbe630d5 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -3314,6 +3314,11 @@ test_expect_success 'X: handling encoding' ' printf "Pi: \360\nCOMMIT\n" >>input && + for i in $(test_seq 200) + do + echo "M 644 $EMPTY_BLOB file-$i" + done >>input && + git fast-import <input && git cat-file -p encoding | grep $(printf "\360") && git log -1 --format=%B encoding | grep $(printf "\317\200") and run the test suite (using an unmodified git, without the earlier patches I showed) then the test fails, putting garbage into the encoding header (and when compiled with ASan, reports a use-after-free). So I think the way the string handling is currently done papers over such problems. You only see the problem if you have a hundred or more modified files, so it works _most_ of the time. That implies to me it's worth following a fix like the one I showed above. I am slightly concerned there are other similar cases to the "encoding" one lurking (and they _might_ not be bugs already, if there's a fixed number of reads between the pointer being saved and being used), but it seems worth the risk. -Peff
On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote: > On Sun, Aug 25, 2019 at 01:13:48PM +0900, Mike Hommey wrote: > > > command_buf.buf is also stored in cmd_hist, so instead of > > strbuf_release, the current code uses strbuf_detach in order to > > "leak" the buffer as far as the strbuf is concerned. > > > > However, strbuf_detach does more than "leak" the strbuf buffer: it > > possibly reallocates it to ensure a terminating nul character. And when > > that happens, what is already stored in cmd_hist may now be a free()d > > buffer. > > > > In practice, though, command_buf.buf is most of the time a nul > > terminated string, meaning command_buf.len < command_buf.alloc, and > > strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call), > > command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does > > allocate a 1 byte buffer to store a nul character in it, which is then > > leaked. > > I think this is stronger than just "most of the time". It's an invariant > for strbufs to have a NUL, so the only case where detaching isn't a noop > is the empty slopbuf case you mention. If it's an invariant, why does detach actively tries to realloc and set the nul terminator as if it can happen in more cases than when using the slopbuf? > Splitting hairs, perhaps, but I think with that explanation, we could > probably argue that this case will never come up: strbuf_getline will > either have allocated a buffer or will have returned EOF. Note that the slopbuf case _does_ come up, and we always leak a 1 byte buffer. > That said, I do think it's quite confusing and is worth fixing, both for > readability and for future-proofing. But... (...) I do agree the way fast-import works between cmd_hist and command_buf is very brittle, as you've shown. I didn't feel like digging into it though. Thanks for having gone further than I did. Mike
On Sun, Aug 25, 2019 at 04:20:31PM +0900, Mike Hommey wrote: > > I think this is stronger than just "most of the time". It's an invariant > > for strbufs to have a NUL, so the only case where detaching isn't a noop > > is the empty slopbuf case you mention. > > If it's an invariant, why does detach actively tries to realloc and set > the nul terminator as if it can happen in more cases than when using the > slopbuf? It calls strbuf_grow() to handle the slopbuf case (we can't hand off the slopbuf, since the caller expects an allocated buffer). It just doesn't bother to distinguish that case itself, and lets strbuf_grow() handle it. I think it would be equally correct for strbuf_detach() to do: if (!sb->alloc) strbuf_grow(0); > > Splitting hairs, perhaps, but I think with that explanation, we could > > probably argue that this case will never come up: strbuf_getline will > > either have allocated a buffer or will have returned EOF. > > Note that the slopbuf case _does_ come up, and we always leak a 1 byte > buffer. Hmm, I suppose so, on the very first call before we've read anything (and likewise if parse_data() reset it then got an EOF, and we then tried to read another command). > I do agree the way fast-import works between cmd_hist and command_buf is > very brittle, as you've shown. I didn't feel like digging into it > though. Thanks for having gone further than I did. I'll see if I can shape my rambling into a patch. -Peff
Am 25.08.19 um 06:13 schrieb Mike Hommey: > command_buf.buf is also stored in cmd_hist, so instead of > strbuf_release, the current code uses strbuf_detach in order to > "leak" the buffer as far as the strbuf is concerned. Hmm. > However, strbuf_detach does more than "leak" the strbuf buffer: it > possibly reallocates it to ensure a terminating nul character. And when > that happens, what is already stored in cmd_hist may now be a free()d > buffer. > > In practice, though, command_buf.buf is most of the time a nul > terminated string, meaning command_buf.len < command_buf.alloc, and > strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call), > command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does > allocate a 1 byte buffer to store a nul character in it, which is then > leaked. And that's why a pointer to a strbuf buf is valid until the next strbuf_ function call. > > Since the code using strbuf_detach is assuming it does nothing to > command_buf.buf, it's overall safer to use strbuf_init, which has > the same practical effect in the usual case, and works appropriately > when command_buf is empty. > --- > fast-import.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/fast-import.c b/fast-import.c > index b44d6a467e..b1d07efe8c 100644 > --- a/fast-import.c > +++ b/fast-import.c > @@ -1763,7 +1763,9 @@ static int read_next_command(void) > } else { > struct recent_command *rc; > > - strbuf_detach(&command_buf, NULL); > + // command_buf is enther empty or also stored in cmd_hist, > + // reinitialize it. > + strbuf_init(&command_buf, 0); This is a no-op; strbuf_detach already re-initializes the strbuf. (And double-slash comments are avoided in Git code..) ((s/enther/either/)) > stdin_eof = strbuf_getline_lf(&command_buf, stdin); > if (stdin_eof) > return EOF; > @@ -1833,7 +1835,9 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res) > char *term = xstrdup(data); > size_t term_len = command_buf.len - (data - command_buf.buf); > > - strbuf_detach(&command_buf, NULL); > + // command_buf is enther empty or also stored in cmd_hist, > + // reinitialize it. > + strbuf_init(&command_buf, 0); Same here. > for (;;) { > if (strbuf_getline_lf(&command_buf, stdin) == EOF) > die("EOF in data (terminator '%s' not found)", term); > strbuf_detach() is handing over ownership of the buffer. Its return value should be stored; grabbing the pointer beforehand is naughty because it can get stale, as you noted. I doubt there is a good reason for ignoring the return value of strbuf_detach(), ever. René
diff --git a/fast-import.c b/fast-import.c index b44d6a467e..b1d07efe8c 100644 --- a/fast-import.c +++ b/fast-import.c @@ -1763,7 +1763,9 @@ static int read_next_command(void) } else { struct recent_command *rc; - strbuf_detach(&command_buf, NULL); + // command_buf is enther empty or also stored in cmd_hist, + // reinitialize it. + strbuf_init(&command_buf, 0); stdin_eof = strbuf_getline_lf(&command_buf, stdin); if (stdin_eof) return EOF; @@ -1833,7 +1835,9 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res) char *term = xstrdup(data); size_t term_len = command_buf.len - (data - command_buf.buf); - strbuf_detach(&command_buf, NULL); + // command_buf is enther empty or also stored in cmd_hist, + // reinitialize it. + strbuf_init(&command_buf, 0); for (;;) { if (strbuf_getline_lf(&command_buf, stdin) == EOF) die("EOF in data (terminator '%s' not found)", term);