Message ID | 20190110042551.915769-1-sandals@crustytoothpaste.net (mailing list archive) |
---|---|
Headers | show |
Series | tree-walk object_id refactor | expand |
On Thu, Jan 10, 2019 at 04:25:46AM +0000, brian m. carlson wrote: > There are a small number of places in our codebase where we cast a > buffer of unsigned char to a struct object_id pointer. When we have > GIT_MAX_RAWSZ set to 32 (because we have SHA-256), one of these places > (the buffer for tree objects) can lead to us copying too much data when > using SHA-1 as the hash, since there are only 20 bytes to read. > > This was not expected to be a problem before future code was introduced, > but due to a combination of series the issue became noticeable. > > This series introduces a refactor to avoid referencing the struct > object_id directly from a buffer and instead storing an additional > struct object_id (and an int) in struct name_entry and referring to > that. I think this is really the only safe and sane solution. We resisted it because of the cost of the extra copies (especially the update_tree_entry() one). But I don't know that anybody actually measured it. Do you have any performance numbers before/after this series? -Peff
On Thu, Jan 10, 2019 at 01:40:31AM -0500, Jeff King wrote: > On Thu, Jan 10, 2019 at 04:25:46AM +0000, brian m. carlson wrote: > > > There are a small number of places in our codebase where we cast a > > buffer of unsigned char to a struct object_id pointer. When we have > > GIT_MAX_RAWSZ set to 32 (because we have SHA-256), one of these places > > (the buffer for tree objects) can lead to us copying too much data when > > using SHA-1 as the hash, since there are only 20 bytes to read. > > > > This was not expected to be a problem before future code was introduced, > > but due to a combination of series the issue became noticeable. > > > > This series introduces a refactor to avoid referencing the struct > > object_id directly from a buffer and instead storing an additional > > struct object_id (and an int) in struct name_entry and referring to > > that. > > I think this is really the only safe and sane solution. We resisted it > because of the cost of the extra copies (especially the > update_tree_entry() one). But I don't know that anybody actually > measured it. Do you have any performance numbers before/after this > series? Unfortunately, I don't. I'm not really sure in what situations we hit this code path a lot, so I'm not sure what exactly we should performance test. If you have suggestions, I can set up some perf tests. I will say that I resisted writing this series for a long time, since I knew it was going to be a difficult slog to get everything working (and it was). If there had been a way to avoid it, I would have done it. However, writing this series led to about ten tests in my SHA-256 work suddenly passing where they hadn't before.
On Fri, Jan 11, 2019 at 12:17:50AM +0000, brian m. carlson wrote: > > I think this is really the only safe and sane solution. We resisted it > > because of the cost of the extra copies (especially the > > update_tree_entry() one). But I don't know that anybody actually > > measured it. Do you have any performance numbers before/after this > > series? > > Unfortunately, I don't. I'm not really sure in what situations we hit > this code path a lot, so I'm not sure what exactly we should performance > test. If you have suggestions, I can set up some perf tests. I think the interesting one here is the copy in update_tree_entry(), which is called for every entry of every tree we parse for a diff. So the command with the most noticeable impact, I think, would be something like: git log -- pathspec I couldn't demonstrate any measurable slowdown (I didn't compute the mean to see if it gets worse, but certainly any change is within the run-to-run noise). I guess that is competing with the cost to access the commit objects. Perhaps a pure tree diff would be a more precise test. Here's the most pathological case I could come up with: git init for i in $(seq 10000); do echo $i >$i; done git add . git commit -m base echo change >9999 git commit -am change time git diff-tree HEAD^ HEAD which should really be spending 99% of its time inflating and walking through the trees. And even there I couldn't measure anything. So I'd say it's probably fine. -Peff