Message ID | pull.1764.git.1721332546.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Fix background maintenance regression in Git 2.45.0 | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > Here is an issue I noticed while exploring issues with my local copy of a > large monorepo. I was intending to show some engineers how nice the objects > were maintained by background maintenance, but saw hundreds of small > pack-files that were up to two months old. This time matched when I upgraded > to the microsoft/git fork that included the 2.45.0 release of Git. I almost said "wow, perfect timing on the -rc1 day", but then realized that this is not a regression during _this_ cycle, but a cycle ago. You already Cc'ed Taylor, whom I would have asked for help if I were the one who found this issue, which is good. Will queue. Thanks.
On Thu, Jul 18, 2024 at 02:57:55PM -0700, Junio C Hamano wrote: > You already Cc'ed Taylor, whom I would have asked for help if I were > the one who found this issue, which is good. It looks like there is a small typo in my email address as "me@ttayllorr.com" instead of "me@ttaylorr.com", but I saw the message anyway ;-). Thanks, Taylor
On Thu, Jul 18, 2024 at 07:55:44PM +0000, Derrick Stolee via GitGitGadget wrote: > The issue is that 'git multi-pack-index repack' was taught to call 'git > pack-objects' with the new '--stdin-packs' option. However, this changes the > object selection algorithm. Instead of using the objects referenced by the > multi-pack-index, it compares pack-files using a list of "included" and > "excluded" pack-files. This loses some granularity of how the > multi-pack-index chooses among duplicate objects. Thank you for looking at this so carefully! Let me double check my own understanding. Suppose we have a MIDX with some pack 'P' and say, some commit object 'C' which appears in that pack. Let's also suppose we have another pack 'Q' in the same MIDX which also contains 'C', but the MIDX selected its copy from pack 'P'. If we want to combine 'P' with some other packs (excluding 'Q'), then the input to --stdin-packs will look something like: P.pack ^Q.pack ... And the resulting pack would not contain 'C', since we would reject it via: add_object_entry_from_pack() -> want_object_in_pack() -> want_found_object() -> has_object_kept_pack(). The final function there would find a copy of 'C' in 'Q', and since 'Q' is excluded, we would reject 'C' as unwanted. So the resulting pack would not contain 'C', and the MIDX would hold onto its copy from pack 'P', resulting in 'P' being both (a) in the set of packs to repack together, but also (b) non-expireable, since it has at least one object selected from it in the MIDX. > The end result is that some objects that would normally have been included > in the new pack-file are no longer included. The copy that the > multi-pack-index references is in the pack-file that was intended to be > repacked, so that pack-file cannot be expired in the next 'git > multi-pack-index expire' step and is included again in the batch of objects > to repack. I think this matches my own understanding, but let me know if I'm missing something. Assuming I'm thinking about this the same way you are, the fix (stop using --stdin-packss) makes sense to me. Thanks, Taylor
On 7/18/24 6:50 PM, Taylor Blau wrote: > On Thu, Jul 18, 2024 at 07:55:44PM +0000, Derrick Stolee via GitGitGadget wrote: >> The issue is that 'git multi-pack-index repack' was taught to call 'git >> pack-objects' with the new '--stdin-packs' option. However, this changes the >> object selection algorithm. Instead of using the objects referenced by the >> multi-pack-index, it compares pack-files using a list of "included" and >> "excluded" pack-files. This loses some granularity of how the >> multi-pack-index chooses among duplicate objects. > > Thank you for looking at this so carefully! Let me double check my own > understanding. > > Suppose we have a MIDX with some pack 'P' and say, some commit object > 'C' which appears in that pack. Let's also suppose we have another pack > 'Q' in the same MIDX which also contains 'C', but the MIDX selected its > copy from pack 'P'. > > If we want to combine 'P' with some other packs (excluding 'Q'), then > the input to --stdin-packs will look something like: > > P.pack > ^Q.pack > ... > > And the resulting pack would not contain 'C', since we would reject it > via: add_object_entry_from_pack() -> want_object_in_pack() -> > want_found_object() -> has_object_kept_pack(). The final function there > would find a copy of 'C' in 'Q', and since 'Q' is excluded, we would > reject 'C' as unwanted. > > So the resulting pack would not contain 'C', and the MIDX would hold > onto its copy from pack 'P', resulting in 'P' being both (a) in the set > of packs to repack together, but also (b) non-expireable, since it has > at least one object selected from it in the MIDX. > >> The end result is that some objects that would normally have been included >> in the new pack-file are no longer included. The copy that the >> multi-pack-index references is in the pack-file that was intended to be >> repacked, so that pack-file cannot be expired in the next 'git >> multi-pack-index expire' step and is included again in the batch of objects >> to repack. > > I think this matches my own understanding, but let me know if I'm > missing something. Assuming I'm thinking about this the same way you > are, the fix (stop using --stdin-packss) makes sense to me. Your interpretation matches mine. Thanks for the careful read. I think we can accomplish similar goals that match the reasoning for --stdin-packs (better deltas while also limiting the object walk to the repacked objects) with some changes to read_object_list_from_stdin(), but that's a more subtle kind of change. Taking this change as-is will cause a regression in the quality of delta choices, but recovers our ability to expire "repacked" pack-files in all cases. Thanks, -Stolee
On 7/18/24 6:38 PM, Taylor Blau wrote: > On Thu, Jul 18, 2024 at 02:57:55PM -0700, Junio C Hamano wrote: >> You already Cc'ed Taylor, whom I would have asked for help if I were >> the one who found this issue, which is good. > > It looks like there is a small typo in my email address as > "me@ttayllorr.com" instead of "me@ttaylorr.com", but I saw the message > anyway ;-). Whoops! Sorry about that. Repeated too many letters. I've fixed the GitGitGadget cover letter in case a v2 is required. Thanks, -Stolee
On 7/18/24 5:57 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> Here is an issue I noticed while exploring issues with my local copy of a >> large monorepo. I was intending to show some engineers how nice the objects >> were maintained by background maintenance, but saw hundreds of small >> pack-files that were up to two months old. This time matched when I upgraded >> to the microsoft/git fork that included the 2.45.0 release of Git. > > I almost said "wow, perfect timing on the -rc1 day", but then > realized that this is not a regression during _this_ cycle, but a > cycle ago. I almost waited until after the release, but I wanted to put the information out there just in case you were interested in taking it into 2.46.0 or were planning on a 2.45.3. Thanks, -Stolee
On Fri, Jul 19, 2024 at 09:21:51AM -0400, Derrick Stolee wrote: > On 7/18/24 6:50 PM, Taylor Blau wrote: > > > I think this matches my own understanding, but let me know if I'm > > missing something. Assuming I'm thinking about this the same way you > > are, the fix (stop using --stdin-packss) makes sense to me. > > Your interpretation matches mine. Thanks for the careful read. > > I think we can accomplish similar goals that match the reasoning for > --stdin-packs (better deltas while also limiting the object walk to the > repacked objects) with some changes to read_object_list_from_stdin(), > but that's a more subtle kind of change. FWIW, the main motivation for that change was to limit the amount of cross-process I/O that was necessary to generate the new pack. I figured that for relatively small amounts of packs which contain relatively large amounts of objects that it would be more efficient to write out the pack names than the object names. I was thinking a little bit about how we would alter the behavior of '--stdin-packs' to match what the 'multi-pack-index repack' caller needs. I agree that it is possible, and I doubly agree that it is subtle ;-). TBH, I think that the amount of I/O we're potentially saving is dwarfed by the amount of I/O and CPU time it takes to actually generate the new pack, so I doubt the effort to make such a subtle change would be all that worthwhile, though certainly an interesting exercise ;-). > Taking this change as-is will cause a regression in the quality of > delta choices, but recovers our ability to expire "repacked" pack-files > in all cases. Yeah, definitely. Thanks, Taylor
Derrick Stolee <stolee@gmail.com> writes: > On 7/18/24 5:57 PM, Junio C Hamano wrote: >> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: >> >>> Here is an issue I noticed while exploring issues with my local copy of a >>> large monorepo. I was intending to show some engineers how nice the objects >>> were maintained by background maintenance, but saw hundreds of small >>> pack-files that were up to two months old. This time matched when I upgraded >>> to the microsoft/git fork that included the 2.45.0 release of Git. >> I almost said "wow, perfect timing on the -rc1 day", but then >> realized that this is not a regression during _this_ cycle, but a >> cycle ago. > > I almost waited until after the release, but I wanted to put the > information out there just in case you were interested in taking it > into 2.46.0 or were planning on a 2.45.3. Yup, thanks but this is not exactly a repository breaking data corruption bug, and did not look ultra urgent. Especially if we want to pursue a solution that helps both expiring stale packs better (which is what you are restoring) and making better delta chain selection (which may be what you are losing) at the same time, such a change could become a source of data corruption bug, so I'd prefer to see it started early in a cycle, rather as a last-minute "let's fix this too". Thanks.
On 7/19/24 11:13 AM, Junio C Hamano wrote: > Derrick Stolee <stolee@gmail.com> writes: > >> On 7/18/24 5:57 PM, Junio C Hamano wrote: >>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: >>> >>>> Here is an issue I noticed while exploring issues with my local copy of a >>>> large monorepo. I was intending to show some engineers how nice the objects >>>> were maintained by background maintenance, but saw hundreds of small >>>> pack-files that were up to two months old. This time matched when I upgraded >>>> to the microsoft/git fork that included the 2.45.0 release of Git. >>> I almost said "wow, perfect timing on the -rc1 day", but then >>> realized that this is not a regression during _this_ cycle, but a >>> cycle ago. >> >> I almost waited until after the release, but I wanted to put the >> information out there just in case you were interested in taking it >> into 2.46.0 or were planning on a 2.45.3. > > Yup, thanks but this is not exactly a repository breaking data > corruption bug, and did not look ultra urgent. Especially if we > want to pursue a solution that helps both expiring stale packs > better (which is what you are restoring) and making better delta > chain selection (which may be what you are losing) at the same time, > such a change could become a source of data corruption bug, so I'd > prefer to see it started early in a cycle, rather as a last-minute > "let's fix this too". I agree with this assessment. The microsoft/git fork has taken the revert as users of that fork have a higher chance of being affected. Thanks, -Stolee