mbox series

[0/2] commit-graph/server-info: use tempfile.h in more places

Message ID cover.1717712358.git.me@ttaylorr.com (mailing list archive)
Headers show
Series commit-graph/server-info: use tempfile.h in more places | expand

Message

Taylor Blau June 6, 2024, 10:19 p.m. UTC
This pair of patches addresses two issues (in the commit-graph and
update-server-info areas, respectively), where temporary files are
created outside of the tempfile.h API and thus survive abnormal process
death.

The commit-graph one is more prevalent, and has been the cause of some
minor headaches (for e.g. automated tooling to detect repository cruft
at GitHub complaining about unknown tmp_graph_XXXXXX files left around).

The fixes in both instances are relatively straightforward conversions
to use the tempfile.h API.

Looking at the remaining uses of mkstemp(), the remaining class of
callers that don't use the tempfile.h API are for creating temporary
.idx, .rev files, and similar. My personal feeling is that we should
apply similar treatment there, since these files are generated based on
.pack data, and thus keeping around temporary copies is unnecessary when
they can be regenerated.

But I'd rather keep that part out of this series and clean up a couple
of more straightforward spots where possible.

Thanks in advance for your review!

Taylor Blau (2):
  commit-graph.c: remove temporary graph layers on exit
  server-info.c: remove temporary info files on exit

 commit-graph.c                | 19 ++++++++-----------
 server-info.c                 | 26 ++++++++++----------------
 t/t5324-split-commit-graph.sh | 26 +++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 28 deletions(-)


base-commit: 7b0defb3915eaa0bd118f0996e8c00b4eb2dc1ca

Comments

Junio C Hamano June 7, 2024, 3:40 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> This pair of patches addresses two issues (in the commit-graph and
> update-server-info areas, respectively), where temporary files are
> created outside of the tempfile.h API and thus survive abnormal process
> death.
>
> The commit-graph one is more prevalent, and has been the cause of some
> minor headaches (for e.g. automated tooling to detect repository cruft
> at GitHub complaining about unknown tmp_graph_XXXXXX files left around).

;-)  

I'd be very surprised if a stale "update-server-info" product causes
any harm, but we unfortunately cannot remove it until we fully
remove the suport for HTTP walkers.

> The fixes in both instances are relatively straightforward conversions
> to use the tempfile.h API.

> Looking at the remaining uses of mkstemp(), the remaining class of
> callers that don't use the tempfile.h API are for creating temporary
> .idx, .rev files, and similar. My personal feeling is that we should
> apply similar treatment there, since these files are generated based on
> .pack data, and thus keeping around temporary copies is unnecessary when
> they can be regenerated.

Absolutely.

I wouldn't be surprised by .idx as it and .pack are so old, dating
back to 2005, but anything newer like .rev I am mildly surprised
that we haven't done this conversion.
Jeff King June 8, 2024, 10:48 a.m. UTC | #2
On Thu, Jun 06, 2024 at 06:19:21PM -0400, Taylor Blau wrote:

> Looking at the remaining uses of mkstemp(), the remaining class of
> callers that don't use the tempfile.h API are for creating temporary
> .idx, .rev files, and similar. My personal feeling is that we should
> apply similar treatment there, since these files are generated based on
> .pack data, and thus keeping around temporary copies is unnecessary when
> they can be regenerated.

And actual loose object and pack files themselves, I think. I think it
was a deliberate choice long ago to keep those files around, since in
the worst case they could be used to recover actual repo content (e.g.,
a failed fetch will leave its tmp_pack_* file around, and you could
probably extract _some_ objects from it).

And the philosophy is that we'd leave them sitting around until gc ran
and found tmp_* in objects/, check their mtimes, and remove them then.

In practice, I don't think I have really seen anybody recover anything
from a temporary file. You're better off looking at whatever was feeding
the temporary file (if it was "git add", then look at the filesystem,
and if it was index-pack, look at the fetch/push source, etc).

And leaving them around is a minor nuisance, since degenerate cases
(like a script retrying a failed operation over and over) can let them
pile up.

And indeed, for push we don't keep the contents around anymore. Even
though we don't clean up the .pack files, we stick them in a quarantine
directory which does get cleaned up. This solved what used to be an
operational headache for GitHub, and I don't think anybody has ever
complained about not having their partial failed packfile available.

So I'd argue that we should just treat object/pack tempfiles like the
rest, and delete them if they don't make it all the way to the rename
step. If we really want to support debugging, we could perhaps provide
a run-time knob to leave them in place (and maybe even have it apply to
_all_ tempfiles).

But that is all way beyond your series, and I don't think there is any
urgent need to tackle it. IIRC, GitHub's fork does sprinkle some
register_tempfile() calls around the odb code, but I don't have access
to that code anymore.

-Peff
Elijah Newren June 14, 2024, 5:41 p.m. UTC | #3
On Sat, Jun 8, 2024 at 3:48 AM Jeff King <peff@peff.net> wrote:
>
> On Thu, Jun 06, 2024 at 06:19:21PM -0400, Taylor Blau wrote:
>
> > Looking at the remaining uses of mkstemp(), the remaining class of
> > callers that don't use the tempfile.h API are for creating temporary
> > .idx, .rev files, and similar. My personal feeling is that we should
> > apply similar treatment there, since these files are generated based on
> > .pack data, and thus keeping around temporary copies is unnecessary when
> > they can be regenerated.
>
> And actual loose object and pack files themselves, I think.
[...]
> So I'd argue that we should just treat object/pack tempfiles like the
> rest, and delete them if they don't make it all the way to the rename
> step. If we really want to support debugging, we could perhaps provide
> a run-time knob to leave them in place (and maybe even have it apply to
> _all_ tempfiles).
>
> But that is all way beyond your series, and I don't think there is any
> urgent need to tackle it.

Regardless, it provides more context around the exact questions I had
while reading the series.  Everything in the series looked fine to me,
but I wondered about packs and loose objects and why those are
different.  Anyway, I like your suggestions as a long term goal.
(Perhaps handling packs and loose objects with tempfiles could serve
as good microprojects?)
Taylor Blau June 14, 2024, 7:35 p.m. UTC | #4
On Sat, Jun 08, 2024 at 06:48:21AM -0400, Jeff King wrote:
> On Thu, Jun 06, 2024 at 06:19:21PM -0400, Taylor Blau wrote:
>
> > Looking at the remaining uses of mkstemp(), the remaining class of
> > callers that don't use the tempfile.h API are for creating temporary
> > .idx, .rev files, and similar. My personal feeling is that we should
> > apply similar treatment there, since these files are generated based on
> > .pack data, and thus keeping around temporary copies is unnecessary when
> > they can be regenerated.
>
> And actual loose object and pack files themselves, I think. I think it
> was a deliberate choice long ago to keep those files around, since in
> the worst case they could be used to recover actual repo content (e.g.,
> a failed fetch will leave its tmp_pack_* file around, and you could
> probably extract _some_ objects from it).

Heh, yes. Those were intentionally omitted from this list ;-).

I agree that having the content stick around in failed packfile and
loose object writes is useful as a last-resort recovery mechanism. I
suspect that it is often difficult or impossible to recover the contents
of an object/pack from a broken write, but I think it's better than the
alternative of just throwing it away up front.

> And the philosophy is that we'd leave them sitting around until gc ran
> and found tmp_* in objects/, check their mtimes, and remove them then.
>
> In practice, I don't think I have really seen anybody recover anything
> from a temporary file. You're better off looking at whatever was feeding
> the temporary file (if it was "git add", then look at the filesystem,
> and if it was index-pack, look at the fetch/push source, etc).

Yup.

> So I'd argue that we should just treat object/pack tempfiles like the
> rest, and delete them if they don't make it all the way to the rename
> step. If we really want to support debugging, we could perhaps provide
> a run-time knob to leave them in place (and maybe even have it apply to
> _all_ tempfiles).

I dunno... I don't disagree with what you're saying, but it does seem a
little scary to delete files containing data that we might have been
able to recover.

I think the current series ends at a reasonable stopping point... I'd
honestly have to think more about whether I agree with what you're
saying here or not ;-).

Thanks,
Taylor