[02/17] pack-mtimes: support reading .mtimes files

Message ID	7d4ae7bd3e28e2ec904abb37b6f26505e37531c5.1638224692.git.me@ttaylorr.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Mon, 29 Nov 2021 17:25:08 -0500 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: gitster@pobox.com, larsxschneider@gmail.com, peff@peff.net, tytso@mit.edu Subject: [PATCH 02/17] pack-mtimes: support reading .mtimes files Message-ID: <7d4ae7bd3e28e2ec904abb37b6f26505e37531c5.1638224692.git.me@ttaylorr.com> References: <cover.1638224692.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <cover.1638224692.git.me@ttaylorr.com> Precedence: bulk
Series	cruft packs \| expand [00/17] cruft packs [01/17] Documentation/technical: add cruft-packs.txt [02/17] pack-mtimes: support reading .mtimes files [03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' [04/17] chunk-format.h: extract oid_version() [05/17] pack-mtimes: support writing pack .mtimes files [06/17] t/helper: add 'pack-mtimes' test-tool [07/17] builtin/pack-objects.c: return from create_object_entry() [08/17] builtin/pack-objects.c: --cruft without expiration [09/17] reachable: add options to add_unseen_recent_objects_to_traversal [10/17] reachable: report precise timestamps from objects in cruft packs [11/17] builtin/pack-objects.c: --cruft with expiration [12/17] builtin/repack.c: support generating a cruft pack [13/17] builtin/repack.c: allow configuring cruft pack generation [14/17] builtin/repack.c: use named flags for existing_packs [15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack [16/17] builtin/gc.c: conditionally avoid pruning objects via loose [17/17] sha1-file.c: don't freshen cruft packs

Message ID

7d4ae7bd3e28e2ec904abb37b6f26505e37531c5.1638224692.git.me@ttaylorr.com (mailing list archive)

State

New, archived

Headers

Date: Mon, 29 Nov 2021 17:25:08 -0500
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, larsxschneider@gmail.com, peff@peff.net,
        tytso@mit.edu
Subject: [PATCH 02/17] pack-mtimes: support reading .mtimes files
Message-ID: 
 <7d4ae7bd3e28e2ec904abb37b6f26505e37531c5.1638224692.git.me@ttaylorr.com>
References: <cover.1638224692.git.me@ttaylorr.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <cover.1638224692.git.me@ttaylorr.com>
Precedence: bulk

Series

cruft packs | expand

Commit Message

Taylor Blau Nov. 29, 2021, 10:25 p.m. UTC

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  22 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 139 ++++++++++++++++++++++++
 pack-mtimes.h                           |  16 +++
 packfile.c                              |  18 ++-
 packfile.h                              |   1 +
 8 files changed, 200 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

Comments

Derrick Stolee Dec. 2, 2021, 3:06 p.m. UTC | #1

On 11/29/2021 5:25 PM, Taylor Blau wrote:

> +== pack-*.mtimes files have the format:
> +
> +  - A 4-byte magic number '0x4d544d45' ('MTME').
> +
> +  - A 4-byte version identifier (= 1).
> +
> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).

I vaguely remember complaints about using a 1-byte identifier in
the commit-graph and multi-pack-index formats because the "standard"
way to refer to these hash functions was a magic number that had a
meaning in ASCII that helped human readers a bit. I cannot find an
example of such 4-byte identifiers, but perhaps brian (CC'd) could
remind us.

You are using a 4-byte identifier, but using the same values as
those 1-byte identifiers.

> +  - A table of mtimes (one per packed object, num_objects in total, each
> +    a 4-byte unsigned integer in network order), in the same order as
> +    objects appear in the index file (e.g., the first entry in the mtime
> +    table corresponds to the object with the lowest lexically-sorted
> +    oid). The mtimes count standard epoch seconds.

This paragraph seemed awkward. Here is a rephrasing that might be
less awkward:

 - A table of 4-byte unsigned integers in network order. The ith value
   is the modified time (mtime) of the ith object of the corresponding
   pack in lexicographic order. The mtime represents standard epoch
   seconds.

Storing these mtimes in 32-bits means we will hit the 2038 problem.
The commit-graph stores commit times with an extra two bits to extend
the lifetime by another hundred years or so.

Could we extend the lifetime of cruft packs by decreasing the granularity
here? Should 'mtime' store a number of _minutes_ instead of seconds? That
should be enough granularity for these purposes.

> +  - A trailer, containing a:
> +
> +    checksum of the corresponding packfile, and
> +
> +    a checksum of all of the above.

Could you specify the checksum as having length according to the
specified hash function?

> +All 4-byte numbers are in network order.
> +

Maybe this could be at the start of the format, since the file
version and hash function are both 4-byte numbers here and we
could remove the mention of network order from the mtime values.

> +static char *pack_mtimes_filename(struct packed_git *p)
> +{
> +	size_t len;
> +	if (!strip_suffix(p->pack_name, ".pack", &len))
> +		BUG("pack_name does not end in .pack");
> +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> +}

I see your NEEDSWORK here and you are probably referring to this:

static char *pack_revindex_filename(struct packed_git *p)
{
	size_t len;
	if (!strip_suffix(p->pack_name, ".pack", &len))
		BUG("pack_name does not end in .pack");
	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
}

and the implementation is identical except for the new trailer
(which exist in the exts[] array in builtin/repack.c, but could
also be pulled out into a header somewhere.

I'm happy to delay any cleanup of these code clones until later,
if at all, because doing it right might mean moving more code
than we like. Such refactorings aren't worth it most of the time.

> +static int load_pack_mtimes_file(char *mtimes_file,
> +				 uint32_t num_objects,
> +				 const uint32_t **data_p, size_t *len_p)
> +{

> +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> +		ret = error(_("mtimes file %s is corrupt"), mtimes_file);

This message could be more informative: "mtimes file %s has the wrong size"?

> +	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
> +		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
> +		goto cleanup;
> +	}

Interesting that you defined 'struct mtimes_header' before this
method, but don't use it here (in favor of moving a uint32_t
pointer). Perhaps you are avoiding pointing the struct at the
memory map, but you could also do this:

	struct mtimes_header header;

	header.signature = ntohl(hdr[0]);
	header.version = ntohl(hdr[1]);
	header.hash_id = ntohl(hdr[2]);

And then operate on the struct for your validation.

At the very least, 'struct mtimes_header' is defined but not
used in this patch. If you decide to not use it this way, then
maybe delay its definition.

> +
> +	if (ntohl(*++hdr) != 1) {
> +		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
> +			    mtimes_file, ntohl(*hdr));

Unlike the commit-graph, if we don't understand the version we
cannot simply ignore the data. error() is appropriate here.

> +int load_pack_mtimes(struct packed_git *p)
> +{
> +	char *mtimes_name = NULL;
> +	int ret = 0;
> +
> +	if (!p->is_cruft)
> +		return ret; /* not a cruft pack */

Interesting that this indicator is essentially "we have an mtimes
file for this pack", but it makes sense to include that check next
to the .keep and .promisor checks.

> +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
> +{
> +	if (!p->mtimes_map)
> +		BUG("pack .mtimes file not loaded for %s", p->pack_name);
> +	if (p->num_objects <= pos)
> +		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
> +		    pos, p->num_objects);
> +
> +	return get_be32(p->mtimes_map + pos + 3);
> +}

A nice safe access method. Good.

> -	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
> +	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};

(Speaking of that refactoring earlier, here is a second definition of
exts[] that would be valuable to unify.)

The hunks I did not comment on look good. Nice standard file format
stuff.

Thanks,
-Stolee

brian m. carlson Dec. 2, 2021, 10:32 p.m. UTC | #2

On 2021-12-02 at 15:06:07, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> 
> > +== pack-*.mtimes files have the format:
> > +
> > +  - A 4-byte magic number '0x4d544d45' ('MTME').
> > +
> > +  - A 4-byte version identifier (= 1).
> > +
> > +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
> 
> I vaguely remember complaints about using a 1-byte identifier in
> the commit-graph and multi-pack-index formats because the "standard"
> way to refer to these hash functions was a magic number that had a
> meaning in ASCII that helped human readers a bit. I cannot find an
> example of such 4-byte identifiers, but perhaps brian (CC'd) could
> remind us.
> 
> You are using a 4-byte identifier, but using the same values as
> those 1-byte identifiers.

The preferred value is the_hash_algo->format_id.  For SHA-1, that's
"sha1", big-endian (0x73686131) and for SHA-256 it's "s256", big-endian
(0x73323536).

There's also hash_algo_by_id to turn the format ID into an index into
the hash_algos array, but you need to check for GIT_HASH_UNKNOWN (0)
first.

These will be used in index v3, which I haven't sent out patches for
yet.

Taylor Blau Dec. 3, 2021, 10:24 p.m. UTC | #3

On Thu, Dec 02, 2021 at 10:06:07AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
>
> > +== pack-*.mtimes files have the format:
> > +
> > +  - A 4-byte magic number '0x4d544d45' ('MTME').
> > +
> > +  - A 4-byte version identifier (= 1).
> > +
> > +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
>
> I vaguely remember complaints about using a 1-byte identifier in
> the commit-graph and multi-pack-index formats because the "standard"
> way to refer to these hash functions was a magic number that had a
> meaning in ASCII that helped human readers a bit. I cannot find an
> example of such 4-byte identifiers, but perhaps brian (CC'd) could
> remind us.
>
> You are using a 4-byte identifier, but using the same values as
> those 1-byte identifiers.

Yeah, I'm definitely borrowing from the commit-graph and multi-pack
index formats here. Though I believe we did the same thing for .rev
files, too (and checking with Documentation/technical/pack-format.txt
confirms as much).

I don't have a strong feeling about using the 4-byte identifier or not.
But making this field four bytes wide is very much intentional, since it
makes sure that all of our reads are aligned, which should yield much
better cache performance (assuming the page size is also a multiple of
four).

I don't, but if others feel strongly we could write the magic
identifiers brian points out downthread here instead. (It would be
mildly inconvenient for GitHub, which has many hundreds of thousands of
these files laying around everywhere with '1' as the identifier. But
since the magic identifiers don't collide with the values proposed here,
GitHub's fork could easily be taught to accept both on the reading side,
but only write out the special identifier).

> > +  - A table of mtimes (one per packed object, num_objects in total, each
> > +    a 4-byte unsigned integer in network order), in the same order as
> > +    objects appear in the index file (e.g., the first entry in the mtime
> > +    table corresponds to the object with the lowest lexically-sorted
> > +    oid). The mtimes count standard epoch seconds.
>
> This paragraph seemed awkward. Here is a rephrasing that might be
> less awkward:
>
>  - A table of 4-byte unsigned integers in network order. The ith value
>    is the modified time (mtime) of the ith object of the corresponding
>    pack in lexicographic order. The mtime represents standard epoch
>    seconds.

Thanks, this is clearer. I went with a blend of the two:

    - A table of 4-byte unsigned integers in network order. The ith
      value is the modification time (mtime) of the ith object in the
      corresponding pack by lexicographic (index) order. The mtimes
      count standard epoch seconds.

> Storing these mtimes in 32-bits means we will hit the 2038 problem.
> The commit-graph stores commit times with an extra two bits to extend
> the lifetime by another hundred years or so.
>
> Could we extend the lifetime of cruft packs by decreasing the granularity
> here? Should 'mtime' store a number of _minutes_ instead of seconds? That
> should be enough granularity for these purposes.

Perhaps, though it does add some complexity to the code that deals with
this format at the expense of some future-proofing. I'm open to it,
though.

>
> > +  - A trailer, containing a:
> > +
> > +    checksum of the corresponding packfile, and
> > +
> > +    a checksum of all of the above.
>
> Could you specify the checksum as having length according to the
> specified hash function?

Great suggestion, thanks.

> > +All 4-byte numbers are in network order.
> > +
>
> Maybe this could be at the start of the format, since the file
> version and hash function are both 4-byte numbers here and we
> could remove the mention of network order from the mtime values.

This is copy-and-pasted from the .rev section above, where I think I
added the "All 4-byte numbers are in network order" bit at the end in
response to a suggestion opposite yours ;).

Here I would probably rather stay consistent with the surrounding
sections.

> > +static char *pack_mtimes_filename(struct packed_git *p)
> > +{
> > +	size_t len;
> > +	if (!strip_suffix(p->pack_name, ".pack", &len))
> > +		BUG("pack_name does not end in .pack");
> > +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> > +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> > +}
>
> I see your NEEDSWORK here and you are probably referring to this:
>
> static char *pack_revindex_filename(struct packed_git *p)
> {
> 	size_t len;
> 	if (!strip_suffix(p->pack_name, ".pack", &len))
> 		BUG("pack_name does not end in .pack");
> 	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
> }
>
> and the implementation is identical except for the new trailer
> (which exist in the exts[] array in builtin/repack.c, but could
> also be pulled out into a header somewhere.
>
> I'm happy to delay any cleanup of these code clones until later,
> if at all, because doing it right might mean moving more code
> than we like. Such refactorings aren't worth it most of the time.

Yeah, I think your thoughts matched my own when writing this. Which is
to say, I felt it prudent to call out that there is an opportunity to
DRY these two up, but I'm not convinced that such a clean up would be
worthwhile.

> > +static int load_pack_mtimes_file(char *mtimes_file,
> > +				 uint32_t num_objects,
> > +				 const uint32_t **data_p, size_t *len_p)
> > +{
>
> > +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> > +		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
>
> This message could be more informative: "mtimes file %s has the wrong size"?

Copy-and-pasting here again from the corresponding code for the .rev
file, which is why I didn't opt to change the message here. Probably
many of these checks could be extracted out and shared between the two
paths, but I don't think we should attempt it here.

> > +	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
> > +
> > +	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
> > +		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
> > +		goto cleanup;
> > +	}
>
> Interesting that you defined 'struct mtimes_header' before this
> method, but don't use it here (in favor of moving a uint32_t
> pointer). Perhaps you are avoiding pointing the struct at the
> memory map, but you could also do this:
>
> 	struct mtimes_header header;
>
> 	header.signature = ntohl(hdr[0]);
> 	header.version = ntohl(hdr[1]);
> 	header.hash_id = ntohl(hdr[2]);
>
> And then operate on the struct for your validation.
>
> At the very least, 'struct mtimes_header' is defined but not
> used in this patch. If you decide to not use it this way, then
> maybe delay its definition.

Yeah, not reading directly out of the struct is intentional, since the
compiler is free to insert padding between these members, which would
break any subsequent reads out of the struct.

But I like your idea to assign the fields manually, thanks!

> > +int load_pack_mtimes(struct packed_git *p)
> > +{
> > +	char *mtimes_name = NULL;
> > +	int ret = 0;
> > +
> > +	if (!p->is_cruft)
> > +		return ret; /* not a cruft pack */
>
> Interesting that this indicator is essentially "we have an mtimes
> file for this pack", but it makes sense to include that check next
> to the .keep and .promisor checks.

I think I had originally called it "mtimes" but changed it to "cruft",
since it makes sense as a prefix similar to the others (that is, "keep
pack", "promisor pack", and "cruft pack", not "mtimes pack").

> The hunks I did not comment on look good. Nice standard file format
> stuff.

Thanks for your review!

Thanks,
Taylor

Taylor Blau Jan. 7, 2022, 7:41 p.m. UTC | #4

On Fri, Dec 03, 2021 at 05:24:03PM -0500, Taylor Blau wrote:
> On Thu, Dec 02, 2021 at 10:06:07AM -0500, Derrick Stolee wrote:
>     - A table of 4-byte unsigned integers in network order. The ith
>       value is the modification time (mtime) of the ith object in the
>       corresponding pack by lexicographic (index) order. The mtimes
>       count standard epoch seconds.
>
> > Storing these mtimes in 32-bits means we will hit the 2038 problem.
> > The commit-graph stores commit times with an extra two bits to extend
> > the lifetime by another hundred years or so.
> >
> > Could we extend the lifetime of cruft packs by decreasing the granularity
> > here? Should 'mtime' store a number of _minutes_ instead of seconds? That
> > should be enough granularity for these purposes.
>
> Perhaps, though it does add some complexity to the code that deals with
> this format at the expense of some future-proofing. I'm open to it,
> though.

I still have quite a bit of review from this topic sitting in my inbox.

But this had been lingering on my mind, and I realized I said something
incorrect. 32-bit mtimes won't cause us to run into the "2038" problem,
since these aren't signed values. So storing epoch seconds in a uint32_t
should get us into the year 2106.

If anybody is still using cruft packs by then, I'll call this project a
wild success ;-). So in the meantime, I don't think it makes sense to
reduce the granularity and/or use extra bits to store the timestamps.

Thanks,
Taylor

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 8d2f42f29e..61d8d960e7 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,28 @@  Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of mtimes (one per packed object, num_objects in total, each
+    a 4-byte unsigned integer in network order), in the same order as
+    objects appear in the index file (e.g., the first entry in the mtime
+    table corresponds to the object with the lowest lexically-sorted
+    oid). The mtimes count standard epoch seconds.
+
+  - A trailer, containing a:
+
+    checksum of the corresponding packfile, and
+
+    a checksum of all of the above.
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 12be39ac49..efd5e00717 100644
--- a/Makefile
+++ b/Makefile
@@ -949,6 +949,7 @@  LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index 0b2d1e5d82..acbb7b8c3b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -212,6 +212,7 @@  static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 952efb6a4b..d87481f101 100644
--- a/object-store.h
+++ b/object-store.h
@@ -89,12 +89,15 @@  struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..4c7c00fa67
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,139 @@ 
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+int pack_has_mtimes(struct packed_git *p)
+{
+	struct stat st;
+	char *fname = pack_mtimes_filename(p);
+
+	if (stat(fname, &st) < 0) {
+		if (errno == ENOENT)
+			return 0;
+		die_errno(_("could not stat %s"), fname);
+	}
+
+	free(fname);
+	return 1;
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (ntohl(*++hdr) != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, ntohl(*hdr));
+		goto cleanup;
+	}
+	hdr++;
+	if (!(ntohl(*hdr) == 1 || ntohl(*hdr) == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, ntohl(*hdr));
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+	if (ret)
+		goto cleanup;
+
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..ac4247bb5e
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,16 @@ 
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int pack_has_mtimes(struct packed_git *p);
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 89402cfc69..ae79ac644e 100644
--- a/packfile.c
+++ b/packfile.c
@@ -333,12 +333,21 @@  void close_pack_revindex(struct packed_git *p) {
 	p->revindex_data = NULL;
 }
 
+void close_pack_mtimes(struct packed_git *p) {
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -362,7 +371,7 @@  void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -717,6 +726,10 @@  struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -868,7 +881,8 @@  static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
diff --git a/packfile.h b/packfile.h
index 186146779d..32201d8af7 100644
--- a/packfile.h
+++ b/packfile.h
@@ -91,6 +91,7 @@  uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 void close_pack_windows(struct packed_git *);
 void close_pack_revindex(struct packed_git *);
+void close_pack_mtimes(struct packed_git *p);
 void close_pack(struct packed_git *);
 void close_object_store(struct raw_object_store *o);
 void unuse_pack(struct pack_window **);

[02/17] pack-mtimes: support reading .mtimes files

Commit Message

Comments

Patch