[1/2] midx: show progress during QSORT operation

Message ID	20250210074623.136599-2-ayu.chandekar@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A1CC1BC064 for <git@vger.kernel.org>; Mon, 10 Feb 2025 07:46:59 +0000 (UTC) From: Ayush Chandekar <ayu.chandekar@gmail.com> To: git@vger.kernel.org Subject: [PATCH 1/2] midx: show progress during QSORT operation Date: Mon, 10 Feb 2025 13:16:22 +0530 Message-ID: <20250210074623.136599-2-ayu.chandekar@gmail.com> In-Reply-To: <20250210074623.136599-1-ayu.chandekar@gmail.com> References: <20250210074623.136599-1-ayu.chandekar@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	midx: implement progress reporting for QSORT operation \| expand [GSOC,RFC,0/2] midx: implement progress reporting for QSORT operation [1/2] midx: show progress during QSORT operation [2/2] t5319: add test for MIDX QSORT progress reporting

Message ID

20250210074623.136599-2-ayu.chandekar@gmail.com (mailing list archive)

State

New

Headers

From: Ayush Chandekar <ayu.chandekar@gmail.com>
To: git@vger.kernel.org
Subject: [PATCH 1/2] midx: show progress during QSORT operation
Date: Mon, 10 Feb 2025 13:16:22 +0530
Message-ID: <20250210074623.136599-2-ayu.chandekar@gmail.com>
In-Reply-To: <20250210074623.136599-1-ayu.chandekar@gmail.com>
References: <20250210074623.136599-1-ayu.chandekar@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

midx: implement progress reporting for QSORT operation | expand

Commit Message

Ayush Chandekar Feb. 10, 2025, 7:46 a.m. UTC

Add progress reporting during the QSORT operation in multi-pack-index
verification. This helps users track the progress of large sorting
operations.

In previous versions, the progress would jump directly from 0% to 100%
without any intermediate updates.

Signed-off-by: Ayush Chandekar <ayu.chandekar@gmail.com>
---
 midx.c | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

Comments

Junio C Hamano Feb. 10, 2025, 4:55 p.m. UTC | #1

Ayush Chandekar <ayu.chandekar@gmail.com> writes:

> Add progress reporting during the QSORT operation in multi-pack-index
> verification. This helps users track the progress of large sorting
> operations.

Hmph.  If the implementation is correct (which I cannot tell), this
needs to explain why it is a bit better than saying nothing.

> +/*
> + * Limit calls to display_progress() for performance reasons.
> + * The interval here was arbitrarily chosen.
> + */
> +#define SPARSE_PROGRESS_INTERVAL (1 << 12)
> +#define midx_display_sparse_progress(progress, n) \
> +	do { \
> +		uint64_t _n = (n); \
> +		if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0) \
> +			display_progress(progress, _n); \
> +	} while (0)
> +
>  struct pair_pos_vs_id
>  {
>  	uint32_t pos;
>  	uint32_t pack_int_id;
>  };
>  
> +static struct progress *sort_progress;
> +static uint64_t last_max_pos;
> +
>  static int compare_pair_pos_vs_id(const void *_a, const void *_b)
>  {
>  	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
>  	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;

This is a compar callback function used by the sorting machinery,
which is called QSORT but system-provided qsort() implementations
are not necessarily quick-sort [*].

> +	
> +	if (sort_progress) {
> +		uint64_t max_pos = (a->pos > b->pos) ? a->pos : b->pos;
> +		if (max_pos > last_max_pos) {
> +			last_max_pos = max_pos;
> +			midx_display_sparse_progress(sort_progress, last_max_pos);
> +		}
> +	}

So I do not quite understand the assumption this implementation of
the progress meter makes.  

The assumption seems to be that the element in the array with the
highest index MUST not be summoned for comparison until the very end
of the sorting process, but what guarantees that?  Even if we assume
that the qsort() implementation supplied by the system implements
the divide and conquer plain vanilla quicksort, it may divide the
array into half, and then sort the top half first before it sorts
the bottom half, and doing so recursively will give you the
comparison between elements near the end of the array with the
highest index fairly early in the process, no?

And the standard does not even specify what algorithm should
internally be used to implement qsort(3), which our QSORT() macro
eventually calls, so making any assumption on the order the elements
of the array is fed to the compar callback function sounds doubly a
frigile deal.

Thanks.

Ayush Chandekar Feb. 11, 2025, 12:23 p.m. UTC | #2

> Hmph.  If the implementation is correct (which I cannot tell), this
> needs to explain why it is a bit better than saying nothing.
While going through the code, I noticed the TODO comment: "Measure
QSORT() progress", and I thought it might be interesting to explore.
For big codebases, being stuck at zero would make it feel like there's
no progress happening and that is why putting a progress might be
better.

> >  static int compare_pair_pos_vs_id(const void *_a, const void *_b)
> >  {
> >       struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
> >       struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
>
> This is a compar callback function used by the sorting machinery,
> which is called QSORT but system-provided qsort() implementations
> are not necessarily quick-sort [*].
Oh.

Initially, I was unsure how to approach it, but I believed that
tracking the highest pos value seen in comparisons could give a rough
estimate of progress.
However, as you pointed out, this assumes that qsort() processes
elements in a structured way where the highest-indexed element isn't
compared until later in the sort.
I now see that this isn't a safe assumption Since there's no guarantee
that progress would be reflected meaningfully, this approach isn't
good.

Let me know if you have any suggestions/comments:)

Thanks,
Ayush

Junio C Hamano Feb. 11, 2025, 4:29 p.m. UTC | #3

Ayush Chandekar <ayu.chandekar@gmail.com> writes:

>> Hmph.  If the implementation is correct (which I cannot tell), this
>> needs to explain why it is a bit better than saying nothing.
> While going through the code, I noticed the TODO comment: "Measure
> QSORT() progress", and I thought it might be interesting to explore.
> For big codebases, being stuck at zero would make it feel like there's
> no progress happening and that is why putting a progress might be
> better.

That much I already know---otherwise we would not have that comment
there ;-)

What I meant was that the proposed log message did not give readers
any hint how the implementation given in the patch is correct, the
assumption it makes on the behaviour of sort(3), etc.

diff --git a/midx.c b/midx.c
index d91088efb8..69937f5ca8 100644
--- a/midx.c
+++ b/midx.c
@@ -14,6 +14,7 @@ 
 #include "pack-bitmap.h"
 #include "pack-revindex.h"
 
+
 int midx_checksum_valid(struct multi_pack_index *m);
 void clear_midx_files_ext(const char *object_dir, const char *ext,
 			  const char *keep_hash);
@@ -853,32 +854,43 @@  static void midx_report(const char *fmt, ...)
 	va_end(ap);
 }
 
+/*
+ * Limit calls to display_progress() for performance reasons.
+ * The interval here was arbitrarily chosen.
+ */
+#define SPARSE_PROGRESS_INTERVAL (1 << 12)
+#define midx_display_sparse_progress(progress, n) \
+	do { \
+		uint64_t _n = (n); \
+		if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0) \
+			display_progress(progress, _n); \
+	} while (0)
+
 struct pair_pos_vs_id
 {
 	uint32_t pos;
 	uint32_t pack_int_id;
 };
 
+static struct progress *sort_progress;
+static uint64_t last_max_pos;
+
 static int compare_pair_pos_vs_id(const void *_a, const void *_b)
 {
 	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
 	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
+	
+	if (sort_progress) {
+		uint64_t max_pos = (a->pos > b->pos) ? a->pos : b->pos;
+		if (max_pos > last_max_pos) {
+			last_max_pos = max_pos;
+			midx_display_sparse_progress(sort_progress, last_max_pos);
+		}
+	}
 
 	return b->pack_int_id - a->pack_int_id;
 }
 
-/*
- * Limit calls to display_progress() for performance reasons.
- * The interval here was arbitrarily chosen.
- */
-#define SPARSE_PROGRESS_INTERVAL (1 << 12)
-#define midx_display_sparse_progress(progress, n) \
-	do { \
-		uint64_t _n = (n); \
-		if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0) \
-			display_progress(progress, _n); \
-	} while (0)
-
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags)
 {
 	struct pair_pos_vs_id *pairs = NULL;
@@ -960,12 +972,15 @@  int verify_midx_file(struct repository *r, const char *object_dir, unsigned flag
 		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
 	}
 
-	if (flags & MIDX_PROGRESS)
+	if (flags & MIDX_PROGRESS) {
 		progress = start_sparse_progress(r,
 						 _("Sorting objects by packfile"),
 						 m->num_objects);
-	display_progress(progress, 0); /* TODO: Measure QSORT() progress */
+		last_max_pos = 0;
+		sort_progress = progress;
+	}
 	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
+	sort_progress = NULL;
 	stop_progress(&progress);
 
 	if (flags & MIDX_PROGRESS)

[1/2] midx: show progress during QSORT operation

Commit Message

Comments

Patch