Message ID | 8c83e92445b4131e9b8f8e2aa29b00717b257d13.1616015337.git.matheus.bernardino@usp.br (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Parallel Checkout (part 2) | expand |
On Wed, Mar 17, 2021 at 10:12 PM Matheus Tavares <matheus.bernardino@usp.br> wrote: > The above benchmarks show that parallel checkout is most effective on > repositories located on an SSD or over a distributed file system. For > local file systems on spinning disks, and/or older machines, the > parallelism does not always bring a good performance. For this reason, > the default value for checkout.workers is one, a.k.a. sequential > checkout. I wonder how many people are still using HDD, and how many will still use them in a few years. I think having 1 as the default value for checkout.workers might be good for a while for backward compatibility and stability, while people who are interested can test parallel checkout on different environments. But we might want, in a few releases, after some bugs, if any, have been fixed, to use a default, maybe 10, that will provide significant speedup for most people, and will justify the added complexity, especially as your numbers still show a small speedup for HDD when using 10. > To decide the default value for checkout.thresholdForParallelism, > another benchmark was executed in the "Local SSD" setup, where parallel > checkout showed to be beneficial. This time, we compared the runtime of > a `git checkout -f`, with and without parallelism, after randomly > removing an increasing number of files from the Linux working tree. The > "sequential fallback" column bellow corresponds to the executions where s/bellow/below/ > checkout.workers was 10 but checkout.thresholdForParallelism was equal > to the number of to-be-updated files plus one (so that we end up writing > sequentially). Each test case was sampled 15 times, and each sample had > a randomly different set of files removed. Here are the results: > > sequential fallback 10 workers speedup > 10 files 772.3 ms ± 12.6 ms 769.0 ms ± 13.6 ms 1.00 ± 0.02 > 20 files 780.5 ms ± 15.8 ms 775.2 ms ± 9.2 ms 1.01 ± 0.02 > 50 files 806.2 ms ± 13.8 ms 767.4 ms ± 8.5 ms 1.05 ± 0.02 > 100 files 833.7 ms ± 21.4 ms 750.5 ms ± 16.8 ms 1.11 ± 0.04 > 200 files 897.6 ms ± 30.9 ms 730.5 ms ± 14.7 ms 1.23 ± 0.05 > 500 files 1035.4 ms ± 48.0 ms 677.1 ms ± 22.3 ms 1.53 ± 0.09 > 1000 files 1244.6 ms ± 35.6 ms 654.0 ms ± 38.3 ms 1.90 ± 0.12 > 2000 files 1488.8 ms ± 53.4 ms 658.8 ms ± 23.8 ms 2.26 ± 0.12 > > From the above numbers, 100 files seems to be a reasonable default value > for the threshold setting. > > Note: Up to 1000 files, we observe a drop in the execution time of the > parallel code with an increase in the number of files. This is a rather > odd behavior, but it was observed in multiple repetitions. Above 1000 > files, the execution time increases according to the number of files, as > one would expect. > > About the test environments: Local SSD tests were executed on an > i7-7700HQ (4 cores with hyper-threading) running Manjaro Linux. Local > HDD tests were executed on an Intel(R) Xeon(R) E3-1230 (also 4 cores > with hyper-threading), HDD Seagate Barracuda 7200.14 SATA 3.1, running > Debian. NFS and EFS tests were executed on an Amazon EC2 c5n.xlarge > instance, with 4 vCPUs. The Linux NFS server was running on a m6g.large > instance with 2 vCPUSs and a 1 TB EBS GP2 volume. Before each timing, > the linux repository was removed (or checked out back to its previous > state), and `sync && sysctl vm.drop_caches=3` was executed. Thanks for all these tests and details! > Co-authored-by: Jeff Hostetler <jeffhost@microsoft.com> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> [...] > @@ -23,6 +25,19 @@ enum pc_status parallel_checkout_status(void) > return parallel_checkout.status; > } > > +#define DEFAULT_THRESHOLD_FOR_PARALLELISM 100 Using a "static const int" might be a bit better. > +void get_parallel_checkout_configs(int *num_workers, int *threshold) > +{ > + if (git_config_get_int("checkout.workers", num_workers)) > + *num_workers = 1; I think it's better when an important default value like this "1" is made more visible using a "static const int" or a "#define". > + else if (*num_workers < 1) > + *num_workers = online_cpus(); > + > + if (git_config_get_int("checkout.thresholdForParallelism", threshold)) > + *threshold = DEFAULT_THRESHOLD_FOR_PARALLELISM; > +} > + > void init_parallel_checkout(void) > { > if (parallel_checkout.status != PC_UNINITIALIZED) > @@ -554,11 +569,9 @@ static void write_items_sequentially(struct checkout *state) > write_pc_item(¶llel_checkout.items[i], state); > } > > -#define DEFAULT_NUM_WORKERS 2 As I say above, it doesn't look like a good idea to have removed this. > -int run_parallel_checkout(struct checkout *state) > +int run_parallel_checkout(struct checkout *state, int num_workers, int threshold) > { > - int ret, num_workers = DEFAULT_NUM_WORKERS; > + int ret; > > if (parallel_checkout.status != PC_ACCEPTING_ENTRIES) > BUG("cannot run parallel checkout: uninitialized or already running"); > @@ -568,7 +581,7 @@ int run_parallel_checkout(struct checkout *state) > if (parallel_checkout.nr < num_workers) > num_workers = parallel_checkout.nr; > > - if (num_workers <= 1) { > + if (num_workers <= 1 || parallel_checkout.nr < threshold) { Here we check the number of workers... > write_items_sequentially(state); > } else { > struct pc_worker *workers = setup_workers(state, num_workers); [...] > @@ -480,7 +483,8 @@ static int check_updates(struct unpack_trees_options *o, > } > } > stop_progress(&progress); > - errs |= run_parallel_checkout(&state); > + if (pc_workers > 1) ...but the number of workers was already checked here. > + errs |= run_parallel_checkout(&state, pc_workers, pc_threshold); > errs |= finish_delayed_checkout(&state, NULL); > git_attr_set_direction(GIT_ATTR_CHECKIN);
On Wed, Mar 31, 2021 at 1:33 AM Christian Couder <christian.couder@gmail.com> wrote: > > On Wed, Mar 17, 2021 at 10:12 PM Matheus Tavares > <matheus.bernardino@usp.br> wrote: > > > The above benchmarks show that parallel checkout is most effective on > > repositories located on an SSD or over a distributed file system. For > > local file systems on spinning disks, and/or older machines, the > > parallelism does not always bring a good performance. For this reason, > > the default value for checkout.workers is one, a.k.a. sequential > > checkout. > > I wonder how many people are still using HDD, and how many will still > use them in a few years. > > I think having 1 as the default value for checkout.workers might be > good for a while for backward compatibility and stability, while > people who are interested can test parallel checkout on different > environments. But we might want, in a few releases, after some bugs, > if any, have been fixed, to use a default, maybe 10, that will provide > significant speedup for most people, and will justify the added > complexity, especially as your numbers still show a small speedup for > HDD when using 10. Yeah, I agree that it would be nice to have a better default in the near future. Unfortunately, on some other HDD machines that I tested, parallel checkout was slower than the sequential version. So I think it may not be possible to enable parallelism by default now. Nevertheless, I was also experimenting with some ideas to auto-detect if parallelism is efficient in a given environment/repo and auto-enable it if so. One interesting possibility suggested by Ævar [1] was to implement this as a git maintenance task, which could periodically probe the system and tune the checkout settings appropriately. [1]: https://lore.kernel.org/git/87y2ixpvos.fsf@evledraar.gmail.com/ > > @@ -23,6 +25,19 @@ enum pc_status parallel_checkout_status(void) > > return parallel_checkout.status; > > } > > > > +#define DEFAULT_THRESHOLD_FOR_PARALLELISM 100 > > Using a "static const int" might be a bit better. Ok, I will change that. > > +void get_parallel_checkout_configs(int *num_workers, int *threshold) > > +{ > > + if (git_config_get_int("checkout.workers", num_workers)) > > + *num_workers = 1; > > I think it's better when an important default value like this "1" is > made more visible using a "static const int" or a "#define". Will do. > > @@ -568,7 +581,7 @@ int run_parallel_checkout(struct checkout *state) > > if (parallel_checkout.nr < num_workers) > > num_workers = parallel_checkout.nr; > > > > - if (num_workers <= 1) { > > + if (num_workers <= 1 || parallel_checkout.nr < threshold) { > > Here we check the number of workers... > > > write_items_sequentially(state); > > } else { > > struct pc_worker *workers = setup_workers(state, num_workers); > > [...] > > > @@ -480,7 +483,8 @@ static int check_updates(struct unpack_trees_options *o, > > } > > } > > stop_progress(&progress); > > - errs |= run_parallel_checkout(&state); > > + if (pc_workers > 1) > > + errs |= run_parallel_checkout(&state, pc_workers, pc_threshold); > ...but the number of workers was already checked here. The re-checking at run_parallel_checkout() is important because `num_workers` might actually become <= 1 after the above check at check_updates(). This happens when there aren't enough enqueued entries for 2+ workers, so we fall back to sequential checkout. It also kind of works as a safe-mechanism for the case where the run_parallel_checkout() caller forgot to check if the user actually wants parallelism before calling the function.
diff --git a/Documentation/config/checkout.txt b/Documentation/config/checkout.txt index 2cddf7b4b4..bfbca90f0e 100644 --- a/Documentation/config/checkout.txt +++ b/Documentation/config/checkout.txt @@ -21,3 +21,24 @@ checkout.guess:: Provides the default value for the `--guess` or `--no-guess` option in `git checkout` and `git switch`. See linkgit:git-switch[1] and linkgit:git-checkout[1]. + +checkout.workers:: + The number of parallel workers to use when updating the working tree. + The default is one, i.e. sequential execution. If set to a value less + than one, Git will use as many workers as the number of logical cores + available. This setting and `checkout.thresholdForParallelism` affect + all commands that perform checkout. E.g. checkout, clone, reset, + sparse-checkout, etc. ++ +Note: parallel checkout usually delivers better performance for repositories +located on SSDs or over NFS. For repositories on spinning disks and/or machines +with a small number of cores, the default sequential checkout often performs +better. The size and compression level of a repository might also influence how +well the parallel version performs. + +checkout.thresholdForParallelism:: + When running parallel checkout with a small number of files, the cost + of subprocess spawning and inter-process communication might outweigh + the parallelization gains. This setting allows to define the minimum + number of files for which parallel checkout should be attempted. The + default is 100. diff --git a/parallel-checkout.c b/parallel-checkout.c index df447aa3a6..92f3872653 100644 --- a/parallel-checkout.c +++ b/parallel-checkout.c @@ -1,9 +1,11 @@ #include "cache.h" +#include "config.h" #include "entry.h" #include "parallel-checkout.h" #include "pkt-line.h" #include "run-command.h" #include "streaming.h" +#include "thread-utils.h" struct pc_worker { struct child_process cp; @@ -23,6 +25,19 @@ enum pc_status parallel_checkout_status(void) return parallel_checkout.status; } +#define DEFAULT_THRESHOLD_FOR_PARALLELISM 100 + +void get_parallel_checkout_configs(int *num_workers, int *threshold) +{ + if (git_config_get_int("checkout.workers", num_workers)) + *num_workers = 1; + else if (*num_workers < 1) + *num_workers = online_cpus(); + + if (git_config_get_int("checkout.thresholdForParallelism", threshold)) + *threshold = DEFAULT_THRESHOLD_FOR_PARALLELISM; +} + void init_parallel_checkout(void) { if (parallel_checkout.status != PC_UNINITIALIZED) @@ -554,11 +569,9 @@ static void write_items_sequentially(struct checkout *state) write_pc_item(¶llel_checkout.items[i], state); } -#define DEFAULT_NUM_WORKERS 2 - -int run_parallel_checkout(struct checkout *state) +int run_parallel_checkout(struct checkout *state, int num_workers, int threshold) { - int ret, num_workers = DEFAULT_NUM_WORKERS; + int ret; if (parallel_checkout.status != PC_ACCEPTING_ENTRIES) BUG("cannot run parallel checkout: uninitialized or already running"); @@ -568,7 +581,7 @@ int run_parallel_checkout(struct checkout *state) if (parallel_checkout.nr < num_workers) num_workers = parallel_checkout.nr; - if (num_workers <= 1) { + if (num_workers <= 1 || parallel_checkout.nr < threshold) { write_items_sequentially(state); } else { struct pc_worker *workers = setup_workers(state, num_workers); diff --git a/parallel-checkout.h b/parallel-checkout.h index 35e5e69a96..26f61ed2ac 100644 --- a/parallel-checkout.h +++ b/parallel-checkout.h @@ -17,6 +17,7 @@ enum pc_status { }; enum pc_status parallel_checkout_status(void); +void get_parallel_checkout_configs(int *num_workers, int *threshold); /* * Put parallel checkout into the PC_ACCEPTING_ENTRIES state. Should be used @@ -31,8 +32,12 @@ void init_parallel_checkout(void); */ int enqueue_checkout(struct cache_entry *ce, struct conv_attrs *ca); -/* Write all the queued entries, returning 0 on success.*/ -int run_parallel_checkout(struct checkout *state); +/* + * Write all the queued entries, returning 0 on success. If the number of + * entries is smaller than the specified threshold, the operation is performed + * sequentially. + */ +int run_parallel_checkout(struct checkout *state, int num_workers, int threshold); /**************************************************************** * Interface with checkout--helper diff --git a/unpack-trees.c b/unpack-trees.c index b9548de96a..8bc5061487 100644 --- a/unpack-trees.c +++ b/unpack-trees.c @@ -399,7 +399,7 @@ static int check_updates(struct unpack_trees_options *o, int errs = 0; struct progress *progress; struct checkout state = CHECKOUT_INIT; - int i; + int i, pc_workers, pc_threshold; trace_performance_enter(); state.force = 1; @@ -465,8 +465,11 @@ static int check_updates(struct unpack_trees_options *o, oid_array_clear(&to_fetch); } + get_parallel_checkout_configs(&pc_workers, &pc_threshold); + enable_delayed_checkout(&state); - init_parallel_checkout(); + if (pc_workers > 1) + init_parallel_checkout(); for (i = 0; i < index->cache_nr; i++) { struct cache_entry *ce = index->cache[i]; @@ -480,7 +483,8 @@ static int check_updates(struct unpack_trees_options *o, } } stop_progress(&progress); - errs |= run_parallel_checkout(&state); + if (pc_workers > 1) + errs |= run_parallel_checkout(&state, pc_workers, pc_threshold); errs |= finish_delayed_checkout(&state, NULL); git_attr_set_direction(GIT_ATTR_CHECKIN);
Make parallel checkout configurable by introducing two new settings: checkout.workers and checkout.thresholdForParallelism. The first defines the number of workers (where one means sequential checkout), and the second defines the minimum number of entries to attempt parallel checkout. To decide the default value for checkout.workers, the parallel version was benchmarked during three operations in the linux repo, with cold cache: cloning v5.8, checking out v5.8 from v2.6.15 (checkout I) and checking out v5.8 from v5.7 (checkout II). The four tables below show the mean run times and standard deviations for 5 runs in: a local file system on SSD, a local file system on HDD, a Linux NFS server, and Amazon EFS (all on Linux). Each parallel checkout test was executed with the number of workers that brings the best overall results in that environment. Local SSD: Sequential 10 workers Speedup Clone 8.805 s ± 0.043 s 3.564 s ± 0.041 s 2.47 ± 0.03 Checkout I 9.678 s ± 0.057 s 4.486 s ± 0.050 s 2.16 ± 0.03 Checkout II 5.034 s ± 0.072 s 3.021 s ± 0.038 s 1.67 ± 0.03 Local HDD: Sequential 10 workers Speedup Clone 32.288 s ± 0.580 s 30.724 s ± 0.522 s 1.05 ± 0.03 Checkout I 54.172 s ± 7.119 s 54.429 s ± 6.738 s 1.00 ± 0.18 Checkout II 40.465 s ± 2.402 s 38.682 s ± 1.365 s 1.05 ± 0.07 Linux NFS server (v4.1, on EBS, single availability zone): Sequential 32 workers Speedup Clone 240.368 s ± 6.347 s 57.349 s ± 0.870 s 4.19 ± 0.13 Checkout I 242.862 s ± 2.215 s 58.700 s ± 0.904 s 4.14 ± 0.07 Checkout II 65.751 s ± 1.577 s 23.820 s ± 0.407 s 2.76 ± 0.08 EFS (v4.1, replicated over multiple availability zones): Sequential 32 workers Speedup Clone 922.321 s ± 2.274 s 210.453 s ± 3.412 s 4.38 ± 0.07 Checkout I 1011.300 s ± 7.346 s 297.828 s ± 0.964 s 3.40 ± 0.03 Checkout II 294.104 s ± 1.836 s 126.017 s ± 1.190 s 2.33 ± 0.03 The above benchmarks show that parallel checkout is most effective on repositories located on an SSD or over a distributed file system. For local file systems on spinning disks, and/or older machines, the parallelism does not always bring a good performance. For this reason, the default value for checkout.workers is one, a.k.a. sequential checkout. To decide the default value for checkout.thresholdForParallelism, another benchmark was executed in the "Local SSD" setup, where parallel checkout showed to be beneficial. This time, we compared the runtime of a `git checkout -f`, with and without parallelism, after randomly removing an increasing number of files from the Linux working tree. The "sequential fallback" column bellow corresponds to the executions where checkout.workers was 10 but checkout.thresholdForParallelism was equal to the number of to-be-updated files plus one (so that we end up writing sequentially). Each test case was sampled 15 times, and each sample had a randomly different set of files removed. Here are the results: sequential fallback 10 workers speedup 10 files 772.3 ms ± 12.6 ms 769.0 ms ± 13.6 ms 1.00 ± 0.02 20 files 780.5 ms ± 15.8 ms 775.2 ms ± 9.2 ms 1.01 ± 0.02 50 files 806.2 ms ± 13.8 ms 767.4 ms ± 8.5 ms 1.05 ± 0.02 100 files 833.7 ms ± 21.4 ms 750.5 ms ± 16.8 ms 1.11 ± 0.04 200 files 897.6 ms ± 30.9 ms 730.5 ms ± 14.7 ms 1.23 ± 0.05 500 files 1035.4 ms ± 48.0 ms 677.1 ms ± 22.3 ms 1.53 ± 0.09 1000 files 1244.6 ms ± 35.6 ms 654.0 ms ± 38.3 ms 1.90 ± 0.12 2000 files 1488.8 ms ± 53.4 ms 658.8 ms ± 23.8 ms 2.26 ± 0.12 From the above numbers, 100 files seems to be a reasonable default value for the threshold setting. Note: Up to 1000 files, we observe a drop in the execution time of the parallel code with an increase in the number of files. This is a rather odd behavior, but it was observed in multiple repetitions. Above 1000 files, the execution time increases according to the number of files, as one would expect. About the test environments: Local SSD tests were executed on an i7-7700HQ (4 cores with hyper-threading) running Manjaro Linux. Local HDD tests were executed on an Intel(R) Xeon(R) E3-1230 (also 4 cores with hyper-threading), HDD Seagate Barracuda 7200.14 SATA 3.1, running Debian. NFS and EFS tests were executed on an Amazon EC2 c5n.xlarge instance, with 4 vCPUs. The Linux NFS server was running on a m6g.large instance with 2 vCPUSs and a 1 TB EBS GP2 volume. Before each timing, the linux repository was removed (or checked out back to its previous state), and `sync && sysctl vm.drop_caches=3` was executed. Co-authored-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> --- Documentation/config/checkout.txt | 21 +++++++++++++++++++++ parallel-checkout.c | 23 ++++++++++++++++++----- parallel-checkout.h | 9 +++++++-- unpack-trees.c | 10 +++++++--- 4 files changed, 53 insertions(+), 10 deletions(-)