From patchwork Thu Jun 20 16:11:13 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13705794 Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F8BE1B14F0 for ; Thu, 20 Jun 2024 16:11:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899883; cv=none; b=ebywRND+IuPtWusfLTk3PfDskOYr5ldx9FMd8ZtUsRZSBMhagSGpZWKNQOCu1a1ZtHoPBYfV1Ae83IRqqf0BWhc0LLNN2Mml/6lHDKhemsjMHif6K+ew9GLVjvOY2o/WiJl33iAzOmoaxUEBK/EoxKcsP0jif2nbJKy0GQCDeUg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899883; c=relaxed/simple; bh=i1PM/Hf85bQSJIJGyLj6DGXx0mEXkW4barRLG4ILABY=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=DIULqL6aFMjul4s/Briy1cb/RctHPnYv9mFrKXGMwgzaDycD7Ws189C/KQzNgqnJTkGZJHWrVea3zzDYIinga9zgPWs+T+8ivtQzmKpChGEbCnwARPdf1LnnPO9mhycF7+0UI5Hmuun3HCOFfEbz4a7FO3DikLLrnNLf7MsrxNc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=YWCPpN29; arc=none smtp.client-ip=209.85.128.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="YWCPpN29" Received: by mail-wm1-f53.google.com with SMTP id 5b1f17b1804b1-421b9068274so10769025e9.1 for ; Thu, 20 Jun 2024 09:11:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718899880; x=1719504680; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=r27ULTCEwvFugeoXKm1llVa/zbQIYdQH1k6MV9PCSo0=; b=YWCPpN29zZOZR5l6XjfHwVA2DBzMph4/nl9cjMd08DGEHlFwTo8Su6BKG6J3un/Ivo iOxGY4FwPWc6x6SPCgIQ9FaHD3TBKAMa5FperRR5ZMUCWWMCxgAsLOsQunX4vDuTj2+f kMQPFeMEXw3ff6TL4ESrOHgeSNqmsIFF8Shd3yxpBd6aNV0HymAX3nIGdwqr6yvTlVSN wfRTof9GTLKmF4W3zILTgKP5pWRHvmaPKHpSmhpqrUe7VtWjcX3Pqol8i+c/7RxUHoAZ /G7OWNNOYc7XKHaaxTkMah/tIwYmnvpjdIel7fs31huc27LS5PytVVOVtJxLJ+NzDQS8 C2TA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718899880; x=1719504680; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=r27ULTCEwvFugeoXKm1llVa/zbQIYdQH1k6MV9PCSo0=; b=igG6n738uwWMISW3T0qLXDP+rOrUvnw8VH8tEw2tSee+nhq5tDFHS0XHHzP3oRTwjG rVBvIKwAXizW9gOgrf2NHkJ0VIYzk5NyktKsTDbqDEo2s0xZRxm7VlKMbyTu7HbFAU0J GHHw7uIC0ktvq3cdPFIit8HbgA6x9P0Lb7p/xNoxxsolCDDgwnu0KJWSnDTZbz1vdwKM 7pRUHc6YUvmfBRi4DYyw1GgEMGTriHnqq1baYkaxlsWgFofUwzDYxWS6QBHY0lqxRvFO 9GwWDwwjKOdUbTl1nRxwa3lzaxD4SRbgDs4Z2A0BFvBLpSldnVZ6bZ3rytAPrap71j11 Js4A== X-Gm-Message-State: AOJu0Yx+tr2FNFmLDMVZiTSdpn8ViSXIWDDX8A8+fSLmc6lMxz75u8sy OTT8rw2Ug5lNPQkn2ufxy5jgPSy/rAGQIJAs7d5v7qNx+0FLYy8DJmHx/Q== X-Google-Smtp-Source: AGHT+IFjtxR+5O4wQHseuPR1IEF+bD8JJZm5Ce4FJmZ8lxSrP+7LZqA6za6OjTf5dBzfRpBxL0+Rzw== X-Received: by 2002:a05:600c:257:b0:421:8028:a507 with SMTP id 5b1f17b1804b1-424751842f1mr40971895e9.18.1718899879672; Thu, 20 Jun 2024 09:11:19 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-360750ad2c1sm20072261f8f.51.2024.06.20.09.11.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Jun 2024 09:11:19 -0700 (PDT) Message-Id: In-Reply-To: References: Date: Thu, 20 Jun 2024 16:11:13 +0000 Subject: [PATCH 1/5] sparse-index: refactor skip worktree retry logic Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The clear_skip_worktree_from_present_files() method was introduced in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14) to help cases where the sparse index is enabled but some paths outside of the sparse-checkout cone also exist on disk. This operation can be slow as it needs to check path existence in a way that is not stored in the collapsed index, so caching was introduced in d79d299352 (Accelerate clear_skip_worktree_from_present_files() by caching, 2022-01-14). If users are having trouble with the performance of this operation and don't care about paths outside of the sparse-checkout cone, they can disable them using the sparse.expectFilesOutsideOfPatterns config option introduced in ecc7c8841d (repo_read_index: add config to expect files outside sparse patterns, 2022-02-25). Even with that caching, it was noticed that this could take a long time to execute. 89aaab11a3 (index: add trace2 region for clear skip worktree, 2022-11-03) introduced trace2 regions to measure this time. Further, the way the loop repeats itself was slightly confusing and prone to breakage, so a BUG() statement was added in 8c7abdc596 (index: raise a bug if the index is materialised more than once, 2022-11-03) to be sure that the second run of the loop does not hit any sparse trees. One thing that can be confusing about the current setup is that the trace2 regions nest and it is not clear that a second loop is running after the index is expanded. Here is an example of what the regions look like in a typical case: | region_enter | ... | label:clear_skip_worktree_from_present_files | region_enter | ... | ..label:update | region_leave | ... | ..label:update | region_enter | ... | ..label:ensure_full_index | region_enter | ... | ....label:update | region_leave | ... | ....label:update | region_leave | ... | ..label:ensure_full_index | data | ... | ..sparse_path_count:1 | data | ... | ..sparse_path_count_full:269538 | region_leave | ... | label:clear_skip_worktree_from_present_files One thing that is particularly difficult to understand about these regions is that most of the time is spent between the close of the ensure_full_index region and the reporting of the end data. This is because of the restart of the loop being within the same region as the first iteration of the loop. This change refactors the method into two separate methods that are traced separately. This will be more important later when we change other features of the methods, but for now the only functional change is the difference in the structure of the trace regions. After this change, the same telemetry section is split into three distinct chunks: | region_enter | ... | label:clear_skip_worktree_from_present_files_sparse | data | ... | ..sparse_path_count:1 | region_leave | ... | label:clear_skip_worktree_from_present_files_sparse | region_enter | ... | label:update | region_leave | ... | label:update | region_enter | ... | label:ensure_full_index | region_enter | ... | ..label:update | region_leave | ... | ..label:update | region_leave | ... | label:ensure_full_index | region_enter | ... | label:clear_skip_worktree_from_present_files_full | data | ... | ..full_path_count:269538 | region_leave | ... | label:clear_skip_worktree_from_present_files_full Here, we see the sparse loop terminating early with its first sparse path being a sparse directory containing a file. Then, that loop's region terminates before ensure_full_index begins (in this case, the cache-tree must also be computed). Then, _after_ the index is expanded, the full loop begins with its own region. Signed-off-by: Derrick Stolee --- sparse-index.c | 77 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 53 insertions(+), 24 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index e48e40cae71..e0457c87fff 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -486,49 +486,78 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, return 0; } -void clear_skip_worktree_from_present_files(struct index_state *istate) +static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate) { const char *last_dirname = NULL; size_t dir_len = 0; int dir_found = 1; - int i; - int path_count[2] = {0, 0}; - int restarted = 0; + int path_count = 0; + int to_restart = 0; - if (!core_apply_sparse_checkout || - sparse_expect_files_outside_of_patterns) - return; - - trace2_region_enter("index", "clear_skip_worktree_from_present_files", + trace2_region_enter("index", "clear_skip_worktree_from_present_files_sparse", istate->repo); -restart: - for (i = 0; i < istate->cache_nr; i++) { + for (int i = 0; i < istate->cache_nr; i++) { struct cache_entry *ce = istate->cache[i]; if (ce_skip_worktree(ce)) { - path_count[restarted]++; + path_count++; if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { if (S_ISSPARSEDIR(ce->ce_mode)) { - if (restarted) - BUG("ensure-full-index did not fully flatten?"); - ensure_full_index(istate); - restarted = 1; - goto restart; + to_restart = 1; + break; } ce->ce_flags &= ~CE_SKIP_WORKTREE; } } } - if (path_count[0]) - trace2_data_intmax("index", istate->repo, - "sparse_path_count", path_count[0]); - if (restarted) - trace2_data_intmax("index", istate->repo, - "sparse_path_count_full", path_count[1]); - trace2_region_leave("index", "clear_skip_worktree_from_present_files", + trace2_data_intmax("index", istate->repo, + "sparse_path_count", path_count); + trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse", + istate->repo); + return to_restart; +} + +static void clear_skip_worktree_from_present_files_full(struct index_state *istate) +{ + const char *last_dirname = NULL; + size_t dir_len = 0; + int dir_found = 1; + + int path_count = 0; + + trace2_region_enter("index", "clear_skip_worktree_from_present_files_full", istate->repo); + for (int i = 0; i < istate->cache_nr; i++) { + struct cache_entry *ce = istate->cache[i]; + + if (S_ISSPARSEDIR(ce->ce_mode)) + BUG("ensure-full-index did not fully flatten?"); + + if (ce_skip_worktree(ce)) { + path_count++; + if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) + ce->ce_flags &= ~CE_SKIP_WORKTREE; + } + } + + trace2_data_intmax("index", istate->repo, + "full_path_count", path_count); + trace2_region_leave("index", "clear_skip_worktree_from_present_files_full", + istate->repo); +} + +void clear_skip_worktree_from_present_files(struct index_state *istate) +{ + if (!core_apply_sparse_checkout || + sparse_expect_files_outside_of_patterns) + return; + + if (clear_skip_worktree_from_present_files_sparse(istate)) { + ensure_full_index(istate); + clear_skip_worktree_from_present_files_full(istate); + } } /* From patchwork Thu Jun 20 16:11:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13705795 Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3BE971B1504 for ; Thu, 20 Jun 2024 16:11:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.43 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899885; cv=none; b=FGRiWRVjfFw8v3sdCwLeJJRHJrrH3pszdePVp9Dxt8KSo1dm7z8KYMpwIwg+mxBTqsdSnREUrnXgMaM+vITb/7FQjvzkd2qYmx28hwSvqP+7kHVR7+FQ/E/QzbLV6N4UT2MH6M/QNvtqL8H55oNPfaFQvWjRS2WNvSs9/tycjyk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899885; c=relaxed/simple; bh=07NigRt+xJMRU0aPtLAADx+e6Ez/EWmQhryI9FrwenU=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=gbFLJ3a5wSi4R/WGVZpoxn8Y7KTEpxFKugBbuQrK+vOb761TTRvFtMjIIAOBjYiSMw3lFa0TFChAl7oOyznrm72tfL2ygDo7qcxnqIxcmMXvO2bohDMTlkwPYiMsx8C6QZ+58jEBuNMKvzPdujk1PKNoD+K0zozSf4tbVyMWbGo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=COo640Yt; arc=none smtp.client-ip=209.85.128.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="COo640Yt" Received: by mail-wm1-f43.google.com with SMTP id 5b1f17b1804b1-424720e73e0so10847325e9.0 for ; Thu, 20 Jun 2024 09:11:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718899881; x=1719504681; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=FKX2JwFsPPiLvIMv/wmeUwRInuRMu/l+CQXu3Cj+z+o=; b=COo640YtdUt8qoZ2zzm0HcZAifOtzscL+NDD5UfttKmL9Ovurw+CESOtVjDmVCATXP KI1mD1bTww82fGHcibD6MKOIo8WzUdLbABF5Cq3dRC/8Nx1UiwhtzK58jrX0QeDEE0lK OKsIyEoqJmkHTyX+Y3HwloRrFyGkZ6hRO+dx/Evvey+5UgmZKApebxkjvq5LmRnzvrEc FP9Lq0M5uTuo414e/aBbic1lfFTg9miBSADU4kevLuRdby0VdBv35G8H4e/orOpEz+pu g6eKYTZZzOUTe1pNrDu5CqmNucSk4Ngf9i7Ditl+HriL3gKQK1XqLttP0U95/hLY/qUA lQ3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718899881; x=1719504681; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FKX2JwFsPPiLvIMv/wmeUwRInuRMu/l+CQXu3Cj+z+o=; b=QcD6kJwWdkdcBJHpItP4IHJ4jo4hW7Xpohtd0UPoFipY3yXkSRzw7JNIqrzbjx6IqP 0Q5gL+TqY760Ivsb9o2c55h/lQ5BXOAd4f9RSFNdOHrjSigeAftufDNgblXSDtccriV+ 8uZe9YXmPp2eMJmrhjXcfGbJSXbipE/x3lV9OEPBQ6KQiv3hl1TsPmyKQ7zY7qD2l7jl VTN/O0RqBQRhYXDJmqpOx6qjTkoVW4NxBhP092ZoiP0jx/oFulYrjDyWp2KgsmjLvDJV YFzNcI6+LVIqCTmi3m32uzL/v5sCRZKafs+3+pIsdgAvI8QPZSd4XmigW2sA7MDKSOaa 1hyA== X-Gm-Message-State: AOJu0YwLXR7dPmXRhVta/6fgbZROFSrzAfiPLgRuZ7p+/1Bk0MxbQueC 6e4v389rLcOv94rBvaDOdJ9ZFV4umaxFaKpnt+hY+y/DWrqFj0wO/6CXvA== X-Google-Smtp-Source: AGHT+IGxLPy9YTa6XApqDH037Wl9tCkjxWvcD9pqG4mT2Are4uJFDSI8sWNDgVdneegZ8wXxHYdMJA== X-Received: by 2002:a05:600c:3b87:b0:424:760d:75d3 with SMTP id 5b1f17b1804b1-424760d772cmr43243895e9.8.1718899880787; Thu, 20 Jun 2024 09:11:20 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-364dac4e70dsm1909605f8f.5.2024.06.20.09.11.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Jun 2024 09:11:20 -0700 (PDT) Message-Id: <7c3b545ee5ea3a0e6686594afe582fa1a19929f6.1718899877.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Thu, 20 Jun 2024 16:11:14 +0000 Subject: [PATCH 2/5] sparse-index: refactor path_found() Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee In advance of changing the behavior of path_found(), take all of the intermediate data values and group them into a single struct. This simplifies the method prototype as well as the initialization. Future changes can be made directly to the struct and method without changing the callers with this approach. Note that the clear_path_found_data() method is currently empty, as there is nothing to free. However, this will change in the future, so place the method and its callers for now. Signed-off-by: Derrick Stolee --- sparse-index.c | 45 +++++++++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index e0457c87fff..de6e727f5c1 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -439,8 +439,22 @@ void ensure_correct_sparsity(struct index_state *istate) ensure_full_index(istate); } -static int path_found(const char *path, const char **dirname, size_t *dir_len, - int *dir_found) +struct path_found_data { + const char *dirname; + size_t dir_len; + int dir_found; +}; + +#define PATH_FOUND_DATA_INIT { \ + .dir_found = 1 \ +} + +static void clear_path_found_data(struct path_found_data *data) +{ + return; +} + +static int path_found(const char *path, struct path_found_data *data) { struct stat st; char *newdir; @@ -450,7 +464,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, * If dirname corresponds to a directory that doesn't exist, and this * path starts with dirname, then path can't exist. */ - if (!*dir_found && !memcmp(path, *dirname, *dir_len)) + if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len)) return 0; /* @@ -472,15 +486,16 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, * If path starts with directory (which we already lstat'ed and found), * then no need to lstat parent directory again. */ - if (*dir_found && *dirname && memcmp(path, *dirname, *dir_len)) + if (data->dir_found && data->dirname && + memcmp(path, data->dirname, data->dir_len)) return 0; /* Free previous dirname, and cache path's dirname */ - *dirname = path; - *dir_len = newdir - path + 1; + data->dirname = path; + data->dir_len = newdir - path + 1; - tmp = xstrndup(path, *dir_len); - *dir_found = !lstat(tmp, &st); + tmp = xstrndup(path, data->dir_len); + data->dir_found = !lstat(tmp, &st); free(tmp); return 0; @@ -488,9 +503,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate) { - const char *last_dirname = NULL; - size_t dir_len = 0; - int dir_found = 1; + struct path_found_data data = PATH_FOUND_DATA_INIT; int path_count = 0; int to_restart = 0; @@ -502,7 +515,7 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist if (ce_skip_worktree(ce)) { path_count++; - if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { + if (path_found(ce->name, &data)) { if (S_ISSPARSEDIR(ce->ce_mode)) { to_restart = 1; break; @@ -516,14 +529,13 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist "sparse_path_count", path_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse", istate->repo); + clear_path_found_data(&data); return to_restart; } static void clear_skip_worktree_from_present_files_full(struct index_state *istate) { - const char *last_dirname = NULL; - size_t dir_len = 0; - int dir_found = 1; + struct path_found_data data = PATH_FOUND_DATA_INIT; int path_count = 0; @@ -537,7 +549,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista if (ce_skip_worktree(ce)) { path_count++; - if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) + if (path_found(ce->name, &data)) ce->ce_flags &= ~CE_SKIP_WORKTREE; } } @@ -546,6 +558,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista "full_path_count", path_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_full", istate->repo); + clear_path_found_data(&data); } void clear_skip_worktree_from_present_files(struct index_state *istate) From patchwork Thu Jun 20 16:11:15 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13705796 Received: from mail-lj1-f171.google.com (mail-lj1-f171.google.com [209.85.208.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3DFBA1B29A9 for ; Thu, 20 Jun 2024 16:11:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899885; cv=none; b=Au+eV+nzLLBSkubSUU1Wtd8jnUx6xOFLybFIABM5AsHFMx8XL4sTyzgqD80ttJh6Jz0cpDU/4Df2IR2rAmH9sGXoQzPPCiZ5N0HHTxIdthgNzr6sOwj34gsPRl9u2ltw2C2xCiSMXr8Fv3kFE5rkkf+X3/3IyyS7M5xSyQQtOUM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899885; c=relaxed/simple; bh=mKI90b/YGFpefVFGNK/0fVqAD9pGjE9PJw08JNb6Gps=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=a47+9UEvep+2mzgOVPdTpvEg8M9PoK6r5daozt4ZKrwXddzJHSkxiA70bWTJdVHstU4Acl0A0UWb248aqOiZhSTdC1T2qn9+IGRyy8wyznZt+Iu6rnfscx59LBAyIbReQNDfRoisGFaVMzWwOVLqNS82nNv754yb0G9Ul7heup4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=lt8Nukdm; arc=none smtp.client-ip=209.85.208.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lt8Nukdm" Received: by mail-lj1-f171.google.com with SMTP id 38308e7fff4ca-2ebe0a81dc8so12234491fa.2 for ; Thu, 20 Jun 2024 09:11:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718899882; x=1719504682; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=xO1bF8/D+K6VmcsbNZerxp108u1S6w5egvhSyaxchUI=; b=lt8Nukdm/VyejH/nls+g12CT4K1AHNqdXskVd2zaKYhftBIg7v+MaDL9R/PVIPP0C6 oBnY/IMjRFfPb9nHe+7PaVLiYfWBYd9f+kgaOvQk3I3VehkQDYt3LlbKZ1n1vG2vCOOD XBy7bDALQdsExQSUg2b2XqTxCzffPL8r6juxS83KWCHUfqh1z5dSHURwqaivRfk3QcCb Gg4PrTdi2X30q5GHy+3WpJhEGkwFypNs+QjqGV0Ijm9J9H7W8VFm1zNqOxRJH7230Xly XrCVyt27SUxP0fWHqsGhjIkA0IBb1aFHKUrtGQOvkB9CJzI9KsDGuvNr7XVTDb2lKGQU 7Qkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718899882; x=1719504682; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xO1bF8/D+K6VmcsbNZerxp108u1S6w5egvhSyaxchUI=; b=ov3RPnTXix8Vtw7A9YS4v9w8C6BypFlZZ7EgA6+xKILsWxsXNYz7BQG99SNZQ9jPU7 Y7ePmOq5Bh2l6bfKt8C7HUMzKRdiUDmh8JUEqPI+SI3M4Netdltiseq8CkwKGksRWhOv N/UF7DBn8cs3eA+3c73vaxJo9qSylRmjXRl6UYM006u0s+f7hsOxA5see2z4v4t4SFzN aMVzkOVScouqv5srdo24Cimq8YCaV3dHne8HBQbRRZUfAR4AIDkwYgJB54kQcwLR2aMR rXeEcMOMMMRqtwjAD4WsMxcVkHdBPPd6cRjkU7C9k7qN6YrCGMH81J8t7lrFE5SHiOX0 sdMQ== X-Gm-Message-State: AOJu0YwDXpakOmLKN+o4cdxa5Fo2jQTisSQj7oflyd/77QjZynyJtUkP RALvZvWZPNlGjZ2LgasF/s2KX38V8u3XR+lEptM7O25y25IB8DiwTNL5zw== X-Google-Smtp-Source: AGHT+IG56n83uSNh5lOQBXb06vifRPGD7+uASq6D+3Vnt6BJtxTdRg8kld8WE5MGLGilqYECTCyoHQ== X-Received: by 2002:a2e:9612:0:b0:2eb:fda7:e366 with SMTP id 38308e7fff4ca-2ec3cee1278mr44798021fa.39.1718899881759; Thu, 20 Jun 2024 09:11:21 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4247d1f6da9sm30398715e9.44.2024.06.20.09.11.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Jun 2024 09:11:21 -0700 (PDT) Message-Id: <217594ffb103969c1a6debc07a6c7f72f6ee4749.1718899877.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Thu, 20 Jun 2024 16:11:15 +0000 Subject: [PATCH 3/5] sparse-index: use strbuf in path_found() Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The path_found() method previously reused strings from the cache entries the calling methods were using. This prevents string manipulation in place and causes some odd reallocation before the final lstat() call in the method. Refactor the method to use strbufs and copy the path into the strbuf, but also only the parent directory and not the whole path. This looks like extra copying when assigning the path to the strbuf, but we save an allocation by dropping the 'tmp' string, and we are "reusing" the copy from 'tmp' to put the data in the strbuf. Signed-off-by: Derrick Stolee --- sparse-index.c | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index de6e727f5c1..fec4f393360 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -440,31 +440,30 @@ void ensure_correct_sparsity(struct index_state *istate) } struct path_found_data { - const char *dirname; - size_t dir_len; + struct strbuf dir; int dir_found; }; #define PATH_FOUND_DATA_INIT { \ + .dir = STRBUF_INIT, \ .dir_found = 1 \ } static void clear_path_found_data(struct path_found_data *data) { - return; + strbuf_release(&data->dir); } static int path_found(const char *path, struct path_found_data *data) { struct stat st; char *newdir; - char *tmp; /* * If dirname corresponds to a directory that doesn't exist, and this * path starts with dirname, then path can't exist. */ - if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len)) + if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len)) return 0; /* @@ -486,17 +485,15 @@ static int path_found(const char *path, struct path_found_data *data) * If path starts with directory (which we already lstat'ed and found), * then no need to lstat parent directory again. */ - if (data->dir_found && data->dirname && - memcmp(path, data->dirname, data->dir_len)) + if (data->dir_found && data->dir.buf && + memcmp(path, data->dir.buf, data->dir.len)) return 0; /* Free previous dirname, and cache path's dirname */ - data->dirname = path; - data->dir_len = newdir - path + 1; + strbuf_reset(&data->dir); + strbuf_add(&data->dir, path, newdir - path + 1); - tmp = xstrndup(path, data->dir_len); - data->dir_found = !lstat(tmp, &st); - free(tmp); + data->dir_found = !lstat(data->dir.buf, &st); return 0; } From patchwork Thu Jun 20 16:11:16 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13705798 Received: from mail-wm1-f45.google.com (mail-wm1-f45.google.com [209.85.128.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9FDEE1B29C2 for ; Thu, 20 Jun 2024 16:11:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899888; cv=none; b=R/rkF/YCdURvbbqI5xNH1c9OjCtQk33sPBBxLE9aAcBUUEiBuim2dw65ayFop/oOyWzc2y4dlHZD/kENf9jazgF68nCIqpjDxcxGpONfZijmIdyoduq/jTDP36K4KIrZ8BKWjVc5jrGY1C3d6z9fjhEW1QVBr6lA6RfmmEQJu3k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899888; c=relaxed/simple; bh=v7pEIAOQacxrKcSmfYu3MMPjIoqKiKMlpNo3w47ds7I=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=kDWrt2eDp8F4K7GdZ+lvZonj9ZbHccpj2PZ4Zht6FRVXJjACzX1AGcGSKkYDfMd+LlktlIZijL4Luajp2RcbrepzuGNN/YeP16kfKkU8GN1Z5hN50ys2UqHIsd7mrJWz8EMHjAt6zBifTuGIUGOwmQ6ucu/q3rNgRmbQ5t21eOM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=dHsfW5ev; arc=none smtp.client-ip=209.85.128.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="dHsfW5ev" Received: by mail-wm1-f45.google.com with SMTP id 5b1f17b1804b1-42278f3aea4so10596765e9.1 for ; Thu, 20 Jun 2024 09:11:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718899882; x=1719504682; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=KauGz/GdUXdGho5U7fUDm2FdMS4nlt8MegpX3RmXHxs=; b=dHsfW5evkM/gKe+MQSldyIFUuqm6ECFYNqLUd93Qj+gOuUaO0GB1bxhhcsd+HBbSZI PmOQ0JFFGHrUj2bSg9hg5W5Z2L2NyIABHX+DSd7jMgmcmvr3s1myvTO7IM4MH19X8nWd E3d+/S3bVLAFyEQpazH7HcO+75sIH+bj7dDhrEc1SmOyFaZ0fgO0fqiXaQIX4BwjJT+U 74eHT14sY8ZZoQcpWXnBGcGmJ/hxmTo7QYQ39D3INhHWcW9KUnMND+61Cn7X33TIXc1s HppVQT3P7An7iit4GSWIu/0IiEscilFH7TDG0kglITvSF9eoevy05MzaHvNDJvvoaFXO G6pA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718899882; x=1719504682; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KauGz/GdUXdGho5U7fUDm2FdMS4nlt8MegpX3RmXHxs=; b=qhfZjtULVyas9NbedvjtC2iprNqw/+JU9rBAm8DCVpXrsKspfddvgPiWwDEmGL2Vbf D7YxIUT1kOb8gtZyfTCGg1+/bVukxPJuN7FLbJ7fyZPya6JrwSIF7RyWZuSCUIf4kRjZ TjYmTk0N/8ynPxYCS8lVxCSCQ+4rwnC7tq0q9EiM+20ep37RqTK1VbZUgstsjwN8DXAp HkKyh3iCUEa1FlGQLNGaTOKEUC4kpo/+3SLmcEhpmqiLUmb0FaUzvQdMp1eGHn7CCybK SikYfmwf2NugOR8kWq8TRB1Ko0NDxsMM5ZF8ppIJgC4UzURF8MzkMOdbdxcI9UNL8wsA RIzw== X-Gm-Message-State: AOJu0YxFhsBr9kaCtk6FlxA97YK4Lh7Q7OMhGnUR6YdtLOrVzitFcFL8 4WBEFSwSKrdhHNNvfs4zzJ8x+9CdwaSo4dlfIpku4o8Wk5UsxY2FsQel1g== X-Google-Smtp-Source: AGHT+IHYoIRNeLtHZRszvvtc75jZsOXs+4ixSoWqYlZW5+vL5nUtaSUFZbhfoY8Z7NAds2K+BQBXWg== X-Received: by 2002:a5d:698f:0:b0:35f:1d7a:c41c with SMTP id ffacd0b85a97d-36319a85dfemr4471260f8f.60.1718899882404; Thu, 20 Jun 2024 09:11:22 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-360940d992fsm11861245f8f.116.2024.06.20.09.11.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Jun 2024 09:11:22 -0700 (PDT) Message-Id: <88a3145e585169fde8cd7d43a435daa07eb82667.1718899877.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Thu, 20 Jun 2024 16:11:16 +0000 Subject: [PATCH 4/5] sparse-index: count lstat() calls Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The clear_skip_worktree.. methods already report some statistics about how many cache entries are checked against path_found() due to having the skip-worktree bit set. However, due to path_found() performing some caching, this isn't the only information that would be helpful to report. Add a new lstat_count member to the path_found_data struct to count the number of times path_found() calls lstat(). This will be helpful to help explain performance problems in this method as well as to demonstrate future changes to the caching algorithm in a more concrete way than end-to-end timings. Signed-off-by: Derrick Stolee --- sparse-index.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/sparse-index.c b/sparse-index.c index fec4f393360..8577fa726b8 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -442,6 +442,7 @@ void ensure_correct_sparsity(struct index_state *istate) struct path_found_data { struct strbuf dir; int dir_found; + size_t lstat_count; }; #define PATH_FOUND_DATA_INIT { \ @@ -469,6 +470,7 @@ static int path_found(const char *path, struct path_found_data *data) /* * If path itself exists, return 1. */ + data->lstat_count++; if (!lstat(path, &st)) return 1; @@ -493,6 +495,7 @@ static int path_found(const char *path, struct path_found_data *data) strbuf_reset(&data->dir); strbuf_add(&data->dir, path, newdir - path + 1); + data->lstat_count++; data->dir_found = !lstat(data->dir.buf, &st); return 0; @@ -524,6 +527,8 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist trace2_data_intmax("index", istate->repo, "sparse_path_count", path_count); + trace2_data_intmax("index", istate->repo, + "sparse_lstat_count", data.lstat_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse", istate->repo); clear_path_found_data(&data); @@ -553,6 +558,8 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista trace2_data_intmax("index", istate->repo, "full_path_count", path_count); + trace2_data_intmax("index", istate->repo, + "full_lstat_count", data.lstat_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_full", istate->repo); clear_path_found_data(&data); From patchwork Thu Jun 20 16:11:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13705797 Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8C65B1B29B9 for ; Thu, 20 Jun 2024 16:11:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899887; cv=none; b=Z36p9lVcLIhNied9JLJcmwnk0U9lDorkfFUB7XLcVknKe6Y8cJVSL6KZBVIyc/EU4BuPozrSY46dQyDryucSP8hWhG601FsNj1+19k9tiqdNITlgnJNjI06hyNk6DRMy9+IhflnSLAQq+gCQWkI2wBAPFlpyAwh0fc9fQ/R10gg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718899887; c=relaxed/simple; bh=/SiHuettgMHg8Ks/9mWvhpAqHOKYKIYZrFACrCV40KI=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=fV/G7Oegh1xk0n6zG+sAqzoPTvSrEFEItLD9TU2PWOdrmHXIEnceoQrvp2j/WqGUrvbne+4C7ZcKLT1VJBhksRozedqzhWZOzqgeGqlmvfoDxkojmg2p4FBjl3weNUV90Dor/2u1juV7+YM6LW+BRir/SmDYbJVAXYiXozE5RLs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=XUr/UMm4; arc=none smtp.client-ip=209.85.128.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="XUr/UMm4" Received: by mail-wm1-f42.google.com with SMTP id 5b1f17b1804b1-42108856c33so13761305e9.1 for ; Thu, 20 Jun 2024 09:11:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718899883; x=1719504683; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=xyEfFF4zOCU9gYDloKyPVM3FVbc/OWOLlCA9CrhoIEg=; b=XUr/UMm4TunYfPRw/AAp38zl7hUb+OlqJxSnr2UrsAIReFSjgPkCcXSPEGIW+6/aHp HDMaWzzcHOWCeTRQo1yPl3nrIFAHMy+OS9YbWwsK+f9FoppYc2LGlthkSeI+SZVhKDx5 miDG8d/Wme71io0d0Qp337zzVXQ35TiIZ9AkldZnhsLKu67v1uO4E8EHPypGSLJ0VsJ2 1KjqXhrywKa+Ilw9u+mLVxGfz4uBrOdNtiaC9f7EHON50xW2Do8tmHorZJIkZhjZSsCT vjscJS9sGYRWg6D1gwe3uYPq1WOim4eJI+PL6erjK8G2bcYqxtNxzy5u4cGi3i0cokj2 h9fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718899883; x=1719504683; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xyEfFF4zOCU9gYDloKyPVM3FVbc/OWOLlCA9CrhoIEg=; b=p3bZPhDwyl36yjtfFB4wi4+cgQBsceR9KUrKejMOpAAHtnHkVuSIgW2FuJufwYA6cg Y4I/tfVigructnl9LQ/7+Wy3TNbAKazWeXJnkqM+C/g4vrO5ZpX10BPVVd6raEj6v0/X uG7G38EXBgB2RSbF260ZCLc+ELfNF/WVN3I3Ow3rIqLxfMxeMfnXsbJQsElni15W1DB8 3m+zCjJnMKUuMlx79UZa0mPuMXJpFUVAUUjDnPqXeVwHZNqfXvKz2zdXWZKfPQ21GmEj /a7EQWhpFpLoVIkOuM2I7G2aLAXd0FGN4ADJaETcJc5L2gSNZ+mBQNsZ9OPweq1+sXOI PxQw== X-Gm-Message-State: AOJu0YyXBSXQyLwCBCYiOw9mI0b1cImF7bsL1oqV/kPeNuYEv5SebqkT Em2XsOOxd43XKdcq21sJJTh9A4Wr3M1z2MXcGiBZa64BXv+9sH74Zba16Q== X-Google-Smtp-Source: AGHT+IG41lJIfbZEv8KzF9+tuLzpZxZ1X9g2GftvS21dMIhRcr3NnJBiwfzXraGE7DzEGLz37i/MyQ== X-Received: by 2002:a05:600c:4506:b0:422:291:6b3e with SMTP id 5b1f17b1804b1-4246f56d2b6mr86963085e9.1.1718899883200; Thu, 20 Jun 2024 09:11:23 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-360917c264bsm12847882f8f.56.2024.06.20.09.11.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Jun 2024 09:11:22 -0700 (PDT) Message-Id: <2654fcb7142a606c5684c762ed28bb5e8d9b4712.1718899877.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Thu, 20 Jun 2024 16:11:17 +0000 Subject: [PATCH 5/5] sparse-index: improve lstat caching of sparse paths Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The clear_skip_worktree_from_present_files() method was first introduced in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14) to allow better interaction with the working directory in the presence of paths outside of the sparse-checkout cone. The initial implementation would lstat() every single sparse tree to see if it existed, and if one did, then the sparse index would expand and every sparse file would be checked. Since these lstat() calls were very expensive, this was improved in d79d299352 (Accelerate clear_skip_worktree_from_present_files() by caching, 2022-01-14) by caching directories that do not exist. However, there are some inefficiencies in that caching mechanism. The caching mechanism stored only the parent directory as not existing, even if a higher parent directory also does not exist. This means that wasted lstat() calls would occur when the sparse files change immediate parent directories but within the same root directory that does not exist. To set up a scenario that triggers this code in an interesting way, we need a sparse-checkout in cone mode and a sparse index. To trigger the full index expansion and a call to the clear_skip_worktree_from_present_files_full() method, we need one of the sparse trees to actually exist on disk. The performance test script p2000-sparse-operations.sh takes the sample repository and copies its HEAD to several copies nested in directories of the form f/f/f where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior and also lead to some interesting cases for the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do not. This is difficult to notice when running performance tests using the Git repository (or a blow-up of the Git repository, as in p2000-sparse-operations.sh) because Git has a very shallow directory structure. This change reorganizes the caching algorithm to focus on storing both the deepest _existing_ directory and the next-level non-existing directory. By doing a little extra work on the first sparse file, we can short-circuit all of the sparse files that exist in that non-existing directory. When in a repository where the first sparse file is likely to have a much deeper path than the first non-existing directory, this can realize significant gains. The details of this algorithm require careful attention, so the new implementation of path_found() has detailed comments, including the use of a new max_common_dir_prefix() method that may be of independent interest. It's worth noting that this is not universally positive, since we are doing extra lstat() calls to establish the exact path to cache. In the blow-up of the Git repository, we can see that the lstat count _increases_ from 28 to 31. However, these numbers were already artificially low. Using an internal monorepo with over two million paths at HEAD and a typical sparse-checkout cone such that the index contains ~190,000 entries (including over two thousand sparse trees), I was able to measure these lstat counts when one sparse directory actually exists on disk: Sparse files in expanded index: 1,841,997 full_lstat_count (before): 173,259 full_lstat_count (after): 6,521 This resulted in this absolute time change, on a warm disk: Time in full loop (before): 2.527 s Time in full loop (after): 0.071 s (These times were calculated on a Windows machine, where lstat() is slower than a similar Linux machine.) Signed-off-by: Derrick Stolee --- sparse-index.c | 118 ++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 93 insertions(+), 25 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index 8577fa726b8..cccd8550dfe 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -440,14 +440,21 @@ void ensure_correct_sparsity(struct index_state *istate) } struct path_found_data { + /** + * The path stored in 'dir', if non-empty, corresponds to the most- + * recent path that we checked where: + * + * 1. The path should be a directory, according to the index. + * 2. The path does not exist. + * 3. The parent path _does_ exist. (This may be the root of the + * working directory.) + */ struct strbuf dir; - int dir_found; size_t lstat_count; }; #define PATH_FOUND_DATA_INIT { \ - .dir = STRBUF_INIT, \ - .dir_found = 1 \ + .dir = STRBUF_INIT \ } static void clear_path_found_data(struct path_found_data *data) @@ -455,50 +462,111 @@ static void clear_path_found_data(struct path_found_data *data) strbuf_release(&data->dir); } +/** + * Return the length of the largest common substring that ends in a + * slash ('/') to indicate the largest common parent directory. Returns + * zero if no common directory exists. + */ +static size_t max_common_dir_prefix(const char *path1, const char *path2) +{ + size_t common_prefix = 0; + for (size_t i = 0; path1[i] && path2[i]; i++) { + if (path1[i] != path2[i]) + break; + + /* + * If they agree at a directory separator, then add one + * to make sure it is included in the common prefix string. + */ + if (path1[i] == '/') + common_prefix = i + 1; + } + + return common_prefix; +} + static int path_found(const char *path, struct path_found_data *data) { struct stat st; - char *newdir; + size_t common_prefix; /* - * If dirname corresponds to a directory that doesn't exist, and this - * path starts with dirname, then path can't exist. + * If data->dir is non-empty, then it contains a path that doesn't + * exist, including an ending slash ('/'). If it is a prefix of 'path', + * then we can return 0. */ - if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len)) + if (data->dir.len && !memcmp(path, data->dir.buf, data->dir.len)) return 0; /* - * If path itself exists, return 1. + * Otherwise, we must check if the current path exists. If it does, then + * return 1. The cached directory will be skipped until we come across + * a missing path again. */ data->lstat_count++; if (!lstat(path, &st)) return 1; /* - * Otherwise, path does not exist so we'll return 0...but we'll first - * determine some info about its parent directory so we can avoid - * lstat calls for future cache entries. + * At this point, we know that 'path' doesn't exist, and we know that + * the parent directory of 'data->dir' does exist. Let's set 'data->dir' + * to be the top-most non-existing directory of 'path'. If the first + * parent of 'path' exists, then we will act ast though 'path' + * corresponds to a directory (by adding a slash). */ - newdir = strrchr(path, '/'); - if (!newdir) - return 0; /* Didn't find a parent dir; just return 0 now. */ + common_prefix = max_common_dir_prefix(path, data->dir.buf); /* - * If path starts with directory (which we already lstat'ed and found), - * then no need to lstat parent directory again. + * At this point, 'path' and 'data->dir' have a common existing parent + * directory given by path[0..common_prefix] (which could have length 0). + * We "grow" the data->dir buffer by checking for existing directories + * along 'path'. */ - if (data->dir_found && data->dir.buf && - memcmp(path, data->dir.buf, data->dir.len)) - return 0; - /* Free previous dirname, and cache path's dirname */ - strbuf_reset(&data->dir); - strbuf_add(&data->dir, path, newdir - path + 1); + strbuf_setlen(&data->dir, common_prefix); + while (1) { + /* Find the next directory in 'path'. */ + const char *next_slash = strchr(path + data->dir.len, '/'); - data->lstat_count++; - data->dir_found = !lstat(data->dir.buf, &st); + /* + * If there are no more slashes, then 'path' doesn't contain a + * non-existent _parent_ directory. Set 'data->dir' to be equal + * to 'path' plus an additional slash, so it can be used for + * caching in the future. The filename of 'path' is considered + * a non-existent directory. + * + * Note: if "{path}/" exists as a directory, then it will never + * appear as a prefix of other callers to this method, assuming + * the context from the clear_skip_worktree... methods. If this + * method is reused, then this must be reconsidered. + */ + if (!next_slash) { + strbuf_addstr(&data->dir, path + data->dir.len); + strbuf_addch(&data->dir, '/'); + break; + } - return 0; + /* + * Now that we have a slash, let's grow 'data->dir' to include + * this slash, then test if we should stop. + */ + strbuf_add(&data->dir, path + data->dir.len, + (next_slash - path) - data->dir.len + 1); + + /* If the path doesn't exist, then stop here. */ + data->lstat_count++; + if (lstat(data->dir.buf, &st)) + return 0; + } + + /* + * At this point, 'data->dir' is equal to 'path' plus a slash character, + * and the parent directory of 'path' definitely exists. Let's return + * the case of whether 'path' exists. + */ + + data->lstat_count++; + return !lstat(path, &st); } static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)