mbox series

[00/27] Sparse Index: API protections

Message ID pull.906.git.1615929435.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Sparse Index: API protections | expand

Message

Philippe Blain via GitGitGadget March 16, 2021, 9:16 p.m. UTC
Here is the second patch series submission coming out of the sparse-index
RFC [1].

[1]
https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/

This is based on v3 of the format series [2].

[2]
https://lore.kernel.org/git/pull.883.v3.git.1615912983.gitgitgadget@gmail.com/

The point of this series is to insert protections for the consumers of the
in-memory index. We mark certain regions of code as needing a full index, so
we call ensure_full_index() to expand a sparse index to a full one, if
necessary. These protections are inserted file-by-file in every loop over
all cache entries. Well, "most" loops, because some are going to be handled
in the very next series so I leave them out.

Many callers use index_name_pos() to find a path by name. In these cases, we
can check if that position resolves to a sparse directory instance. In those
cases, we just expand to a full index and run the search again.

The last few patches deal with the name-hash hashtable for doing O(1)
lookups.

These protections don't do much right now, since the previous series created
the_repository->settings.command_requires_full_index to guard all index
reads and writes to ensure the in-memory copy is full for commands that have
not been tested with the sparse index yet.

However, after this series is complete, we now have a straight-forward plan
for making commands "sparse aware" one-by-one:

 1. Disable settings.command_requires_full_index to allow an in-memory
    sparse-index.
 2. Run versions of that command under a debugger, breaking on
    ensure_full_index().
 3. Examine the call stack to determine the context of that expansion, then
    implement the proper behavior in those locations.
 4. Add tests to ensure we are checking this logic in the presence of sparse
    directory entries.

I will admit that mostly it is the writing of the test cases that takes the
most time in the conversions I've done so far.

Thanks, -Stolee

Derrick Stolee (27):
  *: remove 'const' qualifier for struct index_state
  read-cache: expand on query into sparse-directory entry
  sparse-index: API protection strategy
  cache: move ensure_full_index() to cache.h
  add: ensure full index
  checkout-index: ensure full index
  checkout: ensure full index
  commit: ensure full index
  difftool: ensure full index
  fsck: ensure full index
  grep: ensure full index
  ls-files: ensure full index
  merge-index: ensure full index
  rm: ensure full index
  sparse-checkout: ensure full index
  update-index: ensure full index
  diff-lib: ensure full index
  dir: ensure full index
  entry: ensure full index
  merge-ort: ensure full index
  merge-recursive: ensure full index
  pathspec: ensure full index
  read-cache: ensure full index
  resolve-undo: ensure full index
  revision: ensure full index
  sparse-index: expand_to_path()
  name-hash: use expand_to_path()

 Documentation/technical/sparse-index.txt | 32 ++++++++++-
 attr.c                                   | 14 ++---
 attr.h                                   |  4 +-
 builtin/add.c                            |  1 +
 builtin/checkout-index.c                 |  1 +
 builtin/checkout.c                       |  2 +
 builtin/commit.c                         |  2 +
 builtin/difftool.c                       |  2 +
 builtin/fsck.c                           |  1 +
 builtin/grep.c                           |  1 +
 builtin/ls-files.c                       | 12 ++--
 builtin/merge-index.c                    |  3 +
 builtin/rm.c                             |  1 +
 builtin/stash.c                          |  1 +
 builtin/update-index.c                   |  1 +
 cache.h                                  |  7 ++-
 convert.c                                | 26 ++++-----
 convert.h                                | 20 +++----
 diff-lib.c                               |  1 +
 dir.c                                    | 13 +++--
 dir.h                                    |  8 +--
 entry.c                                  |  1 +
 merge-ort.c                              |  1 +
 merge-recursive.c                        |  3 +-
 name-hash.c                              | 10 ++++
 pathspec.c                               |  7 ++-
 pathspec.h                               |  6 +-
 read-cache.c                             | 33 +++++++++--
 resolve-undo.c                           |  2 +
 revision.c                               |  1 +
 sparse-index.c                           | 72 ++++++++++++++++++++++++
 sparse-index.h                           | 14 ++++-
 submodule.c                              |  6 +-
 submodule.h                              |  6 +-
 34 files changed, 243 insertions(+), 72 deletions(-)


base-commit: 3db06ac46dd5c61e83d7fc4747615d616fdbbdda
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-906%2Fderrickstolee%2Fsparse-index%2Fprotections-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-906/derrickstolee/sparse-index/protections-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/906

Comments

Elijah Newren March 17, 2021, 6:03 p.m. UTC | #1
On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Here is the second patch series submission coming out of the sparse-index
> RFC [1].
>
> [1]
> https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/
>
> This is based on v3 of the format series [2].
>
> [2]
> https://lore.kernel.org/git/pull.883.v3.git.1615912983.gitgitgadget@gmail.com/
>
> The point of this series is to insert protections for the consumers of the
> in-memory index. We mark certain regions of code as needing a full index, so
> we call ensure_full_index() to expand a sparse index to a full one, if
> necessary. These protections are inserted file-by-file in every loop over
> all cache entries. Well, "most" loops, because some are going to be handled
> in the very next series so I leave them out.
>
> Many callers use index_name_pos() to find a path by name. In these cases, we
> can check if that position resolves to a sparse directory instance. In those
> cases, we just expand to a full index and run the search again.
>
> The last few patches deal with the name-hash hashtable for doing O(1)
> lookups.
>
> These protections don't do much right now, since the previous series created
> the_repository->settings.command_requires_full_index to guard all index
> reads and writes to ensure the in-memory copy is full for commands that have
> not been tested with the sparse index yet.
>
> However, after this series is complete, we now have a straight-forward plan
> for making commands "sparse aware" one-by-one:
>
>  1. Disable settings.command_requires_full_index to allow an in-memory
>     sparse-index.
>  2. Run versions of that command under a debugger, breaking on
>     ensure_full_index().
>  3. Examine the call stack to determine the context of that expansion, then
>     implement the proper behavior in those locations.
>  4. Add tests to ensure we are checking this logic in the presence of sparse
>     directory entries.

I started reading the series, then noticed I didn't like the first few
additions of ensure_full_index().  The first was because I thought it
just wasn't needed as per a few lines of code later, but the more
important one that stuck out to me was another where if the
ensure_full_index() call is needed to avoid the code blowing up, then
the code has a good chance that it is doing something inherently wrong
in a sparse-checkout/sparse-index anyway.

So I guess that brings me to the question I asked in 07/27 -- is the
presence of ensure_full_index() a note that the code needs to be later
audited and tweaked to work with sparse-indexes?  If so, then good,
this series may make sense.  However, will that always be the case?
If we think some of these will be left in the code, is there a plan to
annotate each one that has been audited and determined that it needs
to stay?  If not, then each ensure_full_index() might or might not
have been audited for correctness and it becomes a pain to know what's
left to fix.  If the plan is that these are to be audited, and to be
marked if they are truly deemed necessary, then the series makes sense
to me.  If not, then I'm confused about something with the series and
need some help with the goals and plans.

If I'm confused about the goals and the plans, then my reviews will
probably be less than helpful, so I'll suspend reading the series
until I understand the plan a little better.
Elijah Newren March 18, 2021, 6:32 a.m. UTC | #2
On Wed, Mar 17, 2021 at 11:03 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > The point of this series is to insert protections for the consumers of the
> > in-memory index. We mark certain regions of code as needing a full index, so
> > we call ensure_full_index() to expand a sparse index to a full one, if
> > necessary. These protections are inserted file-by-file in every loop over
> > all cache entries. Well, "most" loops, because some are going to be handled
> > in the very next series so I leave them out.
...
> > However, after this series is complete, we now have a straight-forward plan
> > for making commands "sparse aware" one-by-one:
> >
> >  1. Disable settings.command_requires_full_index to allow an in-memory
> >     sparse-index.
> >  2. Run versions of that command under a debugger, breaking on
> >     ensure_full_index().
> >  3. Examine the call stack to determine the context of that expansion, then
> >     implement the proper behavior in those locations.
> >  4. Add tests to ensure we are checking this logic in the presence of sparse
> >     directory entries.
>
...
> If I'm confused about the goals and the plans, then my reviews will
> probably be less than helpful, so I'll suspend reading the series
> until I understand the plan a little better.

Thanks for patiently responding to all my other queries.  I read
through more of the series today.  I inserted comments on a couple
specific patches, but most of patches 6-25 are all along the same
theme.

I have some overall comments on those patches in 6-25 (none of which
need to be addressed in this series, but are meant as hopefully
helpful guides about future work):

add/rm: both of these were just above loops that had a
skip-the-SKIP_WORKTREE entries, at least once Matheus' series is
merged (https://lore.kernel.org/git/cover.1615588108.git.matheus.bernardino@usp.br/).
I think the upshot is just that these become easier to convert.

checkout/commit -- I think these follow the add/rm model and likewise
become easy to convert.

ls-files - I commented elsewhere in this thread about how I think it'd
make sense to have it list the entries in the index, as always.
Obviously, that'd give different output than a full index, because
there are different entries present in the index when using a
sparse-index vs. a full one.

However, all of these, as you say, can wait until later.


I also noted that patches 26 and 27 were based on ones from the RFC
series that I reviewed before, and I noticed you fixed an issue or two
I flagged there, but you also made some other changes and it's too
late at night for my brain to continue thinking and try to compare
them; I may try again tomorrow.

Other than patch 26 which I'm too tired to think through right now,
and the patches I specifically commented on, the rest of the series
looks good to me.