Message ID | c7a2a7efe6d532fc7fce1352b1dfce640cc9f2f6.1648616734.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects | expand |
"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > Batched fsync will fit into bulk-checkin by taking advantage of the > plug/unplug functionality to determine the appropriate time to fsync > and make newly-added objects available in the primary object database. > > * Rename 'state' variable to 'bulk_checkin_state', since we will later > be adding 'bulk_fsync_objdir'. This also makes the variable easier to > find in the debugger, since the name is more unique. > > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate > static variable. Doing this avoids resetting the variable in > finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we > seem to unintentionally disable the plugging functionality the first > time a new packfile must be created due to packfile size limits. While > disabling the plugging state only results in suboptimal behavior for > the current code, it would be fatal for the bulk-fsync functionality > later in this patch series. Paraphrasing to make sure I understand your reasoning here... In the "plug and then perform as many changes to the repository and finally unplug" flow, before or after this series, the "perform" step in the middle is unaware of which "bulk_checkin_state" instance is being used to keep track of what is done to optimize by deferring some operations until the "unplug" time. So bulk_checkin_state is not there to allow us to create multiple instances of it, pass them around to different sequences of "plug, perform, unplug". Each of its members is inherently a singleton, so in the extreme, we could turn these members into separate file-scope global variables if we wanted to. The "plugged" bit happens to be the only one getting ejected by this patch, because it is inconvenient to "clear" other members otherwise. Is that what is going on? If it is, I am mildly opposed to the flow of thought, from at least two reasons. It makes it hard for the next developer to decide if the new members they are adding should be in or out of the struct. More importantly, I think the call of finish_bulk_checkin() we make in deflate_to_pack() you found (and there may possibly other places that we do so; I didn't check) may not appear to be a bug in the original context, but it already is a bug. And when we change the semantics of plug-unplug to be more "transaction-like", it becomes a more serious bug, as you said. There is NO reason to end the ongoing transaction there inside the while() loop that tries to limit the size of the packfile being used. We may want to flush the "packfile part", which may have been almost synonymous to the entirety of bulk_checkin_state, but as you found out, the "plugged" bit is *outside* the "packfile part", and that makes it a bug to call finish_bulk_checkin() from there. We should add a new function, flush_bulk_checking_packfile(), to flush only the packfile part of the bulk_checkin_state without affecting other things---the "plugged" bit is the only one in the current code before this series, but it does not have to stay to be so. When you start plugging the loose ref transactions, you may find it handy (this is me handwaving) to have a list of refs that you may have to do something at "unplug" time kept in the struct, and you do not want deflate_to_pack() affecting the ongoing "plugged" ref operations by calling finish_bulk_checkin() and reinitializing that list, for example. And then we should examine existing calls to finish_bulk_checkin() and replace the ones that should not be finishing, i.e. the ones that wanted "flush" but called "finish".
On Wed, Mar 30, 2022 at 10:11 AM Junio C Hamano <gitster@pobox.com> wrote: > > "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > Batched fsync will fit into bulk-checkin by taking advantage of the > > plug/unplug functionality to determine the appropriate time to fsync > > and make newly-added objects available in the primary object database. > > > > * Rename 'state' variable to 'bulk_checkin_state', since we will later > > be adding 'bulk_fsync_objdir'. This also makes the variable easier to > > find in the debugger, since the name is more unique. > > > > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate > > static variable. Doing this avoids resetting the variable in > > finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we > > seem to unintentionally disable the plugging functionality the first > > time a new packfile must be created due to packfile size limits. While > > disabling the plugging state only results in suboptimal behavior for > > the current code, it would be fatal for the bulk-fsync functionality > > later in this patch series. > > Paraphrasing to make sure I understand your reasoning here... > > In the "plug and then perform as many changes to the repository and > finally unplug" flow, before or after this series, the "perform" > step in the middle is unaware of which "bulk_checkin_state" instance > is being used to keep track of what is done to optimize by deferring > some operations until the "unplug" time. So bulk_checkin_state is > not there to allow us to create multiple instances of it, pass them > around to different sequences of "plug, perform, unplug". Each of > its members is inherently a singleton, so in the extreme, we could > turn these members into separate file-scope global variables if we > wanted to. The "plugged" bit happens to be the only one getting > ejected by this patch, because it is inconvenient to "clear" other > members otherwise. > > Is that what is going on? > More or less. The current state is all about creating a single packfile for multiple large objects. That packfile is a singleton today (we could have an alternate implementation where there's a separate packfile per thread in the future, so it's not inherent to the API). We want to do this if the top-level caller is okay with the state being invisible until the "finish" call, and that is conveyed by the "plugged" flag. > If it is, I am mildly opposed to the flow of thought, from at least > two reasons. It makes it hard for the next developer to decide if > the new members they are adding should be in or out of the struct. > > More importantly, I think the call of finish_bulk_checkin() we make > in deflate_to_pack() you found (and there may possibly other places > that we do so; I didn't check) may not appear to be a bug in the > original context, but it already is a bug. And when we change the > semantics of plug-unplug to be more "transaction-like", it becomes a > more serious bug, as you said. > > There is NO reason to end the ongoing transaction there inside the > while() loop that tries to limit the size of the packfile being > used. We may want to flush the "packfile part", which may have been > almost synonymous to the entirety of bulk_checkin_state, but as you > found out, the "plugged" bit is *outside* the "packfile part", and > that makes it a bug to call finish_bulk_checkin() from there. > > We should add a new function, flush_bulk_checking_packfile(), to > flush only the packfile part of the bulk_checkin_state without > affecting other things---the "plugged" bit is the only one in the > current code before this series, but it does not have to stay to be > so I'm happy to rename the packfile related stuff to end with _packfile to make it clear that all of that state and functionality is related to batching of packfile additions. So from this patch: s/bulk_checkin_state/bulk_checkin_packfile and s/finish_bulk_checkin/finish_bulk_checkin_packfile. My new state will be bulk_fsync_* (as it is already). Any future ODB-related state can go here too (I'm imagining a future log-structured 'new objects pack' that we can append to for adding small objects, similar to the bulk_checkin_packfile but allowing appends from multiple git invocations). > When you start plugging the loose ref transactions, you may > find it handy (this is me handwaving) to have a list of refs that > you may have to do something at "unplug" time kept in the struct, > and you do not want deflate_to_pack() affecting the ongoing > "plugged" ref operations by calling finish_bulk_checkin() and > reinitializing that list, for example. > I don't believe ref transactions will go through this part of the infrastructure. Refs already have a good transaction system (that partly inspired this rebranding, after I saw how Peter implemented batch ref fsync). I expect this area will remain all about the ODB as a subsystem that can enlist in a larger repo->transaction. So a top-level Git command might initiate a repo transaction, which would internally initiate an ODB transaction, index transaction, and ref transaction. The repo transaction would support flushing each of the subtransactions with an optimal number of fsyncs. > And then we should examine existing calls to finish_bulk_checkin() > and replace the ones that should not be finishing, i.e. the ones > that wanted "flush" but called "finish". Sure. I can fix this, which will only change this file. The only case of "finishing" would be in unplug_bulk_checkin / end_odb_transaction. Thanks, Neeraj
Neeraj Singh <nksingh85@gmail.com> writes: >> We should add a new function, flush_bulk_checking_packfile(), to >> flush only the packfile part of the bulk_checkin_state without >> affecting other things---the "plugged" bit is the only one in the >> current code before this series, but it does not have to stay to be >> so > > I'm happy to rename the packfile related stuff to end with _packfile > to make it clear that all of that state and functionality is related > to batching of packfile additions. I do not care about names, though. If you took that I hinted any such change, sorry about that. _state is fine as-is. I do care about not ejecting plugged out of the structure and instead keeping them together, with proper way to flush the part that deflate_to_pack() wants to flush, instead of abusing the "finish". Thanks.
On Wed, Mar 30, 2022 at 1:24 PM Junio C Hamano <gitster@pobox.com> wrote: > > Neeraj Singh <nksingh85@gmail.com> writes: > > >> We should add a new function, flush_bulk_checking_packfile(), to > >> flush only the packfile part of the bulk_checkin_state without > >> affecting other things---the "plugged" bit is the only one in the > >> current code before this series, but it does not have to stay to be > >> so > > > > I'm happy to rename the packfile related stuff to end with _packfile > > to make it clear that all of that state and functionality is related > > to batching of packfile additions. > > I do not care about names, though. If you took that I hinted any > such change, sorry about that. _state is fine as-is. > > I do care about not ejecting plugged out of the structure and > instead keeping them together, with proper way to flush the part > that deflate_to_pack() wants to flush, instead of abusing the > "finish". > > Thanks. Just to understand your feedback better, is it a problem to separate the state of each separate "thing" under ODB transactions into separate file-scope global(s)? In this series I declared the fsync state as completely separate from the packfile state. That's why I was thinking of it as more of a naming problem, since the remaining state aside from the plugged boolean is entirely packfile related. My argument in favor of having separate file-scoped variables for each 'pluggable thing' would be that future implementations can evolve separately without authors first having to disentangle a single struct. Thanks, Neeraj
Neeraj Singh <nksingh85@gmail.com> writes: > Just to understand your feedback better, is it a problem to separate > the state of each separate "thing" under ODB transactions into > separate file-scope global(s)? In this series I declared the fsync > state as completely separate from the packfile state. That's why I > was thinking of it as more of a naming problem, since the remaining > state aside from the plugged boolean is entirely packfile related. Ahh, sorry, my mistake. I somehow thought that you would be making the existing "struct bulk_checkin_state" infrastructure to cover not just the object store but much more, perhaps because I partly mistook the motivation to rename the structure (thinking again about it, since "checkin" is the act of adding new objects to the object database from outside sources (either from the working tree using "git add" command, or from other sources using unpack-objects), the original name was already fine to signal that it was about the object database, and the need to rename it sounded like we were going to do much more than the object database behind my head). > My argument in favor of having separate file-scoped variables for each > 'pluggable thing' would be that future implementations can evolve > separately without authors first having to disentangle a single > struct. That is fine. Would the trigger to "plug" and "unplug" also be independent? As you said elsewhere, the part to harden refs can piggyback on the existing ref-transaction infrastructure. I do not know offhand what things other than loose objects that want "plug" and "unplug" semantics, but if there are, are we going to have type specific begin- and end-transaction? Thanks.
On Thu, Mar 31, 2022 at 10:50 AM Junio C Hamano <gitster@pobox.com> wrote: > > Neeraj Singh <nksingh85@gmail.com> writes: > > > Just to understand your feedback better, is it a problem to separate > > the state of each separate "thing" under ODB transactions into > > separate file-scope global(s)? In this series I declared the fsync > > state as completely separate from the packfile state. That's why I > > was thinking of it as more of a naming problem, since the remaining > > state aside from the plugged boolean is entirely packfile related. > > Ahh, sorry, my mistake. > > I somehow thought that you would be making the existing "struct > bulk_checkin_state" infrastructure to cover not just the object > store but much more, perhaps because I partly mistook the motivation > to rename the structure (thinking again about it, since "checkin" is > the act of adding new objects to the object database from outside > sources (either from the working tree using "git add" command, or > from other sources using unpack-objects), the original name was > already fine to signal that it was about the object database, and > the need to rename it sounded like we were going to do much more > than the object database behind my head). > > > My argument in favor of having separate file-scoped variables for each > > 'pluggable thing' would be that future implementations can evolve > > separately without authors first having to disentangle a single > > struct. > > That is fine. Would the trigger to "plug" and "unplug" also be > independent? As you said elsewhere, the part to harden refs can > piggyback on the existing ref-transaction infrastructure. I do not > know offhand what things other than loose objects that want "plug" > and "unplug" semantics, but if there are, are we going to have type > specific begin- and end-transaction? > With regards to bulk-checkin.h, I believe for simplicity of interface to the callers, there should be a single pair of APIs for plug or unplug of the entire ODB regardless of what optimizations happen under the covers. For eventual repo-wide transactions, there should be a single API to initiate a transaction and a single one to commit/abort the transaction at the end. We may still also want a flush API so that we can make the repo state consistent prior to executing hooks or doing something else where an outside process needs consistent repo state. Thanks, Neeraj
diff --git a/bulk-checkin.c b/bulk-checkin.c index 6d6c37171c9..577b135e39c 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -10,9 +10,9 @@ #include "packfile.h" #include "object-store.h" -static struct bulk_checkin_state { - unsigned plugged:1; +static int bulk_checkin_plugged; +static struct bulk_checkin_state { char *pack_tmp_name; struct hashfile *f; off_t offset; @@ -21,7 +21,7 @@ static struct bulk_checkin_state { struct pack_idx_entry **written; uint32_t alloc_written; uint32_t nr_written; -} state; +} bulk_checkin_state; static void finish_tmp_packfile(struct strbuf *basename, const char *pack_tmp_name, @@ -278,21 +278,23 @@ int index_bulk_checkin(struct object_id *oid, int fd, size_t size, enum object_type type, const char *path, unsigned flags) { - int status = deflate_to_pack(&state, oid, fd, size, type, + int status = deflate_to_pack(&bulk_checkin_state, oid, fd, size, type, path, flags); - if (!state.plugged) - finish_bulk_checkin(&state); + if (!bulk_checkin_plugged) + finish_bulk_checkin(&bulk_checkin_state); return status; } void plug_bulk_checkin(void) { - state.plugged = 1; + assert(!bulk_checkin_plugged); + bulk_checkin_plugged = 1; } void unplug_bulk_checkin(void) { - state.plugged = 0; - if (state.f) - finish_bulk_checkin(&state); + assert(bulk_checkin_plugged); + bulk_checkin_plugged = 0; + if (bulk_checkin_state.f) + finish_bulk_checkin(&bulk_checkin_state); }