Message ID | 20211122033220.32883-3-chiyutianyi@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | None | expand |
On 11/21/2021 10:32 PM, Han Xin wrote: > From: Han Xin <hanxin.hx@alibaba-inc.com> > > When streaming a large blob object to "write_loose_object()", we have no > chance to run "write_object_file_prepare()" to calculate the oid in > advance. So we need to handle undetermined oid in function > "write_loose_object()". > > In the original implementation, we know the oid and we can write the > temporary file in the same directory as the final object, but for an > object with an undetermined oid, we don't know the exact directory for > the object, so we have to save the temporary file in ".git/objects/" > directory instead. My first reaction is to not write into .git/objects/ directly, but instead make a .git/objects/tmp/ directory and write within that directory. The idea is to prevent leaving stale files in the .git/objects/ directory if the process terminates strangely (say, a power outage or segfault). If this was an interesting idea to pursue, it does leave a question: should we clean up the tmp/ directory when it is empty? That would require adding a check in finalize_object_file() that is probably best left unchecked (the lstat() would add a cost per loose object write that is probably too costly). I would rather leave an empty tmp/ directory than add that cost per loose object write. I suppose another way to do it would be to register the check as an event at the end of the process, so we only check once, and that only happens if we created a loose object with this streaming method. With all of these complications in mind, I think cleaning up the stale tmp/ directory could (at the very least) be delayed to another commit or patch series. Hopefully adding the directory is not too much complication to add here. > - loose_object_path(the_repository, &filename, oid); > + if (is_null_oid(oid)) { > + /* When oid is not determined, save tmp file to odb path. */ > + strbuf_reset(&filename); > + strbuf_addstr(&filename, the_repository->objects->odb->path); > + strbuf_addch(&filename, '/'); Here, you could instead of the strbuf_addch() do strbuf_add(&filename, "/tmp/", 5); if (safe_create_leading_directories(filename.buf)) { error(_("failed to create '%s'")); strbuf_release(&filename); return -1; } > + } else { > + loose_object_path(the_repository, &filename, oid); > + } > > fd = create_tmpfile(&tmp_file, filename.buf); > if (fd < 0) { > @@ -1939,12 +1946,31 @@ static int write_loose_object(const struct object_id *oid, char *hdr, > die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), > ret); > the_hash_algo->final_oid_fn(¶no_oid, &c); > - if (!oideq(oid, ¶no_oid)) > + if (!is_null_oid(oid) && !oideq(oid, ¶no_oid)) > die(_("confused by unstable object source data for %s"), > oid_to_hex(oid)); > > close_loose_object(fd); > > + if (is_null_oid(oid)) { > + int dirlen; > + > + oidcpy((struct object_id *)oid, ¶no_oid); > + loose_object_path(the_repository, &filename, oid); > + > + /* We finally know the object path, and create the missing dir. */ > + dirlen = directory_size(filename.buf); > + if (dirlen) { > + struct strbuf dir = STRBUF_INIT; > + strbuf_add(&dir, filename.buf, dirlen - 1); > + if (mkdir(dir.buf, 0777) && errno != EEXIST) > + return -1; > + if (adjust_shared_perm(dir.buf)) > + return -1; > + strbuf_release(&dir); > + } > + } > + Upon first reading I was asking "where is the file rename?" but it is part of finalize_object_file() which is called further down. Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: > My first reaction is to not write into .git/objects/ directly, but > instead make a .git/objects/tmp/ directory and write within that > directory. The idea is to prevent leaving stale files in the > .git/objects/ directory if the process terminates strangely (say, > a power outage or segfault). Even if we know the name of the object we are writing beforehand, I do not think it is a good idea to open-write-close the final object file. The approach we already use everywhere is to write into a tmpfile/lockfile and rename it to the final name object-file.c::write_loose_object() uses create_tmpfile() to prepare a temporary file whose name begins with "tmp_obj_", so that "gc" can recognize stale ones and remove them. > If this was an interesting idea to pursue, it does leave a question: > should we clean up the tmp/ directory when it is empty? That would > require adding a check in finalize_object_file() that is probably > best left unchecked (the lstat() would add a cost per loose object > write that is probably too costly). I would rather leave an empty > tmp/ directory than add that cost per loose object write. I am not sure why we want a new tmp/ directory.
On 11/29/2021 3:44 PM, Junio C Hamano wrote: > Derrick Stolee <stolee@gmail.com> writes: > >> My first reaction is to not write into .git/objects/ directly, but >> instead make a .git/objects/tmp/ directory and write within that >> directory. The idea is to prevent leaving stale files in the >> .git/objects/ directory if the process terminates strangely (say, >> a power outage or segfault). > > Even if we know the name of the object we are writing beforehand, I > do not think it is a good idea to open-write-close the final object > file. The approach we already use everywhere is to write into a > tmpfile/lockfile and rename it to the final name > > object-file.c::write_loose_object() uses create_tmpfile() to prepare > a temporary file whose name begins with "tmp_obj_", so that "gc" can > recognize stale ones and remove them. The only difference is that the tmp_obj_* file would go into the loose object directory corresponding to the first two hex characters of the OID, but that no longer happens now. >> If this was an interesting idea to pursue, it does leave a question: >> should we clean up the tmp/ directory when it is empty? That would >> require adding a check in finalize_object_file() that is probably >> best left unchecked (the lstat() would add a cost per loose object >> write that is probably too costly). I would rather leave an empty >> tmp/ directory than add that cost per loose object write. > > I am not sure why we want a new tmp/ directory. I'm just thinking of a case where this fails repeatedly I would rather have those failed tmp_obj_* files isolated in their own directory. It's an extremely minor point, so I'm fine to drop the recommendation. Thanks, -Stolee
On Tue, Nov 30, 2021 at 6:18 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 11/29/2021 3:44 PM, Junio C Hamano wrote: > > Derrick Stolee <stolee@gmail.com> writes: > > > >> My first reaction is to not write into .git/objects/ directly, but > >> instead make a .git/objects/tmp/ directory and write within that > >> directory. The idea is to prevent leaving stale files in the > >> .git/objects/ directory if the process terminates strangely (say, > >> a power outage or segfault). > > > > Even if we know the name of the object we are writing beforehand, I > > do not think it is a good idea to open-write-close the final object > > file. The approach we already use everywhere is to write into a > > tmpfile/lockfile and rename it to the final name > > > > object-file.c::write_loose_object() uses create_tmpfile() to prepare > > a temporary file whose name begins with "tmp_obj_", so that "gc" can > > recognize stale ones and remove them. > > The only difference is that the tmp_obj_* file would go into the > loose object directory corresponding to the first two hex characters > of the OID, but that no longer happens now. > At the beginning of this patch, I did save the temporary object in a two hex characters directory of "null_oid", but this is also a very strange behavior. "Gc" will indeed clean up these tmp_obj_* files, no matter if they are in .git/objects/ or .git/objects/xx. Thanks, -Han Xin
diff --git a/object-file.c b/object-file.c index 227f53a0de..78fd2a5d39 100644 --- a/object-file.c +++ b/object-file.c @@ -1892,7 +1892,14 @@ static int write_loose_object(const struct object_id *oid, char *hdr, const void *buf; unsigned long len; - loose_object_path(the_repository, &filename, oid); + if (is_null_oid(oid)) { + /* When oid is not determined, save tmp file to odb path. */ + strbuf_reset(&filename); + strbuf_addstr(&filename, the_repository->objects->odb->path); + strbuf_addch(&filename, '/'); + } else { + loose_object_path(the_repository, &filename, oid); + } fd = create_tmpfile(&tmp_file, filename.buf); if (fd < 0) { @@ -1939,12 +1946,31 @@ static int write_loose_object(const struct object_id *oid, char *hdr, die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), ret); the_hash_algo->final_oid_fn(¶no_oid, &c); - if (!oideq(oid, ¶no_oid)) + if (!is_null_oid(oid) && !oideq(oid, ¶no_oid)) die(_("confused by unstable object source data for %s"), oid_to_hex(oid)); close_loose_object(fd); + if (is_null_oid(oid)) { + int dirlen; + + oidcpy((struct object_id *)oid, ¶no_oid); + loose_object_path(the_repository, &filename, oid); + + /* We finally know the object path, and create the missing dir. */ + dirlen = directory_size(filename.buf); + if (dirlen) { + struct strbuf dir = STRBUF_INIT; + strbuf_add(&dir, filename.buf, dirlen - 1); + if (mkdir(dir.buf, 0777) && errno != EEXIST) + return -1; + if (adjust_shared_perm(dir.buf)) + return -1; + strbuf_release(&dir); + } + } + if (mtime) { struct utimbuf utb; utb.actime = mtime;