Message ID | pull.1804.git.1727862424713.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fsmonitor: fix hangs by delayed fs event listening | expand |
On Wed, Oct 02, 2024 at 09:47:04AM +0000, Koji Nakamaru via GitGitGadget wrote: > From: Koji Nakamaru <koji.nakamaru@gree.net> > > The thread serving the client (ipc-thread) calls > with_lock__wait_for_cookie() in which a cookie file is > created. with_lock__wait_for_cookie() then waits for the event caused by > the cookie file from the thread for fs events (fsevent-thread). > > However, in high load situations, the fsevent-thread may start actual fs > event listening (triggered by FSEventStreamStart() for Darwin, for > example) *after* the cookie file is created. In this case, the > fsevent-thread cannot detect the cookie file and > with_lock__wait_for_cookie() waits forever, so that the whole daemon > hangs [1]. First off, thank you for looking into this. I _think_ what you have here would work, but I had envisioned something a little different. So let me first try to walk through your solution... > diff --git a/builtin/fsmonitor--daemon.c b/builtin/fsmonitor--daemon.c > index dce8a3b2482..ccff2cb8bed 100644 > --- a/builtin/fsmonitor--daemon.c > +++ b/builtin/fsmonitor--daemon.c > @@ -172,6 +172,9 @@ static enum fsmonitor_cookie_item_result with_lock__wait_for_cookie( > trace_printf_key(&trace_fsmonitor, "cookie-wait: '%s' '%s'", > cookie->name, cookie_pathname.buf); > > + while (fsmonitor_get_listen_error_code(state) == 0) > + sleep_millisec(50); > + > /* > * Create the cookie file on disk and then wait for a notification > * that the listener thread has seen it. OK, so here we're going to basically busy-wait for the error code value to become non-zero. That happens in a thread-safe way inside our helper function, and the matching thread-safe set() is called here: > diff --git a/compat/fsmonitor/fsm-listen-darwin.c b/compat/fsmonitor/fsm-listen-darwin.c > index 2fc67442eb5..92373ce247f 100644 > --- a/compat/fsmonitor/fsm-listen-darwin.c > +++ b/compat/fsmonitor/fsm-listen-darwin.c > @@ -515,6 +515,7 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) > goto force_error_stop_without_loop; > } > data->stream_started = 1; > + fsmonitor_set_listen_error_code(state, 1); > > pthread_mutex_lock(&data->dq_lock); > pthread_cond_wait(&data->dq_finished, &data->dq_lock); which then releases all of the waiting clients to start working. They'd similarly be released on errors such as this one: > @@ -522,7 +523,7 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) > > switch (data->shutdown_style) { > case FORCE_ERROR_STOP: > - state->listen_error_code = -1; > + fsmonitor_set_listen_error_code(state, -1); > /* fall thru */ > case FORCE_SHUTDOWN: > ipc_server_stop_async(state->ipc_server_data); There's some risk that we'd have a code path fails to set the listen code, in which case our clients might spin forever. But from a cursory look, I think you've got all spots. I think your patch has one small bug, which is that you don't pthread_mutex_destroy() at the end of fsmonitor_run_daemon(). But other than that I think it should work OK (I haven't tried it in practice yet; I assume you did?). My two small-ish complaints are: - busy-waiting feels a bit hacky. I think it's not too bad here because it only happens during startup (and even then only if we get a request) because after that the listen code will already be set to "1". - the check for the listen code is in the wait_for_cookie() function. So even after we've started up, we'll still keep taking a lock to check it for every such request. I guess it probably isn't much overhead in practice, though. But what I had envisioned instead was teaching the ipc_server code to let us control when it starts actually running. We can do that by holding the existing work lock, and letting the callers (in this case, the individual fsm backends) tell it when to start operating. And then we are not introducing a new lock, but rather relying on the existing lock-checks for getting work to do. So there's no busy-wait. And after the startup sequence finishes, there are no additional lock checks. The patch below implements that. The only complication is that we have to keep a "started" flag to avoid double-locking during an early shutdown. Access to the "started" flag is not thread-safe, but I think that's OK. Until we've started the clients, only the main thread would do either a start or shutdown operation. So I dunno. Both solutions have their own little warts, I suppose, and I'm not sure if one is vastly better than the other. --- diff --git a/compat/fsmonitor/fsm-listen-darwin.c b/compat/fsmonitor/fsm-listen-darwin.c index 2fc67442eb..17d0b426e3 100644 --- a/compat/fsmonitor/fsm-listen-darwin.c +++ b/compat/fsmonitor/fsm-listen-darwin.c @@ -516,6 +516,12 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) } data->stream_started = 1; + /* + * Our fs event listener is now running, so it's safe to start + * serving client requests. + */ + ipc_server_start(state->ipc_server_data); + pthread_mutex_lock(&data->dq_lock); pthread_cond_wait(&data->dq_finished, &data->dq_lock); pthread_mutex_unlock(&data->dq_lock); diff --git a/compat/fsmonitor/fsm-listen-win32.c b/compat/fsmonitor/fsm-listen-win32.c index 5a21dade7b..d9a02bc989 100644 --- a/compat/fsmonitor/fsm-listen-win32.c +++ b/compat/fsmonitor/fsm-listen-win32.c @@ -741,6 +741,8 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) start_rdcw_watch(data->watch_gitdir) == -1) goto force_error_stop; + ipc_server_start(state->ipc_server_data); + for (;;) { dwWait = WaitForMultipleObjects(data->nr_listener_handles, data->hListener, diff --git a/compat/simple-ipc/ipc-shared.c b/compat/simple-ipc/ipc-shared.c index cb176d966f..603b60403b 100644 --- a/compat/simple-ipc/ipc-shared.c +++ b/compat/simple-ipc/ipc-shared.c @@ -21,6 +21,7 @@ int ipc_server_run(const char *path, const struct ipc_server_opts *opts, if (ret) return ret; + ipc_server_start(server_data); ret = ipc_server_await(server_data); ipc_server_free(server_data); diff --git a/compat/simple-ipc/ipc-unix-socket.c b/compat/simple-ipc/ipc-unix-socket.c index 9b3f2cdf8c..9a2e93cff5 100644 --- a/compat/simple-ipc/ipc-unix-socket.c +++ b/compat/simple-ipc/ipc-unix-socket.c @@ -328,6 +328,7 @@ struct ipc_server_data { int back_pos; int front_pos; + int started; int shutdown_requested; int is_stopped; }; @@ -888,6 +889,12 @@ int ipc_server_run_async(struct ipc_server_data **returned_server_data, server_data->accept_thread->fd_send_shutdown = sv[0]; server_data->accept_thread->fd_wait_shutdown = sv[1]; + /* + * Hold work-available mutex so that no work can start until + * we unlock it. + */ + pthread_mutex_lock(&server_data->work_available_mutex); + if (pthread_create(&server_data->accept_thread->pthread_id, NULL, accept_thread_proc, server_data->accept_thread)) die_errno(_("could not start accept_thread '%s'"), path); @@ -918,6 +925,15 @@ int ipc_server_run_async(struct ipc_server_data **returned_server_data, return 0; } +void ipc_server_start(struct ipc_server_data *server_data) +{ + if (!server_data || server_data->started) + return; + + server_data->started = 1; + pthread_mutex_unlock(&server_data->work_available_mutex); +} + /* * Gently tell the IPC server treads to shutdown. * Can be run on any thread. @@ -933,7 +949,9 @@ int ipc_server_stop_async(struct ipc_server_data *server_data) trace2_region_enter("ipc-server", "server-stop-async", NULL); - pthread_mutex_lock(&server_data->work_available_mutex); + /* If we haven't started yet, we are already holding lock. */ + if (!server_data->started) + pthread_mutex_lock(&server_data->work_available_mutex); server_data->shutdown_requested = 1; diff --git a/simple-ipc.h b/simple-ipc.h index a849d9f841..764bf7a309 100644 --- a/simple-ipc.h +++ b/simple-ipc.h @@ -179,12 +179,21 @@ struct ipc_server_opts * When a client IPC message is received, the `application_cb` will be * called (possibly on a random thread) to handle the message and * optionally compose a reply message. + * + * This initializes all threads but no actual work will be done until + * ipc_server_start() is called. */ int ipc_server_run_async(struct ipc_server_data **returned_server_data, const char *path, const struct ipc_server_opts *opts, ipc_server_application_cb *application_cb, void *application_data); +/* + * Let an async server start running. This needs to be called only once + * after initialization. + */ +void ipc_server_start(struct ipc_server_data *server_data); + /* * Gently signal the IPC server pool to shutdown. No new client * connections will be accepted, but existing connections will be
On Mon, Oct 07, 2024 at 01:58:21AM -0400, Jeff King wrote: > I think your patch has one small bug, which is that you don't > pthread_mutex_destroy() at the end of fsmonitor_run_daemon(). But other > than that I think it should work OK (I haven't tried it in practice yet; > I assume you did?). I just checked your patch in our CI, using the sleep(1) you suggested earlier to more predictably lose the race(). It does work reliably (and I confirmed with some extra trace statements that it does spin on the sleep_millisec() loop). I had also previously checked my suggested solution. So I do think either is a valid solution to the problem. -Peff
On Mon, Oct 07, 2024 at 01:58:21AM -0400, Jeff King wrote: > First off, thank you for looking into this. I _think_ what you have here > would work, but I had envisioned something a little different. So let me > first try to walk through your solution... Thank you very much for looking through my patch in detail and providing another approach. I agree that busy-waiting is not smart ;) I utilized it to minimize code modification and not to worry about any new deadlock. Your approach is more natural if the code is written from scratch with the problem in mind. > @@ -933,7 +949,9 @@ int ipc_server_stop_async(struct ipc_server_data *server_data) > > trace2_region_enter("ipc-server", "server-stop-async", NULL); > > - pthread_mutex_lock(&server_data->work_available_mutex); > + /* If we haven't started yet, we are already holding lock. */ > + if (!server_data->started) > + pthread_mutex_lock(&server_data->work_available_mutex); > > server_data->shutdown_requested = 1; Is this condition inverted? On Mon, Oct 7, 2024 at 3:08 PM Jeff King <peff@peff.net> wrote: > I just checked your patch in our CI, using the sleep(1) you suggested > earlier to more predictably lose the race(). It does work reliably (and > I confirmed with some extra trace statements that it does spin on the > sleep_millisec() loop). > > I had also previously checked my suggested solution. So I do think > either is a valid solution to the problem. I also tested your approach on Windows with a few additions to ipc-win32.c, and it worked correctly. * define work_available_mutex and started in ipc_server_data. * call pthread_mutex_lock(&server_data->work_available_mutex) before creating the server_thread_proc thread. * define ipc_server_start() Shall we adopt your approach as it is more natural. Can I ask you to submit a new patch? Koji Nakamaru
diff --git a/builtin/fsmonitor--daemon.c b/builtin/fsmonitor--daemon.c index dce8a3b2482..ccff2cb8bed 100644 --- a/builtin/fsmonitor--daemon.c +++ b/builtin/fsmonitor--daemon.c @@ -172,6 +172,9 @@ static enum fsmonitor_cookie_item_result with_lock__wait_for_cookie( trace_printf_key(&trace_fsmonitor, "cookie-wait: '%s' '%s'", cookie->name, cookie_pathname.buf); + while (fsmonitor_get_listen_error_code(state) == 0) + sleep_millisec(50); + /* * Create the cookie file on disk and then wait for a notification * that the listener thread has seen it. @@ -606,6 +609,23 @@ void fsmonitor_force_resync(struct fsmonitor_daemon_state *state) pthread_mutex_unlock(&state->main_lock); } +int fsmonitor_get_listen_error_code(struct fsmonitor_daemon_state *state) +{ + int error_code; + + pthread_mutex_lock(&state->listen_lock); + error_code = state->listen_error_code; + pthread_mutex_unlock(&state->listen_lock); + return error_code; +} + +void fsmonitor_set_listen_error_code(struct fsmonitor_daemon_state *state, int error_code) +{ + pthread_mutex_lock(&state->listen_lock); + state->listen_error_code = error_code; + pthread_mutex_unlock(&state->listen_lock); +} + /* * Format an opaque token string to send to the client. */ @@ -1285,6 +1305,7 @@ static int fsmonitor_run_daemon(void) hashmap_init(&state.cookies, cookies_cmp, NULL, 0); pthread_mutex_init(&state.main_lock, NULL); pthread_cond_init(&state.cookies_cond, NULL); + pthread_mutex_init(&state.listen_lock, NULL); state.listen_error_code = 0; state.health_error_code = 0; state.current_token_data = fsmonitor_new_token_data(); diff --git a/compat/fsmonitor/fsm-listen-darwin.c b/compat/fsmonitor/fsm-listen-darwin.c index 2fc67442eb5..92373ce247f 100644 --- a/compat/fsmonitor/fsm-listen-darwin.c +++ b/compat/fsmonitor/fsm-listen-darwin.c @@ -515,6 +515,7 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) goto force_error_stop_without_loop; } data->stream_started = 1; + fsmonitor_set_listen_error_code(state, 1); pthread_mutex_lock(&data->dq_lock); pthread_cond_wait(&data->dq_finished, &data->dq_lock); @@ -522,7 +523,7 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) switch (data->shutdown_style) { case FORCE_ERROR_STOP: - state->listen_error_code = -1; + fsmonitor_set_listen_error_code(state, -1); /* fall thru */ case FORCE_SHUTDOWN: ipc_server_stop_async(state->ipc_server_data); @@ -534,7 +535,7 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) return; force_error_stop_without_loop: - state->listen_error_code = -1; + fsmonitor_set_listen_error_code(state, -1); ipc_server_stop_async(state->ipc_server_data); return; } diff --git a/compat/fsmonitor/fsm-listen-win32.c b/compat/fsmonitor/fsm-listen-win32.c index 5a21dade7b8..609efd1d6a8 100644 --- a/compat/fsmonitor/fsm-listen-win32.c +++ b/compat/fsmonitor/fsm-listen-win32.c @@ -732,14 +732,13 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) DWORD dwWait; int result; - state->listen_error_code = 0; - if (start_rdcw_watch(data->watch_worktree) == -1) goto force_error_stop; if (data->watch_gitdir && start_rdcw_watch(data->watch_gitdir) == -1) goto force_error_stop; + fsmonitor_set_listen_error_code(state, 1); for (;;) { dwWait = WaitForMultipleObjects(data->nr_listener_handles, @@ -797,7 +796,7 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state) } force_error_stop: - state->listen_error_code = -1; + fsmonitor_set_listen_error_code(state, -1); force_shutdown: /* diff --git a/fsmonitor--daemon.h b/fsmonitor--daemon.h index 5cbbec8d940..77efc59b7bb 100644 --- a/fsmonitor--daemon.h +++ b/fsmonitor--daemon.h @@ -51,6 +51,7 @@ struct fsmonitor_daemon_state { int cookie_seq; struct hashmap cookies; + pthread_mutex_t listen_lock; int listen_error_code; int health_error_code; struct fsm_listen_data *listen_data; @@ -167,5 +168,8 @@ void fsmonitor_publish(struct fsmonitor_daemon_state *state, */ void fsmonitor_force_resync(struct fsmonitor_daemon_state *state); +int fsmonitor_get_listen_error_code(struct fsmonitor_daemon_state *state); +void fsmonitor_set_listen_error_code(struct fsmonitor_daemon_state *state, int error_code); + #endif /* HAVE_FSMONITOR_DAEMON_BACKEND */ #endif /* FSMONITOR_DAEMON_H */