Message ID | 20200610171121.46910-6-inga.stotland@intel.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | Put safeguards around Leave & Attach calls | expand |
Hi Inga, On 06/10, Inga Stotland wrote: > Attach() method: > If Attach method is called for a node that is being processed as a result > of a Create, Import or Join method calls in progress, node attachment > is not allowed and org.bluez.mesh.Error.Busy error is returned. I don't think I understand how this is supposed to be used by the application. So far, we've implemented the API by calling Import() and then, as soon as JoinComplete() call arrives, calling Attach(). > @@ -1654,6 +1665,12 @@ void node_attach(const char *app_root, const char *sender, uint64_t token, > return; > } > > + /* Check if there is a pending request associated with this node */ > + if (node->busy) { > + cb(user_data, MESH_ERROR_BUSY, NULL); > + return; > + } > + > /* Check if the node is already in use */ > if (node->owner) { > l_warn("The node is already in use"); With this patch, this call sequence fails, because now we're supposed to send a *reply* to JoinComplete first, and only then call Attach()? I'm using a high-level API for D-Bus, so I don't really control when the reply is sent, so at the moment the only way to implement this would be by delaying Attach() by a fixed timeout - and I'm not comfortable with that. regards
Hi Michał, On Mon, 2020-06-15 at 10:22 +0200, Michał Lowas-Rzechonek wrote: > Hi Inga, > > On 06/10, Inga Stotland wrote: > > Attach() method: > > If Attach method is called for a node that is being processed as a result > > of a Create, Import or Join method calls in progress, node attachment > > is not allowed and org.bluez.mesh.Error.Busy error is returned. > > I don't think I understand how this is supposed to be used by the > application. > > So far, we've implemented the API by calling Import() and then, as soon as > JoinComplete() call arrives, calling Attach(). > > > @@ -1654,6 +1665,12 @@ void node_attach(const char *app_root, const char *sender, uint64_t token, > > return; > > } > > > > + /* Check if there is a pending request associated with this node */ > > + if (node->busy) { > > + cb(user_data, MESH_ERROR_BUSY, NULL); > > + return; > > + } > > + > > /* Check if the node is already in use */ > > if (node->owner) { > > l_warn("The node is already in use"); > > With this patch, this call sequence fails, because now we're supposed to > send a *reply* to JoinComplete first, and only then call Attach()? A couple months ago, we made the decision (with your input, I believe) that we needed to use JoinComplete on every node creation (Join, Import, Create), to ensure that the App has the token, and all required informationnthe daemon requires the App to have. If the daemon does *not* get the successful return from JoinComplete, it destroys the entire node... So therefore the node isn't entirely complete until the daemon receives the return. This creates an awkward state for the node... Part of Attach() requires the daemon to make an ObjectManager request of the app to get the latest composition settings, and validate the App which is attaching with that token. It also creates the unfortunate situation revealed in one of your test-suite cases where if Leave() was called before returning the JoinComplete() call, we Seg-Faulted. Both Leave(), and the situation of the *first* Attach() following JoinComplete are "Once in a Node's Lifetime" situations. We decided that the safest course of action here was to require the JoinComplete return prior either Attach or Leave... Because the node needs finalizing. > > I'm using a high-level API for D-Bus, so I don't really control when the > reply is sent, so at the moment the only way to implement this would be > by delaying Attach() by a fixed timeout - and I'm not comfortable with > that. Yeah, I can see how this is now required... In the mesh-cfgclient tool (which is also built on ELL) we accomplish this by scheduling an idle_oneshot for the Attach. It could also be done by issuing the l_dbus_send of the JoinComplete response within join_complete, before calling the Attach send... Then returning NULL as the return result of join_complete (where the response would normally be sent). > > regards
Inga, Brian, On 06/15, Gix, Brian wrote: > > With this patch, this call sequence fails, because now we're supposed to > > send a *reply* to JoinComplete first, and only then call Attach()? > > A couple months ago, we made the decision (with your input, I believe) > that we needed to use JoinComplete on every node creation (Join, > Import, Create), to ensure that the App has the token (...) Yes, I remember that discussion. The rationale was ability to catch bugs in the application, and get rid of created, but effectively unusable nodes. > This creates (...) the unfortunate situation revealed in one of your > test-suite cases where if Leave() was called before returning the > JoinComplete() call, we Seg-Faulted. Indeed, although I don't think it's necessary to introduce a "busy" state... In case of Leave(), this call removes the node anyway, so what I think would suffice is to check whether the node still exists when JoinComplete reply arrives, to avoid freeing the node twice (causing SEGV). void node_finalize_new_node(struct mesh_node *node, struct mesh_io *io) { if (!node) return; if (!l_queue_find(nodes, match_simple, L_UINT_TO_PTR(node))) return; // ... } This would allow the application to call Leave *before* sending a reply to JoinComplete. As for Attach(), I also think it should be legal to call it before replying to JoinComplete. The worst thing that can happen is that application successfully attaches, then replies to JoinComplete with an error. This would simply remove the node, and the application would be promptly detached. > > I'm using a high-level API for D-Bus, so I don't really control when the > > reply is sent, so at the moment the only way to implement this would be > > by delaying Attach() by a fixed timeout - and I'm not comfortable with > > that. > > > Yeah, I can see how this is now required... > > In the mesh-cfgclient tool (which is also built on ELL) we accomplish > this by scheduling an idle_oneshot for the Attach. Unfortunately, not all main loops have API to schedule "idle" calls, i.e. calls executed when the loop doesn't have anything better to do. I know that both ELL and Glib do, but AFAIR Qt does not (it uses timers with timeout set to 0, if I'm not mistaken), and Python's asyncio doesn't either. I don't think requiring a specific sequence of dbus messages is a good idea :( regards
Hi Michal, On Mon, 2020-06-15 at 22:32 +0200, michal.lowas-rzechonek@silvair.com wrote: > Inga, Brian, > > On 06/15, Gix, Brian wrote: > > > With this patch, this call sequence fails, because now we're supposed to > > > send a *reply* to JoinComplete first, and only then call Attach()? > > > > A couple months ago, we made the decision (with your input, I believe) > > that we needed to use JoinComplete on every node creation (Join, > > Import, Create), to ensure that the App has the token (...) > > Yes, I remember that discussion. The rationale was ability to catch > bugs in the application, and get rid of created, but effectively > unusable nodes. > > > This creates (...) the unfortunate situation revealed in one of your > > test-suite cases where if Leave() was called before returning the > > JoinComplete() call, we Seg-Faulted. > > Indeed, although I don't think it's necessary to introduce a "busy" > state... > > In case of Leave(), this call removes the node anyway, so what I think > would suffice is to check whether the node still exists when > JoinComplete reply arrives, to avoid freeing the node twice (causing > SEGV). > > void node_finalize_new_node(struct mesh_node *node, struct mesh_io *io) > { > if (!node) > return; > > if (!l_queue_find(nodes, match_simple, L_UINT_TO_PTR(node))) > return; > I am afraid that this kind of check may lead to a race condition (rare, but possible) when: - a un-finalized node has been removed via Leave and - the daemon still waiting for JoinCOmplete() reply and - meanwhile another one has either been created or imported reusing the memory allocation (entirely possible) and has been attached So when the original JoinComplete returns either via timeout/error/ok,we may unintentionally remove dbus resources of a new node that has been validated and attached. > // ... > } > > This would allow the application to call Leave *before* sending a reply > to JoinComplete. > > As for Attach(), I also think it should be legal to call it before > replying to JoinComplete. The worst thing that can happen is that > application successfully attaches, then replies to JoinComplete with an > error. This would simply remove the node, and the application would be > promptly detached. > > I guess we could introduce an internal timer inside the daemon to put Attach on hold until JoinComplete is done. If JoinComplete returns an error, then Attach won'r go through and would return error as well > > > I'm using a high-level API for D-Bus, so I don't really control when the > > > reply is sent, so at the moment the only way to implement this would be > > > by delaying Attach() by a fixed timeout - and I'm not comfortable with > > > that. > > > > > > Yeah, I can see how this is now required... > > > > In the mesh-cfgclient tool (which is also built on ELL) we accomplish > > this by scheduling an idle_oneshot for the Attach. > > Unfortunately, not all main loops have API to schedule "idle" calls, > i.e. calls executed when the loop doesn't have anything better to do. > > I know that both ELL and Glib do, but AFAIR Qt does not (it uses timers > with timeout set to 0, if I'm not mistaken), and Python's asyncio > doesn't either. > > I don't think requiring a specific sequence of dbus messages is a good > idea :( > > regards
Hi Michał, Inga, On Mon, 2020-06-15 at 22:25 +0000, Stotland, Inga wrote: > Hi Michal, > > On Mon, 2020-06-15 at 22:32 +0200, michal.lowas-rzechonek@silvair.com > wrote: > > Inga, Brian, > > > > On 06/15, Gix, Brian wrote: > > > > With this patch, this call sequence fails, because now we're supposed to > > > > send a *reply* to JoinComplete first, and only then call Attach()? > > > > > > A couple months ago, we made the decision (with your input, I believe) > > > that we needed to use JoinComplete on every node creation (Join, > > > Import, Create), to ensure that the App has the token (...) > > > > Yes, I remember that discussion. The rationale was ability to catch > > bugs in the application, and get rid of created, but effectively > > unusable nodes. > > > > > This creates (...) the unfortunate situation revealed in one of your > > > test-suite cases where if Leave() was called before returning the > > > JoinComplete() call, we Seg-Faulted. > > > > Indeed, although I don't think it's necessary to introduce a "busy" > > state... > > > > In case of Leave(), this call removes the node anyway, so what I think > > would suffice is to check whether the node still exists when > > JoinComplete reply arrives, to avoid freeing the node twice (causing > > SEGV). > > > > void node_finalize_new_node(struct mesh_node *node, struct mesh_io *io) > > { > > if (!node) > > return; > > > > if (!l_queue_find(nodes, match_simple, L_UINT_TO_PTR(node))) > > return; > > > > I am afraid that this kind of check may lead to a race condition (rare, but possible) when: > - a un-finalized node has been removed via Leave and > - the daemon still waiting for JoinCOmplete() reply and > - meanwhile another one has either been created or imported reusing the memory allocation (entirely > possible) and has been attached > > So when the original JoinComplete returns either via timeout/error/ok,we may unintentionally remove dbus > resources of a new node that has been validated and attached. > > > > // ... > > } > > > > This would allow the application to call Leave *before* sending a reply > > to JoinComplete. > > > > As for Attach(), I also think it should be legal to call it before > > replying to JoinComplete. The worst thing that can happen is that > > application successfully attaches, then replies to JoinComplete with an > > error. This would simply remove the node, and the application would be > > promptly detached. > > > > > > I guess we could introduce an internal timer inside the daemon to put > Attach on hold until JoinComplete is done. If JoinComplete returns an > error, then Attach won'r go through and would return error as well So I have created a patch with this option, that I have sent to the reflector. (See: [PATCH BlueZ] mesh: Add deferal of Attach() and Leave() if busy ) It maintains the *busy* check to make sure we are in a safe state to perform the Attach or Leave, and if not we defer the call for 1 second... With the understanding that *nobody* should be delaying the response to JoinComplete. This handles just the simple "Out Of Expected Order" issue, but will still respond to Attach and Leave with the Busy error if when we reconsider the call, the node has still not been completed. I have tested this with no memory leaks, and verified that either Attach or Leave can be called within the JoinComplete application handler. > > > > > I'm using a high-level API for D-Bus, so I don't really control when the > > > > reply is sent, so at the moment the only way to implement this would be > > > > by delaying Attach() by a fixed timeout - and I'm not comfortable with > > > > that. > > > > > > Yeah, I can see how this is now required... > > > > > > In the mesh-cfgclient tool (which is also built on ELL) we accomplish > > > this by scheduling an idle_oneshot for the Attach. > > > > Unfortunately, not all main loops have API to schedule "idle" calls, > > i.e. calls executed when the loop doesn't have anything better to do. > > > > I know that both ELL and Glib do, but AFAIR Qt does not (it uses timers > > with timeout set to 0, if I'm not mistaken), and Python's asyncio > > doesn't either. > > > > I don't think requiring a specific sequence of dbus messages is a good > > idea :( > > > > regards
diff --git a/doc/mesh-api.txt b/doc/mesh-api.txt index e85f0bf52..4bef6174c 100644 --- a/doc/mesh-api.txt +++ b/doc/mesh-api.txt @@ -116,6 +116,7 @@ Methods: org.bluez.mesh.Error.InvalidArguments org.bluez.mesh.Error.NotFound, org.bluez.mesh.Error.AlreadyExists, + org.bluez.mesh.Error.Busy, org.bluez.mesh.Error.Failed void Leave(uint64 token) @@ -126,6 +127,8 @@ Methods: PossibleErrors: org.bluez.mesh.Error.InvalidArguments + org.bluez.mesh.Error.NotFound + org.bluez.mesh.Error.Busy void CreateNetwork(object app_root, array{byte}[16] uuid) diff --git a/mesh/mesh.c b/mesh/mesh.c index a5935c216..c8767ee7a 100644 --- a/mesh/mesh.c +++ b/mesh/mesh.c @@ -655,13 +655,21 @@ static struct l_dbus_message *leave_call(struct l_dbus *dbus, void *user_data) { uint64_t token; + struct mesh_node *node; l_debug("Leave"); if (!l_dbus_message_get_arguments(msg, "t", &token)) return dbus_error(msg, MESH_ERROR_INVALID_ARGS, NULL); - node_remove(node_find_by_token(token)); + node = node_find_by_token(token); + if (!node) + return dbus_error(msg, MESH_ERROR_NOT_FOUND, NULL); + + if (node_is_busy(node)) + return dbus_error(msg, MESH_ERROR_BUSY, NULL); + + node_remove(node); return l_dbus_message_new_method_return(msg); } diff --git a/mesh/node.c b/mesh/node.c index 7ec06437b..567f2e6db 100644 --- a/mesh/node.c +++ b/mesh/node.c @@ -88,6 +88,7 @@ struct mesh_node { char *storage_dir; uint32_t disc_watch; uint32_t seq_number; + bool busy; bool provisioner; uint16_t primary; struct node_composition comp; @@ -598,6 +599,11 @@ bool node_is_provisioner(struct mesh_node *node) return node->provisioner; } +bool node_is_busy(struct mesh_node *node) +{ + return node->busy; +} + void node_app_key_delete(struct mesh_node *node, uint16_t net_idx, uint16_t app_idx) { @@ -1352,6 +1358,8 @@ static bool add_local_node(struct mesh_node *node, uint16_t unicast, bool kr, /* Initialize configuration server model */ cfgmod_server_init(node, PRIMARY_ELE_IDX); + node->busy = true; + return true; } @@ -1459,6 +1467,9 @@ static void get_managed_objects_cb(struct l_dbus_message *msg, void *user_data) struct keyring_net_key net_key; uint8_t dev_key[16]; + if (req->type == REQUEST_TYPE_ATTACH) + req->attach->busy = false; + if (!msg || l_dbus_message_is_error(msg)) { l_error("Failed to get app's dbus objects"); goto fail; @@ -1654,6 +1665,12 @@ void node_attach(const char *app_root, const char *sender, uint64_t token, return; } + /* Check if there is a pending request associated with this node */ + if (node->busy) { + cb(user_data, MESH_ERROR_BUSY, NULL); + return; + } + /* Check if the node is already in use */ if (node->owner) { l_warn("The node is already in use"); @@ -1674,6 +1691,8 @@ void node_attach(const char *app_root, const char *sender, uint64_t token, req->attach = node; req->type = REQUEST_TYPE_ATTACH; + node->busy = true; + send_managed_objects_request(sender, app_root, req); } @@ -2347,6 +2366,8 @@ void node_finalize_new_node(struct mesh_node *node, struct mesh_io *io) free_node_dbus_resources(node); mesh_agent_remove(node->agent); + node->busy = false; + /* Register callback for the node's io */ attach_io(node, io); } diff --git a/mesh/node.h b/mesh/node.h index e26d410c8..b8b2b1b49 100644 --- a/mesh/node.h +++ b/mesh/node.h @@ -39,6 +39,7 @@ struct mesh_node *node_find_by_addr(uint16_t addr); struct mesh_node *node_find_by_uuid(uint8_t uuid[16]); struct mesh_node *node_find_by_token(uint64_t token); bool node_is_provisioner(struct mesh_node *node); +bool node_is_busy(struct mesh_node *node); void node_app_key_delete(struct mesh_node *node, uint16_t net_idx, uint16_t app_idx); uint16_t node_get_primary(struct mesh_node *node); diff --git a/test/test-mesh b/test/test-mesh index 38f0c0a74..7c8a25482 100755 --- a/test/test-mesh +++ b/test/test-mesh @@ -412,8 +412,6 @@ class Application(dbus.service.Object): token = value have_token = True - if attached == False: - attach(token) @dbus.service.method(MESH_APPLICATION_IFACE, in_signature="s", out_signature="")