diff mbox series

[RFC,v4,30/49] multi-process: send heartbeat messages to remote

Message ID 14c33104778e77fcf2e7f0df2a1dd96fbcaf49d7.1571905346.git.jag.raman@oracle.com (mailing list archive)
State New, archived
Headers show
Series Initial support of multi-process qemu | expand

Commit Message

Jag Raman Oct. 24, 2019, 9:09 a.m. UTC
From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

In order to detect remote processes which are hung, the
proxy periodically sends heartbeat messages to confirm if
the remote process is alive

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 hw/proxy/qemu-proxy.c    | 101 +++++++++++++++++++++++++++++++++++++++++++++++
 include/io/mpqemu-link.h |   1 +
 2 files changed, 102 insertions(+)

Comments

Stefan Hajnoczi Nov. 11, 2019, 4:27 p.m. UTC | #1
On Thu, Oct 24, 2019 at 05:09:11AM -0400, Jagannathan Raman wrote:
> +static void broadcast_msg(MPQemuMsg *msg, bool need_reply)
> +{
> +    PCIProxyDev *entry;
> +    unsigned int pid;
> +    int wait;
> +
> +    QLIST_FOREACH(entry, &proxy_dev_list.devices, next) {
> +        if (need_reply) {
> +            wait = eventfd(0, EFD_NONBLOCK);
> +            msg->num_fds = 1;
> +            msg->fds[0] = wait;
> +        }
> +
> +        mpqemu_msg_send(entry->mpqemu_link, msg, entry->mpqemu_link->com);
> +        if (need_reply) {
> +            pid = (uint32_t)wait_for_remote(wait);

Sometimes QEMU really needs to wait for the remote process before it can
make progress.  I think this is not one of those cases though.

Since QEMU is event-driven it's problematic to invoke blocking system
calls.  The remote process might not respond for a significant amount of
time.  Other QEMU threads will be held up waiting for the QEMU global
mutex in the meantime (because we hold it!).

Please implement heartbeat/ping asynchronously.  The wait eventfd should
be read by an event loop fd handler instead.  That way QEMU can continue
with running the VM while waiting for the remote process.

This will also improve guest performance because there will be less
jitter (random latency because the event loop is held up waiting for
remote processes for short periods of time).
Jag Raman Nov. 13, 2019, 4:01 p.m. UTC | #2
On 11/11/2019 11:27 AM, Stefan Hajnoczi wrote:
> On Thu, Oct 24, 2019 at 05:09:11AM -0400, Jagannathan Raman wrote:
>> +static void broadcast_msg(MPQemuMsg *msg, bool need_reply)
>> +{
>> +    PCIProxyDev *entry;
>> +    unsigned int pid;
>> +    int wait;
>> +
>> +    QLIST_FOREACH(entry, &proxy_dev_list.devices, next) {
>> +        if (need_reply) {
>> +            wait = eventfd(0, EFD_NONBLOCK);
>> +            msg->num_fds = 1;
>> +            msg->fds[0] = wait;
>> +        }
>> +
>> +        mpqemu_msg_send(entry->mpqemu_link, msg, entry->mpqemu_link->com);
>> +        if (need_reply) {
>> +            pid = (uint32_t)wait_for_remote(wait);
> 
> Sometimes QEMU really needs to wait for the remote process before it can
> make progress.  I think this is not one of those cases though.
> 
> Since QEMU is event-driven it's problematic to invoke blocking system
> calls.  The remote process might not respond for a significant amount of
> time.  Other QEMU threads will be held up waiting for the QEMU global
> mutex in the meantime (because we hold it!).

There are places where we wait synchronously for the remote process.
However, these synchronous waits carry a timeout to prevent the hang
situation you described above.

We will add an error recovery in the future. That is, we will respawn
the remote process if the QEMU times out waiting for it.

> 
> Please implement heartbeat/ping asynchronously.  The wait eventfd should
> be read by an event loop fd handler instead.  That way QEMU can continue
> with running the VM while waiting for the remote process.

In the current implementation, the heartbeat/ping is asynchronous.
start_heartbeat_timer() sets up a timer to perform the ping.

Thanks!
--
Jag

> 
> This will also improve guest performance because there will be less
> jitter (random latency because the event loop is held up waiting for
> remote processes for short periods of time).
>
Stefan Hajnoczi Nov. 21, 2019, 12:19 p.m. UTC | #3
On Wed, Nov 13, 2019 at 11:01:07AM -0500, Jag Raman wrote:
> 
> 
> On 11/11/2019 11:27 AM, Stefan Hajnoczi wrote:
> > On Thu, Oct 24, 2019 at 05:09:11AM -0400, Jagannathan Raman wrote:
> > > +static void broadcast_msg(MPQemuMsg *msg, bool need_reply)
> > > +{
> > > +    PCIProxyDev *entry;
> > > +    unsigned int pid;
> > > +    int wait;
> > > +
> > > +    QLIST_FOREACH(entry, &proxy_dev_list.devices, next) {
> > > +        if (need_reply) {
> > > +            wait = eventfd(0, EFD_NONBLOCK);
> > > +            msg->num_fds = 1;
> > > +            msg->fds[0] = wait;
> > > +        }
> > > +
> > > +        mpqemu_msg_send(entry->mpqemu_link, msg, entry->mpqemu_link->com);
> > > +        if (need_reply) {
> > > +            pid = (uint32_t)wait_for_remote(wait);
> > 
> > Sometimes QEMU really needs to wait for the remote process before it can
> > make progress.  I think this is not one of those cases though.
> > 
> > Since QEMU is event-driven it's problematic to invoke blocking system
> > calls.  The remote process might not respond for a significant amount of
> > time.  Other QEMU threads will be held up waiting for the QEMU global
> > mutex in the meantime (because we hold it!).
> 
> There are places where we wait synchronously for the remote process.
> However, these synchronous waits carry a timeout to prevent the hang
> situation you described above.
> 
> We will add an error recovery in the future. That is, we will respawn
> the remote process if the QEMU times out waiting for it.

Even with a timeout, in the meantime the event loop is blocked.  That
means timers will be delayed by a large amount, the monitor will be
unresponsive, etc.

> > 
> > Please implement heartbeat/ping asynchronously.  The wait eventfd should
> > be read by an event loop fd handler instead.  That way QEMU can continue
> > with running the VM while waiting for the remote process.
> 
> In the current implementation, the heartbeat/ping is asynchronous.
> start_heartbeat_timer() sets up a timer to perform the ping.

The heartbeat/ping is synchronous because broadcast_msg() blocks.

Stefan
diff mbox series

Patch

diff --git a/hw/proxy/qemu-proxy.c b/hw/proxy/qemu-proxy.c
index fc1c731..691b991 100644
--- a/hw/proxy/qemu-proxy.c
+++ b/hw/proxy/qemu-proxy.c
@@ -53,14 +53,96 @@ 
 #include "hw/boards.h"
 #include "include/qemu/log.h"
 
+QEMUTimer *hb_timer;
 static void pci_proxy_dev_realize(PCIDevice *dev, Error **errp);
 static void setup_irqfd(PCIProxyDev *dev);
+static void pci_dev_exit(PCIDevice *dev);
+static void start_heartbeat_timer(void);
+static void stop_heartbeat_timer(void);
+static void childsig_handler(int sig, siginfo_t *siginfo, void *ctx);
+static void broadcast_msg(MPQemuMsg *msg, bool need_reply);
+
+static void childsig_handler(int sig, siginfo_t *siginfo, void *ctx)
+{
+    /* TODO: Add proper handler. */
+    printf("Child (pid %d) is dead? Signal is %d, Exit code is %d.\n",
+           siginfo->si_pid, siginfo->si_signo, siginfo->si_code);
+}
+
+static void broadcast_msg(MPQemuMsg *msg, bool need_reply)
+{
+    PCIProxyDev *entry;
+    unsigned int pid;
+    int wait;
+
+    QLIST_FOREACH(entry, &proxy_dev_list.devices, next) {
+        if (need_reply) {
+            wait = eventfd(0, EFD_NONBLOCK);
+            msg->num_fds = 1;
+            msg->fds[0] = wait;
+        }
+
+        mpqemu_msg_send(entry->mpqemu_link, msg, entry->mpqemu_link->com);
+        if (need_reply) {
+            pid = (uint32_t)wait_for_remote(wait);
+            close(wait);
+            /* TODO: Add proper handling. */
+            if (pid) {
+                need_reply = 0;
+            }
+        }
+    }
+}
+
+#define NOP_INTERVAL 1000000
+
+static void remote_ping(void *opaque)
+{
+    MPQemuMsg msg;
+
+    memset(&msg, 0, sizeof(MPQemuMsg));
+
+    msg.num_fds = 0;
+    msg.cmd = PROXY_PING;
+    msg.bytestream = 0;
+    msg.size = 0;
+
+    broadcast_msg(&msg, true);
+    timer_mod(hb_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + NOP_INTERVAL);
+
+}
+
+void start_heartbeat_timer(void)
+{
+    hb_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                            remote_ping,
+                                            &proxy_dev_list);
+    timer_mod(hb_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + NOP_INTERVAL);
+
+}
+
+static void stop_heartbeat_timer(void)
+{
+    timer_del(hb_timer);
+    timer_free(hb_timer);
+}
+
+static void set_sigchld_handler(void)
+{
+    struct sigaction sa_sigterm;
+    memset(&sa_sigterm, 0, sizeof(sa_sigterm));
+    sa_sigterm.sa_sigaction = childsig_handler;
+    sa_sigterm.sa_flags = SA_SIGINFO | SA_NOCLDWAIT | SA_NOCLDSTOP;
+    sigaction(SIGCHLD, &sa_sigterm, NULL);
+}
 
 static void proxy_ready(PCIDevice *dev)
 {
     PCIProxyDev *pdev = PCI_PROXY_DEV(dev);
 
     setup_irqfd(pdev);
+    set_sigchld_handler();
+    start_heartbeat_timer();
 }
 
 static void set_remote_opts(PCIDevice *dev, QDict *qdict, unsigned int cmd)
@@ -259,6 +341,7 @@  static void pci_proxy_dev_class_init(ObjectClass *klass, void *data)
     PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
 
     k->realize = pci_proxy_dev_realize;
+    k->exit = pci_dev_exit;
     k->config_read = pci_proxy_read_config;
     k->config_write = pci_proxy_write_config;
 }
@@ -397,6 +480,24 @@  static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
     dev->proxy_ready = proxy_ready;
 }
 
+static void pci_dev_exit(PCIDevice *pdev)
+{
+    PCIProxyDev *entry, *sentry;
+    PCIProxyDev *dev = PCI_PROXY_DEV(pdev);
+
+    stop_heartbeat_timer();
+
+    QLIST_FOREACH_SAFE(entry, &proxy_dev_list.devices, next, sentry) {
+        if (entry->remote_pid == dev->remote_pid) {
+            QLIST_REMOVE(entry, next);
+        }
+    }
+
+    if (!QLIST_EMPTY(&proxy_dev_list.devices)) {
+        start_heartbeat_timer();
+    }
+}
+
 static void send_bar_access_msg(PCIProxyDev *dev, MemoryRegion *mr,
                                 bool write, hwaddr addr, uint64_t *val,
                                 unsigned size, bool memory)
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index 3145b0e..16a913b 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -72,6 +72,7 @@  typedef enum {
     DRIVE_OPTS,
     DEVICE_ADD,
     DEVICE_DEL,
+    PROXY_PING,
     MAX,
 } mpqemu_cmd_t;