posix-timers: RCU conversion

Message ID	1300777760.2837.38.camel@edumazet-laptop (mailing list archive)
State	New, archived
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p2M7CaNb029046 for <patchwork-kvm@patchwork.kernel.org>; Tue, 22 Mar 2011 07:12:36 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753475Ab1CVHMK (ORCPT <rfc822;patchwork-kvm@patchwork.kernel.org>); Tue, 22 Mar 2011 03:12:10 -0400 Received: from mail-fx0-f46.google.com ([209.85.161.46]:45348 "EHLO mail-fx0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754203Ab1CVHMJ (ORCPT <rfc822; kvm@vger.kernel.org>); Tue, 22 Mar 2011 03:12:09 -0400 Received: by fxm17 with SMTP id 17so6151986fxm.19 for <multiple recipients>; Tue, 22 Mar 2011 00:12:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:subject:from:to:cc:in-reply-to:references :content-type:date:message-id:mime-version:x-mailer :content-transfer-encoding; bh=iQVzaib+cDn4GOudMadwMdtOLzN6ogwMUzC4t+P+AiA=; b=mUrC7Xx3N5TgaTx3FawWdqITlK94NqTdkOqW1GcRSkC6w9F5anFcMo4pU8NiwXMeVL IHYcT0ORZ2Zp8011ghZRtVRLtPUFX/Y1oqIJD5Wbjcev6GCkEoVnb00Ocm7oRnbyjYG6 F3oKQvD0d3cY8iogZPTspatUXNdXj0vQraB3U= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:from:to:cc:in-reply-to:references:content-type:date :message-id:mime-version:x-mailer:content-transfer-encoding; b=e0rfeipJCucilUdKtvEb2I/ISF6AH5xaPuwTzxqDbojkoOli2oWOZvLzi2O1T53Vze Dm+zAKgx1fljrbWMpI8G71FAxMQsyXPuvrr0ozRepXR3aWJCHHbQySz/UQi0RRUji5Cm ZAAeuk80kCU3MQ2xo6tXpM1PmRMZD699w+60Q= Received: by 10.223.27.14 with SMTP id g14mr763259fac.129.1300777766160; Tue, 22 Mar 2011 00:09:26 -0700 (PDT) Received: from [10.150.51.212] (gw0.net.jmsp.net [212.23.165.14]) by mx.google.com with ESMTPS id b18sm2581903fak.8.2011.03.22.00.09.23 (version=SSLv3 cipher=OTHER); Tue, 22 Mar 2011 00:09:24 -0700 (PDT) Subject: [PATCH] posix-timers: RCU conversion From: Eric Dumazet <eric.dumazet@gmail.com> To: ben@iagu.net, Thomas Gleixner <tglx@linutronix.de> Cc: Avi Kivity <avi@redhat.com>, KVM list <kvm@vger.kernel.org>, linux-kernel <linux-kernel@vger.kernel.org>, John Stultz <johnstul@us.ibm.com>, Richard Cochran <richard.cochran@omicron.at> In-Reply-To: <1300746429.2837.20.camel@edumazet-laptop> References: <AANLkTikZZ4MXZjx_+6u-vKVSgxULQo2BkQKRO6K_vzFy@mail.gmail.com> <20110318123031.GB6066@8bytes.org> <AANLkTin2k1ZSLa0NG6pKBvHYnKEV21t8pEqzeJWNn_rR@mail.gmail.com> <4D871F6C.40207@redhat.com> <AANLkTintxnKRQ4rD4Wo8_MHU4yJVzYaRhR=GjnDMGwNj@mail.gmail.com> <AANLkTin8ZoUAVR2ajoV6cBFiUSe_Dof0_k60dcAGW3r-@mail.gmail.com> <4D875842.9050308@redhat.com> <AANLkTikWQS281kTtJ32-qo5U+w_BAak7qUwVhUQgOxxv@mail.gmail.com> <4D8773AA.8030408@redhat.com> <AANLkTin3Si1EAa=Va6f45vW-tYQ1Yk8dETXpHJHHHkxe@mail.gmail.com> <1300726498.2884.493.camel@edumazet-laptop> <4D8784A9.8040303@redhat.com> <1300727545.2884.513.camel@edumazet-laptop> <AANLkTinn7OKKKvBYe7KWsrReuYDdma5+8gHJHXdCacDY@mail.gmail.com> <1300746429.2837.20.camel@edumazet-laptop> Content-Type: text/plain; charset="UTF-8" Date: Tue, 22 Mar 2011 08:09:20 +0100 Message-ID: <1300777760.2837.38.camel@edumazet-laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 7bit Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: <kvm.vger.kernel.org> X-Mailing-List: kvm@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.6 (demeter1.kernel.org [140.211.167.41]); Tue, 22 Mar 2011 07:12:36 +0000 (UTC)

Eric Dumazet March 22, 2011, 7:09 a.m. UTC

None

Ben Nagy April 4, 2011, 3:30 a.m. UTC | #1

On Tue, Mar 22, 2011 at 4:20 PM, Avi Kivity <avi@redhat.com> wrote:
[...]
> Looks like the posix-timer issue is completely gone, to be replaced by the
> load balancer.
>
> Copying peterz.

Hi all,

I feel bad about such a big cc list, but I don't know who can be left out :/

Still got the performance issue with the load balancer.

Additionally, when trying to test with 48 VMs under load, they
consistently all lock up after a few minutes, which we traced as far
as ocfs2 dying. We're investigating it as an ocfs2 issue, and we'll
try and replicate on ext4 / nfs, but I just thought it might be
somehow related.

Apr  3 18:27:35 eax kernel: [16029.399507] ------------[ cut here ]------------
Apr  3 18:27:35 eax kernel: [16029.401408] kernel BUG at
/home/fuzzadmin/src/natty/source/fs/jbd2/journal.c:1610!
Apr  3 18:27:35 eax kernel: [16029.404541] invalid opcode: 0000 [#1] SMP
Apr  3 18:27:35 eax kernel: [16029.406289] last sysfs file:
/sys/devices/system/cpu/cpu47/cache/index2/shared_cpu_map
Apr  3 18:27:35 eax kernel: [16029.409453] CPU 36
Apr  3 18:27:35 eax kernel: [16029.409453] Modules linked in: ocfs2
quota_tree ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntr
ack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter
ip_tables x_tables bridge stp w83627ehf hwmon_vid ocfs2_dlmfs
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue c
onfigfs ipmi_si ipmi_msghandler ib_srp scsi_transport_srp scsi_tgt
vesafb ib_ipoib ib_iser ib_umad iscsi_tcp rdma_ucm libiscsi_tcp
rdma_cm libiscsi ib_cm iw_cm scsi_transport_iscsi ib_addr ib
_sa ib_uverbs mlx4_ib ib_mad ib_core vhost_net amd64_edac_mod psmouse
edac_core sp5100_tco ghes joydev serio_raw hed edac_mce_amd k10temp
kvm_amd i2c_piix4 kvm lp parport usb_storage usbhid h
id uas igb ahci pata_atiixp libahci dca mlx4_core
Apr  3 18:27:35 eax kernel: [16029.409453]
Apr  3 18:27:35 eax kernel: [16029.409453] Pid: 1746, comm: ocfs2cmt
Not tainted 2.6.38-7-server #39 Supermicro H8QG6/H8QG6
Apr  3 18:27:35 eax kernel: [16029.409453] RIP:
0010:[<ffffffff8124905a>]  [<ffffffff8124905a>]
jbd2_journal_flush+0x17a/0x190
Apr  3 18:27:35 eax kernel: [16029.409453] RSP: 0018:ffff880404be3dc0
EFLAGS: 00010286
Apr  3 18:27:35 eax kernel: [16029.409453] RAX: 0000000000000029 RBX:
ffff88181593c000 RCX: 000000000000001e
Apr  3 18:27:35 eax kernel: [16029.409453] RDX: 00000000fffffffb RSI:
ffff880404be3cd0 RDI: ffff88181593c024
Apr  3 18:27:35 eax kernel: [16029.409453] RBP: ffff880404be3df0 R08:
ffff880404be2000 R09: 0000000000000000
Apr  3 18:27:35 eax kernel: [16029.409453] R10: 0000000000000000 R11:
0000000000000001 R12: 0000000000000e5d
Apr  3 18:27:35 eax kernel: [16029.409453] R13: ffff88181593c39c R14:
ffff88181593c024 R15: 0000000000000000
Apr  3 18:27:35 eax kernel: [16029.409453] FS:  00007f73affb7700(0000)
GS:ffff881627c00000(0000) knlGS:0000000000000000
Apr  3 18:27:35 eax kernel: [16029.736049] CS:  0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Apr  3 18:27:35 eax kernel: [16029.736049] CR2: 0000000002f72000 CR3:
00000002f0d31000 CR4: 00000000000006e0
Apr  3 18:27:35 eax kernel: [16029.736049] DR0: 00000000000000a0 DR1:
0000000000000000 DR2: 0000000000000003
Apr  3 18:27:35 eax kernel: [16029.736049] DR3: 00000000000000b0 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Apr  3 18:27:35 eax kernel: [16029.736049] Process ocfs2cmt (pid:
1746, threadinfo ffff880404be2000, task ffff8803f72396e0)
Apr  3 18:27:35 eax kernel: [16029.736049] Stack:
Apr  3 18:27:35 eax kernel: [16029.736049]  0000000000000001
ffff8806155eb800 ffff8806155eb838 ffff8803f72396e0
Apr  3 18:27:35 eax kernel: [16029.736049]  ffff8806160cb000
ffff8806160cb160 ffff880404be3e40 ffffffffa042af12
Apr  3 18:27:35 eax kernel: [16029.736049]  0000000000000000
ffff8806155eb860 0000000000000286 ffff8806155eb828
Apr  3 18:27:35 eax kernel: [16029.736049] Call Trace:
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042af12>]
ocfs2_commit_cache+0xc2/0x330 [ocfs2]
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042b1e1>]
ocfs2_commit_thread+0x61/0x210 [ocfs2]
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087870>] ?
autoremove_wake_function+0x0/0x40
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042b180>] ?
ocfs2_commit_thread+0x0/0x210 [ocfs2]
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087126>]
kthread+0x96/0xa0
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff8100cde4>]
kernel_thread_helper+0x4/0x10
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087090>] ?
kthread+0x0/0xa0
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff8100cde0>] ?
kernel_thread_helper+0x0/0x10
Apr  3 18:27:35 eax kernel: [16029.736049] Code: c0 5b 41 5c 41 5d 41
5e 41 5f c9 c3 0f 1f 44 00 00 4c 8b 63 58 4d 85 e4 0f 85 d2 fe ff ff
f0 81 43 24 00 00 00 01 e9 da fe ff ff <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b
66 66 66 2e 0f 1f 84 00 00 00 00
Apr  3 18:27:35 eax kernel: [16029.736049] RIP  [<ffffffff8124905a>]
jbd2_journal_flush+0x17a/0x190
Apr  3 18:27:35 eax kernel: [16031.320082] ------------[ cut here ]------------

Thanks for any feedback...

Cheers,

ben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity April 4, 2011, 7:18 a.m. UTC | #2

On 04/04/2011 06:30 AM, Ben Nagy wrote:
> On Tue, Mar 22, 2011 at 4:20 PM, Avi Kivity<avi@redhat.com>  wrote:
> [...]
> >  Looks like the posix-timer issue is completely gone, to be replaced by the
> >  load balancer.
> >
> >  Copying peterz.
>
> Hi all,
>
> I feel bad about such a big cc list, but I don't know who can be left out :/
>
> Still got the performance issue with the load balancer.
>
> Additionally, when trying to test with 48 VMs under load, they
> consistently all lock up after a few minutes, which we traced as far
> as ocfs2 dying. We're investigating it as an ocfs2 issue, and we'll
> try and replicate on ext4 / nfs, but I just thought it might be
> somehow related.
>
> Apr  3 18:27:35 eax kernel: [16029.399507] ------------[ cut here ]------------
> Apr  3 18:27:35 eax kernel: [16029.401408] kernel BUG at
> /home/fuzzadmin/src/natty/source/fs/jbd2/journal.c:1610!
> Apr  3 18:27:35 eax kernel: [16029.404541] invalid opcode: 0000 [#1] SMP
> Apr  3 18:27:35 eax kernel: [16029.406289] last sysfs file:
> /sys/devices/system/cpu/cpu47/cache/index2/shared_cpu_map
> Apr  3 18:27:35 eax kernel: [16029.409453] CPU 36
> Apr  3 18:27:35 eax kernel: [16029.409453] Modules linked in: ocfs2
> quota_tree ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat
> nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntr
> ack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter
> ip_tables x_tables bridge stp w83627ehf hwmon_vid ocfs2_dlmfs
> ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue c
> onfigfs ipmi_si ipmi_msghandler ib_srp scsi_transport_srp scsi_tgt
> vesafb ib_ipoib ib_iser ib_umad iscsi_tcp rdma_ucm libiscsi_tcp
> rdma_cm libiscsi ib_cm iw_cm scsi_transport_iscsi ib_addr ib
> _sa ib_uverbs mlx4_ib ib_mad ib_core vhost_net amd64_edac_mod psmouse
> edac_core sp5100_tco ghes joydev serio_raw hed edac_mce_amd k10temp
> kvm_amd i2c_piix4 kvm lp parport usb_storage usbhid h
> id uas igb ahci pata_atiixp libahci dca mlx4_core
> Apr  3 18:27:35 eax kernel: [16029.409453]
> Apr  3 18:27:35 eax kernel: [16029.409453] Pid: 1746, comm: ocfs2cmt
> Not tainted 2.6.38-7-server #39 Supermicro H8QG6/H8QG6
> Apr  3 18:27:35 eax kernel: [16029.409453] RIP:
> 0010:[<ffffffff8124905a>]  [<ffffffff8124905a>]
> jbd2_journal_flush+0x17a/0x190
> Apr  3 18:27:35 eax kernel: [16029.409453] RSP: 0018:ffff880404be3dc0
> EFLAGS: 00010286
> Apr  3 18:27:35 eax kernel: [16029.409453] RAX: 0000000000000029 RBX:
> ffff88181593c000 RCX: 000000000000001e
> Apr  3 18:27:35 eax kernel: [16029.409453] RDX: 00000000fffffffb RSI:
> ffff880404be3cd0 RDI: ffff88181593c024
> Apr  3 18:27:35 eax kernel: [16029.409453] RBP: ffff880404be3df0 R08:
> ffff880404be2000 R09: 0000000000000000
> Apr  3 18:27:35 eax kernel: [16029.409453] R10: 0000000000000000 R11:
> 0000000000000001 R12: 0000000000000e5d
> Apr  3 18:27:35 eax kernel: [16029.409453] R13: ffff88181593c39c R14:
> ffff88181593c024 R15: 0000000000000000
> Apr  3 18:27:35 eax kernel: [16029.409453] FS:  00007f73affb7700(0000)
> GS:ffff881627c00000(0000) knlGS:0000000000000000
> Apr  3 18:27:35 eax kernel: [16029.736049] CS:  0010 DS: 0000 ES: 0000
> CR0: 000000008005003b
> Apr  3 18:27:35 eax kernel: [16029.736049] CR2: 0000000002f72000 CR3:
> 00000002f0d31000 CR4: 00000000000006e0
> Apr  3 18:27:35 eax kernel: [16029.736049] DR0: 00000000000000a0 DR1:
> 0000000000000000 DR2: 0000000000000003
> Apr  3 18:27:35 eax kernel: [16029.736049] DR3: 00000000000000b0 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Apr  3 18:27:35 eax kernel: [16029.736049] Process ocfs2cmt (pid:
> 1746, threadinfo ffff880404be2000, task ffff8803f72396e0)
> Apr  3 18:27:35 eax kernel: [16029.736049] Stack:
> Apr  3 18:27:35 eax kernel: [16029.736049]  0000000000000001
> ffff8806155eb800 ffff8806155eb838 ffff8803f72396e0
> Apr  3 18:27:35 eax kernel: [16029.736049]  ffff8806160cb000
> ffff8806160cb160 ffff880404be3e40 ffffffffa042af12
> Apr  3 18:27:35 eax kernel: [16029.736049]  0000000000000000
> ffff8806155eb860 0000000000000286 ffff8806155eb828
> Apr  3 18:27:35 eax kernel: [16029.736049] Call Trace:
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042af12>]
> ocfs2_commit_cache+0xc2/0x330 [ocfs2]
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042b1e1>]
> ocfs2_commit_thread+0x61/0x210 [ocfs2]
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087870>] ?
> autoremove_wake_function+0x0/0x40
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042b180>] ?
> ocfs2_commit_thread+0x0/0x210 [ocfs2]
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087126>]
> kthread+0x96/0xa0
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff8100cde4>]
> kernel_thread_helper+0x4/0x10
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087090>] ?
> kthread+0x0/0xa0
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff8100cde0>] ?
> kernel_thread_helper+0x0/0x10
> Apr  3 18:27:35 eax kernel: [16029.736049] Code: c0 5b 41 5c 41 5d 41
> 5e 41 5f c9 c3 0f 1f 44 00 00 4c 8b 63 58 4d 85 e4 0f 85 d2 fe ff ff
> f0 81 43 24 00 00 00 01 e9 da fe ff ff<0f>  0b 0f 0b 0f 0b 0f 0b 0f 0b
> 66 66 66 2e 0f 1f 84 00 00 00 00
> Apr  3 18:27:35 eax kernel: [16029.736049] RIP  [<ffffffff8124905a>]
> jbd2_journal_flush+0x17a/0x190
> Apr  3 18:27:35 eax kernel: [16031.320082] ------------[ cut here ]------------

Please report this to the ocfs2 developers.

Peter Zijlstra April 5, 2011, 7:49 a.m. UTC | #3

On Tue, 2011-03-22 at 12:35 +0200, Avi Kivity wrote:
> Looks like the posix-timer issue is completely gone, to be replaced by 
> the load balancer. 

-ENOINFO, no kernel version, no setup, no workload, no nothing.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity April 5, 2011, 8:16 a.m. UTC | #4

On 04/05/2011 10:49 AM, Peter Zijlstra wrote:
> On Tue, 2011-03-22 at 12:35 +0200, Avi Kivity wrote:
> >  Looks like the posix-timer issue is completely gone, to be replaced by
> >  the load balancer.
>
> -ENOINFO, no kernel version, no setup, no workload, no nothing.

http://www.spinics.net/lists/kvm/msg51526.html

plus patch in

http://www.spinics.net/lists/kvm/msg51595.html

Peter Zijlstra April 5, 2011, 8:48 a.m. UTC | #5

On Tue, 2011-03-22 at 12:35 +0200, Avi Kivity wrote:
>> Here's top with 96 idle guests running:

On some hacked up 2.6.38 kernel...

> > Start of perf report -g
> >      55.26%            kvm  [kernel.kallsyms]     [k] __ticket_spin_lock
> >                        |
> >                        --- __ticket_spin_lock
> >                           |
> >                           |--94.68%-- _raw_spin_lock
> >                           |          |
> >                           |          |--97.55%-- double_rq_lock
> >                           |          |          load_balance
> >                           |          |          idle_balance
> >                           |          |          schedule
> >                           |          |          |
> >                           |          |          |--60.56%-- schedule_hrtimeout_range_clock
> >                           |          |          |          schedule_hrtimeout_range
> >                           |          |          |          poll_schedule_timeout
> >                           |          |          |          do_select
> >                           |          |          |          core_sys_select
> >                           |          |          |          sys_select
> >                           |          |          |          system_call_fastpath

Looks like your workload and idle balancing don't much like each-other.

What I think is happening is that all your 'idle' qemu thingies keep
waking up frequently and because you've got like twice the number of
qemu instances as you've got cpus there's a fair chance you'll have a
cpu with a pending task while another one goes idle.

(Why does qemu keep waking if its idle? broken NOHZ?)

So idle balancing is called when the cpu goes idle (context switch to
the idle thread) and tries to steal a pending task from another cpu,
clearly it keeps finding these tasks otherwise it wouldn't try to take
that lock.

Mike, you build in some idle balance throttle logic, but that seems
defeated here (possible because it keeps finding pending tasks to
migrate? still need morning juice).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity April 5, 2011, 8:56 a.m. UTC | #6

On 04/05/2011 11:48 AM, Peter Zijlstra wrote:
> What I think is happening is that all your 'idle' qemu thingies keep
> waking up frequently and because you've got like twice the number of
> qemu instances as you've got cpus there's a fair chance you'll have a
> cpu with a pending task while another one goes idle.

Those qemus have two threads each (one vcpu thread, one io thread).

> (Why does qemu keep waking if its idle? broken NOHZ?)

Could be waking up due to guest wakeups, or qemu internal wakeups 
(display refresh) or due to guest timer sources which are masked away in 
the guest (if that's the case we should optimize it away).

Mike Galbraith April 5, 2011, 8:56 a.m. UTC | #7

On Tue, 2011-04-05 at 10:48 +0200, Peter Zijlstra wrote:
> On Tue, 2011-03-22 at 12:35 +0200, Avi Kivity wrote:
> >> Here's top with 96 idle guests running:
> 
> On some hacked up 2.6.38 kernel...
> 
> > > Start of perf report -g
> > >      55.26%            kvm  [kernel.kallsyms]     [k] __ticket_spin_lock
> > >                        |
> > >                        --- __ticket_spin_lock
> > >                           |
> > >                           |--94.68%-- _raw_spin_lock
> > >                           |          |
> > >                           |          |--97.55%-- double_rq_lock
> > >                           |          |          load_balance
> > >                           |          |          idle_balance
> > >                           |          |          schedule
> > >                           |          |          |
> > >                           |          |          |--60.56%-- schedule_hrtimeout_range_clock
> > >                           |          |          |          schedule_hrtimeout_range
> > >                           |          |          |          poll_schedule_timeout
> > >                           |          |          |          do_select
> > >                           |          |          |          core_sys_select
> > >                           |          |          |          sys_select
> > >                           |          |          |          system_call_fastpath
> 
> Looks like your workload and idle balancing don't much like each-other.
> 
> What I think is happening is that all your 'idle' qemu thingies keep
> waking up frequently and because you've got like twice the number of
> qemu instances as you've got cpus there's a fair chance you'll have a
> cpu with a pending task while another one goes idle.
> 
> (Why does qemu keep waking if its idle? broken NOHZ?)
> 
> So idle balancing is called when the cpu goes idle (context switch to
> the idle thread) and tries to steal a pending task from another cpu,
> clearly it keeps finding these tasks otherwise it wouldn't try to take
> that lock.
> 
> Mike, you build in some idle balance throttle logic, but that seems
> defeated here (possible because it keeps finding pending tasks to
> migrate? still need morning juice).

Hm.  Maybe someone set sysctl_sched_migration_cost to 0, or wakeups
aren't arriving that frequently is all that pops to mind atm.

	-Mike


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra April 5, 2011, 9:03 a.m. UTC | #8

On Tue, 2011-04-05 at 11:56 +0300, Avi Kivity wrote:
> 
> Could be waking up due to guest wakeups, or qemu internal wakeups 
> (display refresh) or due to guest timer sources which are masked away in 
> the guest (if that's the case we should optimize it away).

Right, so I guess we're all clutching at straws here :-)

Ben how usable is that system when its in that state? Could you run a
function trace or a trace with all kvm and sched trace-events enabled?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity April 5, 2011, 9:08 a.m. UTC | #9

On 04/05/2011 12:03 PM, Peter Zijlstra wrote:
> On Tue, 2011-04-05 at 11:56 +0300, Avi Kivity wrote:
> >
> >  Could be waking up due to guest wakeups, or qemu internal wakeups
> >  (display refresh) or due to guest timer sources which are masked away in
> >  the guest (if that's the case we should optimize it away).
>
> Right, so I guess we're all clutching at straws here :-)
>
> Ben how usable is that system when its in that state? Could you run a
> function trace or a trace with all kvm and sched trace-events enabled?

Ben, to do that, look up http://www.linux-kvm.org/page/Tracing; the 
command for combined sched/kvm tracing is

# trace-cmd record -b 20000 -e kvm -e sched

Ben Nagy April 5, 2011, 9:50 a.m. UTC | #10

On Tue, Apr 5, 2011 at 2:48 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-04-05 at 11:56 +0300, Avi Kivity wrote:
>>
>> Could be waking up due to guest wakeups, or qemu internal wakeups
>> (display refresh) or due to guest timer sources which are masked away in
>> the guest (if that's the case we should optimize it away).
>
> Right, so I guess we're all clutching at straws here :-)
>
> Ben how usable is that system when its in that state? Could you run a
> function trace or a trace with all kvm and sched trace-events enabled?

I'm just rebuilding the storage on the network to work around an ocfs2
kernel oops (trying nfs/rdma) so I can't test anything just at the
moment.

I ran some tests under load with the local ext4 ssd, and, weirdly,
everything looked to be just fine - the huge bulk of the system time
was in svm_vcpu_run, which is as it should be I guess, but that was
with only 60 loaded guests.

I'll be able to repeat the same workload test tomorrow, and I'll see
how the perf top stuff looks. I should also be able to repeat the '96
idle guests' test and see if it's the same - if so we'll do that
tracing. My kernel's a moving target at the moment, sorry, we're
tracking the natty git (with Eric's rcu patch merged in).

Thanks, all,

ben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

posix-timers: RCU conversion

Commit Message

Comments

Patch