Message ID | 20230213163351.30704-1-minipli@grsecurity.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Feb 13, 2023, Mathias Krause wrote: > Relayout members of struct kvm_vcpu and embedded structs to reduce its > memory footprint. Not that it makes sense from a memory usage point of > view (given how few of such objects get allocated), but this series > achieves to make it consume two cachelines less, which should provide a > micro-architectural net win. However, I wasn't able to see a noticeable > difference running benchmarks within a guest VM -- the VMEXIT costs are > likely still high enough to mask any gains. ... > Below is the high level pahole(1) diff. Most significant is the overall > size change from 6688 to 6560 bytes, i.e. -128 bytes. While part of me wishes KVM were more careful about struct layouts, IMO fiddling with per vCPU or per VM structures isn't worth the ongoing maintenance cost. Unless the size of the vCPU allocation (vcpu_vmx or vcpu_svm in x86 land) crosses a meaningful boundary, e.g. drops the size from an order-3 to order-2 allocation, the memory savings are negligible in the grand scheme. Assuming the kernel is even capable of perfectly packing vCPU allocations, saving even a few hundred bytes per vCPU is uninteresting unless the vCPU count gets reaaally high, and at that point the host likely has hundreds of GiB of memory, i.e. saving a few KiB is again uninteresting. And as you observed, imperfect struct layouts are highly unlikely to have a measurable impact on performance. The types of operations that are involved in a world switch are just too costly for the layout to matter much. I do like to shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably more performant, doesn't require ongoing mainteance, and/or also improves the code quality. I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry a sub-optimal layouy and the change is arguably warranted even without the change in size. Ditto for kvm_pmu, logically I think it makes sense to have the version at the very top. But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly egregious field(s) just isn't worth the cost in the long term.
On 13.02.23 18:05, Sean Christopherson wrote: > On Mon, Feb 13, 2023, Mathias Krause wrote: >> Relayout members of struct kvm_vcpu and embedded structs to reduce its >> memory footprint. Not that it makes sense from a memory usage point of >> view (given how few of such objects get allocated), but this series >> achieves to make it consume two cachelines less, which should provide a >> micro-architectural net win. However, I wasn't able to see a noticeable >> difference running benchmarks within a guest VM -- the VMEXIT costs are >> likely still high enough to mask any gains. > > ... > >> Below is the high level pahole(1) diff. Most significant is the overall >> size change from 6688 to 6560 bytes, i.e. -128 bytes. > > While part of me wishes KVM were more careful about struct layouts, IMO fiddling > with per vCPU or per VM structures isn't worth the ongoing maintenance cost. > > Unless the size of the vCPU allocation (vcpu_vmx or vcpu_svm in x86 land) crosses > a meaningful boundary, e.g. drops the size from an order-3 to order-2 allocation, > the memory savings are negligible in the grand scheme. Assuming the kernel is > even capable of perfectly packing vCPU allocations, saving even a few hundred bytes > per vCPU is uninteresting unless the vCPU count gets reaaally high, and at that > point the host likely has hundreds of GiB of memory, i.e. saving a few KiB is again > uninteresting. Fully agree! That's why I said, this change makes no sense from a memory usage point of view. The overall memory savings are not visible at all, recognizing that the slab allocator isn't able to put more vCPU objects in a given slab page. However, I still remain confident that this makes sense from a uarch point of view. Touching less cache lines should be a win -- even if I'm unable to measure it. By preserving more cachelines during a VMEXIT, guests should be able to resume their work faster (assuming they still need these cachelines). > And as you observed, imperfect struct layouts are highly unlikely to have a > measurable impact on performance. The types of operations that are involved in > a world switch are just too costly for the layout to matter much. I do like to > shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably > more performant, doesn't require ongoing mainteance, and/or also improves the code > quality. Any pointers to measure the "more performant" aspect? I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's already counting cycles, but the numbers are too unstable, even if I pin the test to a given CPU, disable turbo mode, SMT, use the performance cpu governor, etc. > I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry > a sub-optimal layouy and the change is arguably warranted even without the change > in size. Ditto for kvm_pmu, logically I think it makes sense to have the version > at the very top. Yeah, was exactly thinking the same when modifying kvm_pmu. > But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling > fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly > egregious field(s) just isn't worth the cost in the long term. Heh, just found this gem in vcpu_vmx: struct vcpu_vmx { [...] union vmx_exit_reason exit_reason; /* XXX 44 bytes hole, try to pack */ /* --- cacheline 123 boundary (7872 bytes) --- */ struct pi_desc pi_desc __attribute__((__aligned__(64))); [...] So there are, in fact, some bigger holes left. Would be nice if pahole had a --density flag that would output some ASCII art, visualizing which bytes of a struct are allocated by real members and which ones are pure padding.
On Tue, Feb 14, 2023, Mathias Krause wrote: > On 13.02.23 18:05, Sean Christopherson wrote: > However, I still remain confident that this makes sense from a uarch point of > view. Touching less cache lines should be a win -- even if I'm unable to > measure it. By preserving more cachelines during a VMEXIT, guests should be > able to resume their work faster (assuming they still need these cachelines). Yes, but I'm skeptical that compacting struct kvm_vcpu will actually result in fewer cache lines being touched, at least not in a persistent, maintainable way. While every VM-Exit accesses a lot of state, it's most definitely still a small percentage of kvm_vcpu. And for the VM-Exits that really benefit from optimized handling, a.k.a. the fastpath exits, that percentage is even lower. On x86, kvm_vcpu is actually comprised of three "major" structs: kvm_vcpu, kvm_vcpu_arch, and vcpu_{vmx,svm}. The majority of fields accessed on every VM-Exit are in the vendor part, vcpu_{vmx,svm}, but there are still high-touch fields in the common structures, e.g. registers in kvm_vcpu_arch and things like requests in kvm_vcpu. Naively compating any of the three (well, four) structures might result in fewer cache lines being touched, but mostly by coincidence. E.g. because compacting part of kvm_vcpu_arch happened to better align vcpu_vmx, not because of the compaction itself. If we really wanted to optimize cache line usage, our best bet would be to identify the fields that are accessed in the fastpath run loop and then pack those fields into as few cache lines as possible. And to do that in a maintainable way, we'd probably need something like ASI[*] to allow us to detect changes that result in the fastpath handling accessing fields that aren't in the cache-optimized region. I'm not necessarily opposed to such aggressive optimization, but the ROI is likely very, very low. For optimized workloads, there simply aren't very many VM-Exits, e.g. the majority of exits on a modern CPU are due to timer ticks. And even those will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that allows for a tickless host. https://lore.kernel.org/all/20220223052223.1202152-1-junaids@google.com > > And as you observed, imperfect struct layouts are highly unlikely to have a > > measurable impact on performance. The types of operations that are involved in > > a world switch are just too costly for the layout to matter much. I do like to > > shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably > > more performant, doesn't require ongoing mainteance, and/or also improves the code > > quality. > > Any pointers to measure the "more performant" aspect? TL;DR: not really. What I've done in the past is run a very tight loop in the guest, and then measure latency from the host by hacking KVM. Measuring from the guest works, e.g. we have a variety of selftests that do exactly that, but when trying to shave cycles in the VM-Exit path, it doesn't take many host IRQs arriving at the "wrong" time to skew the measurement. My quick-and-dirty solution has always been to just hack KVM to measure latency with IRQs disabled, but a more maintainable approach would be to add smarts somewhere to sanitize the results, e.g. to throw away outliers where the guest likely got interrupted. I believe we've talked about adding a selftest to measure fastpath latency, e.g. by writing MSR_IA32_TSC_DEADLINE in a tight loop. However, that's not going to be useful in this case since you are interested in measuring the impact of reducing the host's L1D footprint. If the guest isn't cache-constrainted, reducing the host's cache footprint isn't going to impact performance since there's no contention. Running a micro benchmark in the guest that aims to measure cache performance might work, but presumably those are all userspace tests, i.e. you'd end up measuring the impact of the guest kernel too. And they wouldn't consistently trigger VM-Exits, so it would be difficult to prove the validity of the results. I suppose you could write a dedicated selftest or port a micro benchmark to run as a selftest (or KVM-unit-test)? > I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's > already counting cycles, but the numbers are too unstable, even if I pin the > test to a given CPU, disable turbo mode, SMT, use the performance cpu > governor, etc. Heh, you might have picked quite possibly the worst way to measure VM-Exit performance :-) The guest code in that test that's measuring latency runs at L2. For VMREADs and VMWRITEs that are "passed-through" all the way to L2, no VM-Exit will occur (the access will be handled purely in ucode). And for accesses that do cause a VM-Exit, I'm pretty sure they all result in a nested VM-Exit, which is a _very_ heavy path (~10k cycles). Even if the exit is handled by KVM (in L0), it's still a relatively slow, heavy path. > > I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry > > a sub-optimal layouy and the change is arguably warranted even without the change > > in size. Ditto for kvm_pmu, logically I think it makes sense to have the version > > at the very top. > > Yeah, was exactly thinking the same when modifying kvm_pmu. > > > But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling > > fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly > > egregious field(s) just isn't worth the cost in the long term. > > Heh, just found this gem in vcpu_vmx: > > struct vcpu_vmx { > [...] > union vmx_exit_reason exit_reason; > > /* XXX 44 bytes hole, try to pack */ > > /* --- cacheline 123 boundary (7872 bytes) --- */ > struct pi_desc pi_desc __attribute__((__aligned__(64))); > [...] > > So there are, in fact, some bigger holes left. Ya. Again, I'm definitely ok cleaning up the truly heinous warts and/or doing a targeted, deliberate refactor of structures. What I don't want to do is shuffle fields around purely to save a few bytes here and there.
On 16.02.23 18:32, Sean Christopherson wrote: > On Tue, Feb 14, 2023, Mathias Krause wrote: >> On 13.02.23 18:05, Sean Christopherson wrote: >> However, I still remain confident that this makes sense from a uarch point of >> view. Touching less cache lines should be a win -- even if I'm unable to >> measure it. By preserving more cachelines during a VMEXIT, guests should be >> able to resume their work faster (assuming they still need these cachelines). > > Yes, but I'm skeptical that compacting struct kvm_vcpu will actually result in > fewer cache lines being touched, at least not in a persistent, maintainable way. > While every VM-Exit accesses a lot of state, it's most definitely still a small > percentage of kvm_vcpu. And for the VM-Exits that really benefit from optimized > handling, a.k.a. the fastpath exits, that percentage is even lower. Yeah, that's probably true. > On x86, kvm_vcpu is actually comprised of three "major" structs: kvm_vcpu, > kvm_vcpu_arch, and vcpu_{vmx,svm}. The majority of fields accessed on every VM-Exit > are in the vendor part, vcpu_{vmx,svm}, but there are still high-touch fields in > the common structures, e.g. registers in kvm_vcpu_arch and things like requests > in kvm_vcpu. > > Naively compating any of the three (well, four) structures might result in fewer > cache lines being touched, but mostly by coincidence. E.g. because compacting > part of kvm_vcpu_arch happened to better align vcpu_vmx, not because of the > compaction itself. Fortunately, kvm_vcpu is embedded as first member in vcpu_{vmx,svm}, so all three share a common "header." Optimizations done for kvm_vcpu will therefore benefit the vendor specific structures too. However, you're right that this will implicitly change the layout for the remainder of vcpu_{vmx,svm} and might even have a negative impact regarding cacheline usage. But, as my changes chop off exactly 128 bytes from kvm_vcpu, that's not the case here. But I can see that this is "coincidence" and fragile in the long run. > If we really wanted to optimize cache line usage, our best bet would be to identify > the fields that are accessed in the fastpath run loop and then pack those fields > into as few cache lines as possible. And to do that in a maintainable way, we'd > probably need something like ASI[*] to allow us to detect changes that result in > the fastpath handling accessing fields that aren't in the cache-optimized region. > > I'm not necessarily opposed to such aggressive optimization, but the ROI is likely > very, very low. For optimized workloads, there simply aren't very many VM-Exits, > e.g. the majority of exits on a modern CPU are due to timer ticks. And even those > will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware > virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that > allows for a tickless host. Well, for guests running grsecurity kernels, there's also the CR0.WP toggling triggering VMEXITs, which happens a lot! -- at least until something along the lines of [1] gets merged *hint ;)* [1] https://lore.kernel.org/all/20230201194604.11135-1-minipli@grsecurity.net/ > > https://lore.kernel.org/all/20220223052223.1202152-1-junaids@google.com Heh, that RFC is from February last year and it looks like it stalled at that point. But I guess you only meant patch 44 anyway, that splits up kvm_vcpu_arch: https://lore.kernel.org/all/20220223052223.1202152-45-junaids@google.com/. It does that for other purposes, though, which might conflict with the performance aspect I'm mostly after here. Anyways, I got your point. If we care about cacheline footprint, we should do a more radical change and group hot members together instead of simply shrinking the structs involved. >>> And as you observed, imperfect struct layouts are highly unlikely to have a >>> measurable impact on performance. The types of operations that are involved in >>> a world switch are just too costly for the layout to matter much. I do like to >>> shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably >>> more performant, doesn't require ongoing mainteance, and/or also improves the code >>> quality. >> >> Any pointers to measure the "more performant" aspect? > > TL;DR: not really. > > What I've done in the past is run a very tight loop in the guest, and then measure > latency from the host by hacking KVM. Measuring from the guest works, e.g. we have > a variety of selftests that do exactly that, but when trying to shave cycles in > the VM-Exit path, it doesn't take many host IRQs arriving at the "wrong" time to > skew the measurement. My quick-and-dirty solution has always been to just hack > KVM to measure latency with IRQs disabled, but a more maintainable approach would > be to add smarts somewhere to sanitize the results, e.g. to throw away outliers > where the guest likely got interrupted. > > I believe we've talked about adding a selftest to measure fastpath latency, e.g. > by writing MSR_IA32_TSC_DEADLINE in a tight loop. > > However, that's not going to be useful in this case since you are interested in > measuring the impact of reducing the host's L1D footprint. If the guest isn't > cache-constrainted, reducing the host's cache footprint isn't going to impact > performance since there's no contention. Yeah, it's hard to find a test case measuring the gains. I looked into running Linux userland workloads initially, but saw no real impact, as the sdtdev was already too high. But, as you pointed out, a micro-benchmark is of no use either, so it's all hand-waving only. :D > Running a micro benchmark in the guest that aims to measure cache performance might > work, but presumably those are all userspace tests, i.e. you'd end up measuring > the impact of the guest kernel too. And they wouldn't consistently trigger VM-Exits, > so it would be difficult to prove the validity of the results. Jepp. It's all just gut feeling, unfortunately. > I suppose you could write a dedicated selftest or port a micro benchmark to run > as a selftest (or KVM-unit-test)? > >> I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's >> already counting cycles, but the numbers are too unstable, even if I pin the >> test to a given CPU, disable turbo mode, SMT, use the performance cpu >> governor, etc. > > Heh, you might have picked quite possibly the worst way to measure VM-Exit > performance :-) > > The guest code in that test that's measuring latency runs at L2. For VMREADs > and VMWRITEs that are "passed-through" all the way to L2, no VM-Exit will occur > (the access will be handled purely in ucode). And for accesses that do cause a > VM-Exit, I'm pretty sure they all result in a nested VM-Exit, which is a _very_ > heavy path (~10k cycles). Even if the exit is handled by KVM (in L0), it's still > a relatively slow, heavy path. I see. I'll have a look at the selftests and see if I can repurpose one of them. But, as you noted, a microbenchmark might not be what I'm after. It's more about identifying the usage patterns for hot VMEXIT paths and optimize these. >>> I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry >>> a sub-optimal layouy and the change is arguably warranted even without the change >>> in size. Ditto for kvm_pmu, logically I think it makes sense to have the version >>> at the very top. >> >> Yeah, was exactly thinking the same when modifying kvm_pmu. >> >>> But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling >>> fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly >>> egregious field(s) just isn't worth the cost in the long term. >> >> Heh, just found this gem in vcpu_vmx: >> >> struct vcpu_vmx { >> [...] >> union vmx_exit_reason exit_reason; >> >> /* XXX 44 bytes hole, try to pack */ >> >> /* --- cacheline 123 boundary (7872 bytes) --- */ >> struct pi_desc pi_desc __attribute__((__aligned__(64))); >> [...] >> >> So there are, in fact, some bigger holes left. > > Ya. Again, I'm definitely ok cleaning up the truly heinous warts and/or doing > a targeted, deliberate refactor of structures. What I don't want to do is > shuffle fields around purely to save a few bytes here and there. Got it. I'll back out the reshuffling ones and only keep the ones for kvm_pmu and kvm_mmu_memory_cache, as these are more like straight cleanups. Thanks, Mathias
On Fri, Feb 17, 2023, Mathias Krause wrote: > On 16.02.23 18:32, Sean Christopherson wrote: > > I'm not necessarily opposed to such aggressive optimization, but the ROI is likely > > very, very low. For optimized workloads, there simply aren't very many VM-Exits, > > e.g. the majority of exits on a modern CPU are due to timer ticks. And even those > > will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware > > virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that > > allows for a tickless host. > > Well, for guests running grsecurity kernels, there's also the CR0.WP > toggling triggering VMEXITs, which happens a lot! -- at least until > something along the lines of [1] gets merged *hint ;)* Ha! It's high on my todo list for 6.4, catching up on other stuff at the moment. That series is also _exactly_ why the ROI for aggressive cache line optimization is low. The better long term answer is almost always to avoid the VM-Exit in the first place, or failing that, to handle the exit in a fastpath. Sometimes it takes a few years, e.g. to get necessary hardware support, but x86 virtualization is fast approaching the point where anything remotely performance critical is handled entirely within the guest.
--- kvm_vcpu.before 2023-02-13 14:13:49.919952154 +0100 +++ kvm_vcpu.after 2023-02-13 14:13:53.559952140 +0100 @@ -6,78 +6,60 @@ int vcpu_idx; /* 40 4 */ int ____srcu_idx; /* 44 4 */ int mode; /* 48 4 */ - - /* XXX 4 bytes hole, try to pack */ - + unsigned int guest_debug; /* 52 4 */ u64 requests; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ - long unsigned int guest_debug; /* 64 8 */ - struct mutex mutex; /* 72 32 */ - struct kvm_run * run; /* 104 8 */ - struct rcuwait wait; /* 112 8 */ - struct pid * pid; /* 120 8 */ + struct mutex mutex; /* 64 32 */ + struct kvm_run * run; /* 96 8 */ + struct rcuwait wait; /* 104 8 */ + struct pid * pid; /* 112 8 */ + sigset_t sigset; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ int sigset_active; /* 128 4 */ - - /* XXX 4 bytes hole, try to pack */ - - sigset_t sigset; /* 136 8 */ - unsigned int halt_poll_ns; /* 144 4 */ - bool valid_wakeup; /* 148 1 */ + unsigned int halt_poll_ns; /* 132 4 */ + bool valid_wakeup; /* 136 1 */ /* XXX 3 bytes hole, try to pack */ - int mmio_needed; /* 152 4 */ - int mmio_read_completed; /* 156 4 */ - int mmio_is_write; /* 160 4 */ - int mmio_cur_fragment; /* 164 4 */ - int mmio_nr_fragments; /* 168 4 */ - - /* XXX 4 bytes hole, try to pack */ - - struct kvm_mmio_fragment mmio_fragments[2]; /* 176 48 */ - /* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */ + int mmio_needed; /* 140 4 */ + int mmio_read_completed; /* 144 4 */ + int mmio_is_write; /* 148 4 */ + int mmio_cur_fragment; /* 152 4 */ + int mmio_nr_fragments; /* 156 4 */ + struct kvm_mmio_fragment mmio_fragments[2]; /* 160 48 */ + /* --- cacheline 3 boundary (192 bytes) was 16 bytes ago --- */ struct { - u32 queued; /* 224 4 */ - - /* XXX 4 bytes hole, try to pack */ - - struct list_head queue; /* 232 16 */ - struct list_head done; /* 248 16 */ - /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */ - spinlock_t lock; /* 264 4 */ - } async_pf; /* 224 48 */ - - /* XXX last struct has 4 bytes of padding */ - + struct list_head queue; /* 208 16 */ + struct list_head done; /* 224 16 */ + spinlock_t lock; /* 240 4 */ + u32 queued; /* 244 4 */ + } async_pf; /* 208 40 */ struct { - bool in_spin_loop; /* 272 1 */ - bool dy_eligible; /* 273 1 */ - } spin_loop; /* 272 2 */ - bool preempted; /* 274 1 */ - bool ready; /* 275 1 */ + bool in_spin_loop; /* 248 1 */ + bool dy_eligible; /* 249 1 */ + } spin_loop; /* 248 2 */ + bool preempted; /* 250 1 */ + bool ready; /* 251 1 */ /* XXX 4 bytes hole, try to pack */ - struct kvm_vcpu_arch arch __attribute__((__aligned__(8))); /* 280 5208 */ - - /* XXX last struct has 6 bytes of padding */ - - /* --- cacheline 85 boundary (5440 bytes) was 48 bytes ago --- */ - struct kvm_vcpu_stat stat; /* 5488 1104 */ - /* --- cacheline 103 boundary (6592 bytes) --- */ - char stats_id[48]; /* 6592 48 */ - struct kvm_dirty_ring dirty_ring; /* 6640 32 */ + /* --- cacheline 4 boundary (256 bytes) --- */ + struct kvm_vcpu_arch arch __attribute__((__aligned__(8))); /* 256 5104 */ + /* --- cacheline 83 boundary (5312 bytes) was 48 bytes ago --- */ + struct kvm_vcpu_stat stat; /* 5360 1104 */ + /* --- cacheline 101 boundary (6464 bytes) --- */ + char stats_id[48]; /* 6464 48 */ + struct kvm_dirty_ring dirty_ring; /* 6512 32 */ /* XXX last struct has 4 bytes of padding */ - /* --- cacheline 104 boundary (6656 bytes) was 16 bytes ago --- */ - struct kvm_memory_slot * last_used_slot; /* 6672 8 */ - u64 last_used_slot_gen; /* 6680 8 */ - - /* size: 6688, cachelines: 105, members: 33 */ - /* sum members: 6669, holes: 5, sum holes: 19 */ - /* paddings: 3, sum paddings: 14 */ + /* --- cacheline 102 boundary (6528 bytes) was 16 bytes ago --- */ + struct kvm_memory_slot * last_used_slot; /* 6544 8 */ + u64 last_used_slot_gen; /* 6552 8 */ + + /* size: 6560, cachelines: 103, members: 33 */ + /* sum members: 6553, holes: 2, sum holes: 7 */ + /* paddings: 1, sum paddings: 4 */ /* forced alignments: 1, forced holes: 1, sum forced holes: 4 */ /* last cacheline: 32 bytes */ } __attribute__((__aligned__(8)));