Message ID | 20200522125214.31348-1-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM protected memory extension | expand |
On Fri, May 22, 2020 at 03:51:58PM +0300, Kirill A. Shutemov wrote: > == Background / Problem == > > There are a number of hardware features (MKTME, SEV) which protect guest > memory from some unauthorized host access. The patchset proposes a purely > software feature that mitigates some of the same host-side read-only > attacks. CC people who worked on the related patchsets. > == What does this set mitigate? == > > - Host kernel ”accidental” access to guest data (think speculation) > > - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) > > - Host userspace access to guest data (compromised qemu) > > == What does this set NOT mitigate? == > > - Full host kernel compromise. Kernel will just map the pages again. > > - Hardware attacks > > > The patchset is RFC-quality: it works but has known issues that must be > addressed before it can be considered for applying. > > We are looking for high-level feedback on the concept. Some open > questions: > > - This protects from some kernel and host userspace read-only attacks, > but does not place the host kernel outside the trust boundary. Is it > still valuable? > > - Can this approach be used to avoid cache-coherency problems with > hardware encryption schemes that repurpose physical bits? > > - The guest kernel must be modified for this to work. Is that a deal > breaker, especially for public clouds? > > - Are the costs of removing pages from the direct map too high to be > feasible? > > == Series Overview == > > The hardware features protect guest data by encrypting it and then > ensuring that only the right guest can decrypt it. This has the > side-effect of making the kernel direct map and userspace mapping > (QEMU et al) useless. But, this teaches us something very useful: > neither the kernel or userspace mappings are really necessary for normal > guest operations. > > Instead of using encryption, this series simply unmaps the memory. One > advantage compared to allowing access to ciphertext is that it allows bad > accesses to be caught instead of simply reading garbage. > > Protection from physical attacks needs to be provided by some other means. > On Intel platforms, (single-key) Total Memory Encryption (TME) provides > mitigation against physical attacks, such as DIMM interposers sniffing > memory bus traffic. > > The patchset modifies both host and guest kernel. The guest OS must enable > the feature via hypercall and mark any memory range that has to be shared > with the host: DMA regions, bounce buffers, etc. SEV does this marking via a > bit in the guest’s page table while this approach uses a hypercall. > > For removing the userspace mapping, use a trick similar to what NUMA > balancing does: convert memory that belongs to KVM memory slots to > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and > the newly faulted in pages get PROT_NONE from the updated vm_page_prot. > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the > VMA must be treated in a special way in the GUP and fault paths. The flag > allows GUP to return the page even though it is mapped with PROT_NONE, but > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM > would result in -EFAULT. > > Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from > the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only > flushes local TLB. I think it's a reasonable compromise between security and > perfromance. > > Zapping the PTE would bring the page back to the direct mapping after clearing. > At least for now, we don't remove file-backed pages from the direct mapping. > File-backed pages could be accessed via read/write syscalls. It adds > complexity. > > Occasionally, host kernel has to access guest memory that was not made > shared by the guest. For instance, it happens for instruction emulation. > Normally, it's done via copy_to/from_user() which would fail with -EFAULT > now. We introduced a new pair of helpers: copy_to/from_guest(). The new > helpers acquire the page via GUP, map it into kernel address space with > kmap_atomic()-style mechanism and only then copy the data. > > For some instruction emulation copying is not good enough: cmpxchg > emulation has to have direct access to the guest memory. __kvm_map_gfn() > is modified to accommodate the case. > > The patchset is on top of v5.7-rc6 plus this patch: > > https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com > > == Open Issues == > > Unmapping the pages from direct mapping bring a few of issues that have > not rectified yet: > > - Touching direct mapping leads to fragmentation. We need to be able to > recover from it. I have a buggy patch that aims at recovering 2M/1G page. > It has to be fixed and tested properly > > - Page migration and KSM is not supported yet. > > - Live migration of a guest would require a new flow. Not sure yet how it > would look like. > > - The feature interfere with NUMA balancing. Not sure yet if it's > possible to make them work together. > > - Guests have no mechanism to ensure that even a well-behaving host has > unmapped its private data. With SEV, for instance, the guest only has > to trust the hardware to encrypt a page after the C bit is set in a > guest PTE. A mechanism for a guest to query the host mapping state, or > to constantly assert the intent for a page to be Private would be > valuable.
On 22/05/2020 15:51, Kirill A. Shutemov wrote: > == Background / Problem == > > There are a number of hardware features (MKTME, SEV) which protect guest > memory from some unauthorized host access. The patchset proposes a purely > software feature that mitigates some of the same host-side read-only > attacks. > > > == What does this set mitigate? == > > - Host kernel ”accidental” access to guest data (think speculation) Just to clarify: This is any host kernel memory info-leak vulnerability. Not just speculative execution memory info-leaks. Also architectural ones. In addition, note that removing guest data from host kernel VA space also makes guest<->host memory exploits more difficult. E.g. Guest cannot use already available memory buffer in kernel VA space for ROP or placing valuable guest-controlled code/data in general. > > - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) > > - Host userspace access to guest data (compromised qemu) I don't quite understand what is the benefit of preventing userspace VMM access to guest data while the host kernel can still access it. QEMU is more easily compromised than the host kernel because it's guest<->host attack surface is larger (E.g. Various device emulation). But this compromise comes from the guest itself. Not other guests. In contrast to host kernel attack surface, which an info-leak there can be exploited from one guest to leak another guest data. > > == What does this set NOT mitigate? == > > - Full host kernel compromise. Kernel will just map the pages again. > > - Hardware attacks > > > The patchset is RFC-quality: it works but has known issues that must be > addressed before it can be considered for applying. > > We are looking for high-level feedback on the concept. Some open > questions: > > - This protects from some kernel and host userspace read-only attacks, > but does not place the host kernel outside the trust boundary. Is it > still valuable? I don't currently see a good argument for preventing host userspace access to guest data while host kernel can still access it. But there is definitely strong benefit of mitigating kernel info-leaks exploitable from one guest to leak another guest data. > > - Can this approach be used to avoid cache-coherency problems with > hardware encryption schemes that repurpose physical bits? > > - The guest kernel must be modified for this to work. Is that a deal > breaker, especially for public clouds? > > - Are the costs of removing pages from the direct map too high to be > feasible? If I remember correctly, this perf cost was too high when considering XPFO (eXclusive Page Frame Ownership) patch-series. This created two major perf costs: 1) Removing pages from direct-map prevented direct-map from simply be entirely mapped as 1GB huge-pages. 2) Frequent allocation/free of userspace pages resulted in frequent TLB invalidations. Having said that, (1) can be mitigated in case guest data is completely allocated from 1GB hugetlbfs to guarantee it will not create smaller holes in direct-map. And (2) is not relevant for QEMU/KVM use-case. This makes me wonder: XPFO patch-series, applied to the context of QEMU/KVM, seems to provide exactly the functionality of this patch-series, with the exception of the additional "feature" of preventing guest data from also being accessible to host userspace VMM. i.e. XPFO will unmap guest pages from host kernel direct-map while still keeping them mapped in host userspace VMM page-tables. If I understand correctly, this "feature" is what brings most of the extra complexity of this patch-series compared to XPFO. It requires guest modification to explicitly specify to host which pages can be accessed by userspace VMM, it requires changes to add new VM_KVM_PROTECTED VMA flag & FOLL_KVM for GUP, and it creates issues with Live-Migration support. So if there is no strong convincing argument for the motivation to prevent userspace VMM access to guest data *while host kernel can still access guest data*, I don't see a good reason for using this approach. Furthermore, I would like to point out that just unmapping guest data from kernel direct-map is not sufficient to prevent all guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This is because host kernel VA space have other regions which contains guest sensitive data. For example, KVM per-vCPU struct (which holds vCPU state) is allocated on slab and therefore still leakable. I recommend you will have a look at my (and Alexandre Charte) KVM Forum 2019 talk on KVM ASI which provides extensive background on the various attempts done by the community for mitigating host kernel memory info-leaks exploitable by guest to leak other guests data: https://static.sched.com/hosted_files/kvmforum2019/34/KVM%20Forum%202019%20KVM%20ASI.pdf > > == Series Overview == > > The hardware features protect guest data by encrypting it and then > ensuring that only the right guest can decrypt it. This has the > side-effect of making the kernel direct map and userspace mapping > (QEMU et al) useless. But, this teaches us something very useful: > neither the kernel or userspace mappings are really necessary for normal > guest operations. > > Instead of using encryption, this series simply unmaps the memory. One > advantage compared to allowing access to ciphertext is that it allows bad > accesses to be caught instead of simply reading garbage. > > Protection from physical attacks needs to be provided by some other means. > On Intel platforms, (single-key) Total Memory Encryption (TME) provides > mitigation against physical attacks, such as DIMM interposers sniffing > memory bus traffic. > > The patchset modifies both host and guest kernel. The guest OS must enable > the feature via hypercall and mark any memory range that has to be shared > with the host: DMA regions, bounce buffers, etc. SEV does this marking via a > bit in the guest’s page table while this approach uses a hypercall. > > For removing the userspace mapping, use a trick similar to what NUMA > balancing does: convert memory that belongs to KVM memory slots to > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and > the newly faulted in pages get PROT_NONE from the updated vm_page_prot. > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the > VMA must be treated in a special way in the GUP and fault paths. The flag > allows GUP to return the page even though it is mapped with PROT_NONE, but > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM > would result in -EFAULT. > > Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from > the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only > flushes local TLB. I think it's a reasonable compromise between security and > perfromance. > > Zapping the PTE would bring the page back to the direct mapping after clearing. > At least for now, we don't remove file-backed pages from the direct mapping. > File-backed pages could be accessed via read/write syscalls. It adds > complexity. > > Occasionally, host kernel has to access guest memory that was not made > shared by the guest. For instance, it happens for instruction emulation. > Normally, it's done via copy_to/from_user() which would fail with -EFAULT > now. We introduced a new pair of helpers: copy_to/from_guest(). The new > helpers acquire the page via GUP, map it into kernel address space with > kmap_atomic()-style mechanism and only then copy the data. > > For some instruction emulation copying is not good enough: cmpxchg > emulation has to have direct access to the guest memory. __kvm_map_gfn() > is modified to accommodate the case. > > The patchset is on top of v5.7-rc6 plus this patch: > > https://urldefense.com/v3/__https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com__;!!GqivPVa7Brio!MSTb9DzpOUJMLMaMq-J7QOkopsKIlAYXpIxiu5FwFYfRctwIyNi8zBJWvlt89j8$ > > == Open Issues == > > Unmapping the pages from direct mapping bring a few of issues that have > not rectified yet: > > - Touching direct mapping leads to fragmentation. We need to be able to > recover from it. I have a buggy patch that aims at recovering 2M/1G page. > It has to be fixed and tested properly As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs will lead to holes in kernel direct-map which force it to not be mapped anymore as a series of 1GB huge-pages. This have non-trivial performance cost. Thus, I am not sure addressing this use-case is valuable. > > - Page migration and KSM is not supported yet. > > - Live migration of a guest would require a new flow. Not sure yet how it > would look like. Note that Live-Migration issue is a result of not making guest data accessible to host userspace VMM. -Liran > > - The feature interfere with NUMA balancing. Not sure yet if it's > possible to make them work together. > > - Guests have no mechanism to ensure that even a well-behaving host has > unmapped its private data. With SEV, for instance, the guest only has > to trust the hardware to encrypt a page after the C bit is set in a > guest PTE. A mechanism for a guest to query the host mapping state, or > to constantly assert the intent for a page to be Private would be > valuable.
On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote: > > On 22/05/2020 15:51, Kirill A. Shutemov wrote: > > == Background / Problem == > > > > There are a number of hardware features (MKTME, SEV) which protect guest > > memory from some unauthorized host access. The patchset proposes a purely > > software feature that mitigates some of the same host-side read-only > > attacks. > > > > > > == What does this set mitigate? == > > > > - Host kernel ”accidental” access to guest data (think speculation) > > Just to clarify: This is any host kernel memory info-leak vulnerability. Not > just speculative execution memory info-leaks. Also architectural ones. > > In addition, note that removing guest data from host kernel VA space also > makes guest<->host memory exploits more difficult. > E.g. Guest cannot use already available memory buffer in kernel VA space for > ROP or placing valuable guest-controlled code/data in general. > > > > > - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) > > > > - Host userspace access to guest data (compromised qemu) > > I don't quite understand what is the benefit of preventing userspace VMM > access to guest data while the host kernel can still access it. Let me clarify: the guest memory mapped into host userspace is not accessible by both host kernel and userspace. Host still has way to access it via a new interface: GUP(FOLL_KVM). The GUP will give you struct page that kernel has to map (temporarily) if need to access the data. So only blessed codepaths would know how to deal with the memory. It can help preventing some host->guest attack on the compromised host. Like if an VM has successfully attacked the host it cannot attack other VMs as easy. It would also help to protect against guest->host attack by removing one more places where the guest's data is mapped on the host. > QEMU is more easily compromised than the host kernel because it's > guest<->host attack surface is larger (E.g. Various device emulation). > But this compromise comes from the guest itself. Not other guests. In > contrast to host kernel attack surface, which an info-leak there can > be exploited from one guest to leak another guest data. Consider the case when unprivileged guest user exploits bug in a QEMU device emulation to gain access to data it cannot normally have access within the guest. With the feature it would able to see only other shared regions of guest memory such as DMA and IO buffers, but not the rest. > > > > == What does this set NOT mitigate? == > > > > - Full host kernel compromise. Kernel will just map the pages again. > > > > - Hardware attacks > > > > > > The patchset is RFC-quality: it works but has known issues that must be > > addressed before it can be considered for applying. > > > > We are looking for high-level feedback on the concept. Some open > > questions: > > > > - This protects from some kernel and host userspace read-only attacks, > > but does not place the host kernel outside the trust boundary. Is it > > still valuable? > I don't currently see a good argument for preventing host userspace access > to guest data while host kernel can still access it. > But there is definitely strong benefit of mitigating kernel info-leaks > exploitable from one guest to leak another guest data. > > > > - Can this approach be used to avoid cache-coherency problems with > > hardware encryption schemes that repurpose physical bits? > > > > - The guest kernel must be modified for this to work. Is that a deal > > breaker, especially for public clouds? > > > > - Are the costs of removing pages from the direct map too high to be > > feasible? > > If I remember correctly, this perf cost was too high when considering XPFO > (eXclusive Page Frame Ownership) patch-series. > This created two major perf costs: > 1) Removing pages from direct-map prevented direct-map from simply be > entirely mapped as 1GB huge-pages. > 2) Frequent allocation/free of userspace pages resulted in frequent TLB > invalidations. > > Having said that, (1) can be mitigated in case guest data is completely > allocated from 1GB hugetlbfs to guarantee it will not > create smaller holes in direct-map. And (2) is not relevant for QEMU/KVM > use-case. I'm too invested into THP to give it up to the ugly hugetlbfs. I think we can do better :) > This makes me wonder: > XPFO patch-series, applied to the context of QEMU/KVM, seems to provide > exactly the functionality of this patch-series, > with the exception of the additional "feature" of preventing guest data from > also being accessible to host userspace VMM. > i.e. XPFO will unmap guest pages from host kernel direct-map while still > keeping them mapped in host userspace VMM page-tables. > > If I understand correctly, this "feature" is what brings most of the extra > complexity of this patch-series compared to XPFO. > It requires guest modification to explicitly specify to host which pages can > be accessed by userspace VMM, it requires > changes to add new VM_KVM_PROTECTED VMA flag & FOLL_KVM for GUP, and it > creates issues with Live-Migration support. > > So if there is no strong convincing argument for the motivation to prevent > userspace VMM access to guest data *while host kernel > can still access guest data*, I don't see a good reason for using this > approach. Well, I disagree with you here. See few points above. > Furthermore, I would like to point out that just unmapping guest data from > kernel direct-map is not sufficient to prevent all > guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This > is because host kernel VA space have other regions > which contains guest sensitive data. For example, KVM per-vCPU struct (which > holds vCPU state) is allocated on slab and therefore > still leakable. > > I recommend you will have a look at my (and Alexandre Charte) KVM Forum 2019 > talk on KVM ASI which provides extensive background > on the various attempts done by the community for mitigating host kernel > memory info-leaks exploitable by guest to leak other guests data: > https://static.sched.com/hosted_files/kvmforum2019/34/KVM%20Forum%202019%20KVM%20ASI.pdf Thanks, I'll read it up. > > == Series Overview == > > > > The hardware features protect guest data by encrypting it and then > > ensuring that only the right guest can decrypt it. This has the > > side-effect of making the kernel direct map and userspace mapping > > (QEMU et al) useless. But, this teaches us something very useful: > > neither the kernel or userspace mappings are really necessary for normal > > guest operations. > > > > Instead of using encryption, this series simply unmaps the memory. One > > advantage compared to allowing access to ciphertext is that it allows bad > > accesses to be caught instead of simply reading garbage. > > > > Protection from physical attacks needs to be provided by some other means. > > On Intel platforms, (single-key) Total Memory Encryption (TME) provides > > mitigation against physical attacks, such as DIMM interposers sniffing > > memory bus traffic. > > > > The patchset modifies both host and guest kernel. The guest OS must enable > > the feature via hypercall and mark any memory range that has to be shared > > with the host: DMA regions, bounce buffers, etc. SEV does this marking via a > > bit in the guest’s page table while this approach uses a hypercall. > > > > For removing the userspace mapping, use a trick similar to what NUMA > > balancing does: convert memory that belongs to KVM memory slots to > > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and > > the newly faulted in pages get PROT_NONE from the updated vm_page_prot. > > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the > > VMA must be treated in a special way in the GUP and fault paths. The flag > > allows GUP to return the page even though it is mapped with PROT_NONE, but > > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access > > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM > > would result in -EFAULT. > > > > Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from > > the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only > > flushes local TLB. I think it's a reasonable compromise between security and > > perfromance. > > > > Zapping the PTE would bring the page back to the direct mapping after clearing. > > At least for now, we don't remove file-backed pages from the direct mapping. > > File-backed pages could be accessed via read/write syscalls. It adds > > complexity. > > > > Occasionally, host kernel has to access guest memory that was not made > > shared by the guest. For instance, it happens for instruction emulation. > > Normally, it's done via copy_to/from_user() which would fail with -EFAULT > > now. We introduced a new pair of helpers: copy_to/from_guest(). The new > > helpers acquire the page via GUP, map it into kernel address space with > > kmap_atomic()-style mechanism and only then copy the data. > > > > For some instruction emulation copying is not good enough: cmpxchg > > emulation has to have direct access to the guest memory. __kvm_map_gfn() > > is modified to accommodate the case. > > > > The patchset is on top of v5.7-rc6 plus this patch: > > > > https://urldefense.com/v3/__https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com__;!!GqivPVa7Brio!MSTb9DzpOUJMLMaMq-J7QOkopsKIlAYXpIxiu5FwFYfRctwIyNi8zBJWvlt89j8$ > > > > == Open Issues == > > > > Unmapping the pages from direct mapping bring a few of issues that have > > not rectified yet: > > > > - Touching direct mapping leads to fragmentation. We need to be able to > > recover from it. I have a buggy patch that aims at recovering 2M/1G page. > > It has to be fixed and tested properly > As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs > will lead to holes in kernel direct-map which force it to not be mapped > anymore as a series of 1GB huge-pages. > This have non-trivial performance cost. Thus, I am not sure addressing this > use-case is valuable. Here's the buggy patch I've referred to: http://lore.kernel.org/r/20200416213229.19174-1-kirill.shutemov@linux.intel.com I plan to get work right. > > > > - Page migration and KSM is not supported yet. > > > > - Live migration of a guest would require a new flow. Not sure yet how it > > would look like. > > Note that Live-Migration issue is a result of not making guest data > accessible to host userspace VMM. Yes, I understand.
On 25/05/2020 17:46, Kirill A. Shutemov wrote: > On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote: >> On 22/05/2020 15:51, Kirill A. Shutemov wrote: >>> == Background / Problem == >>> >>> There are a number of hardware features (MKTME, SEV) which protect guest >>> memory from some unauthorized host access. The patchset proposes a purely >>> software feature that mitigates some of the same host-side read-only >>> attacks. >>> >>> >>> == What does this set mitigate? == >>> >>> - Host kernel ”accidental” access to guest data (think speculation) >> Just to clarify: This is any host kernel memory info-leak vulnerability. Not >> just speculative execution memory info-leaks. Also architectural ones. >> >> In addition, note that removing guest data from host kernel VA space also >> makes guest<->host memory exploits more difficult. >> E.g. Guest cannot use already available memory buffer in kernel VA space for >> ROP or placing valuable guest-controlled code/data in general. >> >>> - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) >>> >>> - Host userspace access to guest data (compromised qemu) >> I don't quite understand what is the benefit of preventing userspace VMM >> access to guest data while the host kernel can still access it. > Let me clarify: the guest memory mapped into host userspace is not > accessible by both host kernel and userspace. Host still has way to access > it via a new interface: GUP(FOLL_KVM). The GUP will give you struct page > that kernel has to map (temporarily) if need to access the data. So only > blessed codepaths would know how to deal with the memory. Yes, I understood that. I meant explicit host kernel access. > > It can help preventing some host->guest attack on the compromised host. > Like if an VM has successfully attacked the host it cannot attack other > VMs as easy. We have mechanisms to sandbox the userspace VMM process for that. You need to be more specific on what is the attack scenario you attempt to address here that is not covered by existing mechanisms. i.e. Be crystal clear on the extra value of the feature of not exposing guest data to userspace VMM. > > It would also help to protect against guest->host attack by removing one > more places where the guest's data is mapped on the host. Because guest have explicit interface to request which guest pages can be mapped in userspace VMM, the value of this is very small. Guest already have ability to map guest controlled code/data in userspace VMM either via this interface or via forcing userspace VMM to create various objects during device emulation handling. The only extra property this patch-series provides, is that only a small portion of guest pages will be mapped to host userspace instead of all of it. Resulting in smaller regions for exploits that require guessing a virtual address. But: (a) Userspace VMM device emulation may still allow guest to spray userspace heap with objects containing guest controlled data. (b) How is userspace VMM suppose to limit which guest pages should not be mapped to userspace VMM even though guest have explicitly requested them to be mapped? (E.g. Because they are valid DMA sources/targets for virtual devices or because it's vGPU frame-buffer). >> QEMU is more easily compromised than the host kernel because it's >> guest<->host attack surface is larger (E.g. Various device emulation). >> But this compromise comes from the guest itself. Not other guests. In >> contrast to host kernel attack surface, which an info-leak there can >> be exploited from one guest to leak another guest data. > Consider the case when unprivileged guest user exploits bug in a QEMU > device emulation to gain access to data it cannot normally have access > within the guest. With the feature it would able to see only other shared > regions of guest memory such as DMA and IO buffers, but not the rest. This is a scenario where an unpriviledged guest userspace have direct access to a virtual device and is able to exploit a bug in device emulation handling such that it will allow it to compromise the security *inside* the guest. i.e. Leak guest kernel data or other guest userspace processes data. That's true. Good point. This is a very important missing argument from the cover-letter. Now it's crystal clear on the trade-off considered here: Is the extra complication and perf cost provided by the mechanism of this patch-series worth to protect against the scenario of a userspace VMM vulnerability that may be accessible to unpriviledged guest userspace process to leak other *in-guest* data that is not otherwise accessible to that process? -Liran
On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote: > > On 22/05/2020 15:51, Kirill A. Shutemov wrote: > > Furthermore, I would like to point out that just unmapping guest data from > kernel direct-map is not sufficient to prevent all > guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This > is because host kernel VA space have other regions > which contains guest sensitive data. For example, KVM per-vCPU struct (which > holds vCPU state) is allocated on slab and therefore > still leakable. Objects allocated from slab use the direct map, vmalloc() is another story. > > - Touching direct mapping leads to fragmentation. We need to be able to > > recover from it. I have a buggy patch that aims at recovering 2M/1G page. > > It has to be fixed and tested properly > > As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs > will lead to holes in kernel direct-map which force it to not be mapped > anymore as a series of 1GB huge-pages. > This have non-trivial performance cost. Thus, I am not sure addressing this > use-case is valuable. Out of curiosity, do we actually have some numbers for the "non-trivial performance cost"? For instance for KVM usecase?
On 26/05/2020 9:17, Mike Rapoport wrote: > On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote: >> On 22/05/2020 15:51, Kirill A. Shutemov wrote: >> >> Furthermore, I would like to point out that just unmapping guest data from >> kernel direct-map is not sufficient to prevent all >> guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This >> is because host kernel VA space have other regions >> which contains guest sensitive data. For example, KVM per-vCPU struct (which >> holds vCPU state) is allocated on slab and therefore >> still leakable. > Objects allocated from slab use the direct map, vmalloc() is another story. It doesn't matter. This patch series, like XPFO, only removes guest memory pages from direct-map. Not things such as KVM per-vCPU structs. That's why Julian & Marius (AWS), created the "Process local kernel VA region" patch-series that declare a single PGD entry, which maps a kernelspace region, to have different PFN between different tasks. For more information, see my KVM Forum talk slides I gave in previous reply and related AWS patch-series: https://patchwork.kernel.org/cover/10990403/ > >>> - Touching direct mapping leads to fragmentation. We need to be able to >>> recover from it. I have a buggy patch that aims at recovering 2M/1G page. >>> It has to be fixed and tested properly >> As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs >> will lead to holes in kernel direct-map which force it to not be mapped >> anymore as a series of 1GB huge-pages. >> This have non-trivial performance cost. Thus, I am not sure addressing this >> use-case is valuable. > Out of curiosity, do we actually have some numbers for the "non-trivial > performance cost"? For instance for KVM usecase? > Dig into XPFO mailing-list discussions to find out... I just remember that this was one of the main concerns regarding XPFO. -Liran
On Tue, May 26, 2020 at 01:16:14PM +0300, Liran Alon wrote: > > On 26/05/2020 9:17, Mike Rapoport wrote: > > On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote: > > > On 22/05/2020 15:51, Kirill A. Shutemov wrote: > > > > > Out of curiosity, do we actually have some numbers for the "non-trivial > > performance cost"? For instance for KVM usecase? > > > Dig into XPFO mailing-list discussions to find out... > I just remember that this was one of the main concerns regarding XPFO. The XPFO benchmarks measure total XPFO cost, and huge share of it comes from TLB shootdowns. It's not exactly measurement of the imapct of the direct map fragmentation to workload running inside a vitrual machine. > -Liran
On 5/26/20 4:38 AM, Mike Rapoport wrote: > On Tue, May 26, 2020 at 01:16:14PM +0300, Liran Alon wrote: >> On 26/05/2020 9:17, Mike Rapoport wrote: >>> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote: >>>> On 22/05/2020 15:51, Kirill A. Shutemov wrote: >>>> >>> Out of curiosity, do we actually have some numbers for the "non-trivial >>> performance cost"? For instance for KVM usecase? >>> >> Dig into XPFO mailing-list discussions to find out... >> I just remember that this was one of the main concerns regarding XPFO. > The XPFO benchmarks measure total XPFO cost, and huge share of it comes > from TLB shootdowns. Yes, TLB shootdown when pages transition between owners is huge. The XPFO folks did a lot of work to try to optimize some of this overhead away. But, it's still a concern. The concern with XPFO was that it could affect *all* application page allocation. This approach cheats a bit and only goes after guest VM pages. It's significantly more work to allocate a page and map it into a guest than it is to, for instance, allocate an anonymous user page. That means that the *additional* overhead of things like this for guest memory matter a lot less. > It's not exactly measurement of the imapct of the direct map > fragmentation to workload running inside a vitrual machine. While the VM *itself* is running, there is zero overhead. The host direct map is not used at *all*. The guest and host TLB entries share the same space in the TLB so there could be some increased pressure on the TLB, but that's a really secondary effect. It would also only occur if the guest exits and the host runs and starts evicting TLB entries. The other effects I could think of would be when the guest exits and the host is doing some work for the guest, like emulation or something. The host would see worse TLB behavior because the host is using the (fragmented) direct map. But, both of those things require VMEXITs. The more exits, the more overhead you _might_ observe. What I've been hearing from KVM folks is that exits are getting more and more rare and the hardware designers are working hard to minimize them. That's especially good news because it means that even if the situation isn't perfect, it's only bound to get *better* over time, not worse.
On Wed, May 27, 2020 at 08:45:33AM -0700, Dave Hansen wrote: > On 5/26/20 4:38 AM, Mike Rapoport wrote: > > On Tue, May 26, 2020 at 01:16:14PM +0300, Liran Alon wrote: > >> On 26/05/2020 9:17, Mike Rapoport wrote: > >>> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote: > >>>> On 22/05/2020 15:51, Kirill A. Shutemov wrote: > >>>> > >>> Out of curiosity, do we actually have some numbers for the "non-trivial > >>> performance cost"? For instance for KVM usecase? > >>> > >> Dig into XPFO mailing-list discussions to find out... > >> I just remember that this was one of the main concerns regarding XPFO. > > > > The XPFO benchmarks measure total XPFO cost, and huge share of it comes > > from TLB shootdowns. > > Yes, TLB shootdown when pages transition between owners is huge. The > XPFO folks did a lot of work to try to optimize some of this overhead > away. But, it's still a concern. > > The concern with XPFO was that it could affect *all* application page > allocation. This approach cheats a bit and only goes after guest VM > pages. It's significantly more work to allocate a page and map it into > a guest than it is to, for instance, allocate an anonymous user page. > That means that the *additional* overhead of things like this for guest > memory matter a lot less. > > > It's not exactly measurement of the imapct of the direct map > > fragmentation to workload running inside a vitrual machine. > > While the VM *itself* is running, there is zero overhead. The host > direct map is not used at *all*. The guest and host TLB entries share > the same space in the TLB so there could be some increased pressure on > the TLB, but that's a really secondary effect. It would also only occur > if the guest exits and the host runs and starts evicting TLB entries. > > The other effects I could think of would be when the guest exits and the > host is doing some work for the guest, like emulation or something. The > host would see worse TLB behavior because the host is using the > (fragmented) direct map. > > But, both of those things require VMEXITs. The more exits, the more > overhead you _might_ observe. What I've been hearing from KVM folks is > that exits are getting more and more rare and the hardware designers are > working hard to minimize them. Right, when guest stays in the guest mode, there is no overhead. But guests still exit sometimes and I was wondering if anybody had measured difference in the overhead with different page size used for the host's direct map. My guesstimate is that the overhead will not differ much for most workloads. But still, it's still interesting to *know* what is it. > That's especially good news because it means that even if the > situation > isn't perfect, it's only bound to get *better* over time, not worse. The processors have been aggressively improving performance for decades and see where are we know because of it ;-)
Hi Kirill, Thanks for this. On Fri, 22 May 2020 15:51:58 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote: > == Background / Problem == > > There are a number of hardware features (MKTME, SEV) which protect guest > memory from some unauthorized host access. The patchset proposes a purely > software feature that mitigates some of the same host-side read-only > attacks. > > > == What does this set mitigate? == > > - Host kernel ”accidental” access to guest data (think speculation) > > - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) > > - Host userspace access to guest data (compromised qemu) > > == What does this set NOT mitigate? == > > - Full host kernel compromise. Kernel will just map the pages again. > > - Hardware attacks Just as a heads up, we (the Android kernel team) are currently involved in something pretty similar for KVM/arm64 in order to bring some level of confidentiality to guests. The main idea is to de-privilege the host kernel by wrapping it in its own nested set of page tables which allows us to remove memory allocated to guests on a per-page basis. The core hypervisor runs more or less independently at its own privilege level. It still is KVM though, as we don't intend to reinvent the wheel. Will has written a much more lingo-heavy description here: https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/ This works for one of the virtualization modes that arm64 can use (what we call non-VHE, or nVHE for short). The other mode (VHE), is much more similar to what happens on other architectures, where the kernel and the hypervisor are one single entity. In this case, we cannot use the same trick with nested page tables, and have to rely on something that would very much look like what you're proposing. Note that the two modes of the architecture would benefit from this work anyway, as I'd like the host to know that we've pulled memory from under its feet. Since you have done most of the initial work, I intend to give it a go on arm64 shortly and see what sticks. Thanks, M.
+Jun On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote: > Hi Kirill, > > Thanks for this. > > On Fri, 22 May 2020 15:51:58 +0300 > "Kirill A. Shutemov" <kirill@shutemov.name> wrote: > > > == Background / Problem == > > > > There are a number of hardware features (MKTME, SEV) which protect guest > > memory from some unauthorized host access. The patchset proposes a purely > > software feature that mitigates some of the same host-side read-only > > attacks. > > > > > > == What does this set mitigate? == > > > > - Host kernel ”accidental” access to guest data (think speculation) > > > > - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) > > > > - Host userspace access to guest data (compromised qemu) > > > > == What does this set NOT mitigate? == > > > > - Full host kernel compromise. Kernel will just map the pages again. > > > > - Hardware attacks > > Just as a heads up, we (the Android kernel team) are currently > involved in something pretty similar for KVM/arm64 in order to bring > some level of confidentiality to guests. > > The main idea is to de-privilege the host kernel by wrapping it in its > own nested set of page tables which allows us to remove memory > allocated to guests on a per-page basis. The core hypervisor runs more > or less independently at its own privilege level. It still is KVM > though, as we don't intend to reinvent the wheel. > > Will has written a much more lingo-heavy description here: > https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/ Pardon my arm64 ignorance... IIUC, in this mode, the host kernel runs at EL1? And to switch to a guest it has to bounce through EL2, which is KVM, or at least a chunk of KVM? I assume the EL1->EL2->EL1 switch is done by trapping an exception of some form? If all of the above are "yes", does KVM already have the necessary logic to perform the EL1->EL2->EL1 switches, or is that being added as part of the de-privileging effort? > This works for one of the virtualization modes that arm64 can use (what > we call non-VHE, or nVHE for short). The other mode (VHE), is much more > similar to what happens on other architectures, where the kernel and > the hypervisor are one single entity. In this case, we cannot use the > same trick with nested page tables, and have to rely on something that > would very much look like what you're proposing. > > Note that the two modes of the architecture would benefit from this > work anyway, as I'd like the host to know that we've pulled memory > from under its feet. Since you have done most of the initial work, I > intend to give it a go on arm64 shortly and see what sticks.
Hi Sean, On 2020-06-04 16:48, Sean Christopherson wrote: > +Jun > > On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote: >> Hi Kirill, >> >> Thanks for this. >> >> On Fri, 22 May 2020 15:51:58 +0300 >> "Kirill A. Shutemov" <kirill@shutemov.name> wrote: >> >> > == Background / Problem == >> > >> > There are a number of hardware features (MKTME, SEV) which protect guest >> > memory from some unauthorized host access. The patchset proposes a purely >> > software feature that mitigates some of the same host-side read-only >> > attacks. >> > >> > >> > == What does this set mitigate? == >> > >> > - Host kernel ”accidental” access to guest data (think speculation) >> > >> > - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) >> > >> > - Host userspace access to guest data (compromised qemu) >> > >> > == What does this set NOT mitigate? == >> > >> > - Full host kernel compromise. Kernel will just map the pages again. >> > >> > - Hardware attacks >> >> Just as a heads up, we (the Android kernel team) are currently >> involved in something pretty similar for KVM/arm64 in order to bring >> some level of confidentiality to guests. >> >> The main idea is to de-privilege the host kernel by wrapping it in its >> own nested set of page tables which allows us to remove memory >> allocated to guests on a per-page basis. The core hypervisor runs more >> or less independently at its own privilege level. It still is KVM >> though, as we don't intend to reinvent the wheel. >> >> Will has written a much more lingo-heavy description here: >> https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/ > > Pardon my arm64 ignorance... > > IIUC, in this mode, the host kernel runs at EL1? And to switch to a > guest > it has to bounce through EL2, which is KVM, or at least a chunk of KVM? > I assume the EL1->EL2->EL1 switch is done by trapping an exception of > some > form? > > If all of the above are "yes", does KVM already have the necessary > logic to > perform the EL1->EL2->EL1 switches, or is that being added as part of > the > de-privileging effort? KVM already handles the EL1->EL2->EL1 madness, meaning that from an exception level perspective, the host kernel is already a guest. It's just that this guest can directly change the hypervisor's text, its page tables, and muck with about everything else. De-privileging the memory access to non host EL1 memory is where the ongoing effort is. M.
Hi Sean, On Thu, Jun 04, 2020 at 08:48:35AM -0700, Sean Christopherson wrote: > On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote: > > On Fri, 22 May 2020 15:51:58 +0300 > > "Kirill A. Shutemov" <kirill@shutemov.name> wrote: > > > > > == Background / Problem == > > > > > > There are a number of hardware features (MKTME, SEV) which protect guest > > > memory from some unauthorized host access. The patchset proposes a purely > > > software feature that mitigates some of the same host-side read-only > > > attacks. > > > > > > > > > == What does this set mitigate? == > > > > > > - Host kernel ”accidental” access to guest data (think speculation) > > > > > > - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) > > > > > > - Host userspace access to guest data (compromised qemu) > > > > > > == What does this set NOT mitigate? == > > > > > > - Full host kernel compromise. Kernel will just map the pages again. > > > > > > - Hardware attacks > > > > Just as a heads up, we (the Android kernel team) are currently > > involved in something pretty similar for KVM/arm64 in order to bring > > some level of confidentiality to guests. > > > > The main idea is to de-privilege the host kernel by wrapping it in its > > own nested set of page tables which allows us to remove memory > > allocated to guests on a per-page basis. The core hypervisor runs more > > or less independently at its own privilege level. It still is KVM > > though, as we don't intend to reinvent the wheel. > > > > Will has written a much more lingo-heavy description here: > > https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/ > > Pardon my arm64 ignorance... No, not at all! > IIUC, in this mode, the host kernel runs at EL1? And to switch to a guest > it has to bounce through EL2, which is KVM, or at least a chunk of KVM? > I assume the EL1->EL2->EL1 switch is done by trapping an exception of some > form? Yes, and this is actually the way that KVM works on some Arm CPUs today, as the original virtualisation extensions in the Armv8 architecture do not make it possible to run the kernel directly at EL2 (for example, there is only one page-table base register). This was later addressed in the architecture by the "Virtualisation Host Extensions (VHE)", and so KVM supports both options. With non-VHE today, there is a small amount of "world switch" code at EL2 which is installed by the host kernel and provides a way to transition between the host and the guest. If the host needs to do something at EL2 (e.g. privileged TLB invalidation), then it makes a hypercall (HVC instruction) via the kvm_call_hyp() macro (and this ends up just being a function call for VHE). > If all of the above are "yes", does KVM already have the necessary logic to > perform the EL1->EL2->EL1 switches, or is that being added as part of the > de-privileging effort? The logic is there as part of the non-VHE support code, but it's not great from a security angle. For example, the guest stage-2 page-tables are still allocated by the host, the host has complete access to guest and hypervisor memory (including hypervisor text) and things like kvm_call_hyp() are a bit of an open door. We're working on making the EL2 code more self contained, so that after the host has initialised KVM, it can shut the door and the hypervisor can install a stage-2 translation over the host, which limits its access to hypervisor and guest memory. There will clearly be IOMMU work as well to prevent DMA attacks. Will
> > On Jun 4, 2020, at 9:35 AM, Will Deacon <will@kernel.org> wrote: > > Hi Sean, > > On Thu, Jun 04, 2020 at 08:48:35AM -0700, Sean Christopherson wrote: >> On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote: >>> On Fri, 22 May 2020 15:51:58 +0300 >>> "Kirill A. Shutemov" <kirill@shutemov.name> wrote: >>> >>>> == Background / Problem == >>>> >>>> There are a number of hardware features (MKTME, SEV) which protect guest >>>> memory from some unauthorized host access. The patchset proposes a purely >>>> software feature that mitigates some of the same host-side read-only >>>> attacks. >>>> >>>> >>>> == What does this set mitigate? == >>>> >>>> - Host kernel ”accidental” access to guest data (think speculation) >>>> >>>> - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) >>>> >>>> - Host userspace access to guest data (compromised qemu) >>>> >>>> == What does this set NOT mitigate? == >>>> >>>> - Full host kernel compromise. Kernel will just map the pages again. >>>> >>>> - Hardware attacks >>> >>> Just as a heads up, we (the Android kernel team) are currently >>> involved in something pretty similar for KVM/arm64 in order to bring >>> some level of confidentiality to guests. >>> >>> The main idea is to de-privilege the host kernel by wrapping it in its >>> own nested set of page tables which allows us to remove memory >>> allocated to guests on a per-page basis. The core hypervisor runs more >>> or less independently at its own privilege level. It still is KVM >>> though, as we don't intend to reinvent the wheel. >>> >>> Will has written a much more lingo-heavy description here: >>> https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/ >> We (Intel virtualization team) are also working on a similar thing, prototyping to meet such requirements, i..e "some level of confidentiality to guests”. Linux/KVM is the host, and the Kirill’s patches are helpful when removing the mappings from the host to achieve memory isolation of a guest. But, it’s not easy to prove there are no other mappings. To raise the level of security, our idea is to de-privilege the host kernel just to enforce memory isolation using EPT (Extended Page Table) that virtualizes guest (the host kernel in this case) physical memory; almost everything is passthrough. And the EPT for the host kernel excludes the memory for the guest(s) that has confidential info. So, the host kernel shouldn’t cause VM exits as long as it’s behaving well (CPUID still causes a VM exit, though). When the control enters KVM, we go back to privileged (hypervisor or root) mode, and it works as does today. Once a VM exit happens, we will stay in the root mode as long as the exit can be handled within KVM. If we need to depend on the host kernel, we de-privilege the host kernel (i.e. VM enter). Yes, it sounds ugly. There are cleaner (but more expensive) approaches, and we are collecting data at this point. For example, we could run the host kernel (like Xen dom0) on top of a thin? hypervisor that consists of KVM and minimally configured Linux. > >> IIUC, in this mode, the host kernel runs at EL1? And to switch to a guest >> it has to bounce through EL2, which is KVM, or at least a chunk of KVM? >> I assume the EL1->EL2->EL1 switch is done by trapping an exception of some >> form? > > Yes, and this is actually the way that KVM works on some Arm CPUs today, > as the original virtualisation extensions in the Armv8 architecture do > not make it possible to run the kernel directly at EL2 (for example, there > is only one page-table base register). This was later addressed in the > architecture by the "Virtualisation Host Extensions (VHE)", and so KVM > supports both options. > > With non-VHE today, there is a small amount of "world switch" code at > EL2 which is installed by the host kernel and provides a way to transition > between the host and the guest. If the host needs to do something at EL2 > (e.g. privileged TLB invalidation), then it makes a hypercall (HVC instruction) > via the kvm_call_hyp() macro (and this ends up just being a function call > for VHE). > >> If all of the above are "yes", does KVM already have the necessary logic to >> perform the EL1->EL2->EL1 switches, or is that being added as part of the >> de-privileging effort? > > The logic is there as part of the non-VHE support code, but it's not great > from a security angle. For example, the guest stage-2 page-tables are still > allocated by the host, the host has complete access to guest and hypervisor > memory (including hypervisor text) and things like kvm_call_hyp() are a bit > of an open door. We're working on making the EL2 code more self contained, > so that after the host has initialised KVM, it can shut the door and the > hypervisor can install a stage-2 translation over the host, which limits its > access to hypervisor and guest memory. There will clearly be IOMMU work as > well to prevent DMA attacks. Sounds interesting. --- Jun Intel Open Source Technology Center
On Thu, Jun 4, 2020 at 12:09 PM Nakajima, Jun <jun.nakajima@intel.com> wrote: > We (Intel virtualization team) are also working on a similar thing, prototyping to meet such requirements, i..e "some level of confidentiality to guests”. Linux/KVM is the host, and the Kirill’s patches are helpful when removing the mappings from the host to achieve memory isolation of a guest. But, it’s not easy to prove there are no other mappings. > > To raise the level of security, our idea is to de-privilege the host kernel just to enforce memory isolation using EPT (Extended Page Table) that virtualizes guest (the host kernel in this case) physical memory; almost everything is passthrough. And the EPT for the host kernel excludes the memory for the guest(s) that has confidential info. So, the host kernel shouldn’t cause VM exits as long as it’s behaving well (CPUID still causes a VM exit, though). You're Intel. Can't you just change the CPUID intercept from required to optional? It seems like this should be in the realm of a small microcode patch.
> > On Jun 4, 2020, at 2:03 PM, Jim Mattson <jmattson@google.com> wrote: > > On Thu, Jun 4, 2020 at 12:09 PM Nakajima, Jun <jun.nakajima@intel.com> wrote: > >> We (Intel virtualization team) are also working on a similar thing, prototyping to meet such requirements, i..e "some level of confidentiality to guests”. Linux/KVM is the host, and the Kirill’s patches are helpful when removing the mappings from the host to achieve memory isolation of a guest. But, it’s not easy to prove there are no other mappings. >> >> To raise the level of security, our idea is to de-privilege the host kernel just to enforce memory isolation using EPT (Extended Page Table) that virtualizes guest (the host kernel in this case) physical memory; almost everything is passthrough. And the EPT for the host kernel excludes the memory for the guest(s) that has confidential info. So, the host kernel shouldn’t cause VM exits as long as it’s behaving well (CPUID still causes a VM exit, though). > > You're Intel. Can't you just change the CPUID intercept from required > to optional? It seems like this should be in the realm of a small > microcode patch. We’ll take a look. Probably it would be helpful even for the bare-metal kernel (e.g. debugging). Thanks for the suggestion. --- Jun Intel Open Source Technology Center