diff mbox

[v2,02/12] KVM: x86: tag the instructions which are used to write page table

Message ID 4E37DA73.7010908@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong Aug. 2, 2011, 11:07 a.m. UTC
The idea is from Avi:
| tag instructions that are typically used to modify the page tables, and drop
| shadow if any other instruction is used
| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg, and
| cmpxchg8b

This patch is used to tag the instructions and in the later path, shadow page
is dropped if it is written by other instructions

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_emulate.h |    1 +
 arch/x86/kvm/emulate.c             |    8 ++++++++
 arch/x86/kvm/x86.c                 |    1 +
 arch/x86/kvm/x86.h                 |    5 +++++
 4 files changed, 15 insertions(+), 0 deletions(-)

Comments

Avi Kivity Aug. 2, 2011, 12:20 p.m. UTC | #1
On 08/02/2011 02:07 PM, Xiao Guangrong wrote:
> The idea is from Avi:
> | tag instructions that are typically used to modify the page tables, and drop
> | shadow if any other instruction is used
> | The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg, and
> | cmpxchg8b
>
> This patch is used to tag the instructions and in the later path, shadow page
> is dropped if it is written by other instructions

We already have a mechanism for this, the decode tables.  Why not add a 
new flag, PageTable, and set it on all relevant instructions?

Note we don't need to actually emulate, just decode, since page_fault 
can tell us whether a write failed due to page tables or mmio.
Xiao Guangrong Aug. 3, 2011, 6:02 a.m. UTC | #2
On 08/02/2011 08:20 PM, Avi Kivity wrote:
> On 08/02/2011 02:07 PM, Xiao Guangrong wrote:
>> The idea is from Avi:
>> | tag instructions that are typically used to modify the page tables, and drop
>> | shadow if any other instruction is used
>> | The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg, and
>> | cmpxchg8b
>>
>> This patch is used to tag the instructions and in the later path, shadow page
>> is dropped if it is written by other instructions
> 
> We already have a mechanism for this, the decode tables.  Why not add a new flag, PageTable, and set it on all relevant instructions?
> 

OK, will do it in the next version.

> Note we don't need to actually emulate, just decode, since page_fault can tell us whether a write failed due to page tables or mmio.
> 

This is a interesting feature. If it happens, i will just drop the shadow pages
and retry these instructions directly.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Aug. 3, 2011, 8:09 a.m. UTC | #3
On 08/03/2011 09:02 AM, Xiao Guangrong wrote:
> >  Note we don't need to actually emulate, just decode, since page_fault can tell us whether a write failed due to page tables or mmio.
> >
>
> This is a interesting feature. If it happens, i will just drop the shadow pages
> and retry these instructions directly.

Note it's a little dangerous.  If the guest uses a non-page-table 
modifying instruction on the PDE that points to the instruction, then we 
will unmap the instruction and go to an infinite loop.

Maybe it's better to emulate if we can't find a fix for that.

One way would be to emulate every 20 instructions; this breaks us out of 
the loop but reduces costly emulations to 5%.
Xiao Guangrong Aug. 3, 2011, 9:24 a.m. UTC | #4
Hi Avi,

On 08/03/2011 04:09 PM, Avi Kivity wrote:
> On 08/03/2011 09:02 AM, Xiao Guangrong wrote:
>> >  Note we don't need to actually emulate, just decode, since page_fault can tell us whether a write failed due to page tables or mmio.
>> >
>>
>> This is a interesting feature. If it happens, i will just drop the shadow pages
>> and retry these instructions directly.
> 
> Note it's a little dangerous.  If the guest uses a non-page-table modifying instruction on the PDE that points to the instruction, then we will unmap the instruction and go to an infinite loop.
> 

Yes, it is.

> Maybe it's better to emulate if we can't find a fix for that.
> 
> One way would be to emulate every 20 instructions; this breaks us out of the loop but reduces costly emulations to 5%.
> 

After much thought about this, may be this optimization is not good since:
- it is little complex
- this optimization is only applied to the instruction emulation caused by #PF
- it does not improve too much:
  if we emulate the instruction, we need to do:
  - decode instruction
  - emulate it
  - zap shadow pages
  And do this, it can return to the guest, the guest can run the next instruction

  if we retry the instruction, we need to do:
  - decode instruction
  - zap shadow pages
  then return to the guest and retry the instruction, however, we will get page fault
  again(since the mapping is still read-only), so we will get another VM-exit and need
  to do:
  # trigger page fault
  - handle the page fault and change the mapping to writable
  - retry the instruction
  until now, the guest can run the next instruction

So, i do not think the new way is better, your opinion?


  
  
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Aug. 3, 2011, 9:25 a.m. UTC | #5
On 08/03/2011 12:24 PM, Xiao Guangrong wrote:
> >  Maybe it's better to emulate if we can't find a fix for that.
> >
> >  One way would be to emulate every 20 instructions; this breaks us out of the loop but reduces costly emulations to 5%.
> >
>
> After much thought about this, may be this optimization is not good since:
> - it is little complex
> - this optimization is only applied to the instruction emulation caused by #PF
> - it does not improve too much:
>    if we emulate the instruction, we need to do:
>    - decode instruction
>    - emulate it
>    - zap shadow pages
>    And do this, it can return to the guest, the guest can run the next instruction
>
>    if we retry the instruction, we need to do:
>    - decode instruction
>    - zap shadow pages
>    then return to the guest and retry the instruction, however, we will get page fault
>    again(since the mapping is still read-only), so we will get another VM-exit and need
>    to do:
>    # trigger page fault
>    - handle the page fault and change the mapping to writable
>    - retry the instruction
>    until now, the guest can run the next instruction
>
> So, i do not think the new way is better, your opinion?

We can change to writeable and zap in the same exit, no?  Basically call 
page_fault() again after zapping.
Avi Kivity Aug. 3, 2011, 9:40 a.m. UTC | #6
On 08/03/2011 12:41 PM, Xiao Guangrong wrote:
> OK. :-)
>
> Um, how about cache the last eip when we do this optimization, if we meet the same eip, we
> can break out the potential infinite loop?

Yes, that can work.
Xiao Guangrong Aug. 3, 2011, 9:41 a.m. UTC | #7
On 08/03/2011 05:25 PM, Avi Kivity wrote:
> On 08/03/2011 12:24 PM, Xiao Guangrong wrote:
>> >  Maybe it's better to emulate if we can't find a fix for that.
>> >
>> >  One way would be to emulate every 20 instructions; this breaks us out of the loop but reduces costly emulations to 5%.
>> >
>>
>> After much thought about this, may be this optimization is not good since:
>> - it is little complex
>> - this optimization is only applied to the instruction emulation caused by #PF
>> - it does not improve too much:
>>    if we emulate the instruction, we need to do:
>>    - decode instruction
>>    - emulate it
>>    - zap shadow pages
>>    And do this, it can return to the guest, the guest can run the next instruction
>>
>>    if we retry the instruction, we need to do:
>>    - decode instruction
>>    - zap shadow pages
>>    then return to the guest and retry the instruction, however, we will get page fault
>>    again(since the mapping is still read-only), so we will get another VM-exit and need
>>    to do:
>>    # trigger page fault
>>    - handle the page fault and change the mapping to writable
>>    - retry the instruction
>>    until now, the guest can run the next instruction
>>
>> So, i do not think the new way is better, your opinion?
> 
> We can change to writeable and zap in the same exit, no?  Basically call page_fault() again after zapping.
> 

OK. :-)

Um, how about cache the last eip when we do this optimization, if we meet the same eip, we
can break out the potential infinite loop?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 6040d11..049a6f5 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -244,6 +244,7 @@  struct x86_emulate_ctxt {
 	bool guest_mode; /* guest running a nested guest */
 	bool perm_ok; /* do not check permissions if true */
 	bool only_vendor_specific_insn;
+	bool page_table_written_insn;
 
 	bool have_exception;
 	struct x86_exception exception;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0453c07..3c027ac 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2516,6 +2516,7 @@  static int em_add(struct x86_emulate_ctxt *ctxt)
 static int em_or(struct x86_emulate_ctxt *ctxt)
 {
 	emulate_2op_SrcV("or", ctxt->src, ctxt->dst, ctxt->eflags);
+	tag_page_table_written_insn(ctxt);
 	return X86EMUL_CONTINUE;
 }
 
@@ -2534,6 +2535,7 @@  static int em_sbb(struct x86_emulate_ctxt *ctxt)
 static int em_and(struct x86_emulate_ctxt *ctxt)
 {
 	emulate_2op_SrcV("and", ctxt->src, ctxt->dst, ctxt->eflags);
+	tag_page_table_written_insn(ctxt);
 	return X86EMUL_CONTINUE;
 }
 
@@ -2572,6 +2574,7 @@  static int em_xchg(struct x86_emulate_ctxt *ctxt)
 	/* Write back the memory destination with implicit LOCK prefix. */
 	ctxt->dst.val = ctxt->src.orig_val;
 	ctxt->lock_prefix = 1;
+	tag_page_table_written_insn(ctxt);
 	return X86EMUL_CONTINUE;
 }
 
@@ -2610,6 +2613,7 @@  static int em_rdtsc(struct x86_emulate_ctxt *ctxt)
 static int em_mov(struct x86_emulate_ctxt *ctxt)
 {
 	ctxt->dst.val = ctxt->src.val;
+	tag_page_table_written_insn(ctxt);
 	return X86EMUL_CONTINUE;
 }
 
@@ -4135,6 +4139,7 @@  twobyte_insn:
 		break;
 	case 0xab:
 	      bts:		/* bts */
+		tag_page_table_written_insn(ctxt);
 		emulate_2op_SrcV_nobyte("bts", ctxt->src, ctxt->dst, ctxt->eflags);
 		break;
 	case 0xac: /* shrd imm8, r, r/m */
@@ -4148,6 +4153,7 @@  twobyte_insn:
 		 * Save real source value, then compare EAX against
 		 * destination.
 		 */
+		tag_page_table_written_insn(ctxt);
 		ctxt->src.orig_val = ctxt->src.val;
 		ctxt->src.val = ctxt->regs[VCPU_REGS_RAX];
 		emulate_2op_SrcV("cmp", ctxt->src, ctxt->dst, ctxt->eflags);
@@ -4192,6 +4198,7 @@  twobyte_insn:
 		break;
 	case 0xbb:
 	      btc:		/* btc */
+		tag_page_table_written_insn(ctxt);
 		emulate_2op_SrcV_nobyte("btc", ctxt->src, ctxt->dst, ctxt->eflags);
 		break;
 	case 0xbc: {		/* bsf */
@@ -4235,6 +4242,7 @@  twobyte_insn:
 							(u64) ctxt->src.val;
 		break;
 	case 0xc7:		/* Grp9 (cmpxchg8b) */
+		tag_page_table_written_insn(ctxt);
 		rc = em_grp9(ctxt);
 		break;
 	default:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ea8f9f0..cf6fb29 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4831,6 +4831,7 @@  int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 		ctxt->interruptibility = 0;
 		ctxt->have_exception = false;
 		ctxt->perm_ok = false;
+		ctxt->page_table_written_insn = false;
 
 		ctxt->only_vendor_specific_insn
 			= emulation_type & EMULTYPE_TRAP_UD;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d36fe23..b6e868f 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -111,6 +111,11 @@  static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
 	return false;
 }
 
+static inline void tag_page_table_written_insn(struct x86_emulate_ctxt *ctxt)
+{
+	ctxt->page_table_written_insn = true;
+}
+
 void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);