diff mbox series

[v3,22/33] KVM: PPC: Book3S HV: Handle page fault for a nested guest

Message ID 1538479892-14835-23-git-send-email-paulus@ozlabs.org (mailing list archive)
State New, archived
Headers show
Series KVM: PPC: Book3S HV: Nested HV virtualization | expand

Commit Message

Paul Mackerras Oct. 2, 2018, 11:31 a.m. UTC
From: Suraj Jitindar Singh <sjitindarsingh@gmail.com>

Consider a normal (L1) guest running under the main hypervisor (L0),
and then a nested guest (L2) running under the L1 guest which is acting
as a nested hypervisor. L0 has page tables to map the address space for
L1 providing the translation from L1 real address -> L0 real address;

	L1
	|
	| (L1 -> L0)
	|
	----> L0

There are also page tables in L1 used to map the address space for L2
providing the translation from L2 real address -> L1 read address. Since
the hardware can only walk a single level of page table, we need to
maintain in L0 a "shadow_pgtable" for L2 which provides the translation
from L2 real address -> L0 real address. Which looks like;

	L2				L2
	|				|
	| (L2 -> L1)			|
	|				|
	----> L1			| (L2 -> L0)
	      |				|
	      | (L1 -> L0)		|
	      |				|
	      ----> L0			--------> L0

When a page fault occurs while running a nested (L2) guest we need to
insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
do this we need to:

1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
   provide a page fault to L1 if this mapping doesn't exist.
2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
   or try to insert a pte for that mapping if it doesn't exist.
3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable

Once this mapping exists we can take rc faults when hardware is unable
to automatically set the reference and change bits in the pte. On these
we need to:

1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
   the fault down to L1.
2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
   host page.
3. Set the rc bits in the L2 -> L0 pte.

As we reuse a large number of functions in book3s_64_mmu_radix.c for
this we also needed to refactor a number of these functions to take
an lpid parameter so that the correct lpid is used for tlb invalidations.
The functionality however has remained the same.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
---
 .../powerpc/include/asm/book3s/64/tlbflush-radix.h |   1 +
 arch/powerpc/include/asm/kvm_book3s.h              |  19 ++
 arch/powerpc/include/asm/kvm_book3s_64.h           |   4 +
 arch/powerpc/include/asm/kvm_host.h                |   2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c             | 196 +++++++-----
 arch/powerpc/kvm/book3s_hv_nested.c                | 334 ++++++++++++++++++++-
 arch/powerpc/mm/tlb-radix.c                        |   9 +
 7 files changed, 478 insertions(+), 87 deletions(-)

Comments

David Gibson Oct. 3, 2018, 5:39 a.m. UTC | #1
On Tue, Oct 02, 2018 at 09:31:21PM +1000, Paul Mackerras wrote:
> From: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> 
> Consider a normal (L1) guest running under the main hypervisor (L0),
> and then a nested guest (L2) running under the L1 guest which is acting
> as a nested hypervisor. L0 has page tables to map the address space for
> L1 providing the translation from L1 real address -> L0 real address;
> 
> 	L1
> 	|
> 	| (L1 -> L0)
> 	|
> 	----> L0
> 
> There are also page tables in L1 used to map the address space for L2
> providing the translation from L2 real address -> L1 read address. Since
> the hardware can only walk a single level of page table, we need to
> maintain in L0 a "shadow_pgtable" for L2 which provides the translation
> from L2 real address -> L0 real address. Which looks like;
> 
> 	L2				L2
> 	|				|
> 	| (L2 -> L1)			|
> 	|				|
> 	----> L1			| (L2 -> L0)
> 	      |				|
> 	      | (L1 -> L0)		|
> 	      |				|
> 	      ----> L0			--------> L0
> 
> When a page fault occurs while running a nested (L2) guest we need to
> insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
> do this we need to:
> 
> 1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
>    provide a page fault to L1 if this mapping doesn't exist.
> 2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
>    or try to insert a pte for that mapping if it doesn't exist.
> 3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable
> 
> Once this mapping exists we can take rc faults when hardware is unable
> to automatically set the reference and change bits in the pte. On these
> we need to:
> 
> 1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
>    the fault down to L1.
> 2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
>    host page.
> 3. Set the rc bits in the L2 -> L0 pte.
> 
> As we reuse a large number of functions in book3s_64_mmu_radix.c for
> this we also needed to refactor a number of these functions to take
> an lpid parameter so that the correct lpid is used for tlb invalidations.
> The functionality however has remained the same.
> 
> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

Some comments below, but no showstoppers, so,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  .../powerpc/include/asm/book3s/64/tlbflush-radix.h |   1 +
>  arch/powerpc/include/asm/kvm_book3s.h              |  19 ++
>  arch/powerpc/include/asm/kvm_book3s_64.h           |   4 +
>  arch/powerpc/include/asm/kvm_host.h                |   2 +
>  arch/powerpc/kvm/book3s_64_mmu_radix.c             | 196 +++++++-----
>  arch/powerpc/kvm/book3s_hv_nested.c                | 334 ++++++++++++++++++++-
>  arch/powerpc/mm/tlb-radix.c                        |   9 +
>  7 files changed, 478 insertions(+), 87 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> index 1154a6d..671316f 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> @@ -53,6 +53,7 @@ extern void radix__flush_tlb_lpid_page(unsigned int lpid,
>  					unsigned long addr,
>  					unsigned long page_size);
>  extern void radix__flush_pwc_lpid(unsigned int lpid);
> +extern void radix__flush_tlb_lpid(unsigned int lpid);
>  extern void radix__local_flush_tlb_lpid(unsigned int lpid);
>  extern void radix__local_flush_tlb_lpid_guest(unsigned int lpid);
>  
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> index 0a97446..d983778 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -188,17 +188,34 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, unsigned long hc);
>  extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
>  			struct kvm_vcpu *vcpu,
>  			unsigned long ea, unsigned long dsisr);
> +extern int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
> +				      struct kvmppc_pte *gpte, u64 root,
> +				      u64 *pte_ret_p);
>  extern int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
>  			struct kvmppc_pte *gpte, u64 table,
>  			int table_index, u64 *pte_ret_p);
>  extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
>  			struct kvmppc_pte *gpte, bool data, bool iswrite);
> +extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
> +				    bool writing, unsigned long gpa,
> +				    unsigned int lpid);
> +extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
> +				unsigned long gpa,
> +				struct kvm_memory_slot *memslot,
> +				bool writing, bool kvm_ro,
> +				pte_t *inserted_pte, unsigned int *levelp);
>  extern int kvmppc_init_vm_radix(struct kvm *kvm);
>  extern void kvmppc_free_radix(struct kvm *kvm);
> +extern void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd,
> +				      unsigned int lpid);
>  extern int kvmppc_radix_init(void);
>  extern void kvmppc_radix_exit(void);
>  extern int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  			unsigned long gfn);
> +extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
> +			     unsigned long gpa, unsigned int shift,
> +			     struct kvm_memory_slot *memslot,
> +			     unsigned int lpid);
>  extern int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  			unsigned long gfn);
>  extern int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
> @@ -289,6 +306,8 @@ void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
>  long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
>  int kvmhv_emulate_priv(struct kvm_run *run, struct kvm_vcpu *vcpu,
>  			unsigned int instr);
> +int kvmhv_handle_nested_trap(struct kvm_run *run, struct kvm_vcpu *vcpu,
> +			     struct task_struct *tsk);
>  
>  void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
>  
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 6d67b6a..5496152 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -549,6 +549,10 @@ static inline void copy_to_checkpoint(struct kvm_vcpu *vcpu)
>  }
>  #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
>  
> +extern int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
> +			     unsigned long gpa, unsigned int level,
> +			     unsigned long mmu_seq, unsigned int lpid);
> +
>  #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
>  
>  #endif /* __ASM_KVM_BOOK3S_64_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index ceb9f20..64c4807 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -367,7 +367,9 @@ struct kvmppc_pte {
>  	bool may_write		: 1;
>  	bool may_execute	: 1;
>  	unsigned long wimg;
> +	unsigned long rc;
>  	u8 page_size;		/* MMU_PAGE_xxx */
> +	u16 page_shift;

It's a bit ugly that this has both page_size and page_shift, which is
redundant information AFAICT.  Also, why does page_shift need to be
u16 - given that 2^255 bytes is much more than our supported address
space, let alone a plausible page size.

>  };
>  
>  struct kvmppc_mmu {
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index bd06a95..ee6f493 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -29,43 +29,16 @@
>   */
>  static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
>  
> -/*
> - * Used to walk a partition or process table radix tree in guest memory
> - * Note: We exploit the fact that a partition table and a process
> - * table have the same layout, a partition-scoped page table and a
> - * process-scoped page table have the same layout, and the 2nd
> - * doubleword of a partition table entry has the same layout as
> - * the PTCR register.
> - */
> -int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
> -				     struct kvmppc_pte *gpte, u64 table,
> -				     int table_index, u64 *pte_ret_p)
> +int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
> +			       struct kvmppc_pte *gpte, u64 root,
> +			       u64 *pte_ret_p)
>  {
>  	struct kvm *kvm = vcpu->kvm;
>  	int ret, level, ps;
> -	unsigned long ptbl, root;
> -	unsigned long rts, bits, offset;
> -	unsigned long size, index;
> -	struct prtb_entry entry;
> +	unsigned long rts, bits, offset, index;
>  	u64 pte, base, gpa;
>  	__be64 rpte;
>  
> -	if ((table & PRTS_MASK) > 24)
> -		return -EINVAL;
> -	size = 1ul << ((table & PRTS_MASK) + 12);
> -
> -	/* Is the table big enough to contain this entry? */
> -	if ((table_index * sizeof(entry)) >= size)
> -		return -EINVAL;
> -
> -	/* Read the table to find the root of the radix tree */
> -	ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
> -	ret = kvm_read_guest(kvm, ptbl, &entry, sizeof(entry));
> -	if (ret)
> -		return ret;
> -
> -	/* Root is stored in the first double word */
> -	root = be64_to_cpu(entry.prtb0);

This refactoring somewhat obscures the changes directly relevant to
the nested guest handling.  Ideally it would be nice to fold some of
this into the earlier reworkings.

>  	rts = ((root & RTS1_MASK) >> (RTS1_SHIFT - 3)) |
>  		((root & RTS2_MASK) >> RTS2_SHIFT);
>  	bits = root & RPDS_MASK;
> @@ -79,6 +52,7 @@ int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
>  
>  	/* Walk each level of the radix tree */
>  	for (level = 3; level >= 0; --level) {
> +		u64 addr;
>  		/* Check a valid size */
>  		if (level && bits != p9_supported_radix_bits[level])
>  			return -EINVAL;
> @@ -90,10 +64,13 @@ int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
>  		if (base & ((1UL << (bits + 3)) - 1))
>  			return -EINVAL;
>  		/* Read the entry from guest memory */
> -		ret = kvm_read_guest(kvm, base + (index * sizeof(rpte)),
> -				     &rpte, sizeof(rpte));
> -		if (ret)
> +		addr = base + (index * sizeof(rpte));
> +		ret = kvm_read_guest(kvm, addr, &rpte, sizeof(rpte));
> +		if (ret) {
> +			if (pte_ret_p)
> +				*pte_ret_p = addr;
>  			return ret;
> +		}
>  		pte = __be64_to_cpu(rpte);
>  		if (!(pte & _PAGE_PRESENT))
>  			return -ENOENT;
> @@ -119,6 +96,7 @@ int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
>  		if (offset == mmu_psize_defs[ps].shift)
>  			break;
>  	gpte->page_size = ps;
> +	gpte->page_shift = offset;
>  
>  	gpte->eaddr = eaddr;
>  	gpte->raddr = gpa;
> @@ -128,12 +106,51 @@ int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
>  	gpte->may_write = !!(pte & _PAGE_WRITE);
>  	gpte->may_execute = !!(pte & _PAGE_EXEC);
>  
> +	gpte->rc = pte & (_PAGE_ACCESSED | _PAGE_DIRTY);
> +
>  	if (pte_ret_p)
>  		*pte_ret_p = pte;
>  
>  	return 0;
>  }
>  
> +/*
> + * Used to walk a partition or process table radix tree in guest memory
> + * Note: We exploit the fact that a partition table and a process
> + * table have the same layout, a partition-scoped page table and a
> + * process-scoped page table have the same layout, and the 2nd
> + * doubleword of a partition table entry has the same layout as
> + * the PTCR register.
> + */
> +int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
> +				     struct kvmppc_pte *gpte, u64 table,
> +				     int table_index, u64 *pte_ret_p)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	int ret;
> +	unsigned long size, ptbl, root;
> +	struct prtb_entry entry;
> +
> +	if ((table & PRTS_MASK) > 24)
> +		return -EINVAL;
> +	size = 1ul << ((table & PRTS_MASK) + 12);
> +
> +	/* Is the table big enough to contain this entry? */
> +	if ((table_index * sizeof(entry)) >= size)
> +		return -EINVAL;
> +
> +	/* Read the table to find the root of the radix tree */
> +	ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
> +	ret = kvm_read_guest(kvm, ptbl, &entry, sizeof(entry));
> +	if (ret)
> +		return ret;
> +
> +	/* Root is stored in the first double word */
> +	root = be64_to_cpu(entry.prtb0);
> +
> +	return kvmppc_mmu_walk_radix_tree(vcpu, eaddr, gpte, root, pte_ret_p);
> +}
> +
>  int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
>  			   struct kvmppc_pte *gpte, bool data, bool iswrite)
>  {
> @@ -181,7 +198,7 @@ int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
>  }
>  
>  static void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
> -				    unsigned int pshift)
> +				    unsigned int pshift, unsigned int lpid)
>  {
>  	unsigned long psize = PAGE_SIZE;
>  
> @@ -189,12 +206,12 @@ static void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
>  		psize = 1UL << pshift;
>  
>  	addr &= ~(psize - 1);
> -	radix__flush_tlb_lpid_page(kvm->arch.lpid, addr, psize);
> +	radix__flush_tlb_lpid_page(lpid, addr, psize);
>  }
>  
> -static void kvmppc_radix_flush_pwc(struct kvm *kvm)
> +static void kvmppc_radix_flush_pwc(struct kvm *kvm, unsigned int lpid)
>  {
> -	radix__flush_pwc_lpid(kvm->arch.lpid);
> +	radix__flush_pwc_lpid(lpid);
>  }
>  
>  static unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep,
> @@ -239,16 +256,17 @@ static void kvmppc_pmd_free(pmd_t *pmdp)
>  	kmem_cache_free(kvm_pmd_cache, pmdp);
>  }
>  
> -static void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
> -			     unsigned long gpa, unsigned int shift,
> -			     struct kvm_memory_slot *memslot)
> +void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
> +		      unsigned long gpa, unsigned int shift,
> +		      struct kvm_memory_slot *memslot,
> +		      unsigned int lpid)
>  
>  {
>  	unsigned long old;
>  
>  	old = kvmppc_radix_update_pte(kvm, pte, ~0UL, 0, gpa, shift);
> -	kvmppc_radix_tlbie_page(kvm, gpa, shift);
> -	if (old & _PAGE_DIRTY) {
> +	kvmppc_radix_tlbie_page(kvm, gpa, shift, lpid);
> +	if ((old & _PAGE_DIRTY) && (lpid == kvm->arch.lpid)) {
>  		unsigned long gfn = gpa >> PAGE_SHIFT;
>  		unsigned long page_size = PAGE_SIZE;
>  
> @@ -271,7 +289,8 @@ static void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
>   * and emit a warning if encountered, but there may already be data
>   * corruption due to the unexpected mappings.
>   */
> -static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t *pte, bool full)
> +static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t *pte, bool full,
> +				  unsigned int lpid)
>  {
>  	if (full) {
>  		memset(pte, 0, sizeof(long) << PTE_INDEX_SIZE);
> @@ -285,14 +304,15 @@ static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t *pte, bool full)
>  			WARN_ON_ONCE(1);
>  			kvmppc_unmap_pte(kvm, p,
>  					 pte_pfn(*p) << PAGE_SHIFT,
> -					 PAGE_SHIFT, NULL);
> +					 PAGE_SHIFT, NULL, lpid);
>  		}
>  	}
>  
>  	kvmppc_pte_free(pte);
>  }
>  
> -static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full)
> +static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full,
> +				  unsigned int lpid)
>  {
>  	unsigned long im;
>  	pmd_t *p = pmd;
> @@ -307,20 +327,21 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full)
>  				WARN_ON_ONCE(1);
>  				kvmppc_unmap_pte(kvm, (pte_t *)p,
>  					 pte_pfn(*(pte_t *)p) << PAGE_SHIFT,
> -					 PMD_SHIFT, NULL);
> +					 PMD_SHIFT, NULL, lpid);
>  			}
>  		} else {
>  			pte_t *pte;
>  
>  			pte = pte_offset_map(p, 0);
> -			kvmppc_unmap_free_pte(kvm, pte, full);
> +			kvmppc_unmap_free_pte(kvm, pte, full, lpid);
>  			pmd_clear(p);
>  		}
>  	}
>  	kvmppc_pmd_free(pmd);
>  }
>  
> -static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t *pud)
> +static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t *pud,
> +				  unsigned int lpid)
>  {
>  	unsigned long iu;
>  	pud_t *p = pud;
> @@ -334,36 +355,42 @@ static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t *pud)
>  			pmd_t *pmd;
>  
>  			pmd = pmd_offset(p, 0);
> -			kvmppc_unmap_free_pmd(kvm, pmd, true);
> +			kvmppc_unmap_free_pmd(kvm, pmd, true, lpid);
>  			pud_clear(p);
>  		}
>  	}
>  	pud_free(kvm->mm, pud);
>  }
>  
> -void kvmppc_free_radix(struct kvm *kvm)
> +void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd, unsigned int lpid)
>  {
>  	unsigned long ig;
> -	pgd_t *pgd;
>  
> -	if (!kvm->arch.pgtable)
> +	if (!pgd)

Not sure if this test is worth it, since most of the callers have to
test for a NULL pgtable for other reasons anyway.

>  		return;
> -	pgd = kvm->arch.pgtable;
>  	for (ig = 0; ig < PTRS_PER_PGD; ++ig, ++pgd) {
>  		pud_t *pud;
>  
>  		if (!pgd_present(*pgd))
>  			continue;
>  		pud = pud_offset(pgd, 0);
> -		kvmppc_unmap_free_pud(kvm, pud);
> +		kvmppc_unmap_free_pud(kvm, pud, lpid);
>  		pgd_clear(pgd);
>  	}
> -	pgd_free(kvm->mm, kvm->arch.pgtable);
> -	kvm->arch.pgtable = NULL;
> +}
> +
> +void kvmppc_free_radix(struct kvm *kvm)
> +{
> +	if (kvm->arch.pgtable) {
> +		kvmppc_free_pgtable_radix(kvm, kvm->arch.pgtable,
> +					  kvm->arch.lpid);
> +		pgd_free(kvm->mm, kvm->arch.pgtable);
> +		kvm->arch.pgtable = NULL;
> +	}
>  }
>  
>  static void kvmppc_unmap_free_pmd_entry_table(struct kvm *kvm, pmd_t *pmd,
> -					      unsigned long gpa)
> +					unsigned long gpa, unsigned int lpid)
>  {
>  	pte_t *pte = pte_offset_kernel(pmd, 0);
>  
> @@ -373,13 +400,13 @@ static void kvmppc_unmap_free_pmd_entry_table(struct kvm *kvm, pmd_t *pmd,
>  	 * flushing the PWC again.
>  	 */
>  	pmd_clear(pmd);
> -	kvmppc_radix_flush_pwc(kvm);
> +	kvmppc_radix_flush_pwc(kvm, lpid);
>  
> -	kvmppc_unmap_free_pte(kvm, pte, false);
> +	kvmppc_unmap_free_pte(kvm, pte, false, lpid);
>  }
>  
>  static void kvmppc_unmap_free_pud_entry_table(struct kvm *kvm, pud_t *pud,
> -					unsigned long gpa)
> +					unsigned long gpa, unsigned int lpid)
>  {
>  	pmd_t *pmd = pmd_offset(pud, 0);
>  
> @@ -389,9 +416,9 @@ static void kvmppc_unmap_free_pud_entry_table(struct kvm *kvm, pud_t *pud,
>  	 * so can be freed without flushing the PWC again.
>  	 */
>  	pud_clear(pud);
> -	kvmppc_radix_flush_pwc(kvm);
> +	kvmppc_radix_flush_pwc(kvm, lpid);
>  
> -	kvmppc_unmap_free_pmd(kvm, pmd, false);
> +	kvmppc_unmap_free_pmd(kvm, pmd, false, lpid);
>  }
>  
>  /*
> @@ -403,9 +430,9 @@ static void kvmppc_unmap_free_pud_entry_table(struct kvm *kvm, pud_t *pud,
>   */
>  #define PTE_BITS_MUST_MATCH (~(_PAGE_WRITE | _PAGE_DIRTY | _PAGE_ACCESSED))
>  
> -static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
> -			     unsigned long gpa, unsigned int level,
> -			     unsigned long mmu_seq)
> +int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
> +		      unsigned long gpa, unsigned int level,
> +		      unsigned long mmu_seq, unsigned int lpid)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud, *new_pud = NULL;
> @@ -458,7 +485,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
>  			WARN_ON_ONCE((pud_val(*pud) ^ pte_val(pte)) &
>  							PTE_BITS_MUST_MATCH);
>  			kvmppc_radix_update_pte(kvm, (pte_t *)pud,
> -					      0, pte_val(pte), hgpa, PUD_SHIFT);
> +					0, pte_val(pte), hgpa, PUD_SHIFT);

This is an unconnected whitespace change, AFAICT.

>  			ret = 0;
>  			goto out_unlock;
>  		}
> @@ -471,7 +498,8 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
>  			goto out_unlock;
>  		}
>  		/* Valid 1GB page here already, remove it */
> -		kvmppc_unmap_pte(kvm, (pte_t *)pud, hgpa, PUD_SHIFT, NULL);
> +		kvmppc_unmap_pte(kvm, (pte_t *)pud, hgpa, PUD_SHIFT, NULL,
> +				 lpid);
>  	}
>  	if (level == 2) {
>  		if (!pud_none(*pud)) {
> @@ -480,7 +508,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
>  			 * install a large page, so remove and free the page
>  			 * table page.
>  			 */
> -			kvmppc_unmap_free_pud_entry_table(kvm, pud, gpa);
> +			kvmppc_unmap_free_pud_entry_table(kvm, pud, gpa, lpid);
>  		}
>  		kvmppc_radix_set_pte_at(kvm, gpa, (pte_t *)pud, pte);
>  		ret = 0;
> @@ -506,7 +534,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
>  			WARN_ON_ONCE((pmd_val(*pmd) ^ pte_val(pte)) &
>  							PTE_BITS_MUST_MATCH);
>  			kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd),
> -					      0, pte_val(pte), lgpa, PMD_SHIFT);
> +					0, pte_val(pte), lgpa, PMD_SHIFT);
>  			ret = 0;
>  			goto out_unlock;
>  		}
> @@ -520,7 +548,8 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
>  			goto out_unlock;
>  		}
>  		/* Valid 2MB page here already, remove it */
> -		kvmppc_unmap_pte(kvm, pmdp_ptep(pmd), lgpa, PMD_SHIFT, NULL);
> +		kvmppc_unmap_pte(kvm, pmdp_ptep(pmd), lgpa, PMD_SHIFT, NULL,
> +				 lpid);
>  	}
>  	if (level == 1) {
>  		if (!pmd_none(*pmd)) {
> @@ -529,7 +558,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
>  			 * install a large page, so remove and free the page
>  			 * table page.
>  			 */
> -			kvmppc_unmap_free_pmd_entry_table(kvm, pmd, gpa);
> +			kvmppc_unmap_free_pmd_entry_table(kvm, pmd, gpa, lpid);
>  		}
>  		kvmppc_radix_set_pte_at(kvm, gpa, pmdp_ptep(pmd), pte);
>  		ret = 0;
> @@ -569,8 +598,8 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
>  	return ret;
>  }
>  
> -static bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
> -				    bool writing, unsigned long gpa)
> +bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable, bool writing,
> +			     unsigned long gpa, unsigned int lpid)
>  {
>  	unsigned long pgflags;
>  	unsigned int shift;
> @@ -597,11 +626,11 @@ static bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
>  	return false;
>  }
>  
> -static int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
> -				unsigned long gpa,
> -				struct kvm_memory_slot *memslot,
> -				bool writing, bool kvm_ro,
> -				pte_t *inserted_pte, unsigned int *levelp)
> +int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
> +				   unsigned long gpa,
> +				   struct kvm_memory_slot *memslot,
> +				   bool writing, bool kvm_ro,
> +				   pte_t *inserted_pte, unsigned int *levelp)
>  {
>  	struct kvm *kvm = vcpu->kvm;
>  	struct page *page = NULL;
> @@ -683,7 +712,7 @@ static int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
>  
>  	/* Allocate space in the tree and write the PTE */
>  	ret = kvmppc_create_pte(kvm, kvm->arch.pgtable, pte, gpa, level,
> -				mmu_seq);
> +				mmu_seq, kvm->arch.lpid);
>  	if (inserted_pte)
>  		*inserted_pte = pte;
>  	if (levelp)
> @@ -758,7 +787,7 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
>  	if (dsisr & DSISR_SET_RC) {
>  		spin_lock(&kvm->mmu_lock);
>  		if (kvmppc_hv_handle_set_rc(kvm, kvm->arch.pgtable,
> -					    writing, gpa))
> +					    writing, gpa, kvm->arch.lpid))
>  			dsisr &= ~DSISR_SET_RC;
>  		spin_unlock(&kvm->mmu_lock);
>  
> @@ -786,7 +815,8 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  
>  	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
>  	if (ptep && pte_present(*ptep))
> -		kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot);
> +		kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
> +				 kvm->arch.lpid);
>  	return 0;				
>  }
>  
> @@ -841,7 +871,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
>  			ret = 1 << (shift - PAGE_SHIFT);
>  		kvmppc_radix_update_pte(kvm, ptep, _PAGE_DIRTY, 0,
>  					gpa, shift);
> -		kvmppc_radix_tlbie_page(kvm, gpa, shift);
> +		kvmppc_radix_tlbie_page(kvm, gpa, shift, kvm->arch.lpid);
>  	}
>  	return ret;
>  }
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> index f8f9fab..c160df3 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -12,10 +12,13 @@
>  #include <linux/kvm_host.h>
>  
>  #include <asm/kvm_ppc.h>
> +#include <asm/kvm_book3s.h>
>  #include <asm/mmu.h>
>  #include <asm/pgtable.h>
>  #include <asm/pgalloc.h>
> +#include <asm/pte-walk.h>
>  #include <asm/disassemble.h>
> +#include <asm/reg.h>
>  
>  static struct patb_entry *pseries_partition_tb;
>  
> @@ -393,10 +396,20 @@ struct kvm_nested_guest *kvmhv_alloc_nested(struct kvm *kvm, unsigned int lpid)
>   */
>  static void kvmhv_release_nested(struct kvm_nested_guest *gp)
>  {
> +	struct kvm *kvm = gp->l1_host;
> +
> +	if (gp->shadow_pgtable) {
> +		/*
> +		 * No vcpu is using this struct and no call to
> +		 * kvmhv_remove_nest_rmap can find this struct,
> +		 * so we don't need to hold kvm->mmu_lock.
> +		 */
> +		kvmppc_free_pgtable_radix(kvm, gp->shadow_pgtable,
> +					  gp->shadow_lpid);
> +		pgd_free(kvm->mm, gp->shadow_pgtable);
> +	}
>  	kvmhv_set_ptbl_entry(gp->shadow_lpid, 0, 0);
>  	kvmppc_free_lpid(gp->shadow_lpid);
> -	if (gp->shadow_pgtable)
> -		pgd_free(gp->l1_host->mm, gp->shadow_pgtable);
>  	kfree(gp);
>  }
>  
> @@ -453,6 +466,12 @@ void kvmhv_release_all_nested(struct kvm *kvm)
>  /* caller must hold gp->tlb_lock */
>  void kvmhv_flush_nested(struct kvm_nested_guest *gp)
>  {
> +	struct kvm *kvm = gp->l1_host;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	kvmppc_free_pgtable_radix(kvm, gp->shadow_pgtable, gp->shadow_lpid);
> +	spin_unlock(&kvm->mmu_lock);
> +	radix__flush_tlb_lpid(gp->shadow_lpid);
>  	kvmhv_update_ptbl_cache(gp);
>  	if (gp->l1_gr_to_hr == 0)
>  		kvmhv_remove_nested(gp);
> @@ -512,9 +531,28 @@ void kvmhv_put_nested(struct kvm_nested_guest *gp)
>  		kvmhv_release_nested(gp);
>  }
>  
> -long kvmhv_nested_page_fault(struct kvm_vcpu *vcpu)
> +static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
> +					struct kvm_nested_guest *gp,
> +					long gpa, int *shift_ret)
>  {
> -	return RESUME_HOST;
> +	struct kvm *kvm = vcpu->kvm;
> +	bool ret = false;
> +	pte_t *ptep;
> +	int shift;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
> +	if (!shift)
> +		shift = PAGE_SHIFT;
> +	if (ptep && pte_present(*ptep)) {
> +		kvmppc_unmap_pte(kvm, ptep, gpa, shift, NULL, gp->shadow_lpid);
> +		ret = true;
> +	}
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	if (shift_ret)
> +		*shift_ret = shift;
> +	return ret;
>  }
>  
>  static int kvmhv_emulate_priv_mtspr(struct kvm_run *run, struct kvm_vcpu *vcpu,
> @@ -570,3 +608,291 @@ int kvmhv_emulate_priv(struct kvm_run *run, struct kvm_vcpu *vcpu,
>  
>  	return rc;
>  }
> +
> +/* Used to convert a nested guest real address to a L1 guest real address */
> +static int kvmhv_translate_addr_nested(struct kvm_vcpu *vcpu,
> +				       struct kvm_nested_guest *gp,
> +				       unsigned long n_gpa, unsigned long dsisr,
> +				       struct kvmppc_pte *gpte_p)
> +{
> +	u64 fault_addr, flags = dsisr & DSISR_ISSTORE;
> +	int ret;
> +
> +	ret = kvmppc_mmu_walk_radix_tree(vcpu, n_gpa, gpte_p, gp->l1_gr_to_hr,
> +					 &fault_addr);
> +
> +	if (ret) {
> +		/* We didn't find a pte */
> +		if (ret == -EINVAL) {
> +			/* Unsupported mmu config */
> +			flags |= DSISR_UNSUPP_MMU;
> +		} else if (ret == -ENOENT) {
> +			/* No translation found */
> +			flags |= DSISR_NOHPTE;
> +		} else if (ret == -EFAULT) {
> +			/* Couldn't access L1 real address */
> +			flags |= DSISR_PRTABLE_FAULT;
> +			vcpu->arch.fault_gpa = fault_addr;
> +		} else {
> +			/* Unknown error */
> +			return ret;
> +		}
> +		goto resume_host;

This is effectively forwarding the fault to L1, yes?  In which case a
different name might be better than the ambiguous "resume_host".

> +	} else {
> +		/* We found a pte -> check permissions */
> +		if (dsisr & DSISR_ISSTORE) {
> +			/* Can we write? */
> +			if (!gpte_p->may_write) {
> +				flags |= DSISR_PROTFAULT;
> +				goto resume_host;
> +			}
> +		} else if (vcpu->arch.trap == BOOK3S_INTERRUPT_H_INST_STORAGE) {
> +			/* Can we execute? */
> +			if (!gpte_p->may_execute) {
> +				flags |= SRR1_ISI_N_OR_G;
> +				goto resume_host;
> +			}
> +		} else {
> +			/* Can we read? */
> +			if (!gpte_p->may_read && !gpte_p->may_write) {
> +				flags |= DSISR_PROTFAULT;
> +				goto resume_host;
> +			}
> +		}
> +	}
> +
> +	return 0;
> +
> +resume_host:
> +	vcpu->arch.fault_dsisr = flags;
> +	if (vcpu->arch.trap == BOOK3S_INTERRUPT_H_INST_STORAGE) {
> +		vcpu->arch.shregs.msr &= ~0x783f0000ul;
> +		vcpu->arch.shregs.msr |= flags;
> +	}
> +	return RESUME_HOST;
> +}
> +
> +static long kvmhv_handle_nested_set_rc(struct kvm_vcpu *vcpu,
> +				       struct kvm_nested_guest *gp,
> +				       unsigned long n_gpa,
> +				       struct kvmppc_pte gpte,
> +				       unsigned long dsisr)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	bool writing = !!(dsisr & DSISR_ISSTORE);
> +	u64 pgflags;
> +	bool ret;
> +
> +	/* Are the rc bits set in the L1 partition scoped pte? */
> +	pgflags = _PAGE_ACCESSED;
> +	if (writing)
> +		pgflags |= _PAGE_DIRTY;
> +	if (pgflags & ~gpte.rc)
> +		return RESUME_HOST;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	/* Set the rc bit in the pte of our (L0) pgtable for the L1 guest */
> +	ret = kvmppc_hv_handle_set_rc(kvm, kvm->arch.pgtable, writing,
> +				     gpte.raddr, kvm->arch.lpid);
> +	spin_unlock(&kvm->mmu_lock);
> +	if (!ret)
> +		return -EINVAL;
> +
> +	/* Set the rc bit in the pte of the shadow_pgtable for the nest guest */
> +	ret = kvmppc_hv_handle_set_rc(kvm, gp->shadow_pgtable, writing, n_gpa,
> +				      gp->shadow_lpid);
> +	if (!ret)
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static inline int kvmppc_radix_level_to_shift(int level)
> +{
> +	switch (level) {
> +	case 2:
> +		return PUD_SHIFT;
> +	case 1:
> +		return PMD_SHIFT;
> +	default:
> +		return PAGE_SHIFT;
> +	}
> +}
> +
> +static inline int kvmppc_radix_shift_to_level(int shift)
> +{
> +	if (shift == PUD_SHIFT)
> +		return 2;
> +	if (shift == PMD_SHIFT)
> +		return 1;
> +	if (shift == PAGE_SHIFT)
> +		return 0;
> +	WARN_ON_ONCE(1);
> +	return 0;
> +}
> +
> +/* called with gp->tlb_lock held */
> +static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
> +					  struct kvm_nested_guest *gp)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_memory_slot *memslot;
> +	struct kvmppc_pte gpte;
> +	pte_t pte, *pte_p;
> +	unsigned long mmu_seq;
> +	unsigned long dsisr = vcpu->arch.fault_dsisr;
> +	unsigned long ea = vcpu->arch.fault_dar;
> +	unsigned long n_gpa, gpa, gfn, perm = 0UL;
> +	unsigned int shift, l1_shift, level;
> +	bool writing = !!(dsisr & DSISR_ISSTORE);
> +	bool kvm_ro = false;
> +	long int ret;
> +
> +	if (!gp->l1_gr_to_hr) {
> +		kvmhv_update_ptbl_cache(gp);
> +		if (!gp->l1_gr_to_hr)
> +			return RESUME_HOST;
> +	}
> +
> +	/* Convert the nested guest real address into a L1 guest real address */
> +
> +	n_gpa = vcpu->arch.fault_gpa & ~0xF000000000000FFFULL;
> +	if (!(dsisr & DSISR_PRTABLE_FAULT))
> +		n_gpa |= ea & 0xFFF;
> +	ret = kvmhv_translate_addr_nested(vcpu, gp, n_gpa, dsisr, &gpte);
> +
> +	/*
> +	 * If the hardware found a translation but we don't now have a usable
> +	 * translation in the l1 partition-scoped tree, remove the shadow pte
> +	 * and let the guest retry.
> +	 */
> +	if (ret == RESUME_HOST &&
> +	    (dsisr & (DSISR_PROTFAULT | DSISR_BADACCESS | DSISR_NOEXEC_OR_G |
> +		      DSISR_BAD_COPYPASTE)))
> +		goto inval;
> +	if (ret)
> +		return ret;
> +
> +	/* Failed to set the reference/change bits */
> +	if (dsisr & DSISR_SET_RC) {
> +		ret = kvmhv_handle_nested_set_rc(vcpu, gp, n_gpa, gpte, dsisr);
> +		if (ret == RESUME_HOST)
> +			return ret;
> +		if (ret)
> +			goto inval;
> +		dsisr &= ~DSISR_SET_RC;
> +		if (!(dsisr & (DSISR_BAD_FAULT_64S | DSISR_NOHPTE |
> +			       DSISR_PROTFAULT)))
> +			return RESUME_GUEST;
> +	}
> +
> +	/*
> +	 * We took an HISI or HDSI while we were running a nested guest which
> +	 * means we have no partition scoped translation for that. This means
> +	 * we need to insert a pte for the mapping into our shadow_pgtable.
> +	 */
> +
> +	l1_shift = gpte.page_shift;
> +	if (l1_shift < PAGE_SHIFT) {
> +		/* We don't support l1 using a page size smaller than our own */
> +		pr_err("KVM: L1 guest page shift (%d) less than our own (%d)\n",
> +			l1_shift, PAGE_SHIFT);
> +		return -EINVAL;
> +	}
> +	gpa = gpte.raddr;
> +	gfn = gpa >> PAGE_SHIFT;
> +
> +	/* 1. Get the corresponding host memslot */
> +
> +	memslot = gfn_to_memslot(kvm, gfn);
> +	if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID)) {
> +		if (dsisr & (DSISR_PRTABLE_FAULT | DSISR_BADACCESS)) {
> +			/* unusual error -> reflect to the guest as a DSI */
> +			kvmppc_core_queue_data_storage(vcpu, ea, dsisr);
> +			return RESUME_GUEST;
> +		}
> +		/* passthrough of emulated MMIO case... */
> +		pr_err("emulated MMIO passthrough?\n");
> +		return -EINVAL;
> +	}
> +	if (memslot->flags & KVM_MEM_READONLY) {
> +		if (writing) {
> +			/* Give the guest a DSI */
> +			kvmppc_core_queue_data_storage(vcpu, ea,
> +					DSISR_ISSTORE | DSISR_PROTFAULT);
> +			return RESUME_GUEST;
> +		}
> +		kvm_ro = true;
> +	}
> +
> +	/* 2. Find the host pte for this L1 guest real address */
> +
> +	/* Used to check for invalidations in progress */
> +	mmu_seq = kvm->mmu_notifier_seq;
> +	smp_rmb();
> +
> +	/* See if can find translation in our partition scoped tables for L1 */
> +	pte = __pte(0);
> +	spin_lock(&kvm->mmu_lock);
> +	pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
> +	if (!shift)
> +		shift = PAGE_SHIFT;
> +	if (pte_p)
> +		pte = *pte_p;
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) {
> +		/* No suitable pte found -> try to insert a mapping */
> +		ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot,
> +					writing, kvm_ro, &pte, &level);
> +		if (ret == -EAGAIN)
> +			return RESUME_GUEST;
> +		else if (ret)
> +			return ret;
> +		shift = kvmppc_radix_level_to_shift(level);
> +	}
> +
> +	/* 3. Compute the pte we need to insert for nest_gpa -> host r_addr */
> +
> +	/* The permissions is the combination of the host and l1 guest ptes */
> +	perm |= gpte.may_read ? 0UL : _PAGE_READ;
> +	perm |= gpte.may_write ? 0UL : _PAGE_WRITE;
> +	perm |= gpte.may_execute ? 0UL : _PAGE_EXEC;
> +	pte = __pte(pte_val(pte) & ~perm);
> +
> +	/* What size pte can we insert? */
> +	if (shift > l1_shift) {
> +		u64 mask;
> +		unsigned int actual_shift = PAGE_SHIFT;
> +		if (PMD_SHIFT < l1_shift)
> +			actual_shift = PMD_SHIFT;
> +		mask = (1UL << shift) - (1UL << actual_shift);
> +		pte = __pte(pte_val(pte) | (gpa & mask));
> +		shift = actual_shift;
> +	}
> +	level = kvmppc_radix_shift_to_level(shift);
> +	n_gpa &= ~((1UL << shift) - 1);
> +
> +	/* 4. Insert the pte into our shadow_pgtable */
> +
> +	ret = kvmppc_create_pte(kvm, gp->shadow_pgtable, pte, n_gpa, level,
> +				mmu_seq, gp->shadow_lpid);
> +	if (ret == -EAGAIN)
> +		ret = RESUME_GUEST;	/* Let the guest try again */
> +
> +	return ret;
> +
> + inval:
> +	kvmhv_invalidate_shadow_pte(vcpu, gp, n_gpa, NULL);
> +	return RESUME_GUEST;
> +}
> +
> +long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_nested_guest *gp = vcpu->arch.nested;
> +	long int ret;
> +
> +	mutex_lock(&gp->tlb_lock);
> +	ret = __kvmhv_nested_page_fault(vcpu, gp);
> +	mutex_unlock(&gp->tlb_lock);
> +	return ret;
> +}
> diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
> index fef3e1e..4c4dfc4 100644
> --- a/arch/powerpc/mm/tlb-radix.c
> +++ b/arch/powerpc/mm/tlb-radix.c
> @@ -833,6 +833,15 @@ EXPORT_SYMBOL_GPL(radix__flush_pwc_lpid);
>  /*
>   * Flush partition scoped translations from LPID (=LPIDR)
>   */
> +void radix__flush_tlb_lpid(unsigned int lpid)
> +{
> +	_tlbie_lpid(lpid, RIC_FLUSH_ALL);
> +}
> +EXPORT_SYMBOL_GPL(radix__flush_tlb_lpid);
> +
> +/*
> + * Flush partition scoped translations from LPID (=LPIDR)
> + */
>  void radix__local_flush_tlb_lpid(unsigned int lpid)
>  {
>  	_tlbiel_lpid(lpid, RIC_FLUSH_ALL);
David Gibson Oct. 3, 2018, 5:42 a.m. UTC | #2
On Wed, Oct 03, 2018 at 03:39:13PM +1000, David Gibson wrote:
> On Tue, Oct 02, 2018 at 09:31:21PM +1000, Paul Mackerras wrote:
> > From: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> > 
> > Consider a normal (L1) guest running under the main hypervisor (L0),
> > and then a nested guest (L2) running under the L1 guest which is acting
> > as a nested hypervisor. L0 has page tables to map the address space for
> > L1 providing the translation from L1 real address -> L0 real address;
> > 
> > 	L1
> > 	|
> > 	| (L1 -> L0)
> > 	|
> > 	----> L0
> > 
> > There are also page tables in L1 used to map the address space for L2
> > providing the translation from L2 real address -> L1 read address. Since
> > the hardware can only walk a single level of page table, we need to
> > maintain in L0 a "shadow_pgtable" for L2 which provides the translation
> > from L2 real address -> L0 real address. Which looks like;
> > 
> > 	L2				L2
> > 	|				|
> > 	| (L2 -> L1)			|
> > 	|				|
> > 	----> L1			| (L2 -> L0)
> > 	      |				|
> > 	      | (L1 -> L0)		|
> > 	      |				|
> > 	      ----> L0			--------> L0
> > 
> > When a page fault occurs while running a nested (L2) guest we need to
> > insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
> > do this we need to:
> > 
> > 1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
> >    provide a page fault to L1 if this mapping doesn't exist.
> > 2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
> >    or try to insert a pte for that mapping if it doesn't exist.
> > 3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable
> > 
> > Once this mapping exists we can take rc faults when hardware is unable
> > to automatically set the reference and change bits in the pte. On these
> > we need to:
> > 
> > 1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
> >    the fault down to L1.
> > 2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
> >    host page.
> > 3. Set the rc bits in the L2 -> L0 pte.
> > 
> > As we reuse a large number of functions in book3s_64_mmu_radix.c for
> > this we also needed to refactor a number of these functions to take
> > an lpid parameter so that the correct lpid is used for tlb invalidations.
> > The functionality however has remained the same.
> > 
> > Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> > Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> 
> Some comments below, but no showstoppers, so,
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

One more, again not a showstopper:

> > @@ -393,10 +396,20 @@ struct kvm_nested_guest *kvmhv_alloc_nested(struct kvm *kvm, unsigned int lpid)
> >   */
> >  static void kvmhv_release_nested(struct kvm_nested_guest *gp)
> >  {
> > +	struct kvm *kvm = gp->l1_host;
> > +
> > +	if (gp->shadow_pgtable) {
> > +		/*
> > +		 * No vcpu is using this struct and no call to
> > +		 * kvmhv_remove_nest_rmap can find this struct,

It's kind of dubious that you're referring to kvmhv_remove_nest_rmap()
a patch before it is introduced.
Paul Mackerras Oct. 4, 2018, 9:21 a.m. UTC | #3
On Wed, Oct 03, 2018 at 03:39:13PM +1000, David Gibson wrote:
> On Tue, Oct 02, 2018 at 09:31:21PM +1000, Paul Mackerras wrote:
> > From: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> > @@ -367,7 +367,9 @@ struct kvmppc_pte {
> >  	bool may_write		: 1;
> >  	bool may_execute	: 1;
> >  	unsigned long wimg;
> > +	unsigned long rc;
> >  	u8 page_size;		/* MMU_PAGE_xxx */
> > +	u16 page_shift;
> 
> It's a bit ugly that this has both page_size and page_shift, which is
> redundant information AFAICT.  Also, why does page_shift need to be
> u16 - given that 2^255 bytes is much more than our supported address
> space, let alone a plausible page size.

These values are all essentially function outputs, so I don't think
it's ugly to have the same information in different forms.  I actually
don't like using the MMU_PAGE_xxx values, because the information in
the mmu_psize_defs[] array depends on the MMU mode of the host, but
KVM needs to be able to work with guests in both MMU modes.  More
generally I don't think it's a good idea that the KVM <-> guest
interface depends so much on what the host firmware tells us about the
physical machine we're on.  Thus I'm trying to move away from using
MMU_PSIZE_xxx values and mmu_psize_defs[] in KVM code.

I'll change the type to u8.

> > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > index bd06a95..ee6f493 100644
> > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > @@ -29,43 +29,16 @@
> >   */
> >  static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
> >  
> > -/*
> > - * Used to walk a partition or process table radix tree in guest memory
> > - * Note: We exploit the fact that a partition table and a process
> > - * table have the same layout, a partition-scoped page table and a
> > - * process-scoped page table have the same layout, and the 2nd
> > - * doubleword of a partition table entry has the same layout as
> > - * the PTCR register.
> > - */
> > -int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
> > -				     struct kvmppc_pte *gpte, u64 table,
> > -				     int table_index, u64 *pte_ret_p)
> > +int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
> > +			       struct kvmppc_pte *gpte, u64 root,
> > +			       u64 *pte_ret_p)
> >  {
> >  	struct kvm *kvm = vcpu->kvm;
> >  	int ret, level, ps;
> > -	unsigned long ptbl, root;
> > -	unsigned long rts, bits, offset;
> > -	unsigned long size, index;
> > -	struct prtb_entry entry;
> > +	unsigned long rts, bits, offset, index;
> >  	u64 pte, base, gpa;
> >  	__be64 rpte;
> >  
> > -	if ((table & PRTS_MASK) > 24)
> > -		return -EINVAL;
> > -	size = 1ul << ((table & PRTS_MASK) + 12);
> > -
> > -	/* Is the table big enough to contain this entry? */
> > -	if ((table_index * sizeof(entry)) >= size)
> > -		return -EINVAL;
> > -
> > -	/* Read the table to find the root of the radix tree */
> > -	ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
> > -	ret = kvm_read_guest(kvm, ptbl, &entry, sizeof(entry));
> > -	if (ret)
> > -		return ret;
> > -
> > -	/* Root is stored in the first double word */
> > -	root = be64_to_cpu(entry.prtb0);
> 
> This refactoring somewhat obscures the changes directly relevant to
> the nested guest handling.  Ideally it would be nice to fold some of
> this into the earlier reworkings.

True, but given the rapidly approaching merge window, I'm not inclined
to rework it.

> > +	if (ret) {
> > +		/* We didn't find a pte */
> > +		if (ret == -EINVAL) {
> > +			/* Unsupported mmu config */
> > +			flags |= DSISR_UNSUPP_MMU;
> > +		} else if (ret == -ENOENT) {
> > +			/* No translation found */
> > +			flags |= DSISR_NOHPTE;
> > +		} else if (ret == -EFAULT) {
> > +			/* Couldn't access L1 real address */
> > +			flags |= DSISR_PRTABLE_FAULT;
> > +			vcpu->arch.fault_gpa = fault_addr;
> > +		} else {
> > +			/* Unknown error */
> > +			return ret;
> > +		}
> > +		goto resume_host;
> 
> This is effectively forwarding the fault to L1, yes?  In which case a
> different name might be better than the ambiguous "resume_host".

I'll change it to "forward_to_l1".

Paul.
David Gibson Oct. 5, 2018, 2:46 a.m. UTC | #4
On Thu, Oct 04, 2018 at 07:21:20PM +1000, Paul Mackerras wrote:
> On Wed, Oct 03, 2018 at 03:39:13PM +1000, David Gibson wrote:
> > On Tue, Oct 02, 2018 at 09:31:21PM +1000, Paul Mackerras wrote:
> > > From: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> > > @@ -367,7 +367,9 @@ struct kvmppc_pte {
> > >  	bool may_write		: 1;
> > >  	bool may_execute	: 1;
> > >  	unsigned long wimg;
> > > +	unsigned long rc;
> > >  	u8 page_size;		/* MMU_PAGE_xxx */
> > > +	u16 page_shift;
> > 
> > It's a bit ugly that this has both page_size and page_shift, which is
> > redundant information AFAICT.  Also, why does page_shift need to be
> > u16 - given that 2^255 bytes is much more than our supported address
> > space, let alone a plausible page size.
> 
> These values are all essentially function outputs, so I don't think
> it's ugly to have the same information in different forms.  I actually
> don't like using the MMU_PAGE_xxx values, because the information in
> the mmu_psize_defs[] array depends on the MMU mode of the host, but
> KVM needs to be able to work with guests in both MMU modes.  More
> generally I don't think it's a good idea that the KVM <-> guest
> interface depends so much on what the host firmware tells us about the
> physical machine we're on.  Thus I'm trying to move away from using
> MMU_PSIZE_xxx values and mmu_psize_defs[] in KVM code.

Fair enough.

> I'll change the type to u8.
> 
> > > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > index bd06a95..ee6f493 100644
> > > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > @@ -29,43 +29,16 @@
> > >   */
> > >  static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
> > >  
> > > -/*
> > > - * Used to walk a partition or process table radix tree in guest memory
> > > - * Note: We exploit the fact that a partition table and a process
> > > - * table have the same layout, a partition-scoped page table and a
> > > - * process-scoped page table have the same layout, and the 2nd
> > > - * doubleword of a partition table entry has the same layout as
> > > - * the PTCR register.
> > > - */
> > > -int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
> > > -				     struct kvmppc_pte *gpte, u64 table,
> > > -				     int table_index, u64 *pte_ret_p)
> > > +int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
> > > +			       struct kvmppc_pte *gpte, u64 root,
> > > +			       u64 *pte_ret_p)
> > >  {
> > >  	struct kvm *kvm = vcpu->kvm;
> > >  	int ret, level, ps;
> > > -	unsigned long ptbl, root;
> > > -	unsigned long rts, bits, offset;
> > > -	unsigned long size, index;
> > > -	struct prtb_entry entry;
> > > +	unsigned long rts, bits, offset, index;
> > >  	u64 pte, base, gpa;
> > >  	__be64 rpte;
> > >  
> > > -	if ((table & PRTS_MASK) > 24)
> > > -		return -EINVAL;
> > > -	size = 1ul << ((table & PRTS_MASK) + 12);
> > > -
> > > -	/* Is the table big enough to contain this entry? */
> > > -	if ((table_index * sizeof(entry)) >= size)
> > > -		return -EINVAL;
> > > -
> > > -	/* Read the table to find the root of the radix tree */
> > > -	ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
> > > -	ret = kvm_read_guest(kvm, ptbl, &entry, sizeof(entry));
> > > -	if (ret)
> > > -		return ret;
> > > -
> > > -	/* Root is stored in the first double word */
> > > -	root = be64_to_cpu(entry.prtb0);
> > 
> > This refactoring somewhat obscures the changes directly relevant to
> > the nested guest handling.  Ideally it would be nice to fold some of
> > this into the earlier reworkings.
> 
> True, but given the rapidly approaching merge window, I'm not inclined
> to rework it.

Yeah, ok.

> 
> > > +	if (ret) {
> > > +		/* We didn't find a pte */
> > > +		if (ret == -EINVAL) {
> > > +			/* Unsupported mmu config */
> > > +			flags |= DSISR_UNSUPP_MMU;
> > > +		} else if (ret == -ENOENT) {
> > > +			/* No translation found */
> > > +			flags |= DSISR_NOHPTE;
> > > +		} else if (ret == -EFAULT) {
> > > +			/* Couldn't access L1 real address */
> > > +			flags |= DSISR_PRTABLE_FAULT;
> > > +			vcpu->arch.fault_gpa = fault_addr;
> > > +		} else {
> > > +			/* Unknown error */
> > > +			return ret;
> > > +		}
> > > +		goto resume_host;
> > 
> > This is effectively forwarding the fault to L1, yes?  In which case a
> > different name might be better than the ambiguous "resume_host".
> 
> I'll change it to "forward_to_l1".

Thanks.

> 
> Paul.
> 
>
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 1154a6d..671316f 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -53,6 +53,7 @@  extern void radix__flush_tlb_lpid_page(unsigned int lpid,
 					unsigned long addr,
 					unsigned long page_size);
 extern void radix__flush_pwc_lpid(unsigned int lpid);
+extern void radix__flush_tlb_lpid(unsigned int lpid);
 extern void radix__local_flush_tlb_lpid(unsigned int lpid);
 extern void radix__local_flush_tlb_lpid_guest(unsigned int lpid);
 
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 0a97446..d983778 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,17 +188,34 @@  extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
 			struct kvm_vcpu *vcpu,
 			unsigned long ea, unsigned long dsisr);
+extern int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
+				      struct kvmppc_pte *gpte, u64 root,
+				      u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
 			struct kvmppc_pte *gpte, u64 table,
 			int table_index, u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
 			struct kvmppc_pte *gpte, bool data, bool iswrite);
+extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
+				    bool writing, unsigned long gpa,
+				    unsigned int lpid);
+extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
+				unsigned long gpa,
+				struct kvm_memory_slot *memslot,
+				bool writing, bool kvm_ro,
+				pte_t *inserted_pte, unsigned int *levelp);
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
 extern void kvmppc_free_radix(struct kvm *kvm);
+extern void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd,
+				      unsigned int lpid);
 extern int kvmppc_radix_init(void);
 extern void kvmppc_radix_exit(void);
 extern int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			unsigned long gfn);
+extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
+			     unsigned long gpa, unsigned int shift,
+			     struct kvm_memory_slot *memslot,
+			     unsigned int lpid);
 extern int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			unsigned long gfn);
 extern int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
@@ -289,6 +306,8 @@  void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
 long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
 int kvmhv_emulate_priv(struct kvm_run *run, struct kvm_vcpu *vcpu,
 			unsigned int instr);
+int kvmhv_handle_nested_trap(struct kvm_run *run, struct kvm_vcpu *vcpu,
+			     struct task_struct *tsk);
 
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 6d67b6a..5496152 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -549,6 +549,10 @@  static inline void copy_to_checkpoint(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
 
+extern int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
+			     unsigned long gpa, unsigned int level,
+			     unsigned long mmu_seq, unsigned int lpid);
+
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ceb9f20..64c4807 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -367,7 +367,9 @@  struct kvmppc_pte {
 	bool may_write		: 1;
 	bool may_execute	: 1;
 	unsigned long wimg;
+	unsigned long rc;
 	u8 page_size;		/* MMU_PAGE_xxx */
+	u16 page_shift;
 };
 
 struct kvmppc_mmu {
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index bd06a95..ee6f493 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -29,43 +29,16 @@ 
  */
 static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
 
-/*
- * Used to walk a partition or process table radix tree in guest memory
- * Note: We exploit the fact that a partition table and a process
- * table have the same layout, a partition-scoped page table and a
- * process-scoped page table have the same layout, and the 2nd
- * doubleword of a partition table entry has the same layout as
- * the PTCR register.
- */
-int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
-				     struct kvmppc_pte *gpte, u64 table,
-				     int table_index, u64 *pte_ret_p)
+int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
+			       struct kvmppc_pte *gpte, u64 root,
+			       u64 *pte_ret_p)
 {
 	struct kvm *kvm = vcpu->kvm;
 	int ret, level, ps;
-	unsigned long ptbl, root;
-	unsigned long rts, bits, offset;
-	unsigned long size, index;
-	struct prtb_entry entry;
+	unsigned long rts, bits, offset, index;
 	u64 pte, base, gpa;
 	__be64 rpte;
 
-	if ((table & PRTS_MASK) > 24)
-		return -EINVAL;
-	size = 1ul << ((table & PRTS_MASK) + 12);
-
-	/* Is the table big enough to contain this entry? */
-	if ((table_index * sizeof(entry)) >= size)
-		return -EINVAL;
-
-	/* Read the table to find the root of the radix tree */
-	ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
-	ret = kvm_read_guest(kvm, ptbl, &entry, sizeof(entry));
-	if (ret)
-		return ret;
-
-	/* Root is stored in the first double word */
-	root = be64_to_cpu(entry.prtb0);
 	rts = ((root & RTS1_MASK) >> (RTS1_SHIFT - 3)) |
 		((root & RTS2_MASK) >> RTS2_SHIFT);
 	bits = root & RPDS_MASK;
@@ -79,6 +52,7 @@  int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
 
 	/* Walk each level of the radix tree */
 	for (level = 3; level >= 0; --level) {
+		u64 addr;
 		/* Check a valid size */
 		if (level && bits != p9_supported_radix_bits[level])
 			return -EINVAL;
@@ -90,10 +64,13 @@  int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
 		if (base & ((1UL << (bits + 3)) - 1))
 			return -EINVAL;
 		/* Read the entry from guest memory */
-		ret = kvm_read_guest(kvm, base + (index * sizeof(rpte)),
-				     &rpte, sizeof(rpte));
-		if (ret)
+		addr = base + (index * sizeof(rpte));
+		ret = kvm_read_guest(kvm, addr, &rpte, sizeof(rpte));
+		if (ret) {
+			if (pte_ret_p)
+				*pte_ret_p = addr;
 			return ret;
+		}
 		pte = __be64_to_cpu(rpte);
 		if (!(pte & _PAGE_PRESENT))
 			return -ENOENT;
@@ -119,6 +96,7 @@  int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
 		if (offset == mmu_psize_defs[ps].shift)
 			break;
 	gpte->page_size = ps;
+	gpte->page_shift = offset;
 
 	gpte->eaddr = eaddr;
 	gpte->raddr = gpa;
@@ -128,12 +106,51 @@  int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
 	gpte->may_write = !!(pte & _PAGE_WRITE);
 	gpte->may_execute = !!(pte & _PAGE_EXEC);
 
+	gpte->rc = pte & (_PAGE_ACCESSED | _PAGE_DIRTY);
+
 	if (pte_ret_p)
 		*pte_ret_p = pte;
 
 	return 0;
 }
 
+/*
+ * Used to walk a partition or process table radix tree in guest memory
+ * Note: We exploit the fact that a partition table and a process
+ * table have the same layout, a partition-scoped page table and a
+ * process-scoped page table have the same layout, and the 2nd
+ * doubleword of a partition table entry has the same layout as
+ * the PTCR register.
+ */
+int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
+				     struct kvmppc_pte *gpte, u64 table,
+				     int table_index, u64 *pte_ret_p)
+{
+	struct kvm *kvm = vcpu->kvm;
+	int ret;
+	unsigned long size, ptbl, root;
+	struct prtb_entry entry;
+
+	if ((table & PRTS_MASK) > 24)
+		return -EINVAL;
+	size = 1ul << ((table & PRTS_MASK) + 12);
+
+	/* Is the table big enough to contain this entry? */
+	if ((table_index * sizeof(entry)) >= size)
+		return -EINVAL;
+
+	/* Read the table to find the root of the radix tree */
+	ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
+	ret = kvm_read_guest(kvm, ptbl, &entry, sizeof(entry));
+	if (ret)
+		return ret;
+
+	/* Root is stored in the first double word */
+	root = be64_to_cpu(entry.prtb0);
+
+	return kvmppc_mmu_walk_radix_tree(vcpu, eaddr, gpte, root, pte_ret_p);
+}
+
 int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
 			   struct kvmppc_pte *gpte, bool data, bool iswrite)
 {
@@ -181,7 +198,7 @@  int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
 }
 
 static void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
-				    unsigned int pshift)
+				    unsigned int pshift, unsigned int lpid)
 {
 	unsigned long psize = PAGE_SIZE;
 
@@ -189,12 +206,12 @@  static void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
 		psize = 1UL << pshift;
 
 	addr &= ~(psize - 1);
-	radix__flush_tlb_lpid_page(kvm->arch.lpid, addr, psize);
+	radix__flush_tlb_lpid_page(lpid, addr, psize);
 }
 
-static void kvmppc_radix_flush_pwc(struct kvm *kvm)
+static void kvmppc_radix_flush_pwc(struct kvm *kvm, unsigned int lpid)
 {
-	radix__flush_pwc_lpid(kvm->arch.lpid);
+	radix__flush_pwc_lpid(lpid);
 }
 
 static unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep,
@@ -239,16 +256,17 @@  static void kvmppc_pmd_free(pmd_t *pmdp)
 	kmem_cache_free(kvm_pmd_cache, pmdp);
 }
 
-static void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
-			     unsigned long gpa, unsigned int shift,
-			     struct kvm_memory_slot *memslot)
+void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
+		      unsigned long gpa, unsigned int shift,
+		      struct kvm_memory_slot *memslot,
+		      unsigned int lpid)
 
 {
 	unsigned long old;
 
 	old = kvmppc_radix_update_pte(kvm, pte, ~0UL, 0, gpa, shift);
-	kvmppc_radix_tlbie_page(kvm, gpa, shift);
-	if (old & _PAGE_DIRTY) {
+	kvmppc_radix_tlbie_page(kvm, gpa, shift, lpid);
+	if ((old & _PAGE_DIRTY) && (lpid == kvm->arch.lpid)) {
 		unsigned long gfn = gpa >> PAGE_SHIFT;
 		unsigned long page_size = PAGE_SIZE;
 
@@ -271,7 +289,8 @@  static void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
  * and emit a warning if encountered, but there may already be data
  * corruption due to the unexpected mappings.
  */
-static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t *pte, bool full)
+static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t *pte, bool full,
+				  unsigned int lpid)
 {
 	if (full) {
 		memset(pte, 0, sizeof(long) << PTE_INDEX_SIZE);
@@ -285,14 +304,15 @@  static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t *pte, bool full)
 			WARN_ON_ONCE(1);
 			kvmppc_unmap_pte(kvm, p,
 					 pte_pfn(*p) << PAGE_SHIFT,
-					 PAGE_SHIFT, NULL);
+					 PAGE_SHIFT, NULL, lpid);
 		}
 	}
 
 	kvmppc_pte_free(pte);
 }
 
-static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full)
+static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full,
+				  unsigned int lpid)
 {
 	unsigned long im;
 	pmd_t *p = pmd;
@@ -307,20 +327,21 @@  static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full)
 				WARN_ON_ONCE(1);
 				kvmppc_unmap_pte(kvm, (pte_t *)p,
 					 pte_pfn(*(pte_t *)p) << PAGE_SHIFT,
-					 PMD_SHIFT, NULL);
+					 PMD_SHIFT, NULL, lpid);
 			}
 		} else {
 			pte_t *pte;
 
 			pte = pte_offset_map(p, 0);
-			kvmppc_unmap_free_pte(kvm, pte, full);
+			kvmppc_unmap_free_pte(kvm, pte, full, lpid);
 			pmd_clear(p);
 		}
 	}
 	kvmppc_pmd_free(pmd);
 }
 
-static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t *pud)
+static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t *pud,
+				  unsigned int lpid)
 {
 	unsigned long iu;
 	pud_t *p = pud;
@@ -334,36 +355,42 @@  static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t *pud)
 			pmd_t *pmd;
 
 			pmd = pmd_offset(p, 0);
-			kvmppc_unmap_free_pmd(kvm, pmd, true);
+			kvmppc_unmap_free_pmd(kvm, pmd, true, lpid);
 			pud_clear(p);
 		}
 	}
 	pud_free(kvm->mm, pud);
 }
 
-void kvmppc_free_radix(struct kvm *kvm)
+void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd, unsigned int lpid)
 {
 	unsigned long ig;
-	pgd_t *pgd;
 
-	if (!kvm->arch.pgtable)
+	if (!pgd)
 		return;
-	pgd = kvm->arch.pgtable;
 	for (ig = 0; ig < PTRS_PER_PGD; ++ig, ++pgd) {
 		pud_t *pud;
 
 		if (!pgd_present(*pgd))
 			continue;
 		pud = pud_offset(pgd, 0);
-		kvmppc_unmap_free_pud(kvm, pud);
+		kvmppc_unmap_free_pud(kvm, pud, lpid);
 		pgd_clear(pgd);
 	}
-	pgd_free(kvm->mm, kvm->arch.pgtable);
-	kvm->arch.pgtable = NULL;
+}
+
+void kvmppc_free_radix(struct kvm *kvm)
+{
+	if (kvm->arch.pgtable) {
+		kvmppc_free_pgtable_radix(kvm, kvm->arch.pgtable,
+					  kvm->arch.lpid);
+		pgd_free(kvm->mm, kvm->arch.pgtable);
+		kvm->arch.pgtable = NULL;
+	}
 }
 
 static void kvmppc_unmap_free_pmd_entry_table(struct kvm *kvm, pmd_t *pmd,
-					      unsigned long gpa)
+					unsigned long gpa, unsigned int lpid)
 {
 	pte_t *pte = pte_offset_kernel(pmd, 0);
 
@@ -373,13 +400,13 @@  static void kvmppc_unmap_free_pmd_entry_table(struct kvm *kvm, pmd_t *pmd,
 	 * flushing the PWC again.
 	 */
 	pmd_clear(pmd);
-	kvmppc_radix_flush_pwc(kvm);
+	kvmppc_radix_flush_pwc(kvm, lpid);
 
-	kvmppc_unmap_free_pte(kvm, pte, false);
+	kvmppc_unmap_free_pte(kvm, pte, false, lpid);
 }
 
 static void kvmppc_unmap_free_pud_entry_table(struct kvm *kvm, pud_t *pud,
-					unsigned long gpa)
+					unsigned long gpa, unsigned int lpid)
 {
 	pmd_t *pmd = pmd_offset(pud, 0);
 
@@ -389,9 +416,9 @@  static void kvmppc_unmap_free_pud_entry_table(struct kvm *kvm, pud_t *pud,
 	 * so can be freed without flushing the PWC again.
 	 */
 	pud_clear(pud);
-	kvmppc_radix_flush_pwc(kvm);
+	kvmppc_radix_flush_pwc(kvm, lpid);
 
-	kvmppc_unmap_free_pmd(kvm, pmd, false);
+	kvmppc_unmap_free_pmd(kvm, pmd, false, lpid);
 }
 
 /*
@@ -403,9 +430,9 @@  static void kvmppc_unmap_free_pud_entry_table(struct kvm *kvm, pud_t *pud,
  */
 #define PTE_BITS_MUST_MATCH (~(_PAGE_WRITE | _PAGE_DIRTY | _PAGE_ACCESSED))
 
-static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
-			     unsigned long gpa, unsigned int level,
-			     unsigned long mmu_seq)
+int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
+		      unsigned long gpa, unsigned int level,
+		      unsigned long mmu_seq, unsigned int lpid)
 {
 	pgd_t *pgd;
 	pud_t *pud, *new_pud = NULL;
@@ -458,7 +485,7 @@  static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 			WARN_ON_ONCE((pud_val(*pud) ^ pte_val(pte)) &
 							PTE_BITS_MUST_MATCH);
 			kvmppc_radix_update_pte(kvm, (pte_t *)pud,
-					      0, pte_val(pte), hgpa, PUD_SHIFT);
+					0, pte_val(pte), hgpa, PUD_SHIFT);
 			ret = 0;
 			goto out_unlock;
 		}
@@ -471,7 +498,8 @@  static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 			goto out_unlock;
 		}
 		/* Valid 1GB page here already, remove it */
-		kvmppc_unmap_pte(kvm, (pte_t *)pud, hgpa, PUD_SHIFT, NULL);
+		kvmppc_unmap_pte(kvm, (pte_t *)pud, hgpa, PUD_SHIFT, NULL,
+				 lpid);
 	}
 	if (level == 2) {
 		if (!pud_none(*pud)) {
@@ -480,7 +508,7 @@  static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 			 * install a large page, so remove and free the page
 			 * table page.
 			 */
-			kvmppc_unmap_free_pud_entry_table(kvm, pud, gpa);
+			kvmppc_unmap_free_pud_entry_table(kvm, pud, gpa, lpid);
 		}
 		kvmppc_radix_set_pte_at(kvm, gpa, (pte_t *)pud, pte);
 		ret = 0;
@@ -506,7 +534,7 @@  static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 			WARN_ON_ONCE((pmd_val(*pmd) ^ pte_val(pte)) &
 							PTE_BITS_MUST_MATCH);
 			kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd),
-					      0, pte_val(pte), lgpa, PMD_SHIFT);
+					0, pte_val(pte), lgpa, PMD_SHIFT);
 			ret = 0;
 			goto out_unlock;
 		}
@@ -520,7 +548,8 @@  static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 			goto out_unlock;
 		}
 		/* Valid 2MB page here already, remove it */
-		kvmppc_unmap_pte(kvm, pmdp_ptep(pmd), lgpa, PMD_SHIFT, NULL);
+		kvmppc_unmap_pte(kvm, pmdp_ptep(pmd), lgpa, PMD_SHIFT, NULL,
+				 lpid);
 	}
 	if (level == 1) {
 		if (!pmd_none(*pmd)) {
@@ -529,7 +558,7 @@  static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 			 * install a large page, so remove and free the page
 			 * table page.
 			 */
-			kvmppc_unmap_free_pmd_entry_table(kvm, pmd, gpa);
+			kvmppc_unmap_free_pmd_entry_table(kvm, pmd, gpa, lpid);
 		}
 		kvmppc_radix_set_pte_at(kvm, gpa, pmdp_ptep(pmd), pte);
 		ret = 0;
@@ -569,8 +598,8 @@  static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 	return ret;
 }
 
-static bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
-				    bool writing, unsigned long gpa)
+bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable, bool writing,
+			     unsigned long gpa, unsigned int lpid)
 {
 	unsigned long pgflags;
 	unsigned int shift;
@@ -597,11 +626,11 @@  static bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
 	return false;
 }
 
-static int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
-				unsigned long gpa,
-				struct kvm_memory_slot *memslot,
-				bool writing, bool kvm_ro,
-				pte_t *inserted_pte, unsigned int *levelp)
+int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
+				   unsigned long gpa,
+				   struct kvm_memory_slot *memslot,
+				   bool writing, bool kvm_ro,
+				   pte_t *inserted_pte, unsigned int *levelp)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct page *page = NULL;
@@ -683,7 +712,7 @@  static int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 
 	/* Allocate space in the tree and write the PTE */
 	ret = kvmppc_create_pte(kvm, kvm->arch.pgtable, pte, gpa, level,
-				mmu_seq);
+				mmu_seq, kvm->arch.lpid);
 	if (inserted_pte)
 		*inserted_pte = pte;
 	if (levelp)
@@ -758,7 +787,7 @@  int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
 	if (dsisr & DSISR_SET_RC) {
 		spin_lock(&kvm->mmu_lock);
 		if (kvmppc_hv_handle_set_rc(kvm, kvm->arch.pgtable,
-					    writing, gpa))
+					    writing, gpa, kvm->arch.lpid))
 			dsisr &= ~DSISR_SET_RC;
 		spin_unlock(&kvm->mmu_lock);
 
@@ -786,7 +815,8 @@  int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 
 	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (ptep && pte_present(*ptep))
-		kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot);
+		kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
+				 kvm->arch.lpid);
 	return 0;				
 }
 
@@ -841,7 +871,7 @@  static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 			ret = 1 << (shift - PAGE_SHIFT);
 		kvmppc_radix_update_pte(kvm, ptep, _PAGE_DIRTY, 0,
 					gpa, shift);
-		kvmppc_radix_tlbie_page(kvm, gpa, shift);
+		kvmppc_radix_tlbie_page(kvm, gpa, shift, kvm->arch.lpid);
 	}
 	return ret;
 }
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index f8f9fab..c160df3 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -12,10 +12,13 @@ 
 #include <linux/kvm_host.h>
 
 #include <asm/kvm_ppc.h>
+#include <asm/kvm_book3s.h>
 #include <asm/mmu.h>
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
+#include <asm/pte-walk.h>
 #include <asm/disassemble.h>
+#include <asm/reg.h>
 
 static struct patb_entry *pseries_partition_tb;
 
@@ -393,10 +396,20 @@  struct kvm_nested_guest *kvmhv_alloc_nested(struct kvm *kvm, unsigned int lpid)
  */
 static void kvmhv_release_nested(struct kvm_nested_guest *gp)
 {
+	struct kvm *kvm = gp->l1_host;
+
+	if (gp->shadow_pgtable) {
+		/*
+		 * No vcpu is using this struct and no call to
+		 * kvmhv_remove_nest_rmap can find this struct,
+		 * so we don't need to hold kvm->mmu_lock.
+		 */
+		kvmppc_free_pgtable_radix(kvm, gp->shadow_pgtable,
+					  gp->shadow_lpid);
+		pgd_free(kvm->mm, gp->shadow_pgtable);
+	}
 	kvmhv_set_ptbl_entry(gp->shadow_lpid, 0, 0);
 	kvmppc_free_lpid(gp->shadow_lpid);
-	if (gp->shadow_pgtable)
-		pgd_free(gp->l1_host->mm, gp->shadow_pgtable);
 	kfree(gp);
 }
 
@@ -453,6 +466,12 @@  void kvmhv_release_all_nested(struct kvm *kvm)
 /* caller must hold gp->tlb_lock */
 void kvmhv_flush_nested(struct kvm_nested_guest *gp)
 {
+	struct kvm *kvm = gp->l1_host;
+
+	spin_lock(&kvm->mmu_lock);
+	kvmppc_free_pgtable_radix(kvm, gp->shadow_pgtable, gp->shadow_lpid);
+	spin_unlock(&kvm->mmu_lock);
+	radix__flush_tlb_lpid(gp->shadow_lpid);
 	kvmhv_update_ptbl_cache(gp);
 	if (gp->l1_gr_to_hr == 0)
 		kvmhv_remove_nested(gp);
@@ -512,9 +531,28 @@  void kvmhv_put_nested(struct kvm_nested_guest *gp)
 		kvmhv_release_nested(gp);
 }
 
-long kvmhv_nested_page_fault(struct kvm_vcpu *vcpu)
+static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
+					struct kvm_nested_guest *gp,
+					long gpa, int *shift_ret)
 {
-	return RESUME_HOST;
+	struct kvm *kvm = vcpu->kvm;
+	bool ret = false;
+	pte_t *ptep;
+	int shift;
+
+	spin_lock(&kvm->mmu_lock);
+	ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
+	if (!shift)
+		shift = PAGE_SHIFT;
+	if (ptep && pte_present(*ptep)) {
+		kvmppc_unmap_pte(kvm, ptep, gpa, shift, NULL, gp->shadow_lpid);
+		ret = true;
+	}
+	spin_unlock(&kvm->mmu_lock);
+
+	if (shift_ret)
+		*shift_ret = shift;
+	return ret;
 }
 
 static int kvmhv_emulate_priv_mtspr(struct kvm_run *run, struct kvm_vcpu *vcpu,
@@ -570,3 +608,291 @@  int kvmhv_emulate_priv(struct kvm_run *run, struct kvm_vcpu *vcpu,
 
 	return rc;
 }
+
+/* Used to convert a nested guest real address to a L1 guest real address */
+static int kvmhv_translate_addr_nested(struct kvm_vcpu *vcpu,
+				       struct kvm_nested_guest *gp,
+				       unsigned long n_gpa, unsigned long dsisr,
+				       struct kvmppc_pte *gpte_p)
+{
+	u64 fault_addr, flags = dsisr & DSISR_ISSTORE;
+	int ret;
+
+	ret = kvmppc_mmu_walk_radix_tree(vcpu, n_gpa, gpte_p, gp->l1_gr_to_hr,
+					 &fault_addr);
+
+	if (ret) {
+		/* We didn't find a pte */
+		if (ret == -EINVAL) {
+			/* Unsupported mmu config */
+			flags |= DSISR_UNSUPP_MMU;
+		} else if (ret == -ENOENT) {
+			/* No translation found */
+			flags |= DSISR_NOHPTE;
+		} else if (ret == -EFAULT) {
+			/* Couldn't access L1 real address */
+			flags |= DSISR_PRTABLE_FAULT;
+			vcpu->arch.fault_gpa = fault_addr;
+		} else {
+			/* Unknown error */
+			return ret;
+		}
+		goto resume_host;
+	} else {
+		/* We found a pte -> check permissions */
+		if (dsisr & DSISR_ISSTORE) {
+			/* Can we write? */
+			if (!gpte_p->may_write) {
+				flags |= DSISR_PROTFAULT;
+				goto resume_host;
+			}
+		} else if (vcpu->arch.trap == BOOK3S_INTERRUPT_H_INST_STORAGE) {
+			/* Can we execute? */
+			if (!gpte_p->may_execute) {
+				flags |= SRR1_ISI_N_OR_G;
+				goto resume_host;
+			}
+		} else {
+			/* Can we read? */
+			if (!gpte_p->may_read && !gpte_p->may_write) {
+				flags |= DSISR_PROTFAULT;
+				goto resume_host;
+			}
+		}
+	}
+
+	return 0;
+
+resume_host:
+	vcpu->arch.fault_dsisr = flags;
+	if (vcpu->arch.trap == BOOK3S_INTERRUPT_H_INST_STORAGE) {
+		vcpu->arch.shregs.msr &= ~0x783f0000ul;
+		vcpu->arch.shregs.msr |= flags;
+	}
+	return RESUME_HOST;
+}
+
+static long kvmhv_handle_nested_set_rc(struct kvm_vcpu *vcpu,
+				       struct kvm_nested_guest *gp,
+				       unsigned long n_gpa,
+				       struct kvmppc_pte gpte,
+				       unsigned long dsisr)
+{
+	struct kvm *kvm = vcpu->kvm;
+	bool writing = !!(dsisr & DSISR_ISSTORE);
+	u64 pgflags;
+	bool ret;
+
+	/* Are the rc bits set in the L1 partition scoped pte? */
+	pgflags = _PAGE_ACCESSED;
+	if (writing)
+		pgflags |= _PAGE_DIRTY;
+	if (pgflags & ~gpte.rc)
+		return RESUME_HOST;
+
+	spin_lock(&kvm->mmu_lock);
+	/* Set the rc bit in the pte of our (L0) pgtable for the L1 guest */
+	ret = kvmppc_hv_handle_set_rc(kvm, kvm->arch.pgtable, writing,
+				     gpte.raddr, kvm->arch.lpid);
+	spin_unlock(&kvm->mmu_lock);
+	if (!ret)
+		return -EINVAL;
+
+	/* Set the rc bit in the pte of the shadow_pgtable for the nest guest */
+	ret = kvmppc_hv_handle_set_rc(kvm, gp->shadow_pgtable, writing, n_gpa,
+				      gp->shadow_lpid);
+	if (!ret)
+		return -EINVAL;
+	return 0;
+}
+
+static inline int kvmppc_radix_level_to_shift(int level)
+{
+	switch (level) {
+	case 2:
+		return PUD_SHIFT;
+	case 1:
+		return PMD_SHIFT;
+	default:
+		return PAGE_SHIFT;
+	}
+}
+
+static inline int kvmppc_radix_shift_to_level(int shift)
+{
+	if (shift == PUD_SHIFT)
+		return 2;
+	if (shift == PMD_SHIFT)
+		return 1;
+	if (shift == PAGE_SHIFT)
+		return 0;
+	WARN_ON_ONCE(1);
+	return 0;
+}
+
+/* called with gp->tlb_lock held */
+static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
+					  struct kvm_nested_guest *gp)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_memory_slot *memslot;
+	struct kvmppc_pte gpte;
+	pte_t pte, *pte_p;
+	unsigned long mmu_seq;
+	unsigned long dsisr = vcpu->arch.fault_dsisr;
+	unsigned long ea = vcpu->arch.fault_dar;
+	unsigned long n_gpa, gpa, gfn, perm = 0UL;
+	unsigned int shift, l1_shift, level;
+	bool writing = !!(dsisr & DSISR_ISSTORE);
+	bool kvm_ro = false;
+	long int ret;
+
+	if (!gp->l1_gr_to_hr) {
+		kvmhv_update_ptbl_cache(gp);
+		if (!gp->l1_gr_to_hr)
+			return RESUME_HOST;
+	}
+
+	/* Convert the nested guest real address into a L1 guest real address */
+
+	n_gpa = vcpu->arch.fault_gpa & ~0xF000000000000FFFULL;
+	if (!(dsisr & DSISR_PRTABLE_FAULT))
+		n_gpa |= ea & 0xFFF;
+	ret = kvmhv_translate_addr_nested(vcpu, gp, n_gpa, dsisr, &gpte);
+
+	/*
+	 * If the hardware found a translation but we don't now have a usable
+	 * translation in the l1 partition-scoped tree, remove the shadow pte
+	 * and let the guest retry.
+	 */
+	if (ret == RESUME_HOST &&
+	    (dsisr & (DSISR_PROTFAULT | DSISR_BADACCESS | DSISR_NOEXEC_OR_G |
+		      DSISR_BAD_COPYPASTE)))
+		goto inval;
+	if (ret)
+		return ret;
+
+	/* Failed to set the reference/change bits */
+	if (dsisr & DSISR_SET_RC) {
+		ret = kvmhv_handle_nested_set_rc(vcpu, gp, n_gpa, gpte, dsisr);
+		if (ret == RESUME_HOST)
+			return ret;
+		if (ret)
+			goto inval;
+		dsisr &= ~DSISR_SET_RC;
+		if (!(dsisr & (DSISR_BAD_FAULT_64S | DSISR_NOHPTE |
+			       DSISR_PROTFAULT)))
+			return RESUME_GUEST;
+	}
+
+	/*
+	 * We took an HISI or HDSI while we were running a nested guest which
+	 * means we have no partition scoped translation for that. This means
+	 * we need to insert a pte for the mapping into our shadow_pgtable.
+	 */
+
+	l1_shift = gpte.page_shift;
+	if (l1_shift < PAGE_SHIFT) {
+		/* We don't support l1 using a page size smaller than our own */
+		pr_err("KVM: L1 guest page shift (%d) less than our own (%d)\n",
+			l1_shift, PAGE_SHIFT);
+		return -EINVAL;
+	}
+	gpa = gpte.raddr;
+	gfn = gpa >> PAGE_SHIFT;
+
+	/* 1. Get the corresponding host memslot */
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID)) {
+		if (dsisr & (DSISR_PRTABLE_FAULT | DSISR_BADACCESS)) {
+			/* unusual error -> reflect to the guest as a DSI */
+			kvmppc_core_queue_data_storage(vcpu, ea, dsisr);
+			return RESUME_GUEST;
+		}
+		/* passthrough of emulated MMIO case... */
+		pr_err("emulated MMIO passthrough?\n");
+		return -EINVAL;
+	}
+	if (memslot->flags & KVM_MEM_READONLY) {
+		if (writing) {
+			/* Give the guest a DSI */
+			kvmppc_core_queue_data_storage(vcpu, ea,
+					DSISR_ISSTORE | DSISR_PROTFAULT);
+			return RESUME_GUEST;
+		}
+		kvm_ro = true;
+	}
+
+	/* 2. Find the host pte for this L1 guest real address */
+
+	/* Used to check for invalidations in progress */
+	mmu_seq = kvm->mmu_notifier_seq;
+	smp_rmb();
+
+	/* See if can find translation in our partition scoped tables for L1 */
+	pte = __pte(0);
+	spin_lock(&kvm->mmu_lock);
+	pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
+	if (!shift)
+		shift = PAGE_SHIFT;
+	if (pte_p)
+		pte = *pte_p;
+	spin_unlock(&kvm->mmu_lock);
+
+	if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) {
+		/* No suitable pte found -> try to insert a mapping */
+		ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot,
+					writing, kvm_ro, &pte, &level);
+		if (ret == -EAGAIN)
+			return RESUME_GUEST;
+		else if (ret)
+			return ret;
+		shift = kvmppc_radix_level_to_shift(level);
+	}
+
+	/* 3. Compute the pte we need to insert for nest_gpa -> host r_addr */
+
+	/* The permissions is the combination of the host and l1 guest ptes */
+	perm |= gpte.may_read ? 0UL : _PAGE_READ;
+	perm |= gpte.may_write ? 0UL : _PAGE_WRITE;
+	perm |= gpte.may_execute ? 0UL : _PAGE_EXEC;
+	pte = __pte(pte_val(pte) & ~perm);
+
+	/* What size pte can we insert? */
+	if (shift > l1_shift) {
+		u64 mask;
+		unsigned int actual_shift = PAGE_SHIFT;
+		if (PMD_SHIFT < l1_shift)
+			actual_shift = PMD_SHIFT;
+		mask = (1UL << shift) - (1UL << actual_shift);
+		pte = __pte(pte_val(pte) | (gpa & mask));
+		shift = actual_shift;
+	}
+	level = kvmppc_radix_shift_to_level(shift);
+	n_gpa &= ~((1UL << shift) - 1);
+
+	/* 4. Insert the pte into our shadow_pgtable */
+
+	ret = kvmppc_create_pte(kvm, gp->shadow_pgtable, pte, n_gpa, level,
+				mmu_seq, gp->shadow_lpid);
+	if (ret == -EAGAIN)
+		ret = RESUME_GUEST;	/* Let the guest try again */
+
+	return ret;
+
+ inval:
+	kvmhv_invalidate_shadow_pte(vcpu, gp, n_gpa, NULL);
+	return RESUME_GUEST;
+}
+
+long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu)
+{
+	struct kvm_nested_guest *gp = vcpu->arch.nested;
+	long int ret;
+
+	mutex_lock(&gp->tlb_lock);
+	ret = __kvmhv_nested_page_fault(vcpu, gp);
+	mutex_unlock(&gp->tlb_lock);
+	return ret;
+}
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index fef3e1e..4c4dfc4 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -833,6 +833,15 @@  EXPORT_SYMBOL_GPL(radix__flush_pwc_lpid);
 /*
  * Flush partition scoped translations from LPID (=LPIDR)
  */
+void radix__flush_tlb_lpid(unsigned int lpid)
+{
+	_tlbie_lpid(lpid, RIC_FLUSH_ALL);
+}
+EXPORT_SYMBOL_GPL(radix__flush_tlb_lpid);
+
+/*
+ * Flush partition scoped translations from LPID (=LPIDR)
+ */
 void radix__local_flush_tlb_lpid(unsigned int lpid)
 {
 	_tlbiel_lpid(lpid, RIC_FLUSH_ALL);