[v1,00/11] KVM: s390: pv: implement lazy destroy

Message ID	20210517200758.22593-1-imbrenda@linux.ibm.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: Claudio Imbrenda <imbrenda@linux.ibm.com> To: kvm@vger.kernel.org Cc: cohuck@redhat.com, borntraeger@de.ibm.com, frankja@linux.ibm.com, thuth@redhat.com, pasic@linux.ibm.com, david@redhat.com, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Date: Mon, 17 May 2021 22:07:47 +0200 Message-Id: <20210517200758.22593-1-imbrenda@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: s390: pv: implement lazy destroy \| expand [v1,00/11] KVM: s390: pv: implement lazy destroy [v1,01/11] KVM: s390: pv: leak the ASCE page when destroy fails [v1,02/11] KVM: s390: pv: properly handle page flags for protected guests [v1,03/11] KVM: s390: pv: handle secure storage violations for protected guests [v1,04/11] KVM: s390: pv: handle secure storage exceptions for normal guests [v1,05/11] KVM: s390: pv: refactor s390_reset_acc [v1,06/11] KVM: s390: pv: usage counter instead of flag [v1,07/11] KVM: s390: pv: add export before import [v1,08/11] KVM: s390: pv: lazy destroy for reboot [v1,09/11] KVM: s390: pv: extend lazy destroy to handle shutdown [v1,10/11] KVM: s390: pv: module parameter to fence lazy destroy [v1,11/11] KVM: s390: pv: add support for UV feature bits

Claudio Imbrenda May 17, 2021, 8:07 p.m. UTC

Previously, when a protected VM was rebooted or when it was shut down,
its memory was made unprotected, and then the protected VM itself was
destroyed. Looping over the whole address space can take some time,
considering the overhead of the various Ultravisor Calls (UVCs).  This
means that a reboot or a shutdown would take a potentially long amount
of time, depending on the amount of used memory.

This patchseries implements a deferred destroy mechanism for protected
guests. When a protected guest is destroyed, its memory is cleared in
background, allowing the guest to restart or terminate significantly
faster than before.

There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)

For the reboot case, the reference count of the mm is increased, and
then a background thread is started to clean up. Once the thread went
through the whole address space, the protected VM is actually
destroyed.

For the shutdown case, a list of pages to be destroyed is formed when
the mm is torn down. Instead of just unmapping the pages when the
address space is being torn down, they are also set aside. Later when
KVM cleans up the VM, a thread is started to clean up the pages from
the list.

This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running, the
others will in fact not even have any CPUs.

Claudio Imbrenda (11):
  KVM: s390: pv: leak the ASCE page when destroy fails
  KVM: s390: pv: properly handle page flags for protected guests
  KVM: s390: pv: handle secure storage violations for protected guests
  KVM: s390: pv: handle secure storage exceptions for normal guests
  KVM: s390: pv: refactor s390_reset_acc
  KVM: s390: pv: usage counter instead of flag
  KVM: s390: pv: add export before import
  KVM: s390: pv: lazy destroy for reboot
  KVM: s390: pv: extend lazy destroy to handle shutdown
  KVM: s390: pv: module parameter to fence lazy destroy
  KVM: s390: pv: add support for UV feature bits

 arch/s390/boot/uv.c                 |   1 +
 arch/s390/include/asm/gmap.h        |   5 +-
 arch/s390/include/asm/mmu.h         |   3 +
 arch/s390/include/asm/mmu_context.h |   2 +
 arch/s390/include/asm/pgtable.h     |  16 +-
 arch/s390/include/asm/uv.h          |  35 ++++-
 arch/s390/kernel/uv.c               | 133 +++++++++++++++-
 arch/s390/kvm/kvm-s390.c            |   6 +-
 arch/s390/kvm/kvm-s390.h            |   2 +-
 arch/s390/kvm/pv.c                  | 230 ++++++++++++++++++++++++++--
 arch/s390/mm/fault.c                |  22 ++-
 arch/s390/mm/gmap.c                 |  86 +++++++----
 12 files changed, 490 insertions(+), 51 deletions(-)

Cornelia Huck May 18, 2021, 3:05 p.m. UTC | #1

On Mon, 17 May 2021 22:07:47 +0200
Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:

> Previously, when a protected VM was rebooted or when it was shut down,
> its memory was made unprotected, and then the protected VM itself was
> destroyed. Looping over the whole address space can take some time,
> considering the overhead of the various Ultravisor Calls (UVCs).  This
> means that a reboot or a shutdown would take a potentially long amount
> of time, depending on the amount of used memory.
> 
> This patchseries implements a deferred destroy mechanism for protected
> guests. When a protected guest is destroyed, its memory is cleared in
> background, allowing the guest to restart or terminate significantly
> faster than before.
> 
> There are 2 possibilities when a protected VM is torn down:
> * it still has an address space associated (reboot case)
> * it does not have an address space anymore (shutdown case)
> 
> For the reboot case, the reference count of the mm is increased, and
> then a background thread is started to clean up. Once the thread went
> through the whole address space, the protected VM is actually
> destroyed.
> 
> For the shutdown case, a list of pages to be destroyed is formed when
> the mm is torn down. Instead of just unmapping the pages when the
> address space is being torn down, they are also set aside. Later when
> KVM cleans up the VM, a thread is started to clean up the pages from
> the list.

Just to make sure, 'clean up' includes doing uv calls?

> 
> This means that the same address space can have memory belonging to
> more than one protected guest, although only one will be running, the
> others will in fact not even have any CPUs.

Are those set-aside-but-not-yet-cleaned-up pages still possibly
accessible in any way? I would assume that they only belong to the
'zombie' guests, and any new or rebooted guest is a new entity that
needs to get new pages?

Can too many not-yet-cleaned-up pages lead to a (temporary) memory
exhaustion?

Claudio Imbrenda May 18, 2021, 3:36 p.m. UTC | #2

On Tue, 18 May 2021 17:05:37 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> On Mon, 17 May 2021 22:07:47 +0200
> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> 
> > Previously, when a protected VM was rebooted or when it was shut
> > down, its memory was made unprotected, and then the protected VM
> > itself was destroyed. Looping over the whole address space can take
> > some time, considering the overhead of the various Ultravisor Calls
> > (UVCs).  This means that a reboot or a shutdown would take a
> > potentially long amount of time, depending on the amount of used
> > memory.
> > 
> > This patchseries implements a deferred destroy mechanism for
> > protected guests. When a protected guest is destroyed, its memory
> > is cleared in background, allowing the guest to restart or
> > terminate significantly faster than before.
> > 
> > There are 2 possibilities when a protected VM is torn down:
> > * it still has an address space associated (reboot case)
> > * it does not have an address space anymore (shutdown case)
> > 
> > For the reboot case, the reference count of the mm is increased, and
> > then a background thread is started to clean up. Once the thread
> > went through the whole address space, the protected VM is actually
> > destroyed.
> > 
> > For the shutdown case, a list of pages to be destroyed is formed
> > when the mm is torn down. Instead of just unmapping the pages when
> > the address space is being torn down, they are also set aside.
> > Later when KVM cleans up the VM, a thread is started to clean up
> > the pages from the list.  
> 
> Just to make sure, 'clean up' includes doing uv calls?

yes

> > 
> > This means that the same address space can have memory belonging to
> > more than one protected guest, although only one will be running,
> > the others will in fact not even have any CPUs.  
> 
> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> accessible in any way? I would assume that they only belong to the

in case of reboot: yes, they are still in the address space of the
guest, and can be swapped if needed

> 'zombie' guests, and any new or rebooted guest is a new entity that
> needs to get new pages?

the rebooted guest (normal or secure) will re-use the same pages of the
old guest (before or after cleanup, which is the reason of patches 3
and 4)

the KVM guest is not affected in case of reboot, so the userspace
address space is not touched.

> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
> exhaustion?

in case of reboot, not much; the pages were in use are still in use
after the reboot, and they can be swapped.

in case of a shutdown, yes, because the pages are really taken aside
and cleared/destroyed in background. they cannot be swapped. they are
freed immediately as they are processed, to try to mitigate memory
exhaustion scenarios.

in the end, this patchseries is a tradeoff between speed and memory
consumption. the memory needs to be cleared up at some point, and that
requires time.

in cases where this might be an issue, I introduced a new KVM flag to
disable lazy destroy (patch 10)

Christian Borntraeger May 18, 2021, 3:45 p.m. UTC | #3

On 18.05.21 17:36, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 17:05:37 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> On Mon, 17 May 2021 22:07:47 +0200
>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>
>>> Previously, when a protected VM was rebooted or when it was shut
>>> down, its memory was made unprotected, and then the protected VM
>>> itself was destroyed. Looping over the whole address space can take
>>> some time, considering the overhead of the various Ultravisor Calls
>>> (UVCs).  This means that a reboot or a shutdown would take a
>>> potentially long amount of time, depending on the amount of used
>>> memory.
>>>
>>> This patchseries implements a deferred destroy mechanism for
>>> protected guests. When a protected guest is destroyed, its memory
>>> is cleared in background, allowing the guest to restart or
>>> terminate significantly faster than before.
>>>
>>> There are 2 possibilities when a protected VM is torn down:
>>> * it still has an address space associated (reboot case)
>>> * it does not have an address space anymore (shutdown case)
>>>
>>> For the reboot case, the reference count of the mm is increased, and
>>> then a background thread is started to clean up. Once the thread
>>> went through the whole address space, the protected VM is actually
>>> destroyed.
>>>
>>> For the shutdown case, a list of pages to be destroyed is formed
>>> when the mm is torn down. Instead of just unmapping the pages when
>>> the address space is being torn down, they are also set aside.
>>> Later when KVM cleans up the VM, a thread is started to clean up
>>> the pages from the list.
>>
>> Just to make sure, 'clean up' includes doing uv calls?
> 
> yes
> 
>>>
>>> This means that the same address space can have memory belonging to
>>> more than one protected guest, although only one will be running,
>>> the others will in fact not even have any CPUs.
>>
>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>> accessible in any way? I would assume that they only belong to the
> 
> in case of reboot: yes, they are still in the address space of the
> guest, and can be swapped if needed
> 
>> 'zombie' guests, and any new or rebooted guest is a new entity that
>> needs to get new pages?
> 
> the rebooted guest (normal or secure) will re-use the same pages of the
> old guest (before or after cleanup, which is the reason of patches 3
> and 4)
> 
> the KVM guest is not affected in case of reboot, so the userspace
> address space is not touched.
> 
>> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
>> exhaustion?
> 
> in case of reboot, not much; the pages were in use are still in use
> after the reboot, and they can be swapped.
> 
> in case of a shutdown, yes, because the pages are really taken aside
> and cleared/destroyed in background. they cannot be swapped. they are
> freed immediately as they are processed, to try to mitigate memory
> exhaustion scenarios.
> 
> in the end, this patchseries is a tradeoff between speed and memory
> consumption. the memory needs to be cleared up at some point, and that
> requires time.
> 
> in cases where this might be an issue, I introduced a new KVM flag to
> disable lazy destroy (patch 10)

Maybe we could piggy-back on the OOM-kill notifier and then fall back to
synchronous freeing for some pages?

Cornelia Huck May 18, 2021, 3:52 p.m. UTC | #4

On Tue, 18 May 2021 17:45:18 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 18.05.21 17:36, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:05:37 +0200
> > Cornelia Huck <cohuck@redhat.com> wrote:

> >> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
> >> exhaustion?  
> > 
> > in case of reboot, not much; the pages were in use are still in use
> > after the reboot, and they can be swapped.
> > 
> > in case of a shutdown, yes, because the pages are really taken aside
> > and cleared/destroyed in background. they cannot be swapped. they are
> > freed immediately as they are processed, to try to mitigate memory
> > exhaustion scenarios.
> > 
> > in the end, this patchseries is a tradeoff between speed and memory
> > consumption. the memory needs to be cleared up at some point, and that
> > requires time.
> > 
> > in cases where this might be an issue, I introduced a new KVM flag to
> > disable lazy destroy (patch 10)  
> 
> Maybe we could piggy-back on the OOM-kill notifier and then fall back to
> synchronous freeing for some pages?

Sounds like a good idea. If delayed cleanup is safe, you probably want
to have the fast shutdown behaviour.

Cornelia Huck May 18, 2021, 4:04 p.m. UTC | #5

On Tue, 18 May 2021 17:36:24 +0200
Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:

> On Tue, 18 May 2021 17:05:37 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > On Mon, 17 May 2021 22:07:47 +0200
> > Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:

> > > This means that the same address space can have memory belonging to
> > > more than one protected guest, although only one will be running,
> > > the others will in fact not even have any CPUs.    
> > 
> > Are those set-aside-but-not-yet-cleaned-up pages still possibly
> > accessible in any way? I would assume that they only belong to the  
> 
> in case of reboot: yes, they are still in the address space of the
> guest, and can be swapped if needed
> 
> > 'zombie' guests, and any new or rebooted guest is a new entity that
> > needs to get new pages?  
> 
> the rebooted guest (normal or secure) will re-use the same pages of the
> old guest (before or after cleanup, which is the reason of patches 3
> and 4)

Took a look at those patches, makes sense.

> 
> the KVM guest is not affected in case of reboot, so the userspace
> address space is not touched.

'guest' is a bit ambiguous here -- do you mean the vm here, and the
actual guest above?

Claudio Imbrenda May 18, 2021, 4:13 p.m. UTC | #6

On Tue, 18 May 2021 17:45:18 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 18.05.21 17:36, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:05:37 +0200
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> >> On Mon, 17 May 2021 22:07:47 +0200
> >> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> >>  
> >>> Previously, when a protected VM was rebooted or when it was shut
> >>> down, its memory was made unprotected, and then the protected VM
> >>> itself was destroyed. Looping over the whole address space can
> >>> take some time, considering the overhead of the various
> >>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
> >>> would take a potentially long amount of time, depending on the
> >>> amount of used memory.
> >>>
> >>> This patchseries implements a deferred destroy mechanism for
> >>> protected guests. When a protected guest is destroyed, its memory
> >>> is cleared in background, allowing the guest to restart or
> >>> terminate significantly faster than before.
> >>>
> >>> There are 2 possibilities when a protected VM is torn down:
> >>> * it still has an address space associated (reboot case)
> >>> * it does not have an address space anymore (shutdown case)
> >>>
> >>> For the reboot case, the reference count of the mm is increased,
> >>> and then a background thread is started to clean up. Once the
> >>> thread went through the whole address space, the protected VM is
> >>> actually destroyed.
> >>>
> >>> For the shutdown case, a list of pages to be destroyed is formed
> >>> when the mm is torn down. Instead of just unmapping the pages when
> >>> the address space is being torn down, they are also set aside.
> >>> Later when KVM cleans up the VM, a thread is started to clean up
> >>> the pages from the list.  
> >>
> >> Just to make sure, 'clean up' includes doing uv calls?  
> > 
> > yes
> >   
> >>>
> >>> This means that the same address space can have memory belonging
> >>> to more than one protected guest, although only one will be
> >>> running, the others will in fact not even have any CPUs.  
> >>
> >> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >> accessible in any way? I would assume that they only belong to the
> >>  
> > 
> > in case of reboot: yes, they are still in the address space of the
> > guest, and can be swapped if needed
> >   
> >> 'zombie' guests, and any new or rebooted guest is a new entity that
> >> needs to get new pages?  
> > 
> > the rebooted guest (normal or secure) will re-use the same pages of
> > the old guest (before or after cleanup, which is the reason of
> > patches 3 and 4)
> > 
> > the KVM guest is not affected in case of reboot, so the userspace
> > address space is not touched.
> >   
> >> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
> >> exhaustion?  
> > 
> > in case of reboot, not much; the pages were in use are still in use
> > after the reboot, and they can be swapped.
> > 
> > in case of a shutdown, yes, because the pages are really taken aside
> > and cleared/destroyed in background. they cannot be swapped. they
> > are freed immediately as they are processed, to try to mitigate
> > memory exhaustion scenarios.
> > 
> > in the end, this patchseries is a tradeoff between speed and memory
> > consumption. the memory needs to be cleared up at some point, and
> > that requires time.
> > 
> > in cases where this might be an issue, I introduced a new KVM flag
> > to disable lazy destroy (patch 10)  
> 
> Maybe we could piggy-back on the OOM-kill notifier and then fall back
> to synchronous freeing for some pages?

I'm not sure I follow

once the pages have been set aside, it's too late

while the pages are being set aside, every now and then some memory
needs to be allocated. the allocation is atomic, not allowed to use
emergency reserves, and can fail without warning. if the allocation
fails, we clean up one page and continue, without setting aside
anything (patch 9)

so if the system is low on memory, the lazy destroy should not make the
situation too much worse.

the only issue here is starting a normal process in the host (maybe
a non secure guest) that uses a lot of memory very quickly, right after
a large secure guest has terminated.

Claudio Imbrenda May 18, 2021, 4:19 p.m. UTC | #7

On Tue, 18 May 2021 18:04:11 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 18 May 2021 17:36:24 +0200
> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> 
> > On Tue, 18 May 2021 17:05:37 +0200
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> > > On Mon, 17 May 2021 22:07:47 +0200
> > > Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:  
> 
> > > > This means that the same address space can have memory
> > > > belonging to more than one protected guest, although only one
> > > > will be running, the others will in fact not even have any
> > > > CPUs.      
> > > 
> > > Are those set-aside-but-not-yet-cleaned-up pages still possibly
> > > accessible in any way? I would assume that they only belong to
> > > the    
> > 
> > in case of reboot: yes, they are still in the address space of the
> > guest, and can be swapped if needed
> >   
> > > 'zombie' guests, and any new or rebooted guest is a new entity
> > > that needs to get new pages?    
> > 
> > the rebooted guest (normal or secure) will re-use the same pages of
> > the old guest (before or after cleanup, which is the reason of
> > patches 3 and 4)  
> 
> Took a look at those patches, makes sense.
> 
> > 
> > the KVM guest is not affected in case of reboot, so the userspace
> > address space is not touched.  
> 
> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
> actual guest above?
> 

yes this is tricky, because there is the guest OS, which terminates or
reboots, then there is the "secure configuration" entity, handled by the
Ultravisor, and then the KVM VM

when a secure guest reboots, the "secure configuration" is dismantled
(in this case, in a deferred way), and the KVM VM (and its memory) is
not directly affected

what happened before was that the secure configuration was dismantled
synchronously, and then re-created.

now instead, a new secure configuration is created using the same KVM
VM (and thus the same mm), before the old secure configuration has been
completely dismantled. hence the same KVM VM can have multiple secure
configurations associated, sharing the same address space.

of course, only the newest one is actually running, the other ones are
"zombies", without CPUs.

Christian Borntraeger May 18, 2021, 4:20 p.m. UTC | #8

On 18.05.21 18:13, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 17:45:18 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 18.05.21 17:36, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 17:05:37 +0200
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>    
>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>>>   
>>>>> Previously, when a protected VM was rebooted or when it was shut
>>>>> down, its memory was made unprotected, and then the protected VM
>>>>> itself was destroyed. Looping over the whole address space can
>>>>> take some time, considering the overhead of the various
>>>>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
>>>>> would take a potentially long amount of time, depending on the
>>>>> amount of used memory.
>>>>>
>>>>> This patchseries implements a deferred destroy mechanism for
>>>>> protected guests. When a protected guest is destroyed, its memory
>>>>> is cleared in background, allowing the guest to restart or
>>>>> terminate significantly faster than before.
>>>>>
>>>>> There are 2 possibilities when a protected VM is torn down:
>>>>> * it still has an address space associated (reboot case)
>>>>> * it does not have an address space anymore (shutdown case)
>>>>>
>>>>> For the reboot case, the reference count of the mm is increased,
>>>>> and then a background thread is started to clean up. Once the
>>>>> thread went through the whole address space, the protected VM is
>>>>> actually destroyed.
>>>>>
>>>>> For the shutdown case, a list of pages to be destroyed is formed
>>>>> when the mm is torn down. Instead of just unmapping the pages when
>>>>> the address space is being torn down, they are also set aside.
>>>>> Later when KVM cleans up the VM, a thread is started to clean up
>>>>> the pages from the list.
>>>>
>>>> Just to make sure, 'clean up' includes doing uv calls?
>>>
>>> yes
>>>    
>>>>>
>>>>> This means that the same address space can have memory belonging
>>>>> to more than one protected guest, although only one will be
>>>>> running, the others will in fact not even have any CPUs.
>>>>
>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>> accessible in any way? I would assume that they only belong to the
>>>>   
>>>
>>> in case of reboot: yes, they are still in the address space of the
>>> guest, and can be swapped if needed
>>>    
>>>> 'zombie' guests, and any new or rebooted guest is a new entity that
>>>> needs to get new pages?
>>>
>>> the rebooted guest (normal or secure) will re-use the same pages of
>>> the old guest (before or after cleanup, which is the reason of
>>> patches 3 and 4)
>>>
>>> the KVM guest is not affected in case of reboot, so the userspace
>>> address space is not touched.
>>>    
>>>> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
>>>> exhaustion?
>>>
>>> in case of reboot, not much; the pages were in use are still in use
>>> after the reboot, and they can be swapped.
>>>
>>> in case of a shutdown, yes, because the pages are really taken aside
>>> and cleared/destroyed in background. they cannot be swapped. they
>>> are freed immediately as they are processed, to try to mitigate
>>> memory exhaustion scenarios.
>>>
>>> in the end, this patchseries is a tradeoff between speed and memory
>>> consumption. the memory needs to be cleared up at some point, and
>>> that requires time.
>>>
>>> in cases where this might be an issue, I introduced a new KVM flag
>>> to disable lazy destroy (patch 10)
>>
>> Maybe we could piggy-back on the OOM-kill notifier and then fall back
>> to synchronous freeing for some pages?
> 
> I'm not sure I follow
> 
> once the pages have been set aside, it's too late
> 
> while the pages are being set aside, every now and then some memory
> needs to be allocated. the allocation is atomic, not allowed to use
> emergency reserves, and can fail without warning. if the allocation
> fails, we clean up one page and continue, without setting aside
> anything (patch 9)
> 
> so if the system is low on memory, the lazy destroy should not make the
> situation too much worse.
> 
> the only issue here is starting a normal process in the host (maybe
> a non secure guest) that uses a lot of memory very quickly, right after
> a large secure guest has terminated.

I think page cache page allocations do not need to be atomic.
In that case the kernel might stil l decide to trigger the oom killer. We can
let it notify ourselves free 256 pages synchronously and avoid the oom kill.
Have a look at the virtio-balloon virtio_balloon_oom_notify

David Hildenbrand May 18, 2021, 4:22 p.m. UTC | #9

On 18.05.21 18:19, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:04:11 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> On Tue, 18 May 2021 17:36:24 +0200
>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>
>>> On Tue, 18 May 2021 17:05:37 +0200
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>    
>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>
>>>>> This means that the same address space can have memory
>>>>> belonging to more than one protected guest, although only one
>>>>> will be running, the others will in fact not even have any
>>>>> CPUs.
>>>>
>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>> accessible in any way? I would assume that they only belong to
>>>> the
>>>
>>> in case of reboot: yes, they are still in the address space of the
>>> guest, and can be swapped if needed
>>>    
>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>> that needs to get new pages?
>>>
>>> the rebooted guest (normal or secure) will re-use the same pages of
>>> the old guest (before or after cleanup, which is the reason of
>>> patches 3 and 4)
>>
>> Took a look at those patches, makes sense.
>>
>>>
>>> the KVM guest is not affected in case of reboot, so the userspace
>>> address space is not touched.
>>
>> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
>> actual guest above?
>>
> 
> yes this is tricky, because there is the guest OS, which terminates or
> reboots, then there is the "secure configuration" entity, handled by the
> Ultravisor, and then the KVM VM
> 
> when a secure guest reboots, the "secure configuration" is dismantled
> (in this case, in a deferred way), and the KVM VM (and its memory) is
> not directly affected
> 
> what happened before was that the secure configuration was dismantled
> synchronously, and then re-created.
> 
> now instead, a new secure configuration is created using the same KVM
> VM (and thus the same mm), before the old secure configuration has been
> completely dismantled. hence the same KVM VM can have multiple secure
> configurations associated, sharing the same address space.
> 
> of course, only the newest one is actually running, the other ones are
> "zombies", without CPUs.
> 

Can a guest trigger a DoS?

Claudio Imbrenda May 18, 2021, 4:31 p.m. UTC | #10

On Tue, 18 May 2021 18:22:42 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 18.05.21 18:19, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 18:04:11 +0200
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> >> On Tue, 18 May 2021 17:36:24 +0200
> >> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> >>  
> >>> On Tue, 18 May 2021 17:05:37 +0200
> >>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>      
> >>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:  
> >>  
> >>>>> This means that the same address space can have memory
> >>>>> belonging to more than one protected guest, although only one
> >>>>> will be running, the others will in fact not even have any
> >>>>> CPUs.  
> >>>>
> >>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>> accessible in any way? I would assume that they only belong to
> >>>> the  
> >>>
> >>> in case of reboot: yes, they are still in the address space of the
> >>> guest, and can be swapped if needed
> >>>      
> >>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>> that needs to get new pages?  
> >>>
> >>> the rebooted guest (normal or secure) will re-use the same pages
> >>> of the old guest (before or after cleanup, which is the reason of
> >>> patches 3 and 4)  
> >>
> >> Took a look at those patches, makes sense.
> >>  
> >>>
> >>> the KVM guest is not affected in case of reboot, so the userspace
> >>> address space is not touched.  
> >>
> >> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
> >> actual guest above?
> >>  
> > 
> > yes this is tricky, because there is the guest OS, which terminates
> > or reboots, then there is the "secure configuration" entity,
> > handled by the Ultravisor, and then the KVM VM
> > 
> > when a secure guest reboots, the "secure configuration" is
> > dismantled (in this case, in a deferred way), and the KVM VM (and
> > its memory) is not directly affected
> > 
> > what happened before was that the secure configuration was
> > dismantled synchronously, and then re-created.
> > 
> > now instead, a new secure configuration is created using the same
> > KVM VM (and thus the same mm), before the old secure configuration
> > has been completely dismantled. hence the same KVM VM can have
> > multiple secure configurations associated, sharing the same address
> > space.
> > 
> > of course, only the newest one is actually running, the other ones
> > are "zombies", without CPUs.
> >   
> 
> Can a guest trigger a DoS?

I don't see how

a guest can fill its memory and then reboot, and then fill its memory
again and then reboot... but that will take time, filling the memory
will itself clean up leftover pages from previous boots.

"normal" reboot loops will be fast, because there won't be much memory
to process

I have actually tested mixed reboot/shutdown loops, and the system
behaved as you would expect when under load.

Claudio Imbrenda May 18, 2021, 4:34 p.m. UTC | #11

On Tue, 18 May 2021 18:20:22 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 18.05.21 18:13, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:45:18 +0200
> > Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> >   
> >> On 18.05.21 17:36, Claudio Imbrenda wrote:  
> >>> On Tue, 18 May 2021 17:05:37 +0200
> >>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>      
> >>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> >>>>     
> >>>>> Previously, when a protected VM was rebooted or when it was shut
> >>>>> down, its memory was made unprotected, and then the protected VM
> >>>>> itself was destroyed. Looping over the whole address space can
> >>>>> take some time, considering the overhead of the various
> >>>>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
> >>>>> would take a potentially long amount of time, depending on the
> >>>>> amount of used memory.
> >>>>>
> >>>>> This patchseries implements a deferred destroy mechanism for
> >>>>> protected guests. When a protected guest is destroyed, its
> >>>>> memory is cleared in background, allowing the guest to restart
> >>>>> or terminate significantly faster than before.
> >>>>>
> >>>>> There are 2 possibilities when a protected VM is torn down:
> >>>>> * it still has an address space associated (reboot case)
> >>>>> * it does not have an address space anymore (shutdown case)
> >>>>>
> >>>>> For the reboot case, the reference count of the mm is increased,
> >>>>> and then a background thread is started to clean up. Once the
> >>>>> thread went through the whole address space, the protected VM is
> >>>>> actually destroyed.
> >>>>>
> >>>>> For the shutdown case, a list of pages to be destroyed is formed
> >>>>> when the mm is torn down. Instead of just unmapping the pages
> >>>>> when the address space is being torn down, they are also set
> >>>>> aside. Later when KVM cleans up the VM, a thread is started to
> >>>>> clean up the pages from the list.  
> >>>>
> >>>> Just to make sure, 'clean up' includes doing uv calls?  
> >>>
> >>> yes
> >>>      
> >>>>>
> >>>>> This means that the same address space can have memory belonging
> >>>>> to more than one protected guest, although only one will be
> >>>>> running, the others will in fact not even have any CPUs.  
> >>>>
> >>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>> accessible in any way? I would assume that they only belong to
> >>>> the 
> >>>
> >>> in case of reboot: yes, they are still in the address space of the
> >>> guest, and can be swapped if needed
> >>>      
> >>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>> that needs to get new pages?  
> >>>
> >>> the rebooted guest (normal or secure) will re-use the same pages
> >>> of the old guest (before or after cleanup, which is the reason of
> >>> patches 3 and 4)
> >>>
> >>> the KVM guest is not affected in case of reboot, so the userspace
> >>> address space is not touched.
> >>>      
> >>>> Can too many not-yet-cleaned-up pages lead to a (temporary)
> >>>> memory exhaustion?  
> >>>
> >>> in case of reboot, not much; the pages were in use are still in
> >>> use after the reboot, and they can be swapped.
> >>>
> >>> in case of a shutdown, yes, because the pages are really taken
> >>> aside and cleared/destroyed in background. they cannot be
> >>> swapped. they are freed immediately as they are processed, to try
> >>> to mitigate memory exhaustion scenarios.
> >>>
> >>> in the end, this patchseries is a tradeoff between speed and
> >>> memory consumption. the memory needs to be cleared up at some
> >>> point, and that requires time.
> >>>
> >>> in cases where this might be an issue, I introduced a new KVM flag
> >>> to disable lazy destroy (patch 10)  
> >>
> >> Maybe we could piggy-back on the OOM-kill notifier and then fall
> >> back to synchronous freeing for some pages?  
> > 
> > I'm not sure I follow
> > 
> > once the pages have been set aside, it's too late
> > 
> > while the pages are being set aside, every now and then some memory
> > needs to be allocated. the allocation is atomic, not allowed to use
> > emergency reserves, and can fail without warning. if the allocation
> > fails, we clean up one page and continue, without setting aside
> > anything (patch 9)
> > 
> > so if the system is low on memory, the lazy destroy should not make
> > the situation too much worse.
> > 
> > the only issue here is starting a normal process in the host (maybe
> > a non secure guest) that uses a lot of memory very quickly, right
> > after a large secure guest has terminated.  
> 
> I think page cache page allocations do not need to be atomic.
> In that case the kernel might stil l decide to trigger the oom
> killer. We can let it notify ourselves free 256 pages synchronously
> and avoid the oom kill. Have a look at the virtio-balloon
> virtio_balloon_oom_notify

the issue is that once the pages have been set aside, it's too late.
the OOM notifier would only be useful if we get notified of the OOM
situation _while_ setting aside the pages.

unless you mean that the notifier should simply wait until the thread
has done (some of) its work?

Christian Borntraeger May 18, 2021, 4:35 p.m. UTC | #12

On 18.05.21 18:34, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:20:22 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 18.05.21 18:13, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 17:45:18 +0200
>>> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>>    
>>>> On 18.05.21 17:36, Claudio Imbrenda wrote:
>>>>> On Tue, 18 May 2021 17:05:37 +0200
>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>       
>>>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>>>>>      
>>>>>>> Previously, when a protected VM was rebooted or when it was shut
>>>>>>> down, its memory was made unprotected, and then the protected VM
>>>>>>> itself was destroyed. Looping over the whole address space can
>>>>>>> take some time, considering the overhead of the various
>>>>>>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
>>>>>>> would take a potentially long amount of time, depending on the
>>>>>>> amount of used memory.
>>>>>>>
>>>>>>> This patchseries implements a deferred destroy mechanism for
>>>>>>> protected guests. When a protected guest is destroyed, its
>>>>>>> memory is cleared in background, allowing the guest to restart
>>>>>>> or terminate significantly faster than before.
>>>>>>>
>>>>>>> There are 2 possibilities when a protected VM is torn down:
>>>>>>> * it still has an address space associated (reboot case)
>>>>>>> * it does not have an address space anymore (shutdown case)
>>>>>>>
>>>>>>> For the reboot case, the reference count of the mm is increased,
>>>>>>> and then a background thread is started to clean up. Once the
>>>>>>> thread went through the whole address space, the protected VM is
>>>>>>> actually destroyed.
>>>>>>>
>>>>>>> For the shutdown case, a list of pages to be destroyed is formed
>>>>>>> when the mm is torn down. Instead of just unmapping the pages
>>>>>>> when the address space is being torn down, they are also set
>>>>>>> aside. Later when KVM cleans up the VM, a thread is started to
>>>>>>> clean up the pages from the list.
>>>>>>
>>>>>> Just to make sure, 'clean up' includes doing uv calls?
>>>>>
>>>>> yes
>>>>>       
>>>>>>>
>>>>>>> This means that the same address space can have memory belonging
>>>>>>> to more than one protected guest, although only one will be
>>>>>>> running, the others will in fact not even have any CPUs.
>>>>>>
>>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>>>> accessible in any way? I would assume that they only belong to
>>>>>> the
>>>>>
>>>>> in case of reboot: yes, they are still in the address space of the
>>>>> guest, and can be swapped if needed
>>>>>       
>>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>>>> that needs to get new pages?
>>>>>
>>>>> the rebooted guest (normal or secure) will re-use the same pages
>>>>> of the old guest (before or after cleanup, which is the reason of
>>>>> patches 3 and 4)
>>>>>
>>>>> the KVM guest is not affected in case of reboot, so the userspace
>>>>> address space is not touched.
>>>>>       
>>>>>> Can too many not-yet-cleaned-up pages lead to a (temporary)
>>>>>> memory exhaustion?
>>>>>
>>>>> in case of reboot, not much; the pages were in use are still in
>>>>> use after the reboot, and they can be swapped.
>>>>>
>>>>> in case of a shutdown, yes, because the pages are really taken
>>>>> aside and cleared/destroyed in background. they cannot be
>>>>> swapped. they are freed immediately as they are processed, to try
>>>>> to mitigate memory exhaustion scenarios.
>>>>>
>>>>> in the end, this patchseries is a tradeoff between speed and
>>>>> memory consumption. the memory needs to be cleared up at some
>>>>> point, and that requires time.
>>>>>
>>>>> in cases where this might be an issue, I introduced a new KVM flag
>>>>> to disable lazy destroy (patch 10)
>>>>
>>>> Maybe we could piggy-back on the OOM-kill notifier and then fall
>>>> back to synchronous freeing for some pages?
>>>
>>> I'm not sure I follow
>>>
>>> once the pages have been set aside, it's too late
>>>
>>> while the pages are being set aside, every now and then some memory
>>> needs to be allocated. the allocation is atomic, not allowed to use
>>> emergency reserves, and can fail without warning. if the allocation
>>> fails, we clean up one page and continue, without setting aside
>>> anything (patch 9)
>>>
>>> so if the system is low on memory, the lazy destroy should not make
>>> the situation too much worse.
>>>
>>> the only issue here is starting a normal process in the host (maybe
>>> a non secure guest) that uses a lot of memory very quickly, right
>>> after a large secure guest has terminated.
>>
>> I think page cache page allocations do not need to be atomic.
>> In that case the kernel might stil l decide to trigger the oom
>> killer. We can let it notify ourselves free 256 pages synchronously
>> and avoid the oom kill. Have a look at the virtio-balloon
>> virtio_balloon_oom_notify
> 
> the issue is that once the pages have been set aside, it's too late.
> the OOM notifier would only be useful if we get notified of the OOM
> situation _while_ setting aside the pages.
> 
> unless you mean that the notifier should simply wait until the thread
> has done (some of) its work?

Exactly. Let the notifier wait until you have freed 256pages and return
256 to the oom notifier.

Christian Borntraeger May 18, 2021, 4:55 p.m. UTC | #13

On 18.05.21 18:31, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:22:42 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 18.05.21 18:19, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 18:04:11 +0200
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>    
>>>> On Tue, 18 May 2021 17:36:24 +0200
>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>>>   
>>>>> On Tue, 18 May 2021 17:05:37 +0200
>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>       
>>>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>>>   
>>>>>>> This means that the same address space can have memory
>>>>>>> belonging to more than one protected guest, although only one
>>>>>>> will be running, the others will in fact not even have any
>>>>>>> CPUs.
>>>>>>
>>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>>>> accessible in any way? I would assume that they only belong to
>>>>>> the
>>>>>
>>>>> in case of reboot: yes, they are still in the address space of the
>>>>> guest, and can be swapped if needed
>>>>>       
>>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>>>> that needs to get new pages?
>>>>>
>>>>> the rebooted guest (normal or secure) will re-use the same pages
>>>>> of the old guest (before or after cleanup, which is the reason of
>>>>> patches 3 and 4)
>>>>
>>>> Took a look at those patches, makes sense.
>>>>   
>>>>>
>>>>> the KVM guest is not affected in case of reboot, so the userspace
>>>>> address space is not touched.
>>>>
>>>> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
>>>> actual guest above?
>>>>   
>>>
>>> yes this is tricky, because there is the guest OS, which terminates
>>> or reboots, then there is the "secure configuration" entity,
>>> handled by the Ultravisor, and then the KVM VM
>>>
>>> when a secure guest reboots, the "secure configuration" is
>>> dismantled (in this case, in a deferred way), and the KVM VM (and
>>> its memory) is not directly affected
>>>
>>> what happened before was that the secure configuration was
>>> dismantled synchronously, and then re-created.
>>>
>>> now instead, a new secure configuration is created using the same
>>> KVM VM (and thus the same mm), before the old secure configuration
>>> has been completely dismantled. hence the same KVM VM can have
>>> multiple secure configurations associated, sharing the same address
>>> space.
>>>
>>> of course, only the newest one is actually running, the other ones
>>> are "zombies", without CPUs.
>>>    
>>
>> Can a guest trigger a DoS?
> 
> I don't see how
> 
> a guest can fill its memory and then reboot, and then fill its memory
> again and then reboot... but that will take time, filling the memory
> will itself clean up leftover pages from previous boots.

In essence this guest will then synchronously wait for the page to be
exported and reimported, correct?
> 
> "normal" reboot loops will be fast, because there won't be much memory
> to process
> 
> I have actually tested mixed reboot/shutdown loops, and the system
> behaved as you would expect when under load.

I guess the memory will continue to be accounted to the memcg? Correct?

Claudio Imbrenda May 18, 2021, 5 p.m. UTC | #14

On Tue, 18 May 2021 18:55:56 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 18.05.21 18:31, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 18:22:42 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 18.05.21 18:19, Claudio Imbrenda wrote:  
> >>> On Tue, 18 May 2021 18:04:11 +0200
> >>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>      
> >>>> On Tue, 18 May 2021 17:36:24 +0200
> >>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> >>>>     
> >>>>> On Tue, 18 May 2021 17:05:37 +0200
> >>>>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>>>         
> >>>>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:  
> >>>>     
> >>>>>>> This means that the same address space can have memory
> >>>>>>> belonging to more than one protected guest, although only one
> >>>>>>> will be running, the others will in fact not even have any
> >>>>>>> CPUs.  
> >>>>>>
> >>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>>>> accessible in any way? I would assume that they only belong to
> >>>>>> the  
> >>>>>
> >>>>> in case of reboot: yes, they are still in the address space of
> >>>>> the guest, and can be swapped if needed
> >>>>>         
> >>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>>>> that needs to get new pages?  
> >>>>>
> >>>>> the rebooted guest (normal or secure) will re-use the same pages
> >>>>> of the old guest (before or after cleanup, which is the reason
> >>>>> of patches 3 and 4)  
> >>>>
> >>>> Took a look at those patches, makes sense.
> >>>>     
> >>>>>
> >>>>> the KVM guest is not affected in case of reboot, so the
> >>>>> userspace address space is not touched.  
> >>>>
> >>>> 'guest' is a bit ambiguous here -- do you mean the vm here, and
> >>>> the actual guest above?
> >>>>     
> >>>
> >>> yes this is tricky, because there is the guest OS, which
> >>> terminates or reboots, then there is the "secure configuration"
> >>> entity, handled by the Ultravisor, and then the KVM VM
> >>>
> >>> when a secure guest reboots, the "secure configuration" is
> >>> dismantled (in this case, in a deferred way), and the KVM VM (and
> >>> its memory) is not directly affected
> >>>
> >>> what happened before was that the secure configuration was
> >>> dismantled synchronously, and then re-created.
> >>>
> >>> now instead, a new secure configuration is created using the same
> >>> KVM VM (and thus the same mm), before the old secure configuration
> >>> has been completely dismantled. hence the same KVM VM can have
> >>> multiple secure configurations associated, sharing the same
> >>> address space.
> >>>
> >>> of course, only the newest one is actually running, the other ones
> >>> are "zombies", without CPUs.
> >>>      
> >>
> >> Can a guest trigger a DoS?  
> > 
> > I don't see how
> > 
> > a guest can fill its memory and then reboot, and then fill its
> > memory again and then reboot... but that will take time, filling
> > the memory will itself clean up leftover pages from previous boots.
> >  
> 
> In essence this guest will then synchronously wait for the page to be
> exported and reimported, correct?

correct

> > "normal" reboot loops will be fast, because there won't be much
> > memory to process
> > 
> > I have actually tested mixed reboot/shutdown loops, and the system
> > behaved as you would expect when under load.  
> 
> I guess the memory will continue to be accounted to the memcg?
> Correct?

for the reboot case, yes, since the mm is not directly affected.
for the shutdown case, I'm not sure.

[v1,00/11] KVM: s390: pv: implement lazy destroy

Message

Comments