diff mbox

[v3,07/15] KVM: MMU: introduce nulls desc

Message ID 1382534973-13197-8-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong Oct. 23, 2013, 1:29 p.m. UTC
It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
if that happened

kvm->slots_lock is held when we do lockless walking that prevents rmap
is reused (free rmap need to hold that lock) so that we can not see the same
nulls used on different rmaps

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

Comments

Marcelo Tosatti Nov. 22, 2013, 7:14 p.m. UTC | #1
On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
> It likes nulls list and we use the pte-list as the nulls which can help us to
> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
> if that happened
> 
> kvm->slots_lock is held when we do lockless walking that prevents rmap
> is reused (free rmap need to hold that lock) so that we can not see the same
> nulls used on different rmaps
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

How about simplified lockless walk on the slot while rmapp entry
contains a single spte? (which should be the case with two-dimensional
paging).

That is, grab the lock when finding a rmap with more than one spte in
it (and then keep it locked until the end).

For example, nothing prevents lockless walker to move into some
parent_ptes chain, right?

Also, there is no guarantee of termination (as long as sptes are
deleted with the correct timing). BTW, can't see any guarantee of
termination for rculist nulls either (a writer can race with a lockless
reader indefinately, restarting the lockless walk every time).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 25, 2013, 6:11 a.m. UTC | #2
On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
>> It likes nulls list and we use the pte-list as the nulls which can help us to
>> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
>> if that happened
>> 
>> kvm->slots_lock is held when we do lockless walking that prevents rmap
>> is reused (free rmap need to hold that lock) so that we can not see the same
>> nulls used on different rmaps
>> 
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> 
> How about simplified lockless walk on the slot while rmapp entry
> contains a single spte? (which should be the case with two-dimensional
> paging).
> 
> That is, grab the lock when finding a rmap with more than one spte in
> it (and then keep it locked until the end).

Hmm… that isn't straightforward and more complex than the approach
in this patchset. Also it can drop the improvement for shadow mmu that
gets great improvement by this patchset.

> 
> For example, nothing prevents lockless walker to move into some
> parent_ptes chain, right?

No.

The nulls can help us to detect this case, for parent_ptes, the nulls points
to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
is no chance that the “rmap" is used as shadow page when slot-lock is held.

> 
> Also, there is no guarantee of termination (as long as sptes are
> deleted with the correct timing). BTW, can't see any guarantee of
> termination for rculist nulls either (a writer can race with a lockless
> reader indefinately, restarting the lockless walk every time).

Hmm, that can be avoided by checking dirty-bitmap before rewalk,
that means, if the dirty-bitmap has been set during lockless write-protection,
it’s unnecessary to write-protect its sptes. Your idea?

But… do we really need to care it. :(






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 25, 2013, 6:29 a.m. UTC | #3
On 11/25/2013 02:11 PM, Xiao Guangrong wrote:
> 
> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
>> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
>>> It likes nulls list and we use the pte-list as the nulls which can help us to
>>> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
>>> if that happened
>>>
>>> kvm->slots_lock is held when we do lockless walking that prevents rmap
>>> is reused (free rmap need to hold that lock) so that we can not see the same
>>> nulls used on different rmaps
>>>
>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>>
>> How about simplified lockless walk on the slot while rmapp entry
>> contains a single spte? (which should be the case with two-dimensional
>> paging).
>>
>> That is, grab the lock when finding a rmap with more than one spte in
>> it (and then keep it locked until the end).
> 
> Hmm? that isn't straightforward and more complex than the approach
> in this patchset. Also it can drop the improvement for shadow mmu that
> gets great improvement by this patchset.
> 
>>
>> For example, nothing prevents lockless walker to move into some
>> parent_ptes chain, right?
> 
> No.
> 
> The nulls can help us to detect this case, for parent_ptes, the nulls points
> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
> is no chance that the ?rmap" is used as shadow page when slot-lock is held.
> 
>>
>> Also, there is no guarantee of termination (as long as sptes are
>> deleted with the correct timing). BTW, can't see any guarantee of
>> termination for rculist nulls either (a writer can race with a lockless
>> reader indefinately, restarting the lockless walk every time).
> 
> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> that means, if the dirty-bitmap has been set during lockless write-protection,
> it?s unnecessary to write-protect its sptes. Your idea?

This idea is based on the fact that the number of rmap is limited by
RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
we can break the rewalk at once, in the case of deleting, we can only
rewalk RMAP_RECYCLE_THRESHOLD times.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 25, 2013, 9:31 a.m. UTC | #4
On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
> Also, there is no guarantee of termination (as long as sptes are
> deleted with the correct timing). BTW, can't see any guarantee of
> termination for rculist nulls either (a writer can race with a lockless
> reader indefinately, restarting the lockless walk every time).

What's an rculist null? rculists have regular termination conditions,
they'll reach the end (also head, due to circular etc..) in N steps,
where N is the number of elements.

Of course you can keep adding elements to protract this situation, but
you'll run out of memory eventually -- you also have to make sure to
insert them after the element being read, otherwise the iteration will
simply miss them.

Deleting an element preserves the element itself -- it has to be
RCU-freed to be part of an rculist, and the element also preserves its
fwd link, so any iterator will either not see the element, or if they're
at the element, they'll continue their iteration as normal (rculist
doesn't have backward iterators).

A deleted element may not be re-used before an RCU grace period has
lapsed. Same as for freeing such an element. So if you want to move an
rculist element you need to:

  list_del_rcu()
  synchronize_rcu();
  list_add_rcu();

Or use the list_splice_init_rcu() primitive which also explicitly takes
a @sync argument.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Nov. 25, 2013, 10:19 a.m. UTC | #5
On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
> > 
> > For example, nothing prevents lockless walker to move into some
> > parent_ptes chain, right?
> 
> No.
> 
> The nulls can help us to detect this case, for parent_ptes, the nulls points
> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
> is no chance that the “rmap" is used as shadow page when slot-lock is held.
> 
But meanwhile we will write protect non-last level sptes, no? Better to
create separate slab caches for rmap and parent_ptes lists.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 25, 2013, 10:25 a.m. UTC | #6
On 11/25/2013 06:19 PM, Gleb Natapov wrote:
> On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
>>>
>>> For example, nothing prevents lockless walker to move into some
>>> parent_ptes chain, right?
>>
>> No.
>>
>> The nulls can help us to detect this case, for parent_ptes, the nulls points
>> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
>> is no chance that the ?rmap" is used as shadow page when slot-lock is held.
>>
> But meanwhile we will write protect non-last level sptes, no? Better to

It will meet the non-last sptes but does not write-protect them since
we have do is_last_spte() check before cmpxchg.

> create separate slab caches for rmap and parent_ptes lists.

Yes, this is a good idea. Thanks you, Gleb!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 25, 2013, 10:59 a.m. UTC | #7
Hi Peter,

On 11/25/2013 05:31 PM, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
>> Also, there is no guarantee of termination (as long as sptes are
>> deleted with the correct timing). BTW, can't see any guarantee of
>> termination for rculist nulls either (a writer can race with a lockless
>> reader indefinately, restarting the lockless walk every time).
> 
> What's an rculist null? 

I guess Marcelo was talking about rculist_nulls.h
(Documentation/RCU/rculist_nulls.txt).

> rculists have regular termination conditions,
> they'll reach the end (also head, due to circular etc..) in N steps,
> where N is the number of elements.
> 
> Of course you can keep adding elements to protract this situation, but
> you'll run out of memory eventually -- you also have to make sure to
> insert them after the element being read, otherwise the iteration will
> simply miss them.
> 
> Deleting an element preserves the element itself -- it has to be
> RCU-freed to be part of an rculist, and the element also preserves its
> fwd link, so any iterator will either not see the element, or if they're
> at the element, they'll continue their iteration as normal (rculist
> doesn't have backward iterators).
> 
> A deleted element may not be re-used before an RCU grace period has
> lapsed. Same as for freeing such an element. So if you want to move an
> rculist element you need to:
> 
>   list_del_rcu()
>   synchronize_rcu();
>   list_add_rcu();
> 
> Or use the list_splice_init_rcu() primitive which also explicitly takes
> a @sync argument.

Thanks for your detailed explanation, Peter!

What about if the element is allocated from SLAB_DESTROY_BY_RCU slab? That
means the element may be reused while do iteration.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 25, 2013, 11:05 a.m. UTC | #8
On Mon, Nov 25, 2013 at 06:59:24PM +0800, Xiao Guangrong wrote:
> 
> Hi Peter,
> 
> On 11/25/2013 05:31 PM, Peter Zijlstra wrote:
> > On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
> >> Also, there is no guarantee of termination (as long as sptes are
> >> deleted with the correct timing). BTW, can't see any guarantee of
> >> termination for rculist nulls either (a writer can race with a lockless
> >> reader indefinately, restarting the lockless walk every time).
> > 
> > What's an rculist null? 
> 
> I guess Marcelo was talking about rculist_nulls.h
> (Documentation/RCU/rculist_nulls.txt).

Oh, let me have a look, I don't think I've really looked at that yet.

> > rculists have regular termination conditions,
> > they'll reach the end (also head, due to circular etc..) in N steps,
> > where N is the number of elements.
> > 
> > Of course you can keep adding elements to protract this situation, but
> > you'll run out of memory eventually -- you also have to make sure to
> > insert them after the element being read, otherwise the iteration will
> > simply miss them.
> > 
> > Deleting an element preserves the element itself -- it has to be
> > RCU-freed to be part of an rculist, and the element also preserves its
> > fwd link, so any iterator will either not see the element, or if they're
> > at the element, they'll continue their iteration as normal (rculist
> > doesn't have backward iterators).
> > 
> > A deleted element may not be re-used before an RCU grace period has
> > lapsed. Same as for freeing such an element. So if you want to move an
> > rculist element you need to:
> > 
> >   list_del_rcu()
> >   synchronize_rcu();
> >   list_add_rcu();
> > 
> > Or use the list_splice_init_rcu() primitive which also explicitly takes
> > a @sync argument.
> 
> Thanks for your detailed explanation, Peter!
> 
> What about if the element is allocated from SLAB_DESTROY_BY_RCU slab? That
> means the element may be reused while do iteration.

Then its broken, SLAB_DESTROY_BY_RCU rarely does what people expect it
to; which is why I wrote that honkin' huge comment right next to it.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 25, 2013, 11:29 a.m. UTC | #9
On Mon, Nov 25, 2013 at 12:05:00PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 25, 2013 at 06:59:24PM +0800, Xiao Guangrong wrote:
> > I guess Marcelo was talking about rculist_nulls.h
> > (Documentation/RCU/rculist_nulls.txt).
> 
> Oh, let me have a look, I don't think I've really looked at that yet.

Egads, that's far too clever ;-)

Yes, if that's what Marcello was referencing to he's right, it doesn't
have a proper termination condition in the face of unlimited
modification.

And it can indeed use SLAB_DESTROY_BY_RCU as you alluded to -- although
the documentation is a tad scarce explaining why this is so.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Nov. 25, 2013, 12:48 p.m. UTC | #10
On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
<xiaoguangrong@linux.vnet.ibm.com> wrote:
>
> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

<snip complicated stuff about parent_pte>

I'm not really following, but note that parent_pte predates EPT (and
the use of rcu in kvm), so all the complexity that is the result of
trying to pack as many list entries into a cache line can be dropped.
Most setups now would have exactly one list entry, which is handled
specially antyway.

Alternatively, the trick of storing multiple entries in one list entry
can be moved to generic code, it may be useful to others.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 25, 2013, 2:08 p.m. UTC | #11
On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
> 
> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
> >> It likes nulls list and we use the pte-list as the nulls which can help us to
> >> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
> >> if that happened
> >> 
> >> kvm->slots_lock is held when we do lockless walking that prevents rmap
> >> is reused (free rmap need to hold that lock) so that we can not see the same
> >> nulls used on different rmaps
> >> 
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> > 
> > How about simplified lockless walk on the slot while rmapp entry
> > contains a single spte? (which should be the case with two-dimensional
> > paging).
> > 
> > That is, grab the lock when finding a rmap with more than one spte in
> > it (and then keep it locked until the end).
> 
> Hmm… that isn't straightforward and more complex than the approach
> in this patchset. Also it can drop the improvement for shadow mmu that
> gets great improvement by this patchset.

It is not more complex, since it would remove list lockless walk. Only
the spte pointer at rmap[spte] is accessed without a lock. Its much much
simpler.

> > For example, nothing prevents lockless walker to move into some
> > parent_ptes chain, right?
> 
> No.
> 
> The nulls can help us to detect this case, for parent_ptes, the nulls points
> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
> is no chance that the “rmap" is used as shadow page when slot-lock is held.

The SLAB cache is the same, so entries can be reused. What prevents
a desc entry living in slot.arch.rmap to be freed and reused by a
parent_ptes desc?

> > Also, there is no guarantee of termination (as long as sptes are
> > deleted with the correct timing). BTW, can't see any guarantee of
> > termination for rculist nulls either (a writer can race with a lockless
> > reader indefinately, restarting the lockless walk every time).
> 
> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> that means, if the dirty-bitmap has been set during lockless write-protection,
> it’s unnecessary to write-protect its sptes. Your idea?
> 
> But… do we really need to care it. :(

See my reply to Avi.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 25, 2013, 2:23 p.m. UTC | #12
On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> <xiaoguangrong@linux.vnet.ibm.com> wrote:
> >
> > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> <snip complicated stuff about parent_pte>
> 
> I'm not really following, but note that parent_pte predates EPT (and
> the use of rcu in kvm), so all the complexity that is the result of
> trying to pack as many list entries into a cache line can be dropped.
> Most setups now would have exactly one list entry, which is handled
> specially antyway.
> 
> Alternatively, the trick of storing multiple entries in one list entry
> can be moved to generic code, it may be useful to others.

Yes, can the lockless list walking code be transformed into generic
single-linked list walking? So the correctness can be verified
independently, and KVM becomes a simple user of that interface.

The simpler version is to maintain lockless walk on depth-1 rmap entries
(and grab the lock once depth-2 entry is found).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Nov. 25, 2013, 2:29 p.m. UTC | #13
On Mon, Nov 25, 2013 at 12:23:51PM -0200, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> > On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> > <xiaoguangrong@linux.vnet.ibm.com> wrote:
> > >
> > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > 
> > <snip complicated stuff about parent_pte>
> > 
> > I'm not really following, but note that parent_pte predates EPT (and
> > the use of rcu in kvm), so all the complexity that is the result of
> > trying to pack as many list entries into a cache line can be dropped.
> > Most setups now would have exactly one list entry, which is handled
> > specially antyway.
> > 
> > Alternatively, the trick of storing multiple entries in one list entry
> > can be moved to generic code, it may be useful to others.
> 
> Yes, can the lockless list walking code be transformed into generic
> single-linked list walking? So the correctness can be verified
> independently, and KVM becomes a simple user of that interface.
> 
The code will become simpler but the problem of never ending walk of
rculist_nulls will remain.

> The simpler version is to maintain lockless walk on depth-1 rmap entries
> (and grab the lock once depth-2 entry is found).
And release it between each rmap walk or at the very end of write
protect?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 25, 2013, 6:06 p.m. UTC | #14
GOn Mon, Nov 25, 2013 at 04:29:28PM +0200, Gleb Natapov wrote:
> On Mon, Nov 25, 2013 at 12:23:51PM -0200, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> > > On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> > > <xiaoguangrong@linux.vnet.ibm.com> wrote:
> > > >
> > > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > 
> > > <snip complicated stuff about parent_pte>
> > > 
> > > I'm not really following, but note that parent_pte predates EPT (and
> > > the use of rcu in kvm), so all the complexity that is the result of
> > > trying to pack as many list entries into a cache line can be dropped.
> > > Most setups now would have exactly one list entry, which is handled
> > > specially antyway.
> > > 
> > > Alternatively, the trick of storing multiple entries in one list entry
> > > can be moved to generic code, it may be useful to others.
> > 
> > Yes, can the lockless list walking code be transformed into generic
> > single-linked list walking? So the correctness can be verified
> > independently, and KVM becomes a simple user of that interface.
> > 
> The code will become simpler but the problem of never ending walk of
> rculist_nulls will remain.

Yes. Hopefully it can be fixed in some way.

> > The simpler version is to maintain lockless walk on depth-1 rmap entries
> > (and grab the lock once depth-2 entry is found).
> And release it between each rmap walk or at the very end of write
> protect?

Or keep it locked until 10 consecutive sptes with depth 1 are found.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 25, 2013, 6:12 p.m. UTC | #15
On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
> >> Also, there is no guarantee of termination (as long as sptes are
> >> deleted with the correct timing). BTW, can't see any guarantee of
> >> termination for rculist nulls either (a writer can race with a lockless
> >> reader indefinately, restarting the lockless walk every time).
> > 
> > Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> > that means, if the dirty-bitmap has been set during lockless write-protection,
> > it?s unnecessary to write-protect its sptes. Your idea?
> This idea is based on the fact that the number of rmap is limited by
> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
> we can break the rewalk at once, in the case of deleting, we can only
> rewalk RMAP_RECYCLE_THRESHOLD times.

Please explain in more detail.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 26, 2013, 3:02 a.m. UTC | #16
On 11/25/2013 10:08 PM, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
>>
>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>
>>> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
>>>> It likes nulls list and we use the pte-list as the nulls which can help us to
>>>> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
>>>> if that happened
>>>>
>>>> kvm->slots_lock is held when we do lockless walking that prevents rmap
>>>> is reused (free rmap need to hold that lock) so that we can not see the same
>>>> nulls used on different rmaps
>>>>
>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>>>
>>> How about simplified lockless walk on the slot while rmapp entry
>>> contains a single spte? (which should be the case with two-dimensional
>>> paging).
>>>
>>> That is, grab the lock when finding a rmap with more than one spte in
>>> it (and then keep it locked until the end).
>>
>> Hmm… that isn't straightforward and more complex than the approach
>> in this patchset. Also it can drop the improvement for shadow mmu that
>> gets great improvement by this patchset.
> 
> It is not more complex, since it would remove list lockless walk. Only
> the spte pointer at rmap[spte] is accessed without a lock. Its much much
> simpler.
> 
>>> For example, nothing prevents lockless walker to move into some
>>> parent_ptes chain, right?
>>
>> No.
>>
>> The nulls can help us to detect this case, for parent_ptes, the nulls points
>> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
>> is no chance that the “rmap" is used as shadow page when slot-lock is held.
> 
> The SLAB cache is the same, so entries can be reused. What prevents
> a desc entry living in slot.arch.rmap to be freed and reused by a
> parent_ptes desc?
> 

We will check is_last_spte(), all the sptes on parent_ptes should be failed.
And Gleb suggested to use a separate slab for rmap, that should be excellent.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 26, 2013, 3:10 a.m. UTC | #17
On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
>> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
>> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>>>
>>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>
>> <snip complicated stuff about parent_pte>
>>
>> I'm not really following, but note that parent_pte predates EPT (and
>> the use of rcu in kvm), so all the complexity that is the result of
>> trying to pack as many list entries into a cache line can be dropped.
>> Most setups now would have exactly one list entry, which is handled
>> specially antyway.
>>
>> Alternatively, the trick of storing multiple entries in one list entry
>> can be moved to generic code, it may be useful to others.
> 
> Yes, can the lockless list walking code be transformed into generic
> single-linked list walking? So the correctness can be verified
> independently, and KVM becomes a simple user of that interface.

I'am afraid the signle-entry list is not so good as we expected. In my
experience, there're too many entries on rmap, more than 300 sometimes.
(consider a case that a lib shared by all processes).

> 
> The simpler version is to maintain lockless walk on depth-1 rmap entries
> (and grab the lock once depth-2 entry is found).

I still think rmap-lockless is more graceful: soft mmu can get benefit
from it also it is promising to be used in some mmu-notify functions. :)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 26, 2013, 3:21 a.m. UTC | #18
On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>> Also, there is no guarantee of termination (as long as sptes are
>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>> termination for rculist nulls either (a writer can race with a lockless
>>>> reader indefinately, restarting the lockless walk every time).
>>>
>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>> that means, if the dirty-bitmap has been set during lockless write-protection,
>>> it?s unnecessary to write-protect its sptes. Your idea?
>> This idea is based on the fact that the number of rmap is limited by
>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>> we can break the rewalk at once, in the case of deleting, we can only
>> rewalk RMAP_RECYCLE_THRESHOLD times.
> 
> Please explain in more detail.

Okay.

My proposal is like this:

pte_list_walk_lockless()
{
restart:

+	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
+		return;

	code-doing-lockless-walking;
	......
}

Before do lockless-walking, we check the dirty-bitmap first, if
it is set we can simply skip write-protection for the gfn, that
is the case that new spte is being added into rmap when we lockless
access the rmap.

For the case of deleting spte from rmap, the number of entry is limited
by RMAP_RECYCLE_THRESHOLD, that is not endlessly.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Nov. 26, 2013, 10:12 a.m. UTC | #19
On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
> >>>> Also, there is no guarantee of termination (as long as sptes are
> >>>> deleted with the correct timing). BTW, can't see any guarantee of
> >>>> termination for rculist nulls either (a writer can race with a lockless
> >>>> reader indefinately, restarting the lockless walk every time).
> >>>
> >>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> >>> that means, if the dirty-bitmap has been set during lockless write-protection,
> >>> it?s unnecessary to write-protect its sptes. Your idea?
> >> This idea is based on the fact that the number of rmap is limited by
> >> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
> >> we can break the rewalk at once, in the case of deleting, we can only
> >> rewalk RMAP_RECYCLE_THRESHOLD times.
> > 
> > Please explain in more detail.
> 
> Okay.
> 
> My proposal is like this:
> 
> pte_list_walk_lockless()
> {
> restart:
> 
> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
> +		return;
> 
> 	code-doing-lockless-walking;
> 	......
> }
> 
> Before do lockless-walking, we check the dirty-bitmap first, if
> it is set we can simply skip write-protection for the gfn, that
> is the case that new spte is being added into rmap when we lockless
> access the rmap.
> 
> For the case of deleting spte from rmap, the number of entry is limited
> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
The point is that rmap entry that you are inspecting can be constantly
deleted and added to the beginning of some other list, so the code that
traverse the list will never reach the end.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Nov. 26, 2013, 10:15 a.m. UTC | #20
On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
> On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> >> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> >> <xiaoguangrong@linux.vnet.ibm.com> wrote:
> >>>
> >>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >>
> >> <snip complicated stuff about parent_pte>
> >>
> >> I'm not really following, but note that parent_pte predates EPT (and
> >> the use of rcu in kvm), so all the complexity that is the result of
> >> trying to pack as many list entries into a cache line can be dropped.
> >> Most setups now would have exactly one list entry, which is handled
> >> specially antyway.
> >>
> >> Alternatively, the trick of storing multiple entries in one list entry
> >> can be moved to generic code, it may be useful to others.
> > 
> > Yes, can the lockless list walking code be transformed into generic
> > single-linked list walking? So the correctness can be verified
> > independently, and KVM becomes a simple user of that interface.
> 
> I'am afraid the signle-entry list is not so good as we expected. In my
> experience, there're too many entries on rmap, more than 300 sometimes.
> (consider a case that a lib shared by all processes).
> 
This is without EPT though and non EPT HW is not performance king
anyway. Nested EPT uses shadow paging too, but VMs hardly share any
pages. With KSM they may though.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 26, 2013, 7:31 p.m. UTC | #21
On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
> >>>> Also, there is no guarantee of termination (as long as sptes are
> >>>> deleted with the correct timing). BTW, can't see any guarantee of
> >>>> termination for rculist nulls either (a writer can race with a lockless
> >>>> reader indefinately, restarting the lockless walk every time).
> >>>
> >>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> >>> that means, if the dirty-bitmap has been set during lockless write-protection,
> >>> it?s unnecessary to write-protect its sptes. Your idea?
> >> This idea is based on the fact that the number of rmap is limited by
> >> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
> >> we can break the rewalk at once, in the case of deleting, we can only
> >> rewalk RMAP_RECYCLE_THRESHOLD times.
> > 
> > Please explain in more detail.
> 
> Okay.
> 
> My proposal is like this:
> 
> pte_list_walk_lockless()
> {
> restart:
> 
> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
> +		return;
> 
> 	code-doing-lockless-walking;
> 	......
> }
> 
> Before do lockless-walking, we check the dirty-bitmap first, if
> it is set we can simply skip write-protection for the gfn, that
> is the case that new spte is being added into rmap when we lockless
> access the rmap.

The dirty bit could be set after the check.

> For the case of deleting spte from rmap, the number of entry is limited
> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.

It can shrink and grow while lockless walk is performed.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 26, 2013, 7:58 p.m. UTC | #22
On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
> On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> >> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> >> <xiaoguangrong@linux.vnet.ibm.com> wrote:
> >>>
> >>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >>
> >> <snip complicated stuff about parent_pte>
> >>
> >> I'm not really following, but note that parent_pte predates EPT (and
> >> the use of rcu in kvm), so all the complexity that is the result of
> >> trying to pack as many list entries into a cache line can be dropped.
> >> Most setups now would have exactly one list entry, which is handled
> >> specially antyway.
> >>
> >> Alternatively, the trick of storing multiple entries in one list entry
> >> can be moved to generic code, it may be useful to others.
> > 
> > Yes, can the lockless list walking code be transformed into generic
> > single-linked list walking? So the correctness can be verified
> > independently, and KVM becomes a simple user of that interface.
> 
> I'am afraid the signle-entry list is not so good as we expected. In my
> experience, there're too many entries on rmap, more than 300 sometimes.
> (consider a case that a lib shared by all processes).

single linked list was about moving singly-linked lockless walking
to generic code.

http://www.spinics.net/lists/linux-usb/msg39643.html
http://marc.info/?l=linux-kernel&m=103305635013575&w=3

> > The simpler version is to maintain lockless walk on depth-1 rmap entries
> > (and grab the lock once depth-2 entry is found).
> 
> I still think rmap-lockless is more graceful: soft mmu can get benefit
> from it also it is promising to be used in some mmu-notify functions. :)

OK.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 28, 2013, 8:32 a.m. UTC | #23
On 11/27/2013 03:58 AM, Marcelo Tosatti wrote:
> On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
>> On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
>>> On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
>>>> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
>>>> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>>>>>
>>>>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>>>
>>>> <snip complicated stuff about parent_pte>
>>>>
>>>> I'm not really following, but note that parent_pte predates EPT (and
>>>> the use of rcu in kvm), so all the complexity that is the result of
>>>> trying to pack as many list entries into a cache line can be dropped.
>>>> Most setups now would have exactly one list entry, which is handled
>>>> specially antyway.
>>>>
>>>> Alternatively, the trick of storing multiple entries in one list entry
>>>> can be moved to generic code, it may be useful to others.
>>>
>>> Yes, can the lockless list walking code be transformed into generic
>>> single-linked list walking? So the correctness can be verified
>>> independently, and KVM becomes a simple user of that interface.
>>
>> I'am afraid the signle-entry list is not so good as we expected. In my
>> experience, there're too many entries on rmap, more than 300 sometimes.
>> (consider a case that a lib shared by all processes).
> 
> single linked list was about moving singly-linked lockless walking
> to generic code.
> 
> http://www.spinics.net/lists/linux-usb/msg39643.html
> http://marc.info/?l=linux-kernel&m=103305635013575&w=3
> 

Oh, i confused "single linked" and "single entry", sorry about that.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 28, 2013, 8:53 a.m. UTC | #24
On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>>>> Also, there is no guarantee of termination (as long as sptes are
>>>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>>>> termination for rculist nulls either (a writer can race with a lockless
>>>>>> reader indefinately, restarting the lockless walk every time).
>>>>>
>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>>>> that means, if the dirty-bitmap has been set during lockless write-protection,
>>>>> it?s unnecessary to write-protect its sptes. Your idea?
>>>> This idea is based on the fact that the number of rmap is limited by
>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>>>> we can break the rewalk at once, in the case of deleting, we can only
>>>> rewalk RMAP_RECYCLE_THRESHOLD times.
>>>
>>> Please explain in more detail.
>>
>> Okay.
>>
>> My proposal is like this:
>>
>> pte_list_walk_lockless()
>> {
>> restart:
>>
>> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
>> +		return;
>>
>> 	code-doing-lockless-walking;
>> 	......
>> }
>>
>> Before do lockless-walking, we check the dirty-bitmap first, if
>> it is set we can simply skip write-protection for the gfn, that
>> is the case that new spte is being added into rmap when we lockless
>> access the rmap.
> 
> The dirty bit could be set after the check.
> 
>> For the case of deleting spte from rmap, the number of entry is limited
>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
> 
> It can shrink and grow while lockless walk is performed.

Yes, indeed.

Hmmm, another idea in my mind to fix this is encoding the position into
the reserved bits of desc->more pointer, for example:

          +------+    +------+    +------+
rmapp ->  |Desc 0| -> |Desc 1| -> |Desc 2|
          +------+    +------+    +------+

There are 3 descs on the rmap, and:
rmapp = &desc0 | 1UL | 3UL << 50;
desc0->more = desc1 | 2UL << 50;
desc1->more = desc0 | 1UL << 50
desc2->more = &rmapp | 1UL; (The nulls pointer)

We will walk to the next desc only if the "position" of current desc
is >= the position of next desc. That can make sure we can reach the
last desc anyway.

And in order to avoiding doing too many "rewalk", we will goto the
slow path (do walk with holding the lock) instead when retry the walk
more that N times.

Thanks all you guys in thanksgiving day. :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Dec. 3, 2013, 7:10 a.m. UTC | #25
On 11/28/2013 04:53 PM, Xiao Guangrong wrote:
> On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
>> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
>>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
>>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>>>>> Also, there is no guarantee of termination (as long as sptes are
>>>>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>>>>> termination for rculist nulls either (a writer can race with a lockless
>>>>>>> reader indefinately, restarting the lockless walk every time).
>>>>>>
>>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>>>>> that means, if the dirty-bitmap has been set during lockless write-protection,
>>>>>> it?s unnecessary to write-protect its sptes. Your idea?
>>>>> This idea is based on the fact that the number of rmap is limited by
>>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>>>>> we can break the rewalk at once, in the case of deleting, we can only
>>>>> rewalk RMAP_RECYCLE_THRESHOLD times.
>>>>
>>>> Please explain in more detail.
>>>
>>> Okay.
>>>
>>> My proposal is like this:
>>>
>>> pte_list_walk_lockless()
>>> {
>>> restart:
>>>
>>> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
>>> +		return;
>>>
>>> 	code-doing-lockless-walking;
>>> 	......
>>> }
>>>
>>> Before do lockless-walking, we check the dirty-bitmap first, if
>>> it is set we can simply skip write-protection for the gfn, that
>>> is the case that new spte is being added into rmap when we lockless
>>> access the rmap.
>>
>> The dirty bit could be set after the check.
>>
>>> For the case of deleting spte from rmap, the number of entry is limited
>>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
>>
>> It can shrink and grow while lockless walk is performed.
> 
> Yes, indeed.
> 
> Hmmm, another idea in my mind to fix this is encoding the position into
> the reserved bits of desc->more pointer, for example:
> 
>           +------+    +------+    +------+
> rmapp ->  |Desc 0| -> |Desc 1| -> |Desc 2|
>           +------+    +------+    +------+
> 
> There are 3 descs on the rmap, and:
> rmapp = &desc0 | 1UL | 3UL << 50;
> desc0->more = desc1 | 2UL << 50;
> desc1->more = desc0 | 1UL << 50
> desc2->more = &rmapp | 1UL; (The nulls pointer)
> 
> We will walk to the next desc only if the "position" of current desc
> is >= the position of next desc. That can make sure we can reach the
> last desc anyway.
> 
> And in order to avoiding doing too many "rewalk", we will goto the
> slow path (do walk with holding the lock) instead when retry the walk
> more that N times.

How about this idea? Or you guys still prefer to the idea of lockless on
first-level?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Dec. 5, 2013, 1:50 p.m. UTC | #26
GOn Tue, Dec 03, 2013 at 03:10:48PM +0800, Xiao Guangrong wrote:
> On 11/28/2013 04:53 PM, Xiao Guangrong wrote:
> > On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
> >> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
> >>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
> >>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
> >>>>>>> Also, there is no guarantee of termination (as long as sptes are
> >>>>>>> deleted with the correct timing). BTW, can't see any guarantee of
> >>>>>>> termination for rculist nulls either (a writer can race with a lockless
> >>>>>>> reader indefinately, restarting the lockless walk every time).
> >>>>>>
> >>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> >>>>>> that means, if the dirty-bitmap has been set during lockless write-protection,
> >>>>>> it?s unnecessary to write-protect its sptes. Your idea?
> >>>>> This idea is based on the fact that the number of rmap is limited by
> >>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
> >>>>> we can break the rewalk at once, in the case of deleting, we can only
> >>>>> rewalk RMAP_RECYCLE_THRESHOLD times.
> >>>>
> >>>> Please explain in more detail.
> >>>
> >>> Okay.
> >>>
> >>> My proposal is like this:
> >>>
> >>> pte_list_walk_lockless()
> >>> {
> >>> restart:
> >>>
> >>> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
> >>> +		return;
> >>>
> >>> 	code-doing-lockless-walking;
> >>> 	......
> >>> }
> >>>
> >>> Before do lockless-walking, we check the dirty-bitmap first, if
> >>> it is set we can simply skip write-protection for the gfn, that
> >>> is the case that new spte is being added into rmap when we lockless
> >>> access the rmap.
> >>
> >> The dirty bit could be set after the check.
> >>
> >>> For the case of deleting spte from rmap, the number of entry is limited
> >>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
> >>
> >> It can shrink and grow while lockless walk is performed.
> > 
> > Yes, indeed.
> > 
> > Hmmm, another idea in my mind to fix this is encoding the position into
> > the reserved bits of desc->more pointer, for example:
> > 
> >           +------+    +------+    +------+
> > rmapp ->  |Desc 0| -> |Desc 1| -> |Desc 2|
> >           +------+    +------+    +------+
> > 
> > There are 3 descs on the rmap, and:
> > rmapp = &desc0 | 1UL | 3UL << 50;
> > desc0->more = desc1 | 2UL << 50;
> > desc1->more = desc0 | 1UL << 50
> > desc2->more = &rmapp | 1UL; (The nulls pointer)
> > 
> > We will walk to the next desc only if the "position" of current desc
> > is >= the position of next desc. That can make sure we can reach the
> > last desc anyway.
> > 
> > And in order to avoiding doing too many "rewalk", we will goto the
> > slow path (do walk with holding the lock) instead when retry the walk
> > more that N times.
> 
> How about this idea? Or you guys still prefer to the idea of lockless on
> first-level?

Xiao,

Is it not the case that simply moving to the slow path once a maximum of
rewalks has been reached enough? (looks a like a good solution).

Please move lockless rcu walking code to generic code where it belongs.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Dec. 5, 2013, 3:30 p.m. UTC | #27
On Dec 5, 2013, at 9:50 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> GOn Tue, Dec 03, 2013 at 03:10:48PM +0800, Xiao Guangrong wrote:
>> On 11/28/2013 04:53 PM, Xiao Guangrong wrote:
>>> On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
>>>> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
>>>>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
>>>>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>>>>>>> Also, there is no guarantee of termination (as long as sptes are
>>>>>>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>>>>>>> termination for rculist nulls either (a writer can race with a lockless
>>>>>>>>> reader indefinately, restarting the lockless walk every time).
>>>>>>>> 
>>>>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>>>>>>> that means, if the dirty-bitmap has been set during lockless write-protection,
>>>>>>>> it?s unnecessary to write-protect its sptes. Your idea?
>>>>>>> This idea is based on the fact that the number of rmap is limited by
>>>>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>>>>>>> we can break the rewalk at once, in the case of deleting, we can only
>>>>>>> rewalk RMAP_RECYCLE_THRESHOLD times.
>>>>>> 
>>>>>> Please explain in more detail.
>>>>> 
>>>>> Okay.
>>>>> 
>>>>> My proposal is like this:
>>>>> 
>>>>> pte_list_walk_lockless()
>>>>> {
>>>>> restart:
>>>>> 
>>>>> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
>>>>> +		return;
>>>>> 
>>>>> 	code-doing-lockless-walking;
>>>>> 	......
>>>>> }
>>>>> 
>>>>> Before do lockless-walking, we check the dirty-bitmap first, if
>>>>> it is set we can simply skip write-protection for the gfn, that
>>>>> is the case that new spte is being added into rmap when we lockless
>>>>> access the rmap.
>>>> 
>>>> The dirty bit could be set after the check.
>>>> 
>>>>> For the case of deleting spte from rmap, the number of entry is limited
>>>>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
>>>> 
>>>> It can shrink and grow while lockless walk is performed.
>>> 
>>> Yes, indeed.
>>> 
>>> Hmmm, another idea in my mind to fix this is encoding the position into
>>> the reserved bits of desc->more pointer, for example:
>>> 
>>>         +------+    +------+    +------+
>>> rmapp ->  |Desc 0| -> |Desc 1| -> |Desc 2|
>>>         +------+    +------+    +------+
>>> 
>>> There are 3 descs on the rmap, and:
>>> rmapp = &desc0 | 1UL | 3UL << 50;
>>> desc0->more = desc1 | 2UL << 50;
>>> desc1->more = desc0 | 1UL << 50
>>> desc2->more = &rmapp | 1UL; (The nulls pointer)
>>> 
>>> We will walk to the next desc only if the "position" of current desc
>>> is >= the position of next desc. That can make sure we can reach the
>>> last desc anyway.
>>> 
>>> And in order to avoiding doing too many "rewalk", we will goto the
>>> slow path (do walk with holding the lock) instead when retry the walk
>>> more that N times.
>> 
>> How about this idea? Or you guys still prefer to the idea of lockless on
>> first-level?
> 
> Xiao,
> 
> Is it not the case that simply moving to the slow path once a maximum of
> rewalks has been reached enough? (looks a like a good solution).

In some cases, the lockless walker will do endless-walking on desc and
without rewalk, consider this case:

there are two descs: desc1 and desc2 who is pointed by desc1->next?
desc1->next = desc2.

CPU 0                                                                            CPU 1

lockless walk on desc1
                                                                 deleting desc1 from the rmap
lockless walk on desc2 (desc1->next)
                                                                delete desc2 from the rmap
                                                                add desc1
                                                                add desc2, then desc2->next = desc1

lockless walk on desc1
                                                               delete desc2
                                                               delete desc1
                                                               add desc2
                                                               add desc1; the desc1->next = desc2
lockless walk on desc2

……

Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.


> 
> Please move lockless rcu walking code to generic code where it belongs.

Okay.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Dec. 6, 2013, 12:15 a.m. UTC | #28
On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote:
> > Is it not the case that simply moving to the slow path once a maximum of
> > rewalks has been reached enough? (looks a like a good solution).
> 
> In some cases, the lockless walker will do endless-walking on desc and
> without rewalk, consider this case:
> 
> there are two descs: desc1 and desc2 who is pointed by desc1->next?
> desc1->next = desc2.
> 
> CPU 0                                                                            CPU 1
> 
> lockless walk on desc1
>                                                                  deleting desc1 from the rmap
> lockless walk on desc2 (desc1->next)
>                                                                 delete desc2 from the rmap
>                                                                 add desc1
>                                                                 add desc2, then desc2->next = desc1
> 
> lockless walk on desc1
>                                                                delete desc2
>                                                                delete desc1
>                                                                add desc2
>                                                                add desc1; the desc1->next = desc2
> lockless walk on desc2
> 
> ……
> 
> Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.

Ouch, this is the sort of thing that is worrysome. Do you still think
that having the benefit for shadow is important enough to justify this
complexity?

Also, another separate possibility is to use the dirty bit (which recent
Intel processors support). Then there are no faults at all to handle.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Dec. 6, 2013, 12:22 a.m. UTC | #29
On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote:
> In some cases, the lockless walker will do endless-walking on desc and
> without rewalk, consider this case:
> 
> there are two descs: desc1 and desc2 who is pointed by desc1->next?
> desc1->next = desc2.
> 
> CPU 0                                                                            CPU 1
> 
> lockless walk on desc1
>                                                                  deleting desc1 from the rmap
> lockless walk on desc2 (desc1->next)
>                                                                 delete desc2 from the rmap
>                                                                 add desc1
>                                                                 add desc2, then desc2->next = desc1
> 
> lockless walk on desc1
>                                                                delete desc2
>                                                                delete desc1
>                                                                add desc2
>                                                                add desc1; the desc1->next = desc2
> lockless walk on desc2
> 
> ……
> 
> Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.

The counter can be local to the walker. If its more than maximum rmap
size, break.

(but still, please consider carefully whether lockless list walking is
really necessary or can be avoided).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Dec. 10, 2013, 6:58 a.m. UTC | #30
On 12/06/2013 08:22 AM, Marcelo Tosatti wrote:
> On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote:
>> In some cases, the lockless walker will do endless-walking on desc and
>> without rewalk, consider this case:
>>
>> there are two descs: desc1 and desc2 who is pointed by desc1->next?
>> desc1->next = desc2.
>>
>> CPU 0                                                                            CPU 1
>>
>> lockless walk on desc1
>>                                                                  deleting desc1 from the rmap
>> lockless walk on desc2 (desc1->next)
>>                                                                 delete desc2 from the rmap
>>                                                                 add desc1
>>                                                                 add desc2, then desc2->next = desc1
>>
>> lockless walk on desc1
>>                                                                delete desc2
>>                                                                delete desc1
>>                                                                add desc2
>>                                                                add desc1; the desc1->next = desc2
>> lockless walk on desc2
>>
>> ……
>>
>> Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.
> 
> The counter can be local to the walker. If its more than maximum rmap
> size, break.
> 
> (but still, please consider carefully whether lockless list walking is
> really necessary or can be avoided).

Yep, Marcelo, you're right.

After thinking more, i do not have any idea to simplify this. Your approach
(lockless on the first level) seems a better solution. Will do it based on that
ways.

Thanks!


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5cce039..4687329 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,24 @@  static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
 	return level - 1;
 }
 
+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
+{
+	unsigned long marker;
+
+	marker = (unsigned long)pte_list | 1UL;
+	desc->more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+	return (unsigned long)desc & 1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+	return (unsigned long *)((unsigned long)desc & ~1);
+}
+
 static int __find_first_free(struct pte_list_desc *desc)
 {
 	int i;
@@ -951,7 +969,7 @@  static int count_spte_number(struct pte_list_desc *desc)
 
 	first_free = __find_first_free(desc);
 
-	for (desc_num = 0; desc->more; desc = desc->more)
+	for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
 		desc_num++;
 
 	return first_free + desc_num * PTE_LIST_EXT;
@@ -985,6 +1003,7 @@  static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		desc = mmu_alloc_pte_list_desc(vcpu);
 		desc->sptes[0] = (u64 *)*pte_list;
 		desc->sptes[1] = spte;
+		desc_mark_nulls(pte_list, desc);
 		*pte_list = (unsigned long)desc | 1;
 		return 1;
 	}
@@ -1030,7 +1049,7 @@  pte_list_desc_remove_entry(unsigned long *pte_list,
 		/*
 		 * Only one entry existing but still use a desc to store it?
 		 */
-		WARN_ON(!next_desc);
+		WARN_ON(desc_is_a_nulls(next_desc));
 
 		mmu_free_pte_list_desc(first_desc);
 		*pte_list = (unsigned long)next_desc | 1ul;
@@ -1041,7 +1060,7 @@  pte_list_desc_remove_entry(unsigned long *pte_list,
 	 * Only one entry in this desc, move the entry to the head
 	 * then the desc can be freed.
 	 */
-	if (!first_desc->sptes[1] && !first_desc->more) {
+	if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
 		*pte_list = (unsigned long)first_desc->sptes[0];
 		mmu_free_pte_list_desc(first_desc);
 	}
@@ -1070,7 +1089,7 @@  static void pte_list_remove(u64 *spte, unsigned long *pte_list)
 
 	rmap_printk("pte_list_remove:  %p many->many\n", spte);
 	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-	while (desc) {
+	while (!desc_is_a_nulls(desc)) {
 		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
 			if (desc->sptes[i] == spte) {
 				pte_list_desc_remove_entry(pte_list,
@@ -1097,11 +1116,13 @@  static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
 		return fn((u64 *)*pte_list);
 
 	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-	while (desc) {
+	while (!desc_is_a_nulls(desc)) {
 		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
 			fn(desc->sptes[i]);
 		desc = desc->more;
 	}
+
+	WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1184,6 +1205,7 @@  static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
 
 	iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
 	iter->pos = 0;
+	WARN_ON(desc_is_a_nulls(iter->desc));
 	return iter->desc->sptes[iter->pos];
 }
 
@@ -1204,7 +1226,8 @@  static u64 *rmap_get_next(struct rmap_iterator *iter)
 				return sptep;
 		}
 
-		iter->desc = iter->desc->more;
+		iter->desc = desc_is_a_nulls(iter->desc->more) ?
+				NULL : iter->desc->more;
 
 		if (iter->desc) {
 			iter->pos = 0;