Message ID | 1382534973-13197-8-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote: > It likes nulls list and we use the pte-list as the nulls which can help us to > detect whether the "desc" is moved to anther rmap then we can re-walk the rmap > if that happened > > kvm->slots_lock is held when we do lockless walking that prevents rmap > is reused (free rmap need to hold that lock) so that we can not see the same > nulls used on different rmaps > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> How about simplified lockless walk on the slot while rmapp entry contains a single spte? (which should be the case with two-dimensional paging). That is, grab the lock when finding a rmap with more than one spte in it (and then keep it locked until the end). For example, nothing prevents lockless walker to move into some parent_ptes chain, right? Also, there is no guarantee of termination (as long as sptes are deleted with the correct timing). BTW, can't see any guarantee of termination for rculist nulls either (a writer can race with a lockless reader indefinately, restarting the lockless walk every time). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote: >> It likes nulls list and we use the pte-list as the nulls which can help us to >> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap >> if that happened >> >> kvm->slots_lock is held when we do lockless walking that prevents rmap >> is reused (free rmap need to hold that lock) so that we can not see the same >> nulls used on different rmaps >> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > > How about simplified lockless walk on the slot while rmapp entry > contains a single spte? (which should be the case with two-dimensional > paging). > > That is, grab the lock when finding a rmap with more than one spte in > it (and then keep it locked until the end). Hmm… that isn't straightforward and more complex than the approach in this patchset. Also it can drop the improvement for shadow mmu that gets great improvement by this patchset. > > For example, nothing prevents lockless walker to move into some > parent_ptes chain, right? No. The nulls can help us to detect this case, for parent_ptes, the nulls points to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There is no chance that the “rmap" is used as shadow page when slot-lock is held. > > Also, there is no guarantee of termination (as long as sptes are > deleted with the correct timing). BTW, can't see any guarantee of > termination for rculist nulls either (a writer can race with a lockless > reader indefinately, restarting the lockless walk every time). Hmm, that can be avoided by checking dirty-bitmap before rewalk, that means, if the dirty-bitmap has been set during lockless write-protection, it’s unnecessary to write-protect its sptes. Your idea? But… do we really need to care it. :( -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/25/2013 02:11 PM, Xiao Guangrong wrote: > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > >> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote: >>> It likes nulls list and we use the pte-list as the nulls which can help us to >>> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap >>> if that happened >>> >>> kvm->slots_lock is held when we do lockless walking that prevents rmap >>> is reused (free rmap need to hold that lock) so that we can not see the same >>> nulls used on different rmaps >>> >>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> >> >> How about simplified lockless walk on the slot while rmapp entry >> contains a single spte? (which should be the case with two-dimensional >> paging). >> >> That is, grab the lock when finding a rmap with more than one spte in >> it (and then keep it locked until the end). > > Hmm? that isn't straightforward and more complex than the approach > in this patchset. Also it can drop the improvement for shadow mmu that > gets great improvement by this patchset. > >> >> For example, nothing prevents lockless walker to move into some >> parent_ptes chain, right? > > No. > > The nulls can help us to detect this case, for parent_ptes, the nulls points > to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There > is no chance that the ?rmap" is used as shadow page when slot-lock is held. > >> >> Also, there is no guarantee of termination (as long as sptes are >> deleted with the correct timing). BTW, can't see any guarantee of >> termination for rculist nulls either (a writer can race with a lockless >> reader indefinately, restarting the lockless walk every time). > > Hmm, that can be avoided by checking dirty-bitmap before rewalk, > that means, if the dirty-bitmap has been set during lockless write-protection, > it?s unnecessary to write-protect its sptes. Your idea? This idea is based on the fact that the number of rmap is limited by RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, we can break the rewalk at once, in the case of deleting, we can only rewalk RMAP_RECYCLE_THRESHOLD times. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote: > Also, there is no guarantee of termination (as long as sptes are > deleted with the correct timing). BTW, can't see any guarantee of > termination for rculist nulls either (a writer can race with a lockless > reader indefinately, restarting the lockless walk every time). What's an rculist null? rculists have regular termination conditions, they'll reach the end (also head, due to circular etc..) in N steps, where N is the number of elements. Of course you can keep adding elements to protract this situation, but you'll run out of memory eventually -- you also have to make sure to insert them after the element being read, otherwise the iteration will simply miss them. Deleting an element preserves the element itself -- it has to be RCU-freed to be part of an rculist, and the element also preserves its fwd link, so any iterator will either not see the element, or if they're at the element, they'll continue their iteration as normal (rculist doesn't have backward iterators). A deleted element may not be re-used before an RCU grace period has lapsed. Same as for freeing such an element. So if you want to move an rculist element you need to: list_del_rcu() synchronize_rcu(); list_add_rcu(); Or use the list_splice_init_rcu() primitive which also explicitly takes a @sync argument. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote: > > > > For example, nothing prevents lockless walker to move into some > > parent_ptes chain, right? > > No. > > The nulls can help us to detect this case, for parent_ptes, the nulls points > to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There > is no chance that the “rmap" is used as shadow page when slot-lock is held. > But meanwhile we will write protect non-last level sptes, no? Better to create separate slab caches for rmap and parent_ptes lists. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/25/2013 06:19 PM, Gleb Natapov wrote: > On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote: >>> >>> For example, nothing prevents lockless walker to move into some >>> parent_ptes chain, right? >> >> No. >> >> The nulls can help us to detect this case, for parent_ptes, the nulls points >> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There >> is no chance that the ?rmap" is used as shadow page when slot-lock is held. >> > But meanwhile we will write protect non-last level sptes, no? Better to It will meet the non-last sptes but does not write-protect them since we have do is_last_spte() check before cmpxchg. > create separate slab caches for rmap and parent_ptes lists. Yes, this is a good idea. Thanks you, Gleb! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Peter, On 11/25/2013 05:31 PM, Peter Zijlstra wrote: > On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote: >> Also, there is no guarantee of termination (as long as sptes are >> deleted with the correct timing). BTW, can't see any guarantee of >> termination for rculist nulls either (a writer can race with a lockless >> reader indefinately, restarting the lockless walk every time). > > What's an rculist null? I guess Marcelo was talking about rculist_nulls.h (Documentation/RCU/rculist_nulls.txt). > rculists have regular termination conditions, > they'll reach the end (also head, due to circular etc..) in N steps, > where N is the number of elements. > > Of course you can keep adding elements to protract this situation, but > you'll run out of memory eventually -- you also have to make sure to > insert them after the element being read, otherwise the iteration will > simply miss them. > > Deleting an element preserves the element itself -- it has to be > RCU-freed to be part of an rculist, and the element also preserves its > fwd link, so any iterator will either not see the element, or if they're > at the element, they'll continue their iteration as normal (rculist > doesn't have backward iterators). > > A deleted element may not be re-used before an RCU grace period has > lapsed. Same as for freeing such an element. So if you want to move an > rculist element you need to: > > list_del_rcu() > synchronize_rcu(); > list_add_rcu(); > > Or use the list_splice_init_rcu() primitive which also explicitly takes > a @sync argument. Thanks for your detailed explanation, Peter! What about if the element is allocated from SLAB_DESTROY_BY_RCU slab? That means the element may be reused while do iteration. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 25, 2013 at 06:59:24PM +0800, Xiao Guangrong wrote: > > Hi Peter, > > On 11/25/2013 05:31 PM, Peter Zijlstra wrote: > > On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote: > >> Also, there is no guarantee of termination (as long as sptes are > >> deleted with the correct timing). BTW, can't see any guarantee of > >> termination for rculist nulls either (a writer can race with a lockless > >> reader indefinately, restarting the lockless walk every time). > > > > What's an rculist null? > > I guess Marcelo was talking about rculist_nulls.h > (Documentation/RCU/rculist_nulls.txt). Oh, let me have a look, I don't think I've really looked at that yet. > > rculists have regular termination conditions, > > they'll reach the end (also head, due to circular etc..) in N steps, > > where N is the number of elements. > > > > Of course you can keep adding elements to protract this situation, but > > you'll run out of memory eventually -- you also have to make sure to > > insert them after the element being read, otherwise the iteration will > > simply miss them. > > > > Deleting an element preserves the element itself -- it has to be > > RCU-freed to be part of an rculist, and the element also preserves its > > fwd link, so any iterator will either not see the element, or if they're > > at the element, they'll continue their iteration as normal (rculist > > doesn't have backward iterators). > > > > A deleted element may not be re-used before an RCU grace period has > > lapsed. Same as for freeing such an element. So if you want to move an > > rculist element you need to: > > > > list_del_rcu() > > synchronize_rcu(); > > list_add_rcu(); > > > > Or use the list_splice_init_rcu() primitive which also explicitly takes > > a @sync argument. > > Thanks for your detailed explanation, Peter! > > What about if the element is allocated from SLAB_DESTROY_BY_RCU slab? That > means the element may be reused while do iteration. Then its broken, SLAB_DESTROY_BY_RCU rarely does what people expect it to; which is why I wrote that honkin' huge comment right next to it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 25, 2013 at 12:05:00PM +0100, Peter Zijlstra wrote: > On Mon, Nov 25, 2013 at 06:59:24PM +0800, Xiao Guangrong wrote: > > I guess Marcelo was talking about rculist_nulls.h > > (Documentation/RCU/rculist_nulls.txt). > > Oh, let me have a look, I don't think I've really looked at that yet. Egads, that's far too clever ;-) Yes, if that's what Marcello was referencing to he's right, it doesn't have a proper termination condition in the face of unlimited modification. And it can indeed use SLAB_DESTROY_BY_RCU as you alluded to -- although the documentation is a tad scarce explaining why this is so. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote: > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: <snip complicated stuff about parent_pte> I'm not really following, but note that parent_pte predates EPT (and the use of rcu in kvm), so all the complexity that is the result of trying to pack as many list entries into a cache line can be dropped. Most setups now would have exactly one list entry, which is handled specially antyway. Alternatively, the trick of storing multiple entries in one list entry can be moved to generic code, it may be useful to others. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote: > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote: > >> It likes nulls list and we use the pte-list as the nulls which can help us to > >> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap > >> if that happened > >> > >> kvm->slots_lock is held when we do lockless walking that prevents rmap > >> is reused (free rmap need to hold that lock) so that we can not see the same > >> nulls used on different rmaps > >> > >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > > > > How about simplified lockless walk on the slot while rmapp entry > > contains a single spte? (which should be the case with two-dimensional > > paging). > > > > That is, grab the lock when finding a rmap with more than one spte in > > it (and then keep it locked until the end). > > Hmm… that isn't straightforward and more complex than the approach > in this patchset. Also it can drop the improvement for shadow mmu that > gets great improvement by this patchset. It is not more complex, since it would remove list lockless walk. Only the spte pointer at rmap[spte] is accessed without a lock. Its much much simpler. > > For example, nothing prevents lockless walker to move into some > > parent_ptes chain, right? > > No. > > The nulls can help us to detect this case, for parent_ptes, the nulls points > to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There > is no chance that the “rmap" is used as shadow page when slot-lock is held. The SLAB cache is the same, so entries can be reused. What prevents a desc entry living in slot.arch.rmap to be freed and reused by a parent_ptes desc? > > Also, there is no guarantee of termination (as long as sptes are > > deleted with the correct timing). BTW, can't see any guarantee of > > termination for rculist nulls either (a writer can race with a lockless > > reader indefinately, restarting the lockless walk every time). > > Hmm, that can be avoided by checking dirty-bitmap before rewalk, > that means, if the dirty-bitmap has been set during lockless write-protection, > it’s unnecessary to write-protect its sptes. Your idea? > > But… do we really need to care it. :( See my reply to Avi. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote: > On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong > <xiaoguangrong@linux.vnet.ibm.com> wrote: > > > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > > <snip complicated stuff about parent_pte> > > I'm not really following, but note that parent_pte predates EPT (and > the use of rcu in kvm), so all the complexity that is the result of > trying to pack as many list entries into a cache line can be dropped. > Most setups now would have exactly one list entry, which is handled > specially antyway. > > Alternatively, the trick of storing multiple entries in one list entry > can be moved to generic code, it may be useful to others. Yes, can the lockless list walking code be transformed into generic single-linked list walking? So the correctness can be verified independently, and KVM becomes a simple user of that interface. The simpler version is to maintain lockless walk on depth-1 rmap entries (and grab the lock once depth-2 entry is found). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 25, 2013 at 12:23:51PM -0200, Marcelo Tosatti wrote: > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote: > > On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong > > <xiaoguangrong@linux.vnet.ibm.com> wrote: > > > > > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > <snip complicated stuff about parent_pte> > > > > I'm not really following, but note that parent_pte predates EPT (and > > the use of rcu in kvm), so all the complexity that is the result of > > trying to pack as many list entries into a cache line can be dropped. > > Most setups now would have exactly one list entry, which is handled > > specially antyway. > > > > Alternatively, the trick of storing multiple entries in one list entry > > can be moved to generic code, it may be useful to others. > > Yes, can the lockless list walking code be transformed into generic > single-linked list walking? So the correctness can be verified > independently, and KVM becomes a simple user of that interface. > The code will become simpler but the problem of never ending walk of rculist_nulls will remain. > The simpler version is to maintain lockless walk on depth-1 rmap entries > (and grab the lock once depth-2 entry is found). And release it between each rmap walk or at the very end of write protect? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
GOn Mon, Nov 25, 2013 at 04:29:28PM +0200, Gleb Natapov wrote: > On Mon, Nov 25, 2013 at 12:23:51PM -0200, Marcelo Tosatti wrote: > > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote: > > > On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong > > > <xiaoguangrong@linux.vnet.ibm.com> wrote: > > > > > > > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > > > <snip complicated stuff about parent_pte> > > > > > > I'm not really following, but note that parent_pte predates EPT (and > > > the use of rcu in kvm), so all the complexity that is the result of > > > trying to pack as many list entries into a cache line can be dropped. > > > Most setups now would have exactly one list entry, which is handled > > > specially antyway. > > > > > > Alternatively, the trick of storing multiple entries in one list entry > > > can be moved to generic code, it may be useful to others. > > > > Yes, can the lockless list walking code be transformed into generic > > single-linked list walking? So the correctness can be verified > > independently, and KVM becomes a simple user of that interface. > > > The code will become simpler but the problem of never ending walk of > rculist_nulls will remain. Yes. Hopefully it can be fixed in some way. > > The simpler version is to maintain lockless walk on depth-1 rmap entries > > (and grab the lock once depth-2 entry is found). > And release it between each rmap walk or at the very end of write > protect? Or keep it locked until 10 consecutive sptes with depth 1 are found. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: > >> Also, there is no guarantee of termination (as long as sptes are > >> deleted with the correct timing). BTW, can't see any guarantee of > >> termination for rculist nulls either (a writer can race with a lockless > >> reader indefinately, restarting the lockless walk every time). > > > > Hmm, that can be avoided by checking dirty-bitmap before rewalk, > > that means, if the dirty-bitmap has been set during lockless write-protection, > > it?s unnecessary to write-protect its sptes. Your idea? > This idea is based on the fact that the number of rmap is limited by > RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, > we can break the rewalk at once, in the case of deleting, we can only > rewalk RMAP_RECYCLE_THRESHOLD times. Please explain in more detail. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/25/2013 10:08 PM, Marcelo Tosatti wrote: > On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote: >> >> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: >> >>> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote: >>>> It likes nulls list and we use the pte-list as the nulls which can help us to >>>> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap >>>> if that happened >>>> >>>> kvm->slots_lock is held when we do lockless walking that prevents rmap >>>> is reused (free rmap need to hold that lock) so that we can not see the same >>>> nulls used on different rmaps >>>> >>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> >>> >>> How about simplified lockless walk on the slot while rmapp entry >>> contains a single spte? (which should be the case with two-dimensional >>> paging). >>> >>> That is, grab the lock when finding a rmap with more than one spte in >>> it (and then keep it locked until the end). >> >> Hmm… that isn't straightforward and more complex than the approach >> in this patchset. Also it can drop the improvement for shadow mmu that >> gets great improvement by this patchset. > > It is not more complex, since it would remove list lockless walk. Only > the spte pointer at rmap[spte] is accessed without a lock. Its much much > simpler. > >>> For example, nothing prevents lockless walker to move into some >>> parent_ptes chain, right? >> >> No. >> >> The nulls can help us to detect this case, for parent_ptes, the nulls points >> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There >> is no chance that the “rmap" is used as shadow page when slot-lock is held. > > The SLAB cache is the same, so entries can be reused. What prevents > a desc entry living in slot.arch.rmap to be freed and reused by a > parent_ptes desc? > We will check is_last_spte(), all the sptes on parent_ptes should be failed. And Gleb suggested to use a separate slab for rmap, that should be excellent. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/25/2013 10:23 PM, Marcelo Tosatti wrote: > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote: >> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong >> <xiaoguangrong@linux.vnet.ibm.com> wrote: >>> >>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: >> >> <snip complicated stuff about parent_pte> >> >> I'm not really following, but note that parent_pte predates EPT (and >> the use of rcu in kvm), so all the complexity that is the result of >> trying to pack as many list entries into a cache line can be dropped. >> Most setups now would have exactly one list entry, which is handled >> specially antyway. >> >> Alternatively, the trick of storing multiple entries in one list entry >> can be moved to generic code, it may be useful to others. > > Yes, can the lockless list walking code be transformed into generic > single-linked list walking? So the correctness can be verified > independently, and KVM becomes a simple user of that interface. I'am afraid the signle-entry list is not so good as we expected. In my experience, there're too many entries on rmap, more than 300 sometimes. (consider a case that a lib shared by all processes). > > The simpler version is to maintain lockless walk on depth-1 rmap entries > (and grab the lock once depth-2 entry is found). I still think rmap-lockless is more graceful: soft mmu can get benefit from it also it is promising to be used in some mmu-notify functions. :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: > On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: >>>> Also, there is no guarantee of termination (as long as sptes are >>>> deleted with the correct timing). BTW, can't see any guarantee of >>>> termination for rculist nulls either (a writer can race with a lockless >>>> reader indefinately, restarting the lockless walk every time). >>> >>> Hmm, that can be avoided by checking dirty-bitmap before rewalk, >>> that means, if the dirty-bitmap has been set during lockless write-protection, >>> it?s unnecessary to write-protect its sptes. Your idea? >> This idea is based on the fact that the number of rmap is limited by >> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, >> we can break the rewalk at once, in the case of deleting, we can only >> rewalk RMAP_RECYCLE_THRESHOLD times. > > Please explain in more detail. Okay. My proposal is like this: pte_list_walk_lockless() { restart: + if (__test_bit(slot->arch.dirty_bitmap, gfn-index)) + return; code-doing-lockless-walking; ...... } Before do lockless-walking, we check the dirty-bitmap first, if it is set we can simply skip write-protection for the gfn, that is the case that new spte is being added into rmap when we lockless access the rmap. For the case of deleting spte from rmap, the number of entry is limited by RMAP_RECYCLE_THRESHOLD, that is not endlessly. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote: > On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: > > On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: > >>>> Also, there is no guarantee of termination (as long as sptes are > >>>> deleted with the correct timing). BTW, can't see any guarantee of > >>>> termination for rculist nulls either (a writer can race with a lockless > >>>> reader indefinately, restarting the lockless walk every time). > >>> > >>> Hmm, that can be avoided by checking dirty-bitmap before rewalk, > >>> that means, if the dirty-bitmap has been set during lockless write-protection, > >>> it?s unnecessary to write-protect its sptes. Your idea? > >> This idea is based on the fact that the number of rmap is limited by > >> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, > >> we can break the rewalk at once, in the case of deleting, we can only > >> rewalk RMAP_RECYCLE_THRESHOLD times. > > > > Please explain in more detail. > > Okay. > > My proposal is like this: > > pte_list_walk_lockless() > { > restart: > > + if (__test_bit(slot->arch.dirty_bitmap, gfn-index)) > + return; > > code-doing-lockless-walking; > ...... > } > > Before do lockless-walking, we check the dirty-bitmap first, if > it is set we can simply skip write-protection for the gfn, that > is the case that new spte is being added into rmap when we lockless > access the rmap. > > For the case of deleting spte from rmap, the number of entry is limited > by RMAP_RECYCLE_THRESHOLD, that is not endlessly. The point is that rmap entry that you are inspecting can be constantly deleted and added to the beginning of some other list, so the code that traverse the list will never reach the end. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote: > On 11/25/2013 10:23 PM, Marcelo Tosatti wrote: > > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote: > >> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong > >> <xiaoguangrong@linux.vnet.ibm.com> wrote: > >>> > >>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > >> > >> <snip complicated stuff about parent_pte> > >> > >> I'm not really following, but note that parent_pte predates EPT (and > >> the use of rcu in kvm), so all the complexity that is the result of > >> trying to pack as many list entries into a cache line can be dropped. > >> Most setups now would have exactly one list entry, which is handled > >> specially antyway. > >> > >> Alternatively, the trick of storing multiple entries in one list entry > >> can be moved to generic code, it may be useful to others. > > > > Yes, can the lockless list walking code be transformed into generic > > single-linked list walking? So the correctness can be verified > > independently, and KVM becomes a simple user of that interface. > > I'am afraid the signle-entry list is not so good as we expected. In my > experience, there're too many entries on rmap, more than 300 sometimes. > (consider a case that a lib shared by all processes). > This is without EPT though and non EPT HW is not performance king anyway. Nested EPT uses shadow paging too, but VMs hardly share any pages. With KSM they may though. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote: > On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: > > On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: > >>>> Also, there is no guarantee of termination (as long as sptes are > >>>> deleted with the correct timing). BTW, can't see any guarantee of > >>>> termination for rculist nulls either (a writer can race with a lockless > >>>> reader indefinately, restarting the lockless walk every time). > >>> > >>> Hmm, that can be avoided by checking dirty-bitmap before rewalk, > >>> that means, if the dirty-bitmap has been set during lockless write-protection, > >>> it?s unnecessary to write-protect its sptes. Your idea? > >> This idea is based on the fact that the number of rmap is limited by > >> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, > >> we can break the rewalk at once, in the case of deleting, we can only > >> rewalk RMAP_RECYCLE_THRESHOLD times. > > > > Please explain in more detail. > > Okay. > > My proposal is like this: > > pte_list_walk_lockless() > { > restart: > > + if (__test_bit(slot->arch.dirty_bitmap, gfn-index)) > + return; > > code-doing-lockless-walking; > ...... > } > > Before do lockless-walking, we check the dirty-bitmap first, if > it is set we can simply skip write-protection for the gfn, that > is the case that new spte is being added into rmap when we lockless > access the rmap. The dirty bit could be set after the check. > For the case of deleting spte from rmap, the number of entry is limited > by RMAP_RECYCLE_THRESHOLD, that is not endlessly. It can shrink and grow while lockless walk is performed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote: > On 11/25/2013 10:23 PM, Marcelo Tosatti wrote: > > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote: > >> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong > >> <xiaoguangrong@linux.vnet.ibm.com> wrote: > >>> > >>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > >> > >> <snip complicated stuff about parent_pte> > >> > >> I'm not really following, but note that parent_pte predates EPT (and > >> the use of rcu in kvm), so all the complexity that is the result of > >> trying to pack as many list entries into a cache line can be dropped. > >> Most setups now would have exactly one list entry, which is handled > >> specially antyway. > >> > >> Alternatively, the trick of storing multiple entries in one list entry > >> can be moved to generic code, it may be useful to others. > > > > Yes, can the lockless list walking code be transformed into generic > > single-linked list walking? So the correctness can be verified > > independently, and KVM becomes a simple user of that interface. > > I'am afraid the signle-entry list is not so good as we expected. In my > experience, there're too many entries on rmap, more than 300 sometimes. > (consider a case that a lib shared by all processes). single linked list was about moving singly-linked lockless walking to generic code. http://www.spinics.net/lists/linux-usb/msg39643.html http://marc.info/?l=linux-kernel&m=103305635013575&w=3 > > The simpler version is to maintain lockless walk on depth-1 rmap entries > > (and grab the lock once depth-2 entry is found). > > I still think rmap-lockless is more graceful: soft mmu can get benefit > from it also it is promising to be used in some mmu-notify functions. :) OK. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/27/2013 03:58 AM, Marcelo Tosatti wrote: > On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote: >> On 11/25/2013 10:23 PM, Marcelo Tosatti wrote: >>> On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote: >>>> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong >>>> <xiaoguangrong@linux.vnet.ibm.com> wrote: >>>>> >>>>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: >>>> >>>> <snip complicated stuff about parent_pte> >>>> >>>> I'm not really following, but note that parent_pte predates EPT (and >>>> the use of rcu in kvm), so all the complexity that is the result of >>>> trying to pack as many list entries into a cache line can be dropped. >>>> Most setups now would have exactly one list entry, which is handled >>>> specially antyway. >>>> >>>> Alternatively, the trick of storing multiple entries in one list entry >>>> can be moved to generic code, it may be useful to others. >>> >>> Yes, can the lockless list walking code be transformed into generic >>> single-linked list walking? So the correctness can be verified >>> independently, and KVM becomes a simple user of that interface. >> >> I'am afraid the signle-entry list is not so good as we expected. In my >> experience, there're too many entries on rmap, more than 300 sometimes. >> (consider a case that a lib shared by all processes). > > single linked list was about moving singly-linked lockless walking > to generic code. > > http://www.spinics.net/lists/linux-usb/msg39643.html > http://marc.info/?l=linux-kernel&m=103305635013575&w=3 > Oh, i confused "single linked" and "single entry", sorry about that. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/27/2013 03:31 AM, Marcelo Tosatti wrote: > On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote: >> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: >>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: >>>>>> Also, there is no guarantee of termination (as long as sptes are >>>>>> deleted with the correct timing). BTW, can't see any guarantee of >>>>>> termination for rculist nulls either (a writer can race with a lockless >>>>>> reader indefinately, restarting the lockless walk every time). >>>>> >>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk, >>>>> that means, if the dirty-bitmap has been set during lockless write-protection, >>>>> it?s unnecessary to write-protect its sptes. Your idea? >>>> This idea is based on the fact that the number of rmap is limited by >>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, >>>> we can break the rewalk at once, in the case of deleting, we can only >>>> rewalk RMAP_RECYCLE_THRESHOLD times. >>> >>> Please explain in more detail. >> >> Okay. >> >> My proposal is like this: >> >> pte_list_walk_lockless() >> { >> restart: >> >> + if (__test_bit(slot->arch.dirty_bitmap, gfn-index)) >> + return; >> >> code-doing-lockless-walking; >> ...... >> } >> >> Before do lockless-walking, we check the dirty-bitmap first, if >> it is set we can simply skip write-protection for the gfn, that >> is the case that new spte is being added into rmap when we lockless >> access the rmap. > > The dirty bit could be set after the check. > >> For the case of deleting spte from rmap, the number of entry is limited >> by RMAP_RECYCLE_THRESHOLD, that is not endlessly. > > It can shrink and grow while lockless walk is performed. Yes, indeed. Hmmm, another idea in my mind to fix this is encoding the position into the reserved bits of desc->more pointer, for example: +------+ +------+ +------+ rmapp -> |Desc 0| -> |Desc 1| -> |Desc 2| +------+ +------+ +------+ There are 3 descs on the rmap, and: rmapp = &desc0 | 1UL | 3UL << 50; desc0->more = desc1 | 2UL << 50; desc1->more = desc0 | 1UL << 50 desc2->more = &rmapp | 1UL; (The nulls pointer) We will walk to the next desc only if the "position" of current desc is >= the position of next desc. That can make sure we can reach the last desc anyway. And in order to avoiding doing too many "rewalk", we will goto the slow path (do walk with holding the lock) instead when retry the walk more that N times. Thanks all you guys in thanksgiving day. :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/28/2013 04:53 PM, Xiao Guangrong wrote: > On 11/27/2013 03:31 AM, Marcelo Tosatti wrote: >> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote: >>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: >>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: >>>>>>> Also, there is no guarantee of termination (as long as sptes are >>>>>>> deleted with the correct timing). BTW, can't see any guarantee of >>>>>>> termination for rculist nulls either (a writer can race with a lockless >>>>>>> reader indefinately, restarting the lockless walk every time). >>>>>> >>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk, >>>>>> that means, if the dirty-bitmap has been set during lockless write-protection, >>>>>> it?s unnecessary to write-protect its sptes. Your idea? >>>>> This idea is based on the fact that the number of rmap is limited by >>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, >>>>> we can break the rewalk at once, in the case of deleting, we can only >>>>> rewalk RMAP_RECYCLE_THRESHOLD times. >>>> >>>> Please explain in more detail. >>> >>> Okay. >>> >>> My proposal is like this: >>> >>> pte_list_walk_lockless() >>> { >>> restart: >>> >>> + if (__test_bit(slot->arch.dirty_bitmap, gfn-index)) >>> + return; >>> >>> code-doing-lockless-walking; >>> ...... >>> } >>> >>> Before do lockless-walking, we check the dirty-bitmap first, if >>> it is set we can simply skip write-protection for the gfn, that >>> is the case that new spte is being added into rmap when we lockless >>> access the rmap. >> >> The dirty bit could be set after the check. >> >>> For the case of deleting spte from rmap, the number of entry is limited >>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly. >> >> It can shrink and grow while lockless walk is performed. > > Yes, indeed. > > Hmmm, another idea in my mind to fix this is encoding the position into > the reserved bits of desc->more pointer, for example: > > +------+ +------+ +------+ > rmapp -> |Desc 0| -> |Desc 1| -> |Desc 2| > +------+ +------+ +------+ > > There are 3 descs on the rmap, and: > rmapp = &desc0 | 1UL | 3UL << 50; > desc0->more = desc1 | 2UL << 50; > desc1->more = desc0 | 1UL << 50 > desc2->more = &rmapp | 1UL; (The nulls pointer) > > We will walk to the next desc only if the "position" of current desc > is >= the position of next desc. That can make sure we can reach the > last desc anyway. > > And in order to avoiding doing too many "rewalk", we will goto the > slow path (do walk with holding the lock) instead when retry the walk > more that N times. How about this idea? Or you guys still prefer to the idea of lockless on first-level? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
GOn Tue, Dec 03, 2013 at 03:10:48PM +0800, Xiao Guangrong wrote: > On 11/28/2013 04:53 PM, Xiao Guangrong wrote: > > On 11/27/2013 03:31 AM, Marcelo Tosatti wrote: > >> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote: > >>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: > >>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: > >>>>>>> Also, there is no guarantee of termination (as long as sptes are > >>>>>>> deleted with the correct timing). BTW, can't see any guarantee of > >>>>>>> termination for rculist nulls either (a writer can race with a lockless > >>>>>>> reader indefinately, restarting the lockless walk every time). > >>>>>> > >>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk, > >>>>>> that means, if the dirty-bitmap has been set during lockless write-protection, > >>>>>> it?s unnecessary to write-protect its sptes. Your idea? > >>>>> This idea is based on the fact that the number of rmap is limited by > >>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, > >>>>> we can break the rewalk at once, in the case of deleting, we can only > >>>>> rewalk RMAP_RECYCLE_THRESHOLD times. > >>>> > >>>> Please explain in more detail. > >>> > >>> Okay. > >>> > >>> My proposal is like this: > >>> > >>> pte_list_walk_lockless() > >>> { > >>> restart: > >>> > >>> + if (__test_bit(slot->arch.dirty_bitmap, gfn-index)) > >>> + return; > >>> > >>> code-doing-lockless-walking; > >>> ...... > >>> } > >>> > >>> Before do lockless-walking, we check the dirty-bitmap first, if > >>> it is set we can simply skip write-protection for the gfn, that > >>> is the case that new spte is being added into rmap when we lockless > >>> access the rmap. > >> > >> The dirty bit could be set after the check. > >> > >>> For the case of deleting spte from rmap, the number of entry is limited > >>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly. > >> > >> It can shrink and grow while lockless walk is performed. > > > > Yes, indeed. > > > > Hmmm, another idea in my mind to fix this is encoding the position into > > the reserved bits of desc->more pointer, for example: > > > > +------+ +------+ +------+ > > rmapp -> |Desc 0| -> |Desc 1| -> |Desc 2| > > +------+ +------+ +------+ > > > > There are 3 descs on the rmap, and: > > rmapp = &desc0 | 1UL | 3UL << 50; > > desc0->more = desc1 | 2UL << 50; > > desc1->more = desc0 | 1UL << 50 > > desc2->more = &rmapp | 1UL; (The nulls pointer) > > > > We will walk to the next desc only if the "position" of current desc > > is >= the position of next desc. That can make sure we can reach the > > last desc anyway. > > > > And in order to avoiding doing too many "rewalk", we will goto the > > slow path (do walk with holding the lock) instead when retry the walk > > more that N times. > > How about this idea? Or you guys still prefer to the idea of lockless on > first-level? Xiao, Is it not the case that simply moving to the slow path once a maximum of rewalks has been reached enough? (looks a like a good solution). Please move lockless rcu walking code to generic code where it belongs. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Dec 5, 2013, at 9:50 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > GOn Tue, Dec 03, 2013 at 03:10:48PM +0800, Xiao Guangrong wrote: >> On 11/28/2013 04:53 PM, Xiao Guangrong wrote: >>> On 11/27/2013 03:31 AM, Marcelo Tosatti wrote: >>>> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote: >>>>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: >>>>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: >>>>>>>>> Also, there is no guarantee of termination (as long as sptes are >>>>>>>>> deleted with the correct timing). BTW, can't see any guarantee of >>>>>>>>> termination for rculist nulls either (a writer can race with a lockless >>>>>>>>> reader indefinately, restarting the lockless walk every time). >>>>>>>> >>>>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk, >>>>>>>> that means, if the dirty-bitmap has been set during lockless write-protection, >>>>>>>> it?s unnecessary to write-protect its sptes. Your idea? >>>>>>> This idea is based on the fact that the number of rmap is limited by >>>>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, >>>>>>> we can break the rewalk at once, in the case of deleting, we can only >>>>>>> rewalk RMAP_RECYCLE_THRESHOLD times. >>>>>> >>>>>> Please explain in more detail. >>>>> >>>>> Okay. >>>>> >>>>> My proposal is like this: >>>>> >>>>> pte_list_walk_lockless() >>>>> { >>>>> restart: >>>>> >>>>> + if (__test_bit(slot->arch.dirty_bitmap, gfn-index)) >>>>> + return; >>>>> >>>>> code-doing-lockless-walking; >>>>> ...... >>>>> } >>>>> >>>>> Before do lockless-walking, we check the dirty-bitmap first, if >>>>> it is set we can simply skip write-protection for the gfn, that >>>>> is the case that new spte is being added into rmap when we lockless >>>>> access the rmap. >>>> >>>> The dirty bit could be set after the check. >>>> >>>>> For the case of deleting spte from rmap, the number of entry is limited >>>>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly. >>>> >>>> It can shrink and grow while lockless walk is performed. >>> >>> Yes, indeed. >>> >>> Hmmm, another idea in my mind to fix this is encoding the position into >>> the reserved bits of desc->more pointer, for example: >>> >>> +------+ +------+ +------+ >>> rmapp -> |Desc 0| -> |Desc 1| -> |Desc 2| >>> +------+ +------+ +------+ >>> >>> There are 3 descs on the rmap, and: >>> rmapp = &desc0 | 1UL | 3UL << 50; >>> desc0->more = desc1 | 2UL << 50; >>> desc1->more = desc0 | 1UL << 50 >>> desc2->more = &rmapp | 1UL; (The nulls pointer) >>> >>> We will walk to the next desc only if the "position" of current desc >>> is >= the position of next desc. That can make sure we can reach the >>> last desc anyway. >>> >>> And in order to avoiding doing too many "rewalk", we will goto the >>> slow path (do walk with holding the lock) instead when retry the walk >>> more that N times. >> >> How about this idea? Or you guys still prefer to the idea of lockless on >> first-level? > > Xiao, > > Is it not the case that simply moving to the slow path once a maximum of > rewalks has been reached enough? (looks a like a good solution). In some cases, the lockless walker will do endless-walking on desc and without rewalk, consider this case: there are two descs: desc1 and desc2 who is pointed by desc1->next? desc1->next = desc2. CPU 0 CPU 1 lockless walk on desc1 deleting desc1 from the rmap lockless walk on desc2 (desc1->next) delete desc2 from the rmap add desc1 add desc2, then desc2->next = desc1 lockless walk on desc1 delete desc2 delete desc1 add desc2 add desc1; the desc1->next = desc2 lockless walk on desc2 …… Then, the walker is endlessly walking on desc1 and desc2 without any rewalk. > > Please move lockless rcu walking code to generic code where it belongs. Okay. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote: > > Is it not the case that simply moving to the slow path once a maximum of > > rewalks has been reached enough? (looks a like a good solution). > > In some cases, the lockless walker will do endless-walking on desc and > without rewalk, consider this case: > > there are two descs: desc1 and desc2 who is pointed by desc1->next? > desc1->next = desc2. > > CPU 0 CPU 1 > > lockless walk on desc1 > deleting desc1 from the rmap > lockless walk on desc2 (desc1->next) > delete desc2 from the rmap > add desc1 > add desc2, then desc2->next = desc1 > > lockless walk on desc1 > delete desc2 > delete desc1 > add desc2 > add desc1; the desc1->next = desc2 > lockless walk on desc2 > > …… > > Then, the walker is endlessly walking on desc1 and desc2 without any rewalk. Ouch, this is the sort of thing that is worrysome. Do you still think that having the benefit for shadow is important enough to justify this complexity? Also, another separate possibility is to use the dirty bit (which recent Intel processors support). Then there are no faults at all to handle. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote: > In some cases, the lockless walker will do endless-walking on desc and > without rewalk, consider this case: > > there are two descs: desc1 and desc2 who is pointed by desc1->next? > desc1->next = desc2. > > CPU 0 CPU 1 > > lockless walk on desc1 > deleting desc1 from the rmap > lockless walk on desc2 (desc1->next) > delete desc2 from the rmap > add desc1 > add desc2, then desc2->next = desc1 > > lockless walk on desc1 > delete desc2 > delete desc1 > add desc2 > add desc1; the desc1->next = desc2 > lockless walk on desc2 > > …… > > Then, the walker is endlessly walking on desc1 and desc2 without any rewalk. The counter can be local to the walker. If its more than maximum rmap size, break. (but still, please consider carefully whether lockless list walking is really necessary or can be avoided). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/06/2013 08:22 AM, Marcelo Tosatti wrote: > On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote: >> In some cases, the lockless walker will do endless-walking on desc and >> without rewalk, consider this case: >> >> there are two descs: desc1 and desc2 who is pointed by desc1->next? >> desc1->next = desc2. >> >> CPU 0 CPU 1 >> >> lockless walk on desc1 >> deleting desc1 from the rmap >> lockless walk on desc2 (desc1->next) >> delete desc2 from the rmap >> add desc1 >> add desc2, then desc2->next = desc1 >> >> lockless walk on desc1 >> delete desc2 >> delete desc1 >> add desc2 >> add desc1; the desc1->next = desc2 >> lockless walk on desc2 >> >> …… >> >> Then, the walker is endlessly walking on desc1 and desc2 without any rewalk. > > The counter can be local to the walker. If its more than maximum rmap > size, break. > > (but still, please consider carefully whether lockless list walking is > really necessary or can be avoided). Yep, Marcelo, you're right. After thinking more, i do not have any idea to simplify this. Your approach (lockless on the first level) seems a better solution. Will do it based on that ways. Thanks! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5cce039..4687329 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) return level - 1; } +static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc) +{ + unsigned long marker; + + marker = (unsigned long)pte_list | 1UL; + desc->more = (struct pte_list_desc *)marker; +} + +static bool desc_is_a_nulls(struct pte_list_desc *desc) +{ + return (unsigned long)desc & 1; +} + +static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc) +{ + return (unsigned long *)((unsigned long)desc & ~1); +} + static int __find_first_free(struct pte_list_desc *desc) { int i; @@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc) first_free = __find_first_free(desc); - for (desc_num = 0; desc->more; desc = desc->more) + for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more) desc_num++; return first_free + desc_num * PTE_LIST_EXT; @@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, desc = mmu_alloc_pte_list_desc(vcpu); desc->sptes[0] = (u64 *)*pte_list; desc->sptes[1] = spte; + desc_mark_nulls(pte_list, desc); *pte_list = (unsigned long)desc | 1; return 1; } @@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list, /* * Only one entry existing but still use a desc to store it? */ - WARN_ON(!next_desc); + WARN_ON(desc_is_a_nulls(next_desc)); mmu_free_pte_list_desc(first_desc); *pte_list = (unsigned long)next_desc | 1ul; @@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list, * Only one entry in this desc, move the entry to the head * then the desc can be freed. */ - if (!first_desc->sptes[1] && !first_desc->more) { + if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) { *pte_list = (unsigned long)first_desc->sptes[0]; mmu_free_pte_list_desc(first_desc); } @@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list) rmap_printk("pte_list_remove: %p many->many\n", spte); desc = (struct pte_list_desc *)(*pte_list & ~1ul); - while (desc) { + while (!desc_is_a_nulls(desc)) { for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) if (desc->sptes[i] == spte) { pte_list_desc_remove_entry(pte_list, @@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn) return fn((u64 *)*pte_list); desc = (struct pte_list_desc *)(*pte_list & ~1ul); - while (desc) { + while (!desc_is_a_nulls(desc)) { for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) fn(desc->sptes[i]); desc = desc->more; } + + WARN_ON(desc_get_nulls_value(desc) != pte_list); } static unsigned long *__gfn_to_rmap(gfn_t gfn, int level, @@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter) iter->desc = (struct pte_list_desc *)(rmap & ~1ul); iter->pos = 0; + WARN_ON(desc_is_a_nulls(iter->desc)); return iter->desc->sptes[iter->pos]; } @@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter) return sptep; } - iter->desc = iter->desc->more; + iter->desc = desc_is_a_nulls(iter->desc->more) ? + NULL : iter->desc->more; if (iter->desc) { iter->pos = 0;
It likes nulls list and we use the pte-list as the nulls which can help us to detect whether the "desc" is moved to anther rmap then we can re-walk the rmap if that happened kvm->slots_lock is held when we do lockless walking that prevents rmap is reused (free rmap need to hold that lock) so that we can not see the same nulls used on different rmaps Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> --- arch/x86/kvm/mmu.c | 35 +++++++++++++++++++++++++++++------ 1 file changed, 29 insertions(+), 6 deletions(-)