diff mbox

[v20,4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

Message ID 1513685879-21823-5-git-send-email-wei.w.wang@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wang, Wei W Dec. 19, 2017, 12:17 p.m. UTC
Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of
balloon (i.e. inflated/deflated) pages using scatter-gather lists to the
host.

The implementation of the previous virtio-balloon is not very efficient,
because the balloon pages are transferred to the host by one array each
time. Here is the breakdown of the time in percentage spent on each step
of the balloon inflating process (inflating 7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete. The above
profiling shows that the bottlenecks are stage 2) and stage 4).

This patch optimizes step 2) by transferring pages to host in sgs. An sg
describes a chunk of guest physically continuous pages. With this
mechanism, step 4) can also be optimized by doing address translation and
madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~460ms resulting
in an improvement of ~89%.

TODO:
- optimize stage 1) by allocating/freeing a chunk of pages instead of a
single page each time.
- sort the internal balloon page queue.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/virtio/virtio_balloon.c     | 234 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 217 insertions(+), 18 deletions(-)

Comments

Matthew Wilcox Dec. 24, 2017, 3:21 a.m. UTC | #1
On Tue, Dec 19, 2017 at 08:17:56PM +0800, Wei Wang wrote:
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)

I'm not crazy about the naming here.  I'd use pfn_min and pfn_max like
you use in the caller.

> +{
> +	unsigned long pfn_start, pfn_end;
> +	uint32_t max_len = round_down(UINT_MAX, PAGE_SIZE);
> +	uint64_t len;
> +
> +	pfn_start = page_xb_start;

And I think pfn_start is actually just 'pfn'.

'pfn_end' is perhaps just 'end'.  Or 'gap'.

> +	while (pfn_start < page_xb_end) {
> +		pfn_start = xb_find_set(&vb->page_xb, page_xb_end + 1,
> +					pfn_start);
> +		if (pfn_start == page_xb_end + 1)
> +			break;
> +		pfn_end = xb_find_zero(&vb->page_xb, page_xb_end + 1,
> +				       pfn_start);
> +		len = (pfn_end - pfn_start) << PAGE_SHIFT;

> +static inline int xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{

I really don't like it that you're naming things after the 'xb'.
Things should be named by something that makes sense to the user, not
after the particular implementation.  If you changed the underlying
representation from an xbitmap to, say, a BTree, you wouldn't want to
rename this function to 'btree_set_page'.  Maybe this function is really
"vb_set_page".  Or "record_page".  Or something.  Someone who understands
this driver better than I do can probably weigh in with a better name.

> +	unsigned long pfn = page_to_pfn(page);
> +	int ret;
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +
> +	do {
> +		if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
> +			return -ENOMEM;
> +
> +		ret = xb_set_bit(&vb->page_xb, pfn);
> +		xb_preload_end();
> +	} while (unlikely(ret == -EAGAIN));

OK, so you don't need a spinlock because you're under a mutex?  But you
can't allocate memory because you're in the balloon driver, and so a
GFP_KERNEL allocation might recurse into your driver?  Would GFP_NOIO
do the job?  I'm a little hazy on exactly how the balloon driver works.

If you can't preload with anything better than that, I think that
xb_set_bit() should attempt an allocation with GFP_NOWAIT | __GFP_NOWARN,
and then you can skip the preload; it has no value for you.

> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	while ((page = balloon_page_pop(&pages))) {
>  		balloon_page_enqueue(&vb->vb_dev_info, page);
> +		if (use_sg) {
> +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
> +				__free_page(page);
> +				continue;
> +			}
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}

Is this the right behaviour?  If we can't record the page in the xb,
wouldn't we rather send it across as a single page?
Tetsuo Handa Dec. 24, 2017, 4:45 a.m. UTC | #2
Matthew Wilcox wrote:
> > +	unsigned long pfn = page_to_pfn(page);
> > +	int ret;
> > +
> > +	*pfn_min = min(pfn, *pfn_min);
> > +	*pfn_max = max(pfn, *pfn_max);
> > +
> > +	do {
> > +		if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
> > +			return -ENOMEM;
> > +
> > +		ret = xb_set_bit(&vb->page_xb, pfn);
> > +		xb_preload_end();
> > +	} while (unlikely(ret == -EAGAIN));
> 
> OK, so you don't need a spinlock because you're under a mutex?  But you
> can't allocate memory because you're in the balloon driver, and so a
> GFP_KERNEL allocation might recurse into your driver?

Right. We can't (directly or indirectly) depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
allocations because the balloon driver needs to handle OOM notifier callback.

>                                                        Would GFP_NOIO
> do the job?  I'm a little hazy on exactly how the balloon driver works.

GFP_NOIO implies __GFP_DIRECT_RECLAIM. In the worst case, it can lockup due to
the too small to fail memory allocation rule. GFP_NOIO | __GFP_NORETRY would work
if there is really a guarantee that GFP_NOIO | __GFP_NORETRY never depend on
__GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations, which is too subtle for me to
validate. The direct reclaim dependency is too complicated to validate.
I consider that !__GFP_DIRECT_RECLAIM is the future-safe choice.

> 
> If you can't preload with anything better than that, I think that
> xb_set_bit() should attempt an allocation with GFP_NOWAIT | __GFP_NOWARN,
> and then you can skip the preload; it has no value for you.

Yes, that's why I suggest directly using kzalloc() at xb_set_bit().

> 
> > @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> >  
> >  	while ((page = balloon_page_pop(&pages))) {
> >  		balloon_page_enqueue(&vb->vb_dev_info, page);
> > +		if (use_sg) {
> > +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
> > +				__free_page(page);
> > +				continue;
> > +			}
> > +		} else {
> > +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> > +		}
> 
> Is this the right behaviour?

I don't think so. In the worst case, we can set no bit using xb_set_page().

>                               If we can't record the page in the xb,
> wouldn't we rather send it across as a single page?
> 

I think that we need to be able to fallback to !use_sg path when OOM.
Wang, Wei W Dec. 24, 2017, 7:42 a.m. UTC | #3
On 12/24/2017 12:45 PM, Tetsuo Handa wrote:
> Matthew Wilcox wrote:
>>> +	unsigned long pfn = page_to_pfn(page);
>>> +	int ret;
>>> +
>>> +	*pfn_min = min(pfn, *pfn_min);
>>> +	*pfn_max = max(pfn, *pfn_max);
>>> +
>>> +	do {
>>> +		if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
>>> +			return -ENOMEM;
>>> +
>>> +		ret = xb_set_bit(&vb->page_xb, pfn);
>>> +		xb_preload_end();
>>> +	} while (unlikely(ret == -EAGAIN));
>> OK, so you don't need a spinlock because you're under a mutex?  But you
>> can't allocate memory because you're in the balloon driver, and so a
>> GFP_KERNEL allocation might recurse into your driver?
> Right. We can't (directly or indirectly) depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> allocations because the balloon driver needs to handle OOM notifier callback.
>
>>                                                         Would GFP_NOIO
>> do the job?  I'm a little hazy on exactly how the balloon driver works.
> GFP_NOIO implies __GFP_DIRECT_RECLAIM. In the worst case, it can lockup due to
> the too small to fail memory allocation rule. GFP_NOIO | __GFP_NORETRY would work
> if there is really a guarantee that GFP_NOIO | __GFP_NORETRY never depend on
> __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations, which is too subtle for me to
> validate. The direct reclaim dependency is too complicated to validate.
> I consider that !__GFP_DIRECT_RECLAIM is the future-safe choice.

What's the problem with (or how is it better than) the "GFP_NOWAIT | 
__GFP_NOWARN" we are using here?


>> If you can't preload with anything better than that, I think that
>> xb_set_bit() should attempt an allocation with GFP_NOWAIT | __GFP_NOWARN,
>> and then you can skip the preload; it has no value for you.
> Yes, that's why I suggest directly using kzalloc() at xb_set_bit().

It has some possibilities to remove that preload if we also do the 
bitmap allocation in the xb_set_bit():

bitmap = rcu_dereference_raw(*slot);
if (!bitmap) {
     bitmap = this_cpu_xchg(ida_bitmap, NULL);
     if (!bitmap) {
         bitmap = kmalloc(sizeof(*bitmap), gfp);
         if (!bitmap)
             return -ENOMEM;
     }
}

But why not just follow the radix tree implementation style that puts 
the allocation in preload, which would be invoked with a more relaxed 
gfp in other use cases?
Its usage in virtio_balloon is just a little special that we need to put 
the allocation within the balloon_lock, which doesn't give us the 
benefit of using a relaxed gfp in preload, but it doesn't prevent us 
from living with the current APIs (i.e. the preload + xb_set pattern).
On the other side, if we do it as above, we have more things that need 
to consider. For example, what if the a use case just want the radix 
tree implementation style, which means it doesn't want allocation within 
xb_set(), then would we be troubled with how to avoid the allocation 
path in that case?

So, I think it is better to stick with the convention by putting the 
allocation in preload. Breaking the convention should show obvious 
advantages, IMHO.


>
>>> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>>>   
>>>   	while ((page = balloon_page_pop(&pages))) {
>>>   		balloon_page_enqueue(&vb->vb_dev_info, page);
>>> +		if (use_sg) {
>>> +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
>>> +				__free_page(page);
>>> +				continue;
>>> +			}
>>> +		} else {
>>> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>>> +		}
>> Is this the right behaviour?
> I don't think so. In the worst case, we can set no bit using xb_set_page().

>
>>                                If we can't record the page in the xb,
>> wouldn't we rather send it across as a single page?
>>
> I think that we need to be able to fallback to !use_sg path when OOM.

I also have different thoughts:

1) For OOM, we have leak_balloon_sg_oom (oom has nothing to do with 
fill_balloon), which does not use xbitmap to record pages, thus no 
memory allocation.

2) If the memory is already under pressure, it is pointless to continue 
inflating memory to the host. We need to give thanks to the memory 
allocation failure reported by xbitmap, which gets us a chance to 
release the inflated pages that have been demonstrated to cause the 
memory pressure of the guest.


Best,
Wei
Wang, Wei W Dec. 24, 2017, 8:16 a.m. UTC | #4
On 12/24/2017 03:42 PM, Wei Wang wrote:
> On 12/24/2017 12:45 PM, Tetsuo Handa wrote:
>> Matthew Wilcox wrote:
>>>> +    unsigned long pfn = page_to_pfn(page);
>>>> +    int ret;
>>>> +
>>>> +    *pfn_min = min(pfn, *pfn_min);
>>>> +    *pfn_max = max(pfn, *pfn_max);
>>>> +
>>>> +    do {
>>>> +        if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
>>>> +            return -ENOMEM;
>>>> +
>>>> +        ret = xb_set_bit(&vb->page_xb, pfn);
>>>> +        xb_preload_end();
>>>> +    } while (unlikely(ret == -EAGAIN));
>>> OK, so you don't need a spinlock because you're under a mutex?  But you
>>> can't allocate memory because you're in the balloon driver, and so a
>>> GFP_KERNEL allocation might recurse into your driver?
>> Right. We can't (directly or indirectly) depend on 
>> __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
>> allocations because the balloon driver needs to handle OOM notifier 
>> callback.
>>
>>> Would GFP_NOIO
>>> do the job?  I'm a little hazy on exactly how the balloon driver works.
>> GFP_NOIO implies __GFP_DIRECT_RECLAIM. In the worst case, it can 
>> lockup due to
>> the too small to fail memory allocation rule. GFP_NOIO | 
>> __GFP_NORETRY would work
>> if there is really a guarantee that GFP_NOIO | __GFP_NORETRY never 
>> depend on
>> __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations, which is too 
>> subtle for me to
>> validate. The direct reclaim dependency is too complicated to validate.
>> I consider that !__GFP_DIRECT_RECLAIM is the future-safe choice.
>
> What's the problem with (or how is it better than) the "GFP_NOWAIT | 
> __GFP_NOWARN" we are using here?
>
>
>>> If you can't preload with anything better than that, I think that
>>> xb_set_bit() should attempt an allocation with GFP_NOWAIT | 
>>> __GFP_NOWARN,
>>> and then you can skip the preload; it has no value for you.
>> Yes, that's why I suggest directly using kzalloc() at xb_set_bit().
>
> It has some possibilities to remove that preload if we also do the 
> bitmap allocation in the xb_set_bit():
>
> bitmap = rcu_dereference_raw(*slot);
> if (!bitmap) {
>     bitmap = this_cpu_xchg(ida_bitmap, NULL);
>     if (!bitmap) {
>         bitmap = kmalloc(sizeof(*bitmap), gfp);
>         if (!bitmap)
>             return -ENOMEM;
>     }
> }
>
> But why not just follow the radix tree implementation style that puts 
> the allocation in preload, which would be invoked with a more relaxed 
> gfp in other use cases?
> Its usage in virtio_balloon is just a little special that we need to 
> put the allocation within the balloon_lock, which doesn't give us the 
> benefit of using a relaxed gfp in preload, but it doesn't prevent us 
> from living with the current APIs (i.e. the preload + xb_set pattern).
> On the other side, if we do it as above, we have more things that need 
> to consider. For example, what if the a use case just want the radix 
> tree implementation style, which means it doesn't want allocation 
> within xb_set(), then would we be troubled with how to avoid the 
> allocation path in that case?
>
> So, I think it is better to stick with the convention by putting the 
> allocation in preload. Breaking the convention should show obvious 
> advantages, IMHO.
>
>
>>
>>>> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct 
>>>> virtio_balloon *vb, size_t num)
>>>>         while ((page = balloon_page_pop(&pages))) {
>>>>           balloon_page_enqueue(&vb->vb_dev_info, page);
>>>> +        if (use_sg) {
>>>> +            if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
>>>> +                __free_page(page);
>>>> +                continue;
>>>> +            }
>>>> +        } else {
>>>> +            set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>>>> +        }
>>> Is this the right behaviour?
>> I don't think so. In the worst case, we can set no bit using 
>> xb_set_page().
>
>>
>>>                                If we can't record the page in the xb,
>>> wouldn't we rather send it across as a single page?
>>>
>> I think that we need to be able to fallback to !use_sg path when OOM.
>
> I also have different thoughts:
>
> 1) For OOM, we have leak_balloon_sg_oom (oom has nothing to do with 
> fill_balloon), which does not use xbitmap to record pages, thus no 
> memory allocation.
>
> 2) If the memory is already under pressure, it is pointless to 
> continue inflating memory to the host. We need to give thanks to the 
> memory allocation failure reported by xbitmap, which gets us a chance 
> to release the inflated pages that have been demonstrated to cause the 
> memory pressure of the guest.
>

Forgot to add my conclusion: I think the above behavior is correct.

Best,
Wei
Tetsuo Handa Dec. 25, 2017, 2:51 p.m. UTC | #5
Wei Wang wrote:
> >>>> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct 
> >>>> virtio_balloon *vb, size_t num)
> >>>>         while ((page = balloon_page_pop(&pages))) {
> >>>>           balloon_page_enqueue(&vb->vb_dev_info, page);
> >>>> +        if (use_sg) {
> >>>> +            if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
> >>>> +                __free_page(page);
> >>>> +                continue;
> >>>> +            }
> >>>> +        } else {
> >>>> +            set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> >>>> +        }
> >>> Is this the right behaviour?
> >> I don't think so. In the worst case, we can set no bit using 
> >> xb_set_page().
> >
> >>
> >>>                                If we can't record the page in the xb,
> >>> wouldn't we rather send it across as a single page?
> >>>
> >> I think that we need to be able to fallback to !use_sg path when OOM.
> >
> > I also have different thoughts:
> >
> > 1) For OOM, we have leak_balloon_sg_oom (oom has nothing to do with 
> > fill_balloon), which does not use xbitmap to record pages, thus no 
> > memory allocation.
> >
> > 2) If the memory is already under pressure, it is pointless to 
> > continue inflating memory to the host. We need to give thanks to the 
> > memory allocation failure reported by xbitmap, which gets us a chance 
> > to release the inflated pages that have been demonstrated to cause the 
> > memory pressure of the guest.
> >
> 
> Forgot to add my conclusion: I think the above behavior is correct.
> 

What is the desired behavior when hitting OOM path during inflate/deflate?
Once inflation started, the inflation logic is called again and again
until the balloon inflates to the requested size. Such situation will
continue wasting CPU resource between inflate-due-to-host's-request versus
deflate-due-to-guest's-OOM. It is pointless but cannot stop doing pointless
thing.

Also, as of Linux 4.15, only up to VIRTIO_BALLOON_ARRAY_PFNS_MAX pages (i.e.
1MB) are invisible from deflate request. That amount would be an acceptable
error. But your patch makes more pages being invisible, for pages allocated
by balloon_page_alloc() without holding balloon_lock are stored into a local
variable "LIST_HEAD(pages)" (which means that balloon_page_dequeue() with
balloon_lock held won't be able to find pages not yet queued by
balloon_page_enqueue()), doesn't it? What if all memory pages were held in
"LIST_HEAD(pages)" and balloon_page_dequeue() was called before
balloon_page_enqueue() is called?

So, I think we need to consider how to handle such situation.
Wang, Wei W Dec. 26, 2017, 3:06 a.m. UTC | #6
On 12/25/2017 10:51 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>>>>>> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct
>>>>>> virtio_balloon *vb, size_t num)
>>>>>>          while ((page = balloon_page_pop(&pages))) {
>>>>>>            balloon_page_enqueue(&vb->vb_dev_info, page);
>>>>>> +        if (use_sg) {
>>>>>> +            if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
>>>>>> +                __free_page(page);
>>>>>> +                continue;
>>>>>> +            }
>>>>>> +        } else {
>>>>>> +            set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>>>>>> +        }
>>>>> Is this the right behaviour?
>>>> I don't think so. In the worst case, we can set no bit using
>>>> xb_set_page().
>>>>>                                 If we can't record the page in the xb,
>>>>> wouldn't we rather send it across as a single page?
>>>>>
>>>> I think that we need to be able to fallback to !use_sg path when OOM.
>>> I also have different thoughts:
>>>
>>> 1) For OOM, we have leak_balloon_sg_oom (oom has nothing to do with
>>> fill_balloon), which does not use xbitmap to record pages, thus no
>>> memory allocation.
>>>
>>> 2) If the memory is already under pressure, it is pointless to
>>> continue inflating memory to the host. We need to give thanks to the
>>> memory allocation failure reported by xbitmap, which gets us a chance
>>> to release the inflated pages that have been demonstrated to cause the
>>> memory pressure of the guest.
>>>
>> Forgot to add my conclusion: I think the above behavior is correct.
>>
> What is the desired behavior when hitting OOM path during inflate/deflate?
> Once inflation started, the inflation logic is called again and again
> until the balloon inflates to the requested size.

The above is true, but I can't agree with the following. Please see below.

> Such situation will
> continue wasting CPU resource between inflate-due-to-host's-request versus
> deflate-due-to-guest's-OOM. It is pointless but cannot stop doing pointless
> thing.

What we are doing here is to free the pages that were just allocated in 
this round of inflating. Next round will be sometime later when the 
balloon work item gets its turn to run. Yes, it will then continue to 
inflate.
Here are the two cases that will happen then:
1) the guest is still under memory pressure, the inflate will fail at 
memory allocation, which results in a msleep(200), and then it exists 
for another time to run.
2) the guest isn't under memory pressure any more (e.g. the task which 
consumes the huge amount of memory is gone), it will continue to inflate 
as normal till the requested size.

I think what we are doing is a quite sensible behavior, except a small 
change I plan to make:

         while ((page = balloon_page_pop(&pages))) {
-               balloon_page_enqueue(&vb->vb_dev_info, page);
                 if (use_sg) {
                         if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 
0) {
                                 __free_page(page);
                                 continue;
                         }
                 } else {
                         set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
                 }
+             balloon_page_enqueue(&vb->vb_dev_info, page);

>
> Also, as of Linux 4.15, only up to VIRTIO_BALLOON_ARRAY_PFNS_MAX pages (i.e.
> 1MB) are invisible from deflate request. That amount would be an acceptable
> error. But your patch makes more pages being invisible, for pages allocated
> by balloon_page_alloc() without holding balloon_lock are stored into a local
> variable "LIST_HEAD(pages)" (which means that balloon_page_dequeue() with
> balloon_lock held won't be able to find pages not yet queued by
> balloon_page_enqueue()), doesn't it? What if all memory pages were held in
> "LIST_HEAD(pages)" and balloon_page_dequeue() was called before
> balloon_page_enqueue() is called?
>

If we think of the balloon driver just as a regular driver or 
application, that will be a pretty nature thing. A regular driver can 
eat a huge amount of memory for its own usages, would this amount of 
memory be treated as an error as they are invisible to the 
balloon_page_enqueue?

Best,
Wei
Tetsuo Handa Dec. 26, 2017, 10:38 a.m. UTC | #7
Wei Wang wrote:
> On 12/25/2017 10:51 PM, Tetsuo Handa wrote:
> > Wei Wang wrote:
> >>>>>> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct
> >>>>>> virtio_balloon *vb, size_t num)
> >>>>>>          while ((page = balloon_page_pop(&pages))) {
> >>>>>>            balloon_page_enqueue(&vb->vb_dev_info, page);
> >>>>>> +        if (use_sg) {
> >>>>>> +            if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
> >>>>>> +                __free_page(page);
> >>>>>> +                continue;
> >>>>>> +            }
> >>>>>> +        } else {
> >>>>>> +            set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> >>>>>> +        }
> >>>>> Is this the right behaviour?
> >>>> I don't think so. In the worst case, we can set no bit using
> >>>> xb_set_page().
> >>>>>                                 If we can't record the page in the xb,
> >>>>> wouldn't we rather send it across as a single page?
> >>>>>
> >>>> I think that we need to be able to fallback to !use_sg path when OOM.
> >>> I also have different thoughts:
> >>>
> >>> 1) For OOM, we have leak_balloon_sg_oom (oom has nothing to do with
> >>> fill_balloon), which does not use xbitmap to record pages, thus no
> >>> memory allocation.
> >>>
> >>> 2) If the memory is already under pressure, it is pointless to
> >>> continue inflating memory to the host. We need to give thanks to the
> >>> memory allocation failure reported by xbitmap, which gets us a chance
> >>> to release the inflated pages that have been demonstrated to cause the
> >>> memory pressure of the guest.
> >>>
> >> Forgot to add my conclusion: I think the above behavior is correct.
> >>
> > What is the desired behavior when hitting OOM path during inflate/deflate?
> > Once inflation started, the inflation logic is called again and again
> > until the balloon inflates to the requested size.
> 
> The above is true, but I can't agree with the following. Please see below.
> 
> > Such situation will
> > continue wasting CPU resource between inflate-due-to-host's-request versus
> > deflate-due-to-guest's-OOM. It is pointless but cannot stop doing pointless
> > thing.
> 
> What we are doing here is to free the pages that were just allocated in 
> this round of inflating. Next round will be sometime later when the 
> balloon work item gets its turn to run. Yes, it will then continue to 
> inflate.
> Here are the two cases that will happen then:
> 1) the guest is still under memory pressure, the inflate will fail at 
> memory allocation, which results in a msleep(200), and then it exists 
> for another time to run.
> 2) the guest isn't under memory pressure any more (e.g. the task which 
> consumes the huge amount of memory is gone), it will continue to inflate 
> as normal till the requested size.
> 

How likely does 2) occur? It is not so likely. msleep(200) is enough to spam
the guest with puff messages. Next round is starting too quickly.

> I think what we are doing is a quite sensible behavior, except a small 
> change I plan to make:
> 
>          while ((page = balloon_page_pop(&pages))) {
> -               balloon_page_enqueue(&vb->vb_dev_info, page);
>                  if (use_sg) {
>                          if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 
> 0) {
>                                  __free_page(page);
>                                  continue;
>                          }
>                  } else {
>                          set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>                  }
> +             balloon_page_enqueue(&vb->vb_dev_info, page);
> 
> >
> > Also, as of Linux 4.15, only up to VIRTIO_BALLOON_ARRAY_PFNS_MAX pages (i.e.
> > 1MB) are invisible from deflate request. That amount would be an acceptable
> > error. But your patch makes more pages being invisible, for pages allocated
> > by balloon_page_alloc() without holding balloon_lock are stored into a local
> > variable "LIST_HEAD(pages)" (which means that balloon_page_dequeue() with
> > balloon_lock held won't be able to find pages not yet queued by
> > balloon_page_enqueue()), doesn't it? What if all memory pages were held in
> > "LIST_HEAD(pages)" and balloon_page_dequeue() was called before
> > balloon_page_enqueue() is called?
> >
> 
> If we think of the balloon driver just as a regular driver or 
> application, that will be a pretty nature thing. A regular driver can 
> eat a huge amount of memory for its own usages, would this amount of 
> memory be treated as an error as they are invisible to the 
> balloon_page_enqueue?
> 

No. Memory used by applications which consumed a lot of memory in their
mm_struct is reclaimed by the OOM killer/reaper. Drivers try to avoid
allocating more memory than they need. If drivers allocate more memory
than they need, they have a hook for releasing unused memory (i.e.
register_shrinker() or OOM notifier). What I'm saying here is that
the hook for releasing unused memory does not work unless memory held in
LIST_HEAD(pages) becomes visible to balloon_page_dequeue().

If a system has 128GB of memory, and 127GB of memory was stored into
LIST_HEAD(pages) upon first fill_balloon() request, and somebody held
balloon_lock from OOM notifier path from out_of_memory() before
fill_balloon() holds balloon_lock, leak_balloon_sg_oom() finds that
no memory can be freed because balloon_page_enqueue() was never called,
and allows the caller of out_of_memory() to invoke the OOM killer despite
there is 127GB of memory which can be freed if fill_balloon() was able
to hold balloon_lock before leak_balloon_sg_oom() holds balloon_lock.
I don't think that that amount is an acceptable error.
Wang, Wei W Dec. 26, 2017, 11:36 a.m. UTC | #8
On 12/26/2017 06:38 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 12/25/2017 10:51 PM, Tetsuo Handa wrote:
>>> Wei Wang wrote:
>>>
>> What we are doing here is to free the pages that were just allocated in
>> this round of inflating. Next round will be sometime later when the
>> balloon work item gets its turn to run. Yes, it will then continue to
>> inflate.
>> Here are the two cases that will happen then:
>> 1) the guest is still under memory pressure, the inflate will fail at
>> memory allocation, which results in a msleep(200), and then it exists
>> for another time to run.
>> 2) the guest isn't under memory pressure any more (e.g. the task which
>> consumes the huge amount of memory is gone), it will continue to inflate
>> as normal till the requested size.
>>
> How likely does 2) occur? It is not so likely. msleep(200) is enough to spam
> the guest with puff messages. Next round is starting too quickly.

I meant one of the two cases, 1) or 2), would happen, rather than 2) 
happens after 1).

If 2) doesn't happen, then 1) happens. It will continue to try to 
inflate round by round. But the memory allocation won't succeed, so 
there will be no pages to inflate to the host. That is, the inflating is 
simply a code path to the msleep(200) as long as the guest is under 
memory pressure.

Back to our code change, it doesn't result in incorrect behavior as 
explained above.

>> I think what we are doing is a quite sensible behavior, except a small
>> change I plan to make:
>>
>>           while ((page = balloon_page_pop(&pages))) {
>> -               balloon_page_enqueue(&vb->vb_dev_info, page);
>>                   if (use_sg) {
>>                           if (xb_set_page(vb, page, &pfn_min, &pfn_max) <
>> 0) {
>>                                   __free_page(page);
>>                                   continue;
>>                           }
>>                   } else {
>>                           set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>>                   }
>> +             balloon_page_enqueue(&vb->vb_dev_info, page);
>>
>>> Also, as of Linux 4.15, only up to VIRTIO_BALLOON_ARRAY_PFNS_MAX pages (i.e.
>>> 1MB) are invisible from deflate request. That amount would be an acceptable
>>> error. But your patch makes more pages being invisible, for pages allocated
>>> by balloon_page_alloc() without holding balloon_lock are stored into a local
>>> variable "LIST_HEAD(pages)" (which means that balloon_page_dequeue() with
>>> balloon_lock held won't be able to find pages not yet queued by
>>> balloon_page_enqueue()), doesn't it? What if all memory pages were held in
>>> "LIST_HEAD(pages)" and balloon_page_dequeue() was called before
>>> balloon_page_enqueue() is called?
>>>
>> If we think of the balloon driver just as a regular driver or
>> application, that will be a pretty nature thing. A regular driver can
>> eat a huge amount of memory for its own usages, would this amount of
>> memory be treated as an error as they are invisible to the
>> balloon_page_enqueue?
>>
> No. Memory used by applications which consumed a lot of memory in their
> mm_struct is reclaimed by the OOM killer/reaper. Drivers try to avoid
> allocating more memory than they need. If drivers allocate more memory
> than they need, they have a hook for releasing unused memory (i.e.
> register_shrinker() or OOM notifier). What I'm saying here is that
> the hook for releasing unused memory does not work unless memory held in
> LIST_HEAD(pages) becomes visible to balloon_page_dequeue().
>
> If a system has 128GB of memory, and 127GB of memory was stored into
> LIST_HEAD(pages) upon first fill_balloon() request, and somebody held
> balloon_lock from OOM notifier path from out_of_memory() before
> fill_balloon() holds balloon_lock, leak_balloon_sg_oom() finds that
> no memory can be freed because balloon_page_enqueue() was never called,
> and allows the caller of out_of_memory() to invoke the OOM killer despite
> there is 127GB of memory which can be freed if fill_balloon() was able
> to hold balloon_lock before leak_balloon_sg_oom() holds balloon_lock.
> I don't think that that amount is an acceptable error.

I understand you are worried that OOM couldn't get balloon pages while 
there are some in the local list. This is a debatable issue, and it may 
lead to a long discussion. If this is considered to be a big issue, we 
can make the local list to be global in vb, and accessed by oom 
notifier, this won't affect this patch, and can be achieved with an 
add-on patch. How about leaving this discussion as a second step outside 
this series? Balloon has something more that can be improved, and this 
patch series is already big.

Best,
Wei
Tetsuo Handa Dec. 26, 2017, 1:40 p.m. UTC | #9
Wei Wang wrote:
> On 12/26/2017 06:38 PM, Tetsuo Handa wrote:
> > Wei Wang wrote:
> >> On 12/25/2017 10:51 PM, Tetsuo Handa wrote:
> >>> Wei Wang wrote:
> >>>
> >> What we are doing here is to free the pages that were just allocated in
> >> this round of inflating. Next round will be sometime later when the
> >> balloon work item gets its turn to run. Yes, it will then continue to
> >> inflate.
> >> Here are the two cases that will happen then:
> >> 1) the guest is still under memory pressure, the inflate will fail at
> >> memory allocation, which results in a msleep(200), and then it exists
> >> for another time to run.
> >> 2) the guest isn't under memory pressure any more (e.g. the task which
> >> consumes the huge amount of memory is gone), it will continue to inflate
> >> as normal till the requested size.
> >>
> > How likely does 2) occur? It is not so likely. msleep(200) is enough to spam
> > the guest with puff messages. Next round is starting too quickly.
> 
> I meant one of the two cases, 1) or 2), would happen, rather than 2) 
> happens after 1).
> 
> If 2) doesn't happen, then 1) happens. It will continue to try to 
> inflate round by round. But the memory allocation won't succeed, so 
> there will be no pages to inflate to the host. That is, the inflating is 
> simply a code path to the msleep(200) as long as the guest is under 
> memory pressure.

No. See http://lkml.kernel.org/r/201710181959.ACI05296.JLMVQOOFtHSOFF@I-love.SAKURA.ne.jp .
Did you try how unlikely 2) occurs if once 1) started?

> 
> Back to our code change, it doesn't result in incorrect behavior as 
> explained above.

The guest will be effectively unusable due to spam.

> 
> >> I think what we are doing is a quite sensible behavior, except a small
> >> change I plan to make:
> >>
> >>           while ((page = balloon_page_pop(&pages))) {
> >> -               balloon_page_enqueue(&vb->vb_dev_info, page);
> >>                   if (use_sg) {
> >>                           if (xb_set_page(vb, page, &pfn_min, &pfn_max) <
> >> 0) {
> >>                                   __free_page(page);
> >>                                   continue;
> >>                           }
> >>                   } else {
> >>                           set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> >>                   }
> >> +             balloon_page_enqueue(&vb->vb_dev_info, page);
> >>
> >>> Also, as of Linux 4.15, only up to VIRTIO_BALLOON_ARRAY_PFNS_MAX pages (i.e.
> >>> 1MB) are invisible from deflate request. That amount would be an acceptable
> >>> error. But your patch makes more pages being invisible, for pages allocated
> >>> by balloon_page_alloc() without holding balloon_lock are stored into a local
> >>> variable "LIST_HEAD(pages)" (which means that balloon_page_dequeue() with
> >>> balloon_lock held won't be able to find pages not yet queued by
> >>> balloon_page_enqueue()), doesn't it? What if all memory pages were held in
> >>> "LIST_HEAD(pages)" and balloon_page_dequeue() was called before
> >>> balloon_page_enqueue() is called?
> >>>
> >> If we think of the balloon driver just as a regular driver or
> >> application, that will be a pretty nature thing. A regular driver can
> >> eat a huge amount of memory for its own usages, would this amount of
> >> memory be treated as an error as they are invisible to the
> >> balloon_page_enqueue?
> >>
> > No. Memory used by applications which consumed a lot of memory in their
> > mm_struct is reclaimed by the OOM killer/reaper. Drivers try to avoid
> > allocating more memory than they need. If drivers allocate more memory
> > than they need, they have a hook for releasing unused memory (i.e.
> > register_shrinker() or OOM notifier). What I'm saying here is that
> > the hook for releasing unused memory does not work unless memory held in
> > LIST_HEAD(pages) becomes visible to balloon_page_dequeue().
> >
> > If a system has 128GB of memory, and 127GB of memory was stored into
> > LIST_HEAD(pages) upon first fill_balloon() request, and somebody held
> > balloon_lock from OOM notifier path from out_of_memory() before
> > fill_balloon() holds balloon_lock, leak_balloon_sg_oom() finds that
> > no memory can be freed because balloon_page_enqueue() was never called,
> > and allows the caller of out_of_memory() to invoke the OOM killer despite
> > there is 127GB of memory which can be freed if fill_balloon() was able
> > to hold balloon_lock before leak_balloon_sg_oom() holds balloon_lock.
> > I don't think that that amount is an acceptable error.
> 
> I understand you are worried that OOM couldn't get balloon pages while 
> there are some in the local list. This is a debatable issue, and it may 
> lead to a long discussion. If this is considered to be a big issue, we 
> can make the local list to be global in vb, and accessed by oom 
> notifier, this won't affect this patch, and can be achieved with an 
> add-on patch. How about leaving this discussion as a second step outside 
> this series?

No. This is a big issue. Even changing balloon_page_alloc() to exclude
__GFP_DIRECT_RECLAIM might be the better, for we don't want to try so hard.
Reclaiming all reclaimable memory results in hitting OOM notifier path which
after all releases memory reclaimed by a lot of effort. Though I don't know whether
  (GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM
has undesirable side effect.

>              Balloon has something more that can be improved, and this 
> patch series is already big.

The reason this patch series becomes big is that you are doing a lot of
changes in this series. This series is too optimistic about worst/corner
cases and difficult for me to check. Please always consider the worst case,
and write patches in a way that can survive the worst case.

Please compose this series with patch 1/2 for xbitmap and patch 2/2 for
VIRTIO_BALLOON_F_SG. Nothing more to append. Of course, after we came to
an agreement about whether virtio_balloon should use preload. (We are
waiting for response from Matthew Wilcox, aren't we?) Also, adding some
cond_resched() might be needed. Also, comparing (maybe benchmarking)
Matthew's radix tree implementation and my B+ tree implementation is
another TODO thing before merging this series.
Matthew Wilcox Jan. 2, 2018, 1:24 p.m. UTC | #10
On Sun, Dec 24, 2017 at 03:42:02PM +0800, Wei Wang wrote:
> On 12/24/2017 12:45 PM, Tetsuo Handa wrote:
> > Matthew Wilcox wrote:
> > > If you can't preload with anything better than that, I think that
> > > xb_set_bit() should attempt an allocation with GFP_NOWAIT | __GFP_NOWARN,
> > > and then you can skip the preload; it has no value for you.
> > Yes, that's why I suggest directly using kzalloc() at xb_set_bit().
> 
> It has some possibilities to remove that preload if we also do the bitmap
> allocation in the xb_set_bit():
> 
> bitmap = rcu_dereference_raw(*slot);
> if (!bitmap) {
>     bitmap = this_cpu_xchg(ida_bitmap, NULL);
>     if (!bitmap) {
>         bitmap = kmalloc(sizeof(*bitmap), gfp);
>         if (!bitmap)
>             return -ENOMEM;
>     }
> }
> 
> But why not just follow the radix tree implementation style that puts the
> allocation in preload, which would be invoked with a more relaxed gfp in
> other use cases?

Actually, the radix tree does attempt allocation, and falls back to the
preload.  The IDA was the odd one out that doesn't attempt allocation,
and that was where I copied the code from.  For other users, the preload
API is beneficial, so I will leave it in.

> Its usage in virtio_balloon is just a little special that we need to put the
> allocation within the balloon_lock, which doesn't give us the benefit of
> using a relaxed gfp in preload, but it doesn't prevent us from living with
> the current APIs (i.e. the preload + xb_set pattern).
> On the other side, if we do it as above, we have more things that need to
> consider. For example, what if the a use case just want the radix tree
> implementation style, which means it doesn't want allocation within
> xb_set(), then would we be troubled with how to avoid the allocation path in
> that case?
> 
> So, I think it is better to stick with the convention by putting the
> allocation in preload. Breaking the convention should show obvious
> advantages, IMHO.

The radix tree convention is objectively awful, which is why I'm working
to change it.  Specifying the GFP flags at radix tree initialisation time
rather than allocation time leads to all kinds of confusion.  The preload
API is a pretty awful workaround, and it will go away once the XArray
is working correctly.  That said, there's no alternative to it without
making XBitmap depend on XArray, and I don't want to hold you up there.
So there's an xb_preload for the moment.
Tetsuo Handa Jan. 3, 2018, 2:29 a.m. UTC | #11
Matthew Wilcox wrote:
> The radix tree convention is objectively awful, which is why I'm working
> to change it.  Specifying the GFP flags at radix tree initialisation time
> rather than allocation time leads to all kinds of confusion.  The preload
> API is a pretty awful workaround, and it will go away once the XArray
> is working correctly.  That said, there's no alternative to it without
> making XBitmap depend on XArray, and I don't want to hold you up there.
> So there's an xb_preload for the moment.

I'm ready to propose cvbmp shown below as an alternative to xbitmap (but
specialized for virtio-balloon case). Wei, can you do some benchmarking
between xbitmap and cvbmp?
----------------------------------------
cvbmp: clustered values bitmap

This patch provides simple API for recording any "unsigned long" value and
for fetching recorded values in ascendant order, in order to allow handling
chunk of unique values efficiently.

The virtio-balloon driver manages memory pages (where the page frame number
is in unique "unsigned long" value range) between the host and the guest.
Currently that communication is using fixed sized array, and allowing that
communication to use scatter-gather API can improve performance a lot.

This patch is implemented for virtio-balloon driver as initial user. Since
the ballooning operation gathers many pages at once, gathered pages tend to
form a cluster (i.e. their values tend to fit bitmap based management).

This API will fail only when memory allocation failed while trying to record
an "unsigned long" value. All operations provided by this API never sleeps.
Also, this API does not provide exclusion control.
Therefore, the callers are responsible for e.g. inserting cond_resched() and
handling out of memory situation, and using rwlock or plain lock as needed.

Since virtio-balloon driver uses OOM notifier callback, in order to avoid
potential deadlock, the ballooning operation must not directly or indirectly
depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation.
Also, there should be no need to use __GFP_HIGH for memory allocation for
the ballooning operation, for if there is already little free memory such
that normal memory allocation requests will fail, the OOM notifier callback
will be fired by normal memory allocation requests, and the ballooning
operation will have to release memory just allocated. Therefore, this API
uses GFP_NOWAIT | __GFP_NOWARN for memory allocation.

If gathered pages tend to form a cluster, a bitmap for recording next
"unsigned long" value could be found at neighbor of the bitmap used for
recording previous "unsigned long" value. Therefore, this API uses
sequential seek rather than using some tree based algorithm (e.g. radix
tree or B+ tree) when finding a bitmap for recording an "unsigned long"
value. If values changes sequentially, this approach is much faster than
tree based algorithm.
----------------------------------------
/*
 * Clustered values bitmap.
 *
 * This file provides simple API for recording any "unsigned long" value and
 * for fetching recorded values in ascendant order, in order to allow handling
 * chunk of unique values efficiently.
 *
 * This API will fail only when memory allocation failed while trying to record
 * an "unsigned long" value. All operations provided by this API never sleeps.
 * Also, this API does not provide exclusion control.
 * Therefore, the callers are responsible for e.g. inserting cond_resched() and
 * handling out of memory situation, and using rwlock or plain lock as needed.
 */

/* Header file part. */

#include <linux/list.h>
 
/* Tune this size between 8 and PAGE_SIZE * 8, in power of 2. */
#define CVBMP_SIZE 1024

struct cvbmp_node;
struct cvbmp_head {
	/*
	 * list[0] is used by currently used "struct cvbmp_node" list.
	 * list[1] is used by currently unused "struct cvbmp_node" list.
	 */
	struct list_head list[2];
	/*
	 * Previously used "struct cvbmp_node" which is used as a hint for
	 * next operation.
	 */
	struct cvbmp_node *last_used;
};

void cvbmp_init(struct cvbmp_head *head);
void cvbmp_destroy(struct cvbmp_head *head);
bool __must_check cvbmp_set_bit(struct cvbmp_head *head,
				const unsigned long value);
bool cvbmp_test_bit(struct cvbmp_head *head, const unsigned long value);
void cvbmp_clear_bit(struct cvbmp_head *head, const unsigned long value);
unsigned long cvbmp_get_bit_range(struct cvbmp_head *head,
				  unsigned long *start);
bool __must_check cvbmp_set_segment(struct cvbmp_head *head,
				    const unsigned long segment);
void cvbmp_clear_segment(struct cvbmp_head *head, const unsigned long segment);

/* C file part. */

#ifdef __KERNEL__
#include <linux/sched.h>
#include <linux/module.h>
#include <linux/slab.h>
#define assert(x) WARN_ON(!x)
#else
#include <linux/bitops.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#define kfree free
#define kzalloc(size, flag) calloc(size, 1)
#define pr_info printf
#define cond_resched() do { } while (0)
#endif

struct cvbmp_node {
	/* List for chaining to "struct cvbmp_head". */
	struct list_head list;
	/* Starting segment for this offset bitmap. */
	unsigned long segment;
	/*
	 * Number of bits set in this offset bitmap. If this value is larger
	 * than CVBMP_SIZE, this value must be multiple of CVBMP_SIZE, for
	 * this entry represents start of segments with all "1" set.
	 */
	unsigned long bits;
	/*
	 * Offset bitmap of CVBMP_SIZE bits. This bitmap can be modified
	 * only if "bits <= CVBMP_SIZE" is true.
	 */
	unsigned long *bitmap;
};

/**
 * cvbmp_init - Initialize "struct cvbmp_head".
 *
 * @head: Pointer to "struct cvbmp_head".
 *
 * Returns nothing.
 */
void cvbmp_init(struct cvbmp_head *head)
{
	INIT_LIST_HEAD(&head->list[0]);
	INIT_LIST_HEAD(&head->list[1]);
	head->last_used = NULL;
}

/**
 * cvbmp_destroy - Finalize "struct cvbmp_head".
 *
 * @head: Pointer to "struct cvbmp_head".
 *
 * Returns nothing.
 */
void cvbmp_destroy(struct cvbmp_head *head)
{
	struct cvbmp_node *ptr;
	struct cvbmp_node *tmp;
	unsigned int i;

	for (i = 0; i < 2; i++) {
		list_for_each_entry_safe(ptr, tmp, &head->list[i], list) {
			list_del(&ptr->list);
			kfree(ptr->bitmap);
			kfree(ptr);
		}
	}
	head->last_used = NULL;
}

/**
 * __cvbmp_merge - Merge "struct cvbmp_node" with all "1" set.
 *
 * @head:    Pointer to "struct cvbmp_head".
 * @segment: Segment number to merge.
 * @ptr:     Pointer to "struct cvbmp_node" with all "1" set.
 *
 * Returns nothing.
 */
static void __cvbmp_merge(struct cvbmp_head *head, const unsigned long segment,
			  struct cvbmp_node *ptr)
{
	if (ptr != list_first_entry(&head->list[0], typeof(*ptr), list)) {
		struct cvbmp_node *prev = list_prev_entry(ptr, list);

		if (prev->segment + prev->bits / CVBMP_SIZE == segment) {
			list_del(&ptr->list);
			list_add(&ptr->list, &head->list[1]);
			prev->bits += CVBMP_SIZE;
			head->last_used = prev;
			ptr = prev;
		}
	}
	if (ptr != list_last_entry(&head->list[0], typeof(*ptr), list)) {
		struct cvbmp_node *next = list_next_entry(ptr, list);

		if (next->bits >= CVBMP_SIZE && next->segment == segment + 1) {
			list_del(&next->list);
			list_add(&next->list, &head->list[1]);
			ptr->bits += next->bits;
		}
	}
}

/**
 * __cvbmp_unmerge - Unmerge "struct cvbmp_node" with all "1" set.
 *
 * @head:    Pointer to "struct cvbmp_head".
 * @segment: Segment number to unmerge.
 * @ptr:     Pointer to "struct cvbmp_node" with all "1" set.
 *
 * Returns nothing.
 */
static struct cvbmp_node *__cvbmp_unmerge(struct cvbmp_head *head,
					  const unsigned long segment,
					  struct cvbmp_node *ptr)
{
	unsigned long diff = segment - ptr->segment;
	struct cvbmp_node *new;

again:
	new = list_first_entry(&head->list[1], typeof(*ptr), list);
	list_del(&new->list);
	if (!diff) {
		list_add_tail(&new->list, &ptr->list);
		new->segment = segment;
		new->bits = CVBMP_SIZE;
		ptr->bits -= CVBMP_SIZE;
		ptr->segment++;
		return new;
	}
	list_add(&new->list, &ptr->list);
	new->segment = segment;
	new->bits = ptr->bits - CVBMP_SIZE * diff;
	ptr->bits -= new->bits;
	if (new->bits <= CVBMP_SIZE)
		return new;
	ptr = new;
	diff = 0;
	goto again;
}

/**
 * __cvbmp_in_data - Check "struct cvbmp_node" segment.
 *
 * @ptr:     Pointer to "struct cvbmp_node".
 * @segment: Segment number to check.
 *
 * Returns true if @segment is in @ptr, false otherwise.
 */
static inline bool __cvbmp_segment_in_data(struct cvbmp_node *ptr,
				    const unsigned long segment)
{
	return ptr->segment <= segment &&
		segment <= ptr->segment + (ptr->bits - 1) / CVBMP_SIZE;
}

/**
 * __cvbmp_lookup - Find "struct cvbmp_node" segment.
 *
 * @head:    Pointer to "struct cvbmp_head".
 * @segment: Segment number to find.
 * @create:  Whether to create one if not found.
 *
 * Returns pointer to "struct cvbmp_node" on success, NULL otherwise.
 */
static struct cvbmp_node *__cvbmp_lookup(struct cvbmp_head *head,
					 const unsigned long segment,
					 const bool create)
{
	struct cvbmp_node *ptr = head->last_used;
	struct list_head *insert_pos;
	bool add_tail = false;

	if (!ptr) {
		insert_pos = &head->list[0];
		list_for_each_entry(ptr, &head->list[0], list) {
			if (ptr->segment >= segment)
				break;
			insert_pos = &ptr->list;
		}
	} else if (__cvbmp_segment_in_data(ptr, segment)) {
		return ptr;
	} else if (ptr->segment < segment) {
		insert_pos = &ptr->list;
		list_for_each_entry_continue(ptr, &head->list[0], list) {
			if (ptr->segment >= segment)
				break;
			insert_pos = &ptr->list;
		}
	} else {
		add_tail = true;
		insert_pos = &ptr->list;
		list_for_each_entry_continue_reverse(ptr, &head->list[0],
						     list) {
			if (ptr->segment <= segment)
				break;
			insert_pos = &ptr->list;
		}
	}
	if (&ptr->list != &head->list[0]) {
		if (__cvbmp_segment_in_data(ptr, segment))
			return ptr;
		ptr = list_prev_entry(ptr, list);
		if (__cvbmp_segment_in_data(ptr, segment))
			return ptr;
	}
	if (!create)
		return NULL;
	ptr = kzalloc(sizeof(*ptr), GFP_NOWAIT | __GFP_NOWARN);
	if (!ptr)
		return NULL;
	ptr->bitmap = kzalloc(CVBMP_SIZE / 8, GFP_NOWAIT | __GFP_NOWARN);
	if (!ptr->bitmap) {
		kfree(ptr);
		return NULL;
	}
	ptr->segment = segment;
	if (!add_tail)
		list_add(&ptr->list, insert_pos);
	else
		list_add_tail(&ptr->list, insert_pos);
	return ptr;
}

/**
 * cvbmp_set_bit - Set one "1" bit in the bitmap.
 *
 * @head:  Pointer to "struct cvbmp_head".
 * @value: Value to set bit.
 *
 * Returns true on success, false otherwise.
 */
bool cvbmp_set_bit(struct cvbmp_head *head, const unsigned long value)
{
	struct cvbmp_node *ptr;
	const unsigned long segment = value / CVBMP_SIZE;
	const unsigned long offset = value % CVBMP_SIZE;

	ptr = __cvbmp_lookup(head, segment, true);
	if (!ptr)
		return false;
	head->last_used = ptr;
	if (test_bit(offset, ptr->bitmap))
		return true;
	__set_bit(offset, ptr->bitmap);
	ptr->bits++;
	if (ptr->bits == CVBMP_SIZE)
		__cvbmp_merge(head, segment, ptr);
	return true;
}

/**
 * cvbmp_test_bit - Test one "1" bit in the bitmap.
 *
 * @head:  Pointer to "struct cvbmp_head".
 * @value: Value to test bit.
 *
 * Returns true if "1" bit is set, false otherwise.
 */
bool cvbmp_test_bit(struct cvbmp_head *head, const unsigned long value)
{
	struct cvbmp_node *ptr;
	const unsigned long segment = value / CVBMP_SIZE;
	const unsigned long offset = value % CVBMP_SIZE;

	ptr = __cvbmp_lookup(head, segment, false);
	return ptr && test_bit(offset, ptr->bitmap);
}

/**
 * __cvbmp_neighbor - Find neighbor "struct cvbmp_node" segment.
 *
 * @head: Pointer to "struct cvbmp_head".
 * @ptr:  Pointer to "struct cvbmp_node".
 *
 * Returns pointer to "struct cvbmp_node" on success, NULL otherwise.
 */
static struct cvbmp_node *__cvbmp_neighbor(struct cvbmp_head *head,
					   struct cvbmp_node *ptr)
{
	if (ptr != list_first_entry(&head->list[0], typeof(*ptr), list))
		return list_prev_entry(ptr, list);
	if (ptr != list_last_entry(&head->list[0], typeof(*ptr), list))
		return list_next_entry(ptr, list);
	return NULL;
}

/**
 * cvbmp_clear_bit - Clear one "1" bit in the bitmap.
 *
 * @head:  Pointer to "struct cvbmp_head".
 * @value: Value to clear bit.
 *
 * Returns nothing.
 */
void cvbmp_clear_bit(struct cvbmp_head *head, const unsigned long value)
{
	struct cvbmp_node *ptr;
	const unsigned long segment = value / CVBMP_SIZE;
	const unsigned long offset = value % CVBMP_SIZE;

	ptr = __cvbmp_lookup(head, segment, false);
	if (!ptr)
		return;
	if (ptr->bits > CVBMP_SIZE)
		ptr = __cvbmp_unmerge(head, segment, ptr);
	head->last_used = ptr;
	if (!test_bit(offset, ptr->bitmap))
		return;
	__clear_bit(offset, ptr->bitmap);
	if (--ptr->bits)
		return;
	head->last_used = __cvbmp_neighbor(head, ptr);
	list_del(&ptr->list);
	kfree(ptr->bitmap);
	kfree(ptr);
}

/**
 * cvbmp_get_bit_range - Get range of "1" bits.
 *
 * @head:  Pointer to "struct cvbmp_head".
 * @start: Pointer to "unsigned long" which holds starting bit upon entry and
 *         holds found bit upon return.
 *
 * Returns length of consecutive "1" bits which start from @start or higher.
 * @start is updated to hold actual location where "1" bit started. The caller
 * can call this function again after adding return value of this function to
 * @start, but be sure to check for overflow which will happen when the last
 * bit (the ULONG_MAX'th bit) is "1".
 *
 * Returns 0 if no "1" bit was found in @start and afterwords bits. It does not
 * make sense to call this function again in that case.
 */
unsigned long cvbmp_get_bit_range(struct cvbmp_head *head, unsigned long *start)
{
	struct cvbmp_node *ptr;
	unsigned long segment = *start / CVBMP_SIZE;
	unsigned int offset = *start % CVBMP_SIZE;
	unsigned long ret = CVBMP_SIZE;

	list_for_each_entry(ptr, &head->list[0], list) {
		if (ptr->segment + ((ptr->bits - 1) / CVBMP_SIZE) < segment)
			continue;
		if (ptr->segment > segment)
			offset = 0;
		ret = find_next_bit(ptr->bitmap, CVBMP_SIZE, offset);
		if (ret < CVBMP_SIZE)
			break;
	}
	if (ret == CVBMP_SIZE)
		return 0;
	if (segment < ptr->segment)
		segment = ptr->segment;
	*start = segment * CVBMP_SIZE + ret;
	ret = find_next_zero_bit(ptr->bitmap, CVBMP_SIZE, ret);
	if (ret == CVBMP_SIZE) {
		segment = ptr->segment + ((ptr->bits - 1) / CVBMP_SIZE);
		list_for_each_entry_continue(ptr, &head->list[0], list) {
			if (ptr->segment != segment + 1)
				break;
			segment = ptr->segment + ((ptr->bits - 1) / CVBMP_SIZE);
			if (ptr->bits >= CVBMP_SIZE)
				continue;
			ret = find_next_zero_bit(ptr->bitmap, CVBMP_SIZE, 0);
			break;
		}
	}
	return segment * CVBMP_SIZE + ret - *start;
}

/**
 * cvbmp_set_segment - Set CVBMP_SIZE "1" bits in the bitmap.
 *
 * @head:    Pointer to "struct cvbmp_head".
 * @segment: Segment to set.
 *
 * Returns true on success, false otherwise.
 */
bool cvbmp_set_segment(struct cvbmp_head *head, const unsigned long segment)
{
	struct cvbmp_node *ptr;

	if (!cvbmp_set_bit(head, segment * CVBMP_SIZE))
		return false;
	ptr = head->last_used;
	if (ptr->bits >= CVBMP_SIZE)
		return true;
	ptr->bits = CVBMP_SIZE;
	memset(ptr->bitmap, 0xFF, CVBMP_SIZE / 8);
	__cvbmp_merge(head, segment, ptr);
	return true;
}

/**
 * cvbmp_clear_segment - Clear CVBMP_SIZE "1" bits in the bitmap.
 *
 * @head:    Pointer to "struct cvbmp_head".
 * @segment: Segment to set.
 *
 * Returns nothing.
 */
void cvbmp_clear_segment(struct cvbmp_head *head, const unsigned long segment)
{
	struct cvbmp_node *ptr;

	cvbmp_clear_bit(head, segment * CVBMP_SIZE);
	ptr = head->last_used;
	if (!ptr || ptr->segment != segment)
		return;
	head->last_used = __cvbmp_neighbor(head, ptr);
	list_del(&ptr->list);
	kfree(ptr->bitmap);
	kfree(ptr);
}

/* Module testing part. */

struct expect {
	unsigned long start;
	unsigned long end;
};

static void dump(struct cvbmp_head *head)
{
	struct cvbmp_node *ptr;

	pr_info("Debug dump start\n");
	list_for_each_entry(ptr, &head->list[0], list)
		pr_info("  %20lu %20lu (%20lu)\n", ptr->bits,
			ptr->segment * CVBMP_SIZE, ptr->segment);
	pr_info("Debug dump end\n");
}

static void check_result(struct cvbmp_head *head, const struct expect *expect,
			 const unsigned int num)
{
	unsigned long start = 0;
	unsigned long len;
	unsigned int i;

	for (i = 0; i < num; i++) {
		len = cvbmp_get_bit_range(head, &start);
		if (len == 0 || start != expect[i].start ||
		    len != expect[i].end - expect[i].start + 1) {
			dump(head);
			pr_info("start=%lu/%lu end=%lu/%lu\n", start,
				expect[i].start, start + len - 1,
				expect[i].end);
			assert(len != 0 && start == expect[i].start &&
			     len == expect[i].end - expect[i].start + 1);
		}
		start += len;
	}
	len = !num || start ? cvbmp_get_bit_range(head, &start) : 0;
	if (len) {
		dump(head);
		assert(len == 0);
	}
}

#define SET_BIT(i) assert(cvbmp_set_bit(&head, i))
#define CLEAR_BIT(i) cvbmp_clear_bit(&head, i)
#define SET_SEGMENT(i) assert(cvbmp_set_segment(&head, i))
#define CLEAR_SEGMENT(i) cvbmp_clear_segment(&head, i)

#define TEST_BIT(i) {				\
	SET_BIT(i);				\
	assert(cvbmp_test_bit(&head, i));	\
	CLEAR_BIT(i);				\
	assert(!cvbmp_test_bit(&head, i));	\
	SET_SEGMENT(i / CVBMP_SIZE);		\
	assert(cvbmp_test_bit(&head, i));	\
	CLEAR_SEGMENT(i / CVBMP_SIZE);		\
	assert(!cvbmp_test_bit(&head, i));	\
	}

static void test1(void)
{
	struct cvbmp_head head;
	unsigned long i;

	for (i = 1; i; i *= 2) {
		cvbmp_init(&head);
		TEST_BIT(i);
		cvbmp_destroy(&head);
	}

	for (i = ULONG_MAX; i; i /= 2) {
		cvbmp_init(&head);
		TEST_BIT(i);
		cvbmp_destroy(&head);
	}

	{
		const struct expect expect[] = {
			{ 100 * CVBMP_SIZE, 100 * CVBMP_SIZE }
		};

		cvbmp_init(&head);
		check_result(&head, NULL, 0);
		SET_BIT(100 * CVBMP_SIZE);
		check_result(&head, expect, ARRAY_SIZE(expect));
		CLEAR_BIT(100 * CVBMP_SIZE);
		check_result(&head, NULL, 0);
		cvbmp_destroy(&head);
	}
		
	{
		const struct expect expect[] = { { 0, 16 * CVBMP_SIZE - 1 } };

		cvbmp_init(&head);
		for (i = 0; i < 16 * CVBMP_SIZE; i += 2)
			SET_BIT(i);
		for (i = 1; i < 16 * CVBMP_SIZE; i += 2)
			SET_BIT(i);
		check_result(&head, expect, ARRAY_SIZE(expect));
		for (i = 0; i < 16 * CVBMP_SIZE; i++)
			CLEAR_BIT(i);
		cvbmp_destroy(&head);
	}

	{
		const struct expect expect[] = { { 0, 16 * CVBMP_SIZE - 1 } };

		cvbmp_init(&head);
		for (i = 0; i < 16; i += 2)
			SET_SEGMENT(i);
		for (i = 1; i < 16; i += 2)
			SET_SEGMENT(i);
		check_result(&head, expect, ARRAY_SIZE(expect));
		for (i = 0; i < 16; i++)
			CLEAR_SEGMENT(i);
		cvbmp_destroy(&head);
	}

	{
		const struct expect expect[] = {
			{ 100 * CVBMP_SIZE, 116 * CVBMP_SIZE - 1 }
		};

		cvbmp_init(&head);
		for (i = 101; i < 116; i += 2)
			SET_SEGMENT(i);
		for (i = 100; i < 116; i += 2)
			SET_SEGMENT(i);
		check_result(&head, expect, ARRAY_SIZE(expect));
		for (i = 100; i < 116; i++)
			CLEAR_SEGMENT(i);
		cvbmp_destroy(&head);
	}

	{
		const struct expect expect[] = {
			{ 50 * CVBMP_SIZE, 50 * CVBMP_SIZE },
			{ 100 * CVBMP_SIZE, 100 * CVBMP_SIZE },
			{ 200 * CVBMP_SIZE, 200 * CVBMP_SIZE },
			{ 300 * CVBMP_SIZE, 300 * CVBMP_SIZE }
		};
		
		cvbmp_init(&head);
		SET_BIT(100 * CVBMP_SIZE);
		SET_BIT(300 * CVBMP_SIZE);
		SET_BIT(50 * CVBMP_SIZE);
		SET_BIT(200 * CVBMP_SIZE);
		check_result(&head, expect, ARRAY_SIZE(expect));
		CLEAR_BIT(100 * CVBMP_SIZE);
		CLEAR_BIT(300 * CVBMP_SIZE);
		CLEAR_BIT(50 * CVBMP_SIZE);
		CLEAR_BIT(200 * CVBMP_SIZE);
		cvbmp_destroy(&head);
	}

	{
		const struct expect expect[] = {
			{ 100 * CVBMP_SIZE, 100 * CVBMP_SIZE },
			{ 200 * CVBMP_SIZE, 200 * CVBMP_SIZE },
			{ 250 * CVBMP_SIZE, 250 * CVBMP_SIZE },
			{ 300 * CVBMP_SIZE, 300 * CVBMP_SIZE }
		};
		
		cvbmp_init(&head);
		SET_BIT(100 * CVBMP_SIZE);
		SET_BIT(300 * CVBMP_SIZE);
		SET_BIT(250 * CVBMP_SIZE);
		SET_BIT(200 * CVBMP_SIZE);
		check_result(&head, expect, ARRAY_SIZE(expect));
		CLEAR_BIT(100 * CVBMP_SIZE);
		CLEAR_BIT(300 * CVBMP_SIZE);
		CLEAR_BIT(250 * CVBMP_SIZE);
		CLEAR_BIT(200 * CVBMP_SIZE);
		cvbmp_destroy(&head);
	}

	{
		const struct expect expect[] = {
			{ 0, CVBMP_SIZE - 1},
			{ LONG_MAX / 2 - CVBMP_SIZE + 1, LONG_MAX / 2 },
			{ ULONG_MAX - CVBMP_SIZE + 1, ULONG_MAX }
		};

		cvbmp_init(&head);
		SET_SEGMENT(ULONG_MAX / CVBMP_SIZE);
		SET_SEGMENT(0);
		SET_SEGMENT(LONG_MAX / 2 / CVBMP_SIZE);
		check_result(&head, expect, ARRAY_SIZE(expect));
		CLEAR_SEGMENT(ULONG_MAX / CVBMP_SIZE);
		CLEAR_SEGMENT(0);
		CLEAR_SEGMENT(LONG_MAX / 2 / CVBMP_SIZE);
		cvbmp_destroy(&head);
	}

	{
		unsigned long bit;
		struct expect expect[] = {
			{ 100 * CVBMP_SIZE, 0 },
			{ 0, 110 * CVBMP_SIZE - 1 }
		};

		cvbmp_init(&head);
		for (i = 100; i < 110; i++)
			SET_SEGMENT(i);
		for (i = 100; i < 110; i++) {
			bit = i * CVBMP_SIZE + CVBMP_SIZE / 2;
			CLEAR_BIT(bit);
			expect[0].end = bit - 1;
			expect[1].start = bit + 1;
			check_result(&head, expect, 2);
			SET_BIT(bit);
		}
		for (i = 101; i < 109; i++) {
			bit = i * CVBMP_SIZE;
			CLEAR_BIT(bit);
			expect[0].end = bit - 1;
			expect[1].start = bit + 1;
			check_result(&head, expect, 2);
			SET_BIT(bit);
			bit = i * CVBMP_SIZE + CVBMP_SIZE - 1;
			CLEAR_BIT(bit);
			expect[0].end = bit - 1;
			expect[1].start = bit + 1;
			check_result(&head, expect, 2);
			SET_BIT(bit);
		}
		for (i = 100; i < 110; i++)
			CLEAR_SEGMENT(i);
		cvbmp_destroy(&head);
	}
}

static void test2(void)
{
	struct cvbmp_head head;
	unsigned long start;
	unsigned long i;
	const struct expect expect[] = {
		{ 0, 0 },
		{ 10, 10 },
		{ 12, 12 },
		{ 1000000, 1000000 },
		{ 1048576, 1048600 + CVBMP_SIZE * 2 - 2 - 1 },
		{ 1048600 + CVBMP_SIZE * 2 - 2 + 1,
		  1048576 + CVBMP_SIZE * 3 - 1 },
		{ 1000000000, 1234567890 - 1 },
		{ 1234567890 + 1, 1000000000 * 2 - 2 },
		{ 2000000000, 2000000000 },
		{ ULONG_MAX - CVBMP_SIZE * 3 + 1, ULONG_MAX }
	};

	cvbmp_init(&head);

	SET_BIT(0);
	SET_BIT(10);
	SET_BIT(11);
	SET_BIT(1000000);
	SET_BIT(12);
	SET_BIT(2000000000);
	CLEAR_BIT(11);
	for (i = 1048576; i < 1048576 + CVBMP_SIZE * 3; i += CVBMP_SIZE)
		SET_SEGMENT(i / CVBMP_SIZE);
	for (i = 1048576 + CVBMP_SIZE; i < 1048576 + CVBMP_SIZE * 2;
	     i += CVBMP_SIZE)
		SET_SEGMENT(i / CVBMP_SIZE);
	CLEAR_BIT(1048600 + CVBMP_SIZE * 2 - 2);
	for (i = ULONG_MAX; i > ULONG_MAX - CVBMP_SIZE * 3; i--)
		SET_BIT(i);
	SET_BIT(ULONG_MAX);
	for (start = 0; start < 1; start++)
		for (i = 1000000000; i <= 1000000000 * 2 - 2; i += 1) {
			if (start + i <= 1000000000 * 2 - 2)
				SET_BIT(start + i);
			cond_resched();
		}
	CLEAR_BIT(1234567890);

	check_result(&head, expect, ARRAY_SIZE(expect));

	cvbmp_destroy(&head);
}

#ifdef __KERNEL__
static int __init test_init(void)
{
	test1();
	test2();
	return -EINVAL;
}

module_init(test_init);
MODULE_LICENSE("GPL");
#else
int __weak main(int argc, char *argv[])
{
	test1();
	test2();
	return 0;
}
#endif
Wang, Wei W Jan. 3, 2018, 9 a.m. UTC | #12
On 01/03/2018 10:29 AM, Tetsuo Handa wrote:
> Matthew Wilcox wrote:
>> The radix tree convention is objectively awful, which is why I'm working
>> to change it.  Specifying the GFP flags at radix tree initialisation time
>> rather than allocation time leads to all kinds of confusion.  The preload
>> API is a pretty awful workaround, and it will go away once the XArray
>> is working correctly.  That said, there's no alternative to it without
>> making XBitmap depend on XArray, and I don't want to hold you up there.
>> So there's an xb_preload for the moment.
> I'm ready to propose cvbmp shown below as an alternative to xbitmap (but
> specialized for virtio-balloon case). Wei, can you do some benchmarking
> between xbitmap and cvbmp?
> ----------------------------------------
> cvbmp: clustered values bitmap

I don't think we need to replace xbitmap, at least at this stage. The 
new implementation doesn't look simpler at all, and virtio-balloon has 
worked well with xbitmap.

I would suggest you to send out the new implementation for discussion 
after this series ends, and justify with better performance results if 
you could get.

Best,
Wei
Tetsuo Handa Jan. 3, 2018, 10:19 a.m. UTC | #13
Wei Wang wrote:
> On 01/03/2018 10:29 AM, Tetsuo Handa wrote:
> > Matthew Wilcox wrote:
> >> The radix tree convention is objectively awful, which is why I'm working
> >> to change it.  Specifying the GFP flags at radix tree initialisation time
> >> rather than allocation time leads to all kinds of confusion.  The preload
> >> API is a pretty awful workaround, and it will go away once the XArray
> >> is working correctly.  That said, there's no alternative to it without
> >> making XBitmap depend on XArray, and I don't want to hold you up there.
> >> So there's an xb_preload for the moment.
> > I'm ready to propose cvbmp shown below as an alternative to xbitmap (but
> > specialized for virtio-balloon case). Wei, can you do some benchmarking
> > between xbitmap and cvbmp?
> > ----------------------------------------
> > cvbmp: clustered values bitmap
> 
> I don't think we need to replace xbitmap, at least at this stage. The 
> new implementation doesn't look simpler at all, and virtio-balloon has 
> worked well with xbitmap.
> 
> I would suggest you to send out the new implementation for discussion 
> after this series ends, and justify with better performance results if 
> you could get.

I'm VMware Workstation Player user, and I don't have environment for doing
performance test using virtio-balloon. Thus, I need to ask you.

Also, please look at
http://lkml.kernel.org/r/1514904621-39186-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
diff mbox

Patch

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index a1fb52c..fff0a3f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@ 
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@  struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,15 +146,129 @@  static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static void add_one_sg(struct virtqueue *vq, unsigned long pfn, uint32_t len)
+{
+	struct scatterlist sg;
+	unsigned int unused;
+	int err;
+
+	sg_init_table(&sg, 1);
+	sg_set_page(&sg, pfn_to_page(pfn), len, 0);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &unused))
+		;
+
+	err = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+	/*
+	 * This is expected to never fail: there is always at least 1 entry
+	 * available on the vq, because when the vq is full the worker thread
+	 * that adds the sg will be put into sleep until at least 1 entry is
+	 * available to use.
+	 */
+	BUG_ON(err);
+}
+
+static void batch_balloon_page_sg(struct virtio_balloon *vb,
+				  struct virtqueue *vq,
+				  unsigned long pfn,
+				  uint32_t len)
+{
+	add_one_sg(vq, pfn, len);
+
+	/* Batch till the vq is full */
+	if (!vq->num_free)
+		kick_and_wait(vq, vb->acked);
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long pfn_start, pfn_end;
+	uint32_t max_len = round_down(UINT_MAX, PAGE_SIZE);
+	uint64_t len;
+
+	pfn_start = page_xb_start;
+	while (pfn_start < page_xb_end) {
+		pfn_start = xb_find_set(&vb->page_xb, page_xb_end + 1,
+					pfn_start);
+		if (pfn_start == page_xb_end + 1)
+			break;
+		pfn_end = xb_find_zero(&vb->page_xb, page_xb_end + 1,
+				       pfn_start);
+		len = (pfn_end - pfn_start) << PAGE_SHIFT;
+		while (len > max_len) {
+			batch_balloon_page_sg(vb, vq, pfn_start, max_len);
+			pfn_start += max_len >> PAGE_SHIFT;
+			len -= max_len;
+		}
+		batch_balloon_page_sg(vb, vq, pfn_start, (uint32_t)len);
+		pfn_start = pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+}
+
+static inline int xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+	int ret;
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+
+	do {
+		if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
+			return -ENOMEM;
+
+		ret = xb_set_bit(&vb->page_xb, pfn);
+		xb_preload_end();
+	} while (unlikely(ret == -EAGAIN));
+
+	return ret;
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	unsigned num_allocated_pages;
 	unsigned num_pfns;
 	struct page *page;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	for (num_pfns = 0; num_pfns < num;
 	     num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
@@ -173,8 +292,15 @@  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	while ((page = balloon_page_pop(&pages))) {
 		balloon_page_enqueue(&vb->vb_dev_info, page);
+		if (use_sg) {
+			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+				__free_page(page);
+				continue;
+			}
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
 
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -184,8 +310,12 @@  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -211,9 +341,12 @@  static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -223,7 +356,14 @@  static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg) {
+			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+				balloon_page_enqueue(&vb->vb_dev_info, page);
+				break;
+			}
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -234,13 +374,55 @@  static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
 
+/*
+ * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory allocation
+ * for xbitmap, which is not suitable for the oom case. This function does not
+ * use xbitmap to chunk pages, so it can be used by oom notifier to deflate
+ * pages when VIRTIO_BALLOON_F_SG is negotiated.
+ */
+static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb)
+{
+	unsigned int n;
+	struct page *page;
+	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	struct virtqueue *vq = vb->deflate_vq;
+	LIST_HEAD(pages);
+
+	mutex_lock(&vb->balloon_lock);
+	for (n = 0; n < oom_pages; n++) {
+		page = balloon_page_dequeue(vb_dev_info);
+		if (!page)
+			break;
+
+		list_add(&page->lru, &pages);
+		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+		batch_balloon_page_sg(vb, vb->deflate_vq, page_to_pfn(page),
+				      PAGE_SIZE);
+		release_pages_balloon(vb, &pages);
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+	mutex_unlock(&vb->balloon_lock);
+
+	return n;
+}
+
 static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
@@ -380,7 +562,10 @@  static int virtballoon_oom_notify(struct notifier_block *self,
 		return NOTIFY_OK;
 
 	freed = parm;
-	num_freed_pages = leak_balloon(vb, oom_pages);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG))
+		num_freed_pages = leak_balloon_sg_oom(vb);
+	else
+		num_freed_pages = leak_balloon(vb, oom_pages);
 	update_balloon_size(vb);
 	*freed += num_freed_pages;
 
@@ -477,6 +662,7 @@  static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -498,16 +684,24 @@  static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		add_one_sg(vb->inflate_vq, page_to_pfn(newpage), PAGE_SIZE);
+		kick_and_wait(vb->inflate_vq, vb->acked);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		add_one_sg(vb->deflate_vq, page_to_pfn(page), PAGE_SIZE);
+		kick_and_wait(vb->deflate_vq, vb->acked);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -566,6 +760,9 @@  static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -682,6 +879,7 @@  static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@ 
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12