diff mbox

block, migration: Use qemu_madvise inplace of madvise

Message ID 1487318764-29513-1-git-send-email-pagupta@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Pankaj Gupta Feb. 17, 2017, 8:06 a.m. UTC
To maintain consistency at all the places use qemu_madvise wrapper
 inplace of madvise call.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 block/qcow2-cache.c      | 2 +-
 migration/postcopy-ram.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Kevin Wolf Feb. 17, 2017, 8:49 a.m. UTC | #1
Am 17.02.2017 um 09:06 hat Pankaj Gupta geschrieben:
>  To maintain consistency at all the places use qemu_madvise wrapper
>  inplace of madvise call.
> 
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

Juan/Dave, if one of you can give an Acked-by, I can take this through
my tree.

Kevin
Dr. David Alan Gilbert Feb. 17, 2017, 9:48 a.m. UTC | #2
* Kevin Wolf (kwolf@redhat.com) wrote:
> Am 17.02.2017 um 09:06 hat Pankaj Gupta geschrieben:
> >  To maintain consistency at all the places use qemu_madvise wrapper
> >  inplace of madvise call.
> > 
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> 
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> 
> Juan/Dave, if one of you can give an Acked-by, I can take this through
> my tree.

NACK

That's wrong; qemu_madvise can end up going through posix_madvise and
using POSIX_MADV_DONTNEED, it has different semantics to the madvise(MADV_DONTNEED)
and we need the semantics of madvise - i.e. it's guaranteed to throw
away the pages, where as posix_madvise *may* throw away the pages if
the kernel feels like it.

Dave

> Kevin
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Alberto Garcia Feb. 17, 2017, 10:06 a.m. UTC | #3
On Fri 17 Feb 2017 09:06:04 AM CET, Pankaj Gupta wrote:
>  To maintain consistency at all the places use qemu_madvise wrapper
>  inplace of madvise call.

>      if (length > 0) {
> -        madvise((uint8_t *) t + offset, length, MADV_DONTNEED);
> +        qemu_madvise((uint8_t *) t + offset, length, QEMU_MADV_DONTNEED);

This was changed two months ago from qemu_madvise() to madvise(), is
there any reason why you want to revert that change? Those two calls are
not equivalent, please see commit 2f2c8d6b371cfc6689affb0b7e for an
explanation.

> -    if (madvise(start, length, MADV_DONTNEED)) {
> +    if (qemu_madvise(start, length, QEMU_MADV_DONTNEED)) {
>          error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));

And this is the same case.

Berto
Pankaj Gupta Feb. 17, 2017, 11:30 a.m. UTC | #4
Thanks for your comments. I have below query.
> 
> On Fri 17 Feb 2017 09:06:04 AM CET, Pankaj Gupta wrote:
> >  To maintain consistency at all the places use qemu_madvise wrapper
> >  inplace of madvise call.
> 
> >      if (length > 0) {
> > -        madvise((uint8_t *) t + offset, length, MADV_DONTNEED);
> > +        qemu_madvise((uint8_t *) t + offset, length, QEMU_MADV_DONTNEED);
> 
> This was changed two months ago from qemu_madvise() to madvise(), is
> there any reason why you want to revert that change? Those two calls are
> not equivalent, please see commit 2f2c8d6b371cfc6689affb0b7e for an
> explanation.
> 
> > -    if (madvise(start, length, MADV_DONTNEED)) {
> > +    if (qemu_madvise(start, length, QEMU_MADV_DONTNEED)) {
> >          error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));

I checked history of only change related to 'postcopy'.

For my linux machine:

./config-host.mak

CONFIG_MADVISE=y
CONFIG_POSIX_MADVISE=y

As both these options are set for Linux, every time we call call 'qemu_madvise' ==>"madvise(addr, len, advice);" will 
be compiled/called. I don't understand why '2f2c8d6b371cfc6689affb0b7e' explicitly changed for :"#ifdef CONFIG_LINUX"
I think its better to write generic function maybe in a wrapper then to conditionally set something at different places.

int qemu_madvise(void *addr, size_t len, int advice)
{
    if (advice == QEMU_MADV_INVALID) {
        errno = EINVAL;
        return -1;
    }
#if defined(CONFIG_MADVISE)
    return madvise(addr, len, advice);
#elif defined(CONFIG_POSIX_MADVISE)
    return posix_madvise(addr, len, advice);
#else
    errno = EINVAL;
    return -1;
#endif
}

> 
> And this is the same case.
> 
> Berto
>
Dr. David Alan Gilbert Feb. 17, 2017, 11:36 a.m. UTC | #5
* Pankaj Gupta (pagupta@redhat.com) wrote:
> 
> Thanks for your comments. I have below query.
> > 
> > On Fri 17 Feb 2017 09:06:04 AM CET, Pankaj Gupta wrote:
> > >  To maintain consistency at all the places use qemu_madvise wrapper
> > >  inplace of madvise call.
> > 
> > >      if (length > 0) {
> > > -        madvise((uint8_t *) t + offset, length, MADV_DONTNEED);
> > > +        qemu_madvise((uint8_t *) t + offset, length, QEMU_MADV_DONTNEED);
> > 
> > This was changed two months ago from qemu_madvise() to madvise(), is
> > there any reason why you want to revert that change? Those two calls are
> > not equivalent, please see commit 2f2c8d6b371cfc6689affb0b7e for an
> > explanation.
> > 
> > > -    if (madvise(start, length, MADV_DONTNEED)) {
> > > +    if (qemu_madvise(start, length, QEMU_MADV_DONTNEED)) {
> > >          error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));
> 
> I checked history of only change related to 'postcopy'.
> 
> For my linux machine:
> 
> ./config-host.mak
> 
> CONFIG_MADVISE=y
> CONFIG_POSIX_MADVISE=y
> 
> As both these options are set for Linux, every time we call call 'qemu_madvise' ==>"madvise(addr, len, advice);" will 
> be compiled/called. I don't understand why '2f2c8d6b371cfc6689affb0b7e' explicitly changed for :"#ifdef CONFIG_LINUX"
> I think its better to write generic function maybe in a wrapper then to conditionally set something at different places.

No; the problem is that the behaviours are different.
You're right that the current build on Linux defines MADVISE and thus we are safe because qemu_madvise
takes teh CONFIG_MADVISE/madvise route - but we need to be explicit that it's only
the madvise() route that's safe, not any of the calls implemented by 
qemu_madvise, because if in the future someone was to rearrange qemu_madvise
to prefer posix_madvise postcopy would break in a very subtle way.

IMHO it might even be better to remove the definition of QEMU_MADV_DONTNEED altogether
and make a name that wasn't ambiguous between the two, since the posix definition is
so different.

Dave

> int qemu_madvise(void *addr, size_t len, int advice)
> {
>     if (advice == QEMU_MADV_INVALID) {
>         errno = EINVAL;
>         return -1;
>     }
> #if defined(CONFIG_MADVISE)
>     return madvise(addr, len, advice);
> #elif defined(CONFIG_POSIX_MADVISE)
>     return posix_madvise(addr, len, advice);
> #else
>     errno = EINVAL;
>     return -1;
> #endif
> }
> 
> > 
> > And this is the same case.
> > 
> > Berto
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Pankaj Gupta Feb. 17, 2017, 12:30 p.m. UTC | #6
> 
> * Pankaj Gupta (pagupta@redhat.com) wrote:
> > 
> > Thanks for your comments. I have below query.
> > > 
> > > On Fri 17 Feb 2017 09:06:04 AM CET, Pankaj Gupta wrote:
> > > >  To maintain consistency at all the places use qemu_madvise wrapper
> > > >  inplace of madvise call.
> > > 
> > > >      if (length > 0) {
> > > > -        madvise((uint8_t *) t + offset, length, MADV_DONTNEED);
> > > > +        qemu_madvise((uint8_t *) t + offset, length,
> > > > QEMU_MADV_DONTNEED);
> > > 
> > > This was changed two months ago from qemu_madvise() to madvise(), is
> > > there any reason why you want to revert that change? Those two calls are
> > > not equivalent, please see commit 2f2c8d6b371cfc6689affb0b7e for an
> > > explanation.
> > > 
> > > > -    if (madvise(start, length, MADV_DONTNEED)) {
> > > > +    if (qemu_madvise(start, length, QEMU_MADV_DONTNEED)) {
> > > >          error_report("%s MADV_DONTNEED: %s", __func__,
> > > >          strerror(errno));
> > 
> > I checked history of only change related to 'postcopy'.
> > 
> > For my linux machine:
> > 
> > ./config-host.mak
> > 
> > CONFIG_MADVISE=y
> > CONFIG_POSIX_MADVISE=y
> > 
> > As both these options are set for Linux, every time we call call
> > 'qemu_madvise' ==>"madvise(addr, len, advice);" will
> > be compiled/called. I don't understand why '2f2c8d6b371cfc6689affb0b7e'
> > explicitly changed for :"#ifdef CONFIG_LINUX"
> > I think its better to write generic function maybe in a wrapper then to
> > conditionally set something at different places.
> 
> No; the problem is that the behaviours are different.
> You're right that the current build on Linux defines MADVISE and thus we are
> safe because qemu_madvise
> takes teh CONFIG_MADVISE/madvise route - but we need to be explicit that it's
> only
> the madvise() route that's safe, not any of the calls implemented by
> qemu_madvise, because if in the future someone was to rearrange qemu_madvise
> to prefer posix_madvise postcopy would break in a very subtle way.

Agree. 
We can add comment explaining this?

> 
> IMHO it might even be better to remove the definition of QEMU_MADV_DONTNEED
> altogether
> and make a name that wasn't ambiguous between the two, since the posix
> definition is
> so different.

I think 'posix_madvise' was added for systems which didnot have 'madvise'.
If I look at makefile, first we check what all calls are available and then 
set config option accordingly. We give 'madvise' precedence over 'posix_madvise' 
if both are present. 

For the systems which don't have madvise call 'posix_madvise' is called which as per
discussion is not right thing for 'DONTNEED' option. It will not give desired results.

Either we have to find right alternative or else it is already broken for systems which
don't support madvise.

> 
> Dave
> 
> > int qemu_madvise(void *addr, size_t len, int advice)
> > {
> >     if (advice == QEMU_MADV_INVALID) {
> >         errno = EINVAL;
> >         return -1;
> >     }
> > #if defined(CONFIG_MADVISE)
> >     return madvise(addr, len, advice);
> > #elif defined(CONFIG_POSIX_MADVISE)
> >     return posix_madvise(addr, len, advice);
> > #else
> >     errno = EINVAL;
> >     return -1;
> > #endif
> > }
> > 
> > > 
> > > And this is the same case.
> > > 
> > > Berto
> > > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
>
Alberto Garcia Feb. 17, 2017, 12:31 p.m. UTC | #7
On Fri 17 Feb 2017 12:30:28 PM CET, Pankaj Gupta wrote:
>> >  To maintain consistency at all the places use qemu_madvise wrapper
>> >  inplace of madvise call.
>> 
>> > -        madvise((uint8_t *) t + offset, length, MADV_DONTNEED);
>> > +        qemu_madvise((uint8_t *) t + offset, length, QEMU_MADV_DONTNEED);
>> 
>> Those two calls are not equivalent, please see commit
>> 2f2c8d6b371cfc6689affb0b7e for an explanation.

> I don't understand why '2f2c8d6b371cfc6689affb0b7e' explicitly changed
> for :"#ifdef CONFIG_LINUX" I think its better to write generic
> function maybe in a wrapper then to conditionally set something at
> different places.

The problem with qemu_madvise(QEMU_MADV_DONTNEED) is that it can mean
different things depending on the platform:

   posix_madvise(POSIX_MADV_DONTNEED)
   madvise(MADV_DONTNEED)

The first call is standard but it doesn't do what we need, so we cannot
use it.

The second call -- madvise(MADV_DONTNEED) -- is not standard, and it
doesn't do the same in all platforms. The only platform in which it does
what we need is Linux, hence the #ifdef CONFIG_LINUX and #if
defined(__linux__) that you see in the code.

I agree with David's comment that maybe it's better to remove
QEMU_MADV_DONTNEED altogether since it's not reliable.

Berto
Alberto Garcia Feb. 17, 2017, 12:49 p.m. UTC | #8
On Fri 17 Feb 2017 01:30:09 PM CET, Pankaj Gupta wrote:
> I think 'posix_madvise' was added for systems which didnot have
> 'madvise' [...] For the systems which don't have madvise call
> 'posix_madvise' is called which as per discussion is not right thing
> for 'DONTNEED' option. It will not give desired results.
>
> Either we have to find right alternative or else it is already broken
> for systems which don't support madvise.

Do you have an example of a call that is currently broken in the QEMU
code?

Berto
diff mbox

Patch

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 1d25147..4991ca5 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -74,7 +74,7 @@  static void qcow2_cache_table_release(BlockDriverState *bs, Qcow2Cache *c,
     size_t offset = QEMU_ALIGN_UP((uintptr_t) t, align) - (uintptr_t) t;
     size_t length = QEMU_ALIGN_DOWN(mem_size - offset, align);
     if (length > 0) {
-        madvise((uint8_t *) t + offset, length, MADV_DONTNEED);
+        qemu_madvise((uint8_t *) t + offset, length, QEMU_MADV_DONTNEED);
     }
 #endif
 }
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a40dddb..558fec1 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -213,7 +213,7 @@  int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
                                size_t length)
 {
     trace_postcopy_ram_discard_range(start, length);
-    if (madvise(start, length, MADV_DONTNEED)) {
+    if (qemu_madvise(start, length, QEMU_MADV_DONTNEED)) {
         error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));
         return -1;
     }