kmemleak: don't use __GFP_NOFAIL

Message ID	1684479370.5483281.1527680579781.JavaMail.zimbra@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of chuhu@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; Date: Wed, 30 May 2018 07:42:59 -0400 (EDT) From: Chunyu Hu <chuhu@redhat.com> Reply-To: Chunyu Hu <chuhu@redhat.com> To: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>, malat@debian.org, dvyukov@google.com, linux-mm@kvack.org, catalin marinas <catalin.marinas@arm.com>, Akinobu Mita <akinobu.mita@gmail.com> Message-ID: <1684479370.5483281.1527680579781.JavaMail.zimbra@redhat.com> In-Reply-To: <20180530104637.GC27180@dhcp22.suse.cz> References: <CA+7wUswp_Sr=hHqi1bwRZ3FE2wY5ozZWZ8Z1BgrFnSAmijUKjA@mail.gmail.com> <f054219d-6daa-68b1-0c60-0acd9ad8c5ab@i-love.sakura.ne.jp> <20180528132410.GD27180@dhcp22.suse.cz> <201805290605.DGF87549.LOVFMFJQSOHtFO@I-love.SAKURA.ne.jp> <1126233373.5118805.1527600426174.JavaMail.zimbra@redhat.com> <f3d58cbd-29ca-7a23-69e0-59690b9cd4fb@i-love.sakura.ne.jp> <1730157334.5467848.1527672937617.JavaMail.zimbra@redhat.com> <20180530104637.GC27180@dhcp22.suse.cz> Subject: Re: [PATCH] kmemleak: don't use __GFP_NOFAIL MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Thread-Topic: kmemleak: don't use __GFP_NOFAIL Thread-Index: jNkZpf29iE7XjgVWu4YUXEszv3bdTg== Sender: owner-linux-mm@kvack.org Precedence: bulk

Chunyu Hu May 30, 2018, 11:42 a.m. UTC

----- Original Message -----
> From: "Michal Hocko" <mhocko@suse.com>
> To: "Chunyu Hu" <chuhu@redhat.com>
> Cc: "Tetsuo Handa" <penguin-kernel@i-love.sakura.ne.jp>, malat@debian.org, dvyukov@google.com, linux-mm@kvack.org,
> "catalin marinas" <catalin.marinas@arm.com>
> Sent: Wednesday, May 30, 2018 6:46:37 PM
> Subject: Re: [PATCH] kmemleak: don't use __GFP_NOFAIL
> 
> On Wed 30-05-18 05:35:37, Chunyu Hu wrote:
> [...]
> > I'm trying to reuse the make_it_fail field in task for fault injection. As
> > adding
> > an extra memory alloc flag is not thought so good,  I think adding task
> > flag
> > is either?
> 
> Yeah, task flag will be reduced to KMEMLEAK enabled configurations
> without an additional maint. overhead. Anyway, you should really think
> about how to guarantee trackability for atomic allocation requests. You
> cannot simply assume that GFP_NOWAIT will succeed. I guess you really

Sure. While I'm using task->make_it_fail, I'm still in the direction of 
making kmemleak avoid fault inject with task flag instead of page alloc flag.

> want to have a pre-populated pool of objects for those requests. The
> obvious question is how to balance such a pool. It ain't easy to track
> memory by allocating more memory...

This solution is going to make kmemleak trace really nofail. We can think
later.

while I'm thinking about if fault inject can be disabled via flag in task.

Actually, I'm doing something like below, the disable_fault_inject() is
just setting a flag in task->make_it_fail. But this will depend on if
fault injection accept a change like this. CCing Akinobu 

[1] http://lkml.kernel.org/r/1524243513-29118-1-git-send-email-chuhu@redhat.com
[2] http://lkml.kernel.org/r/CA+7wUswp_Sr=hHqi1bwRZ3FE2wY5ozZWZ8Z1BgrFnSAmijUKjA@mail.gmail.com
[3] commit d9570ee3bd1d ("kmemleak: allow to coexist with fault injection")


+#define disable_fault_inject() \
+do {   \
+   unsigned long flag; \
+   local_irq_save(flag);   \
+   if (in_irq())   \
+       current->make_it_fail |= HARDIRQ_NOFAULT_OFFSET;    \
+   else    \
+       current->make_it_fail |= TASK_NOFAULT_OFFSET;   \
+   local_irq_restore(flag);        \
+} while (0)




> 
> --
> Michal Hocko
> SUSE Labs
>

Michal Hocko May 30, 2018, 12:38 p.m. UTC | #1

On Wed 30-05-18 07:42:59, Chunyu Hu wrote:
> 
> ----- Original Message -----
> > From: "Michal Hocko" <mhocko@suse.com>
> > To: "Chunyu Hu" <chuhu@redhat.com>
> > Cc: "Tetsuo Handa" <penguin-kernel@i-love.sakura.ne.jp>, malat@debian.org, dvyukov@google.com, linux-mm@kvack.org,
> > "catalin marinas" <catalin.marinas@arm.com>
> > Sent: Wednesday, May 30, 2018 6:46:37 PM
> > Subject: Re: [PATCH] kmemleak: don't use __GFP_NOFAIL
> > 
> > On Wed 30-05-18 05:35:37, Chunyu Hu wrote:
> > [...]
> > > I'm trying to reuse the make_it_fail field in task for fault injection. As
> > > adding
> > > an extra memory alloc flag is not thought so good,  I think adding task
> > > flag
> > > is either?
> > 
> > Yeah, task flag will be reduced to KMEMLEAK enabled configurations
> > without an additional maint. overhead. Anyway, you should really think
> > about how to guarantee trackability for atomic allocation requests. You
> > cannot simply assume that GFP_NOWAIT will succeed. I guess you really
> 
> Sure. While I'm using task->make_it_fail, I'm still in the direction of 
> making kmemleak avoid fault inject with task flag instead of page alloc flag.
> 
> > want to have a pre-populated pool of objects for those requests. The
> > obvious question is how to balance such a pool. It ain't easy to track
> > memory by allocating more memory...
> 
> This solution is going to make kmemleak trace really nofail. We can think
> later.
> 
> while I'm thinking about if fault inject can be disabled via flag in task.
> 
> Actually, I'm doing something like below, the disable_fault_inject() is
> just setting a flag in task->make_it_fail. But this will depend on if
> fault injection accept a change like this. CCing Akinobu 

You still seem to be missing my point I am afraid (or I am ;). So say
that you want to track a GFP_NOWAIT allocation request. So create_object
will get called with that gfp mask and no matter what you try here your
tracking object will be allocated in a weak allocation context as well
and disable kmemleak. So it only takes a more heavy memory pressure and
the tracing is gone...

Catalin Marinas May 31, 2018, 3:22 p.m. UTC | #2

Hi Michal,

I'm catching up with this thread.

On Wed, May 30, 2018 at 02:38:26PM +0200, Michal Hocko wrote:
> On Wed 30-05-18 07:42:59, Chunyu Hu wrote:
> > From: "Michal Hocko" <mhocko@suse.com>
> > > want to have a pre-populated pool of objects for those requests. The
> > > obvious question is how to balance such a pool. It ain't easy to track
> > > memory by allocating more memory...
> > 
> > This solution is going to make kmemleak trace really nofail. We can think
> > later.
> > 
> > while I'm thinking about if fault inject can be disabled via flag in task.
> > 
> > Actually, I'm doing something like below, the disable_fault_inject() is
> > just setting a flag in task->make_it_fail. But this will depend on if
> > fault injection accept a change like this. CCing Akinobu 
> 
> You still seem to be missing my point I am afraid (or I am ;). So say
> that you want to track a GFP_NOWAIT allocation request. So create_object
> will get called with that gfp mask and no matter what you try here your
> tracking object will be allocated in a weak allocation context as well
> and disable kmemleak. So it only takes a more heavy memory pressure and
> the tracing is gone...

create_object() indeed gets the originating gfp but it only cares
whether it was GFP_KERNEL or GFP_ATOMIC. gfp_kmemleak_mask() masks out
all the other flags when allocating a struct kmemleak_object (and adds
some of its own).

This has worked ok so far. There is a higher risk of GFP_ATOMIC
allocations failing but I haven't seen issues with kmemleak unless you
run it under heavy memory pressure (and kmemleak just disables itself).
With fault injection, such pressure is simulated with the side effect of
rendering kmemleak unusable.

Kmemleak could implement its own allocation mechanism (maybe even
skipping the slab altogether) but I feel it's overkill just to cope with
the specific case of fault injection. Also, it's not easy to figure out
how to balance such pool and it may end up calling alloc_pages() which
in turn can inject a fault.

Could we tweak gfp_kmemleak_mask() to still pass __GFP_NOFAIL but in a
compatible way (e.g. by setting __GFP_DIRECT_RECLAIM) when fault
injection is enabled?

Otherwise, I'd prefer some per-thread temporary fault injection
disabling.

Michal Hocko May 31, 2018, 6:41 p.m. UTC | #3

On Thu 31-05-18 16:22:26, Catalin Marinas wrote:
> Hi Michal,
> 
> I'm catching up with this thread.
> 
> On Wed, May 30, 2018 at 02:38:26PM +0200, Michal Hocko wrote:
> > On Wed 30-05-18 07:42:59, Chunyu Hu wrote:
> > > From: "Michal Hocko" <mhocko@suse.com>
> > > > want to have a pre-populated pool of objects for those requests. The
> > > > obvious question is how to balance such a pool. It ain't easy to track
> > > > memory by allocating more memory...
> > > 
> > > This solution is going to make kmemleak trace really nofail. We can think
> > > later.
> > > 
> > > while I'm thinking about if fault inject can be disabled via flag in task.
> > > 
> > > Actually, I'm doing something like below, the disable_fault_inject() is
> > > just setting a flag in task->make_it_fail. But this will depend on if
> > > fault injection accept a change like this. CCing Akinobu 
> > 
> > You still seem to be missing my point I am afraid (or I am ;). So say
> > that you want to track a GFP_NOWAIT allocation request. So create_object
> > will get called with that gfp mask and no matter what you try here your
> > tracking object will be allocated in a weak allocation context as well
> > and disable kmemleak. So it only takes a more heavy memory pressure and
> > the tracing is gone...
> 
> create_object() indeed gets the originating gfp but it only cares
> whether it was GFP_KERNEL or GFP_ATOMIC. gfp_kmemleak_mask() masks out
> all the other flags when allocating a struct kmemleak_object (and adds
> some of its own).
> 
> This has worked ok so far. There is a higher risk of GFP_ATOMIC
> allocations failing but I haven't seen issues with kmemleak unless you
> run it under heavy memory pressure (and kmemleak just disables itself).
> With fault injection, such pressure is simulated with the side effect of
> rendering kmemleak unusable.
> 
> Kmemleak could implement its own allocation mechanism (maybe even
> skipping the slab altogether) but I feel it's overkill just to cope with
> the specific case of fault injection. Also, it's not easy to figure out
> how to balance such pool and it may end up calling alloc_pages() which
> in turn can inject a fault.
> 
> Could we tweak gfp_kmemleak_mask() to still pass __GFP_NOFAIL but in a
> compatible way (e.g. by setting __GFP_DIRECT_RECLAIM) when fault
> injection is enabled?
> 
> Otherwise, I'd prefer some per-thread temporary fault injection
> disabling.

Well, there are two issues (which boil down to the one in the end) here.
Fault injection or a GFP_NOWAIT or any other weaker than GFP_KERNEL
context is something to care about. A weaker allocation context can and
will lead to kmemleak meta data allocation failures regardless of the
fault injection. The way how those objects are allocated directly in the
allacator context makes this really hard and inherently subtle. I can
see only two ways around. Either you have a placeholder for "this object
is not tracked so do not throw false positives" or have a preallocated
pool to use if the direct context allocation failes for whatever reason.
Abusing __GFP_NOFAIL is simply a crude hack which will lead to all kind
of problems.

Chunyu Hu June 1, 2018, 1:50 a.m. UTC | #4

----- Original Message -----
> From: "Michal Hocko" <mhocko@suse.com>
> To: "Catalin Marinas" <catalin.marinas@arm.com>
> Cc: "Chunyu Hu" <chuhu@redhat.com>, "Tetsuo Handa" <penguin-kernel@i-love.sakura.ne.jp>, malat@debian.org,
> dvyukov@google.com, linux-mm@kvack.org, "Akinobu Mita" <akinobu.mita@gmail.com>
> Sent: Friday, June 1, 2018 2:41:04 AM
> Subject: Re: [PATCH] kmemleak: don't use __GFP_NOFAIL
> 
> On Thu 31-05-18 16:22:26, Catalin Marinas wrote:
> > Hi Michal,
> > 
> > I'm catching up with this thread.
> > 
> > On Wed, May 30, 2018 at 02:38:26PM +0200, Michal Hocko wrote:
> > > On Wed 30-05-18 07:42:59, Chunyu Hu wrote:
> > > > From: "Michal Hocko" <mhocko@suse.com>
> > > > > want to have a pre-populated pool of objects for those requests. The
> > > > > obvious question is how to balance such a pool. It ain't easy to
> > > > > track
> > > > > memory by allocating more memory...
> > > > 
> > > > This solution is going to make kmemleak trace really nofail. We can
> > > > think
> > > > later.
> > > > 
> > > > while I'm thinking about if fault inject can be disabled via flag in
> > > > task.
> > > > 
> > > > Actually, I'm doing something like below, the disable_fault_inject() is
> > > > just setting a flag in task->make_it_fail. But this will depend on if
> > > > fault injection accept a change like this. CCing Akinobu
> > > 
> > > You still seem to be missing my point I am afraid (or I am ;). So say
> > > that you want to track a GFP_NOWAIT allocation request. So create_object
> > > will get called with that gfp mask and no matter what you try here your
> > > tracking object will be allocated in a weak allocation context as well
> > > and disable kmemleak. So it only takes a more heavy memory pressure and
> > > the tracing is gone...
> > 
> > create_object() indeed gets the originating gfp but it only cares
> > whether it was GFP_KERNEL or GFP_ATOMIC. gfp_kmemleak_mask() masks out
> > all the other flags when allocating a struct kmemleak_object (and adds
> > some of its own).
> > 
> > This has worked ok so far. There is a higher risk of GFP_ATOMIC
> > allocations failing but I haven't seen issues with kmemleak unless you
> > run it under heavy memory pressure (and kmemleak just disables itself).
> > With fault injection, such pressure is simulated with the side effect of
> > rendering kmemleak unusable.
> > 
> > Kmemleak could implement its own allocation mechanism (maybe even
> > skipping the slab altogether) but I feel it's overkill just to cope with
> > the specific case of fault injection. Also, it's not easy to figure out
> > how to balance such pool and it may end up calling alloc_pages() which
> > in turn can inject a fault.


it would benefit kmemleak trace, I see in my test that kmemleak even can work in
user pressure cases, such as in my test, stress-ng to consume
nearly all the swap space. kmemleak is still working. but 1M objects pool
is consuming around 400M + memory. So this is just a experiment try, as you
said, how to balance it's size is the issue or ther issues has to be resolved,
such as when to add pool, the speed, how big, and so on ...

And I fault injected 20000 times fail_page_alloc, and 2148 times happened
in create_object, and in such a case, kmemleak is till working after the
2000+ calloc failures. 

[root@dhcp-12-244 fail_page_alloc]# grep create_object /var/log/messages | wc -l
2148

[60498.299412] FAULT_INJECTION: forcing a failure.
name fail_page_alloc, interval 0, probability 80, space 0, times 2

So this way is not just for fault injection, it's about making kmemleak
a bit stronger under memory failure case. It would be an exciting experience we
see if kmemleak still work even after mem pressure, as a user, I experienced
the good usage.


> > 
> > Could we tweak gfp_kmemleak_mask() to still pass __GFP_NOFAIL but in a
> > compatible way (e.g. by setting __GFP_DIRECT_RECLAIM) when fault
> > injection is enabled?

Maybe I can have a try on this..

> > 
> > Otherwise, I'd prefer some per-thread temporary fault injection
> > disabling.

I tried in make_it_fail flag, kmemleak can avoid fault injection, but I
can see kmemleak diabled itself...

> 
> Well, there are two issues (which boil down to the one in the end) here.
> Fault injection or a GFP_NOWAIT or any other weaker than GFP_KERNEL
> context is something to care about. A weaker allocation context can and
> will lead to kmemleak meta data allocation failures regardless of the
> fault injection. The way how those objects are allocated directly in the
> allacator context makes this really hard and inherently subtle. I can
> see only two ways around. Either you have a placeholder for "this object
> is not tracked so do not throw false positives" or have a preallocated
> pool to use if the direct context allocation failes for whatever reason.
> Abusing __GFP_NOFAIL is simply a crude hack which will lead to all kind
> of problems.
> --
> Michal Hocko
> SUSE Labs
>

Chunyu Hu June 1, 2018, 4:53 a.m. UTC | #5

----- Original Message -----
> From: "Chunyu Hu" <chuhu@redhat.com>
> To: "Catalin Marinas" <catalin.marinas@arm.com>
> Cc: "Michal Hocko" <mhocko@suse.com>, "Tetsuo Handa" <penguin-kernel@i-love.sakura.ne.jp>, malat@debian.org,
> dvyukov@google.com, linux-mm@kvack.org, "Akinobu Mita" <akinobu.mita@gmail.com>
> Sent: Friday, June 1, 2018 9:50:20 AM
> Subject: Re: [PATCH] kmemleak: don't use __GFP_NOFAIL
> 
> 
> 
> ----- Original Message -----
> > From: "Michal Hocko" <mhocko@suse.com>
> > To: "Catalin Marinas" <catalin.marinas@arm.com>
> > Cc: "Chunyu Hu" <chuhu@redhat.com>, "Tetsuo Handa"
> > <penguin-kernel@i-love.sakura.ne.jp>, malat@debian.org,
> > dvyukov@google.com, linux-mm@kvack.org, "Akinobu Mita"
> > <akinobu.mita@gmail.com>
> > Sent: Friday, June 1, 2018 2:41:04 AM
> > Subject: Re: [PATCH] kmemleak: don't use __GFP_NOFAIL
> > 
> > On Thu 31-05-18 16:22:26, Catalin Marinas wrote:
> > > Hi Michal,
> > > 
> > > I'm catching up with this thread.
> > > 
> > > On Wed, May 30, 2018 at 02:38:26PM +0200, Michal Hocko wrote:
> > > > On Wed 30-05-18 07:42:59, Chunyu Hu wrote:
> > > > > From: "Michal Hocko" <mhocko@suse.com>
> > > > > > want to have a pre-populated pool of objects for those requests.
> > > > > > The
> > > > > > obvious question is how to balance such a pool. It ain't easy to
> > > > > > track
> > > > > > memory by allocating more memory...
> > > > > 
> > > > > This solution is going to make kmemleak trace really nofail. We can
> > > > > think
> > > > > later.
> > > > > 
> > > > > while I'm thinking about if fault inject can be disabled via flag in
> > > > > task.
> > > > > 
> > > > > Actually, I'm doing something like below, the disable_fault_inject()
> > > > > is
> > > > > just setting a flag in task->make_it_fail. But this will depend on if
> > > > > fault injection accept a change like this. CCing Akinobu
> > > > 
> > > > You still seem to be missing my point I am afraid (or I am ;). So say
> > > > that you want to track a GFP_NOWAIT allocation request. So
> > > > create_object
> > > > will get called with that gfp mask and no matter what you try here your
> > > > tracking object will be allocated in a weak allocation context as well
> > > > and disable kmemleak. So it only takes a more heavy memory pressure and
> > > > the tracing is gone...
> > > 
> > > create_object() indeed gets the originating gfp but it only cares
> > > whether it was GFP_KERNEL or GFP_ATOMIC. gfp_kmemleak_mask() masks out
> > > all the other flags when allocating a struct kmemleak_object (and adds
> > > some of its own).
> > > 
> > > This has worked ok so far. There is a higher risk of GFP_ATOMIC
> > > allocations failing but I haven't seen issues with kmemleak unless you
> > > run it under heavy memory pressure (and kmemleak just disables itself).
> > > With fault injection, such pressure is simulated with the side effect of
> > > rendering kmemleak unusable.
> > > 
> > > Kmemleak could implement its own allocation mechanism (maybe even
> > > skipping the slab altogether) but I feel it's overkill just to cope with
> > > the specific case of fault injection. Also, it's not easy to figure out
> > > how to balance such pool and it may end up calling alloc_pages() which
> > > in turn can inject a fault.
> 
> 
> it would benefit kmemleak trace, I see in my test that kmemleak even can work
> in
> user pressure cases, such as in my test, stress-ng to consume
> nearly all the swap space. kmemleak is still working. but 1M objects pool
> is consuming around 400M + memory. So this is just a experiment try, as you
> said, how to balance it's size is the issue or ther issues has to be
> resolved,
> such as when to add pool, the speed, how big, and so on ...
> 
> And I fault injected 20000 times fail_page_alloc, and 2148 times happened
> in create_object, and in such a case, kmemleak is till working after the
> 2000+ calloc failures.
> 
> [root@dhcp-12-244 fail_page_alloc]# grep create_object /var/log/messages | wc
> -l
> 2148
> 
> [60498.299412] FAULT_INJECTION: forcing a failure.
> name fail_page_alloc, interval 0, probability 80, space 0, times 2
> 
> So this way is not just for fault injection, it's about making kmemleak
> a bit stronger under memory failure case. It would be an exciting experience
> we
> see if kmemleak still work even after mem pressure, as a user, I experienced
> the good usage.
> 
> 
> > > 
> > > Could we tweak gfp_kmemleak_mask() to still pass __GFP_NOFAIL but in a
> > > compatible way (e.g. by setting __GFP_DIRECT_RECLAIM) when fault
> > > injection is enabled?
> 
> Maybe I can have a try on this..

looks like I tried this in my first report thread, and it failed. As it
can sleep in irq() ..

https://marc.info/?l=linux-mm&m=152482400222806&w=2

> 
> > > 
> > > Otherwise, I'd prefer some per-thread temporary fault injection
> > > disabling.
> 
> I tried in make_it_fail flag, kmemleak can avoid fault injection, but I
> can see kmemleak diabled itself...
> 
> > 
> > Well, there are two issues (which boil down to the one in the end) here.
> > Fault injection or a GFP_NOWAIT or any other weaker than GFP_KERNEL
> > context is something to care about. A weaker allocation context can and
> > will lead to kmemleak meta data allocation failures regardless of the
> > fault injection. The way how those objects are allocated directly in the
> > allacator context makes this really hard and inherently subtle. I can
> > see only two ways around. Either you have a placeholder for "this object
> > is not tracked so do not throw false positives" or have a preallocated
> > pool to use if the direct context allocation failes for whatever reason.
> > Abusing __GFP_NOFAIL is simply a crude hack which will lead to all kind
> > of problems.
> > --
> > Michal Hocko
> > SUSE Labs
> > 
> 
> --
> Regards,
> Chunyu Hu
> 
>

Dmitry Vyukov June 4, 2018, 8:41 a.m. UTC | #6

On Fri, Jun 1, 2018 at 6:53 AM, Chunyu Hu <chuhu@redhat.com> wrote:
> ----- Original Message -----
>> From: "Chunyu Hu" <chuhu@redhat.com>
>> To: "Catalin Marinas" <catalin.marinas@arm.com>
>> Cc: "Michal Hocko" <mhocko@suse.com>, "Tetsuo Handa" <penguin-kernel@i-love.sakura.ne.jp>, malat@debian.org,
>> dvyukov@google.com, linux-mm@kvack.org, "Akinobu Mita" <akinobu.mita@gmail.com>
>> Sent: Friday, June 1, 2018 9:50:20 AM
>> Subject: Re: [PATCH] kmemleak: don't use __GFP_NOFAIL
>>
>>
>>
>> ----- Original Message -----
>> > From: "Michal Hocko" <mhocko@suse.com>
>> > To: "Catalin Marinas" <catalin.marinas@arm.com>
>> > Cc: "Chunyu Hu" <chuhu@redhat.com>, "Tetsuo Handa"
>> > <penguin-kernel@i-love.sakura.ne.jp>, malat@debian.org,
>> > dvyukov@google.com, linux-mm@kvack.org, "Akinobu Mita"
>> > <akinobu.mita@gmail.com>
>> > Sent: Friday, June 1, 2018 2:41:04 AM
>> > Subject: Re: [PATCH] kmemleak: don't use __GFP_NOFAIL
>> >
>> > On Thu 31-05-18 16:22:26, Catalin Marinas wrote:
>> > > Hi Michal,
>> > >
>> > > I'm catching up with this thread.
>> > >
>> > > On Wed, May 30, 2018 at 02:38:26PM +0200, Michal Hocko wrote:
>> > > > On Wed 30-05-18 07:42:59, Chunyu Hu wrote:
>> > > > > From: "Michal Hocko" <mhocko@suse.com>
>> > > > > > want to have a pre-populated pool of objects for those requests.
>> > > > > > The
>> > > > > > obvious question is how to balance such a pool. It ain't easy to
>> > > > > > track
>> > > > > > memory by allocating more memory...
>> > > > >
>> > > > > This solution is going to make kmemleak trace really nofail. We can
>> > > > > think
>> > > > > later.
>> > > > >
>> > > > > while I'm thinking about if fault inject can be disabled via flag in
>> > > > > task.
>> > > > >
>> > > > > Actually, I'm doing something like below, the disable_fault_inject()
>> > > > > is
>> > > > > just setting a flag in task->make_it_fail. But this will depend on if
>> > > > > fault injection accept a change like this. CCing Akinobu
>> > > >
>> > > > You still seem to be missing my point I am afraid (or I am ;). So say
>> > > > that you want to track a GFP_NOWAIT allocation request. So
>> > > > create_object
>> > > > will get called with that gfp mask and no matter what you try here your
>> > > > tracking object will be allocated in a weak allocation context as well
>> > > > and disable kmemleak. So it only takes a more heavy memory pressure and
>> > > > the tracing is gone...
>> > >
>> > > create_object() indeed gets the originating gfp but it only cares
>> > > whether it was GFP_KERNEL or GFP_ATOMIC. gfp_kmemleak_mask() masks out
>> > > all the other flags when allocating a struct kmemleak_object (and adds
>> > > some of its own).
>> > >
>> > > This has worked ok so far. There is a higher risk of GFP_ATOMIC
>> > > allocations failing but I haven't seen issues with kmemleak unless you
>> > > run it under heavy memory pressure (and kmemleak just disables itself).
>> > > With fault injection, such pressure is simulated with the side effect of
>> > > rendering kmemleak unusable.
>> > >
>> > > Kmemleak could implement its own allocation mechanism (maybe even
>> > > skipping the slab altogether) but I feel it's overkill just to cope with
>> > > the specific case of fault injection. Also, it's not easy to figure out
>> > > how to balance such pool and it may end up calling alloc_pages() which
>> > > in turn can inject a fault.
>>
>>
>> it would benefit kmemleak trace, I see in my test that kmemleak even can work
>> in
>> user pressure cases, such as in my test, stress-ng to consume
>> nearly all the swap space. kmemleak is still working. but 1M objects pool
>> is consuming around 400M + memory. So this is just a experiment try, as you
>> said, how to balance it's size is the issue or ther issues has to be
>> resolved,
>> such as when to add pool, the speed, how big, and so on ...
>>
>> And I fault injected 20000 times fail_page_alloc, and 2148 times happened
>> in create_object, and in such a case, kmemleak is till working after the
>> 2000+ calloc failures.
>>
>> [root@dhcp-12-244 fail_page_alloc]# grep create_object /var/log/messages | wc
>> -l
>> 2148
>>
>> [60498.299412] FAULT_INJECTION: forcing a failure.
>> name fail_page_alloc, interval 0, probability 80, space 0, times 2
>>
>> So this way is not just for fault injection, it's about making kmemleak
>> a bit stronger under memory failure case. It would be an exciting experience
>> we
>> see if kmemleak still work even after mem pressure, as a user, I experienced
>> the good usage.
>>
>>
>> > >
>> > > Could we tweak gfp_kmemleak_mask() to still pass __GFP_NOFAIL but in a
>> > > compatible way (e.g. by setting __GFP_DIRECT_RECLAIM) when fault
>> > > injection is enabled?
>>
>> Maybe I can have a try on this..
>
> looks like I tried this in my first report thread, and it failed. As it
> can sleep in irq() ..
>
> https://marc.info/?l=linux-mm&m=152482400222806&w=2
>
>>
>> > >
>> > > Otherwise, I'd prefer some per-thread temporary fault injection
>> > > disabling.
>>
>> I tried in make_it_fail flag, kmemleak can avoid fault injection, but I
>> can see kmemleak diabled itself...
>>
>> >
>> > Well, there are two issues (which boil down to the one in the end) here.
>> > Fault injection or a GFP_NOWAIT or any other weaker than GFP_KERNEL
>> > context is something to care about. A weaker allocation context can and
>> > will lead to kmemleak meta data allocation failures regardless of the
>> > fault injection. The way how those objects are allocated directly in the
>> > allacator context makes this really hard and inherently subtle. I can
>> > see only two ways around. Either you have a placeholder for "this object
>> > is not tracked so do not throw false positives" or have a preallocated
>> > pool to use if the direct context allocation failes for whatever reason.
>> > Abusing __GFP_NOFAIL is simply a crude hack which will lead to all kind
>> > of problems.



FWIW this problem is traditionally solved in dynamic analysis tools by
embedding meta info right in headers of heap blocks. All of KASAN,
KMSAN, slub debug, LeakSanitizer, asan, valgrind work this way. Then
an object is either allocated or not. If caller has something to
prevent allocations from failing in any context, then the same will be
true for KMEMLEAK meta data.

On a related note, we could also consider switching to
lib/stackdepot.c for alloc stack memorization to reduce header size.
stackdepot is not 100% proof against allocation failures in atomic
contexts, but it does eager memory preallocation and seems to work
well in practice.

Michal Hocko June 4, 2018, 12:42 p.m. UTC | #7

On Mon 04-06-18 10:41:39, Dmitry Vyukov wrote:
[...]
> FWIW this problem is traditionally solved in dynamic analysis tools by
> embedding meta info right in headers of heap blocks. All of KASAN,
> KMSAN, slub debug, LeakSanitizer, asan, valgrind work this way. Then
> an object is either allocated or not. If caller has something to
> prevent allocations from failing in any context, then the same will be
> true for KMEMLEAK meta data.
> 

This makes much more sense, of course. I thought there were some
fundamental reasons why kmemleak needs to have an off-object tracking
which makes the whole thing much more complicated of course.

Catalin Marinas June 4, 2018, 3:08 p.m. UTC | #8

On Mon, Jun 04, 2018 at 02:42:10PM +0200, Michal Hocko wrote:
> On Mon 04-06-18 10:41:39, Dmitry Vyukov wrote:
> [...]
> > FWIW this problem is traditionally solved in dynamic analysis tools by
> > embedding meta info right in headers of heap blocks. All of KASAN,
> > KMSAN, slub debug, LeakSanitizer, asan, valgrind work this way. Then
> > an object is either allocated or not. If caller has something to
> > prevent allocations from failing in any context, then the same will be
> > true for KMEMLEAK meta data.
> 
> This makes much more sense, of course. I thought there were some
> fundamental reasons why kmemleak needs to have an off-object tracking
> which makes the whole thing much more complicated of course.

Kmemleak needs to track all memory blocks that may contain pointers
(otherwise the dependency graph cannot be correctly tracked leading to
lots of false positives). Not all these objects come from the slab
allocator, for example it tracks certain alloc_pages() blocks, all of
memblock_alloc().

An option would be to use separate metadata only for non-slab objects,
though I'd have to see how intrusive this is for mm/sl*b.c. Also there
is RCU freeing for the kmemleak metadata to avoid locking when
traversing the internal lists. If the metadata is in the slab object
itself, we'd have to either defer its freeing or add some bigger lock to
kmemleak.

Dmitry Vyukov June 4, 2018, 3:36 p.m. UTC | #9

On Mon, Jun 4, 2018 at 5:08 PM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Mon, Jun 04, 2018 at 02:42:10PM +0200, Michal Hocko wrote:
>> On Mon 04-06-18 10:41:39, Dmitry Vyukov wrote:
>> [...]
>> > FWIW this problem is traditionally solved in dynamic analysis tools by
>> > embedding meta info right in headers of heap blocks. All of KASAN,
>> > KMSAN, slub debug, LeakSanitizer, asan, valgrind work this way. Then
>> > an object is either allocated or not. If caller has something to
>> > prevent allocations from failing in any context, then the same will be
>> > true for KMEMLEAK meta data.
>>
>> This makes much more sense, of course. I thought there were some
>> fundamental reasons why kmemleak needs to have an off-object tracking
>> which makes the whole thing much more complicated of course.
>
> Kmemleak needs to track all memory blocks that may contain pointers
> (otherwise the dependency graph cannot be correctly tracked leading to
> lots of false positives). Not all these objects come from the slab
> allocator, for example it tracks certain alloc_pages() blocks, all of
> memblock_alloc().

I understand that this will make KMEMLEAK tracking non-uniform, but
heap objects are the most important class of allocations.
page struct already contains stackdepot id if CONFIG_PAGE_OWNER is
enabled. Do we need anything else other than stack trace for pages?
I don't know about memblock's.

> An option would be to use separate metadata only for non-slab objects,
> though I'd have to see how intrusive this is for mm/sl*b.c. Also there
> is RCU freeing for the kmemleak metadata to avoid locking when
> traversing the internal lists. If the metadata is in the slab object
> itself, we'd have to either defer its freeing or add some bigger lock to
> kmemleak.

This relates to scanning without slopped world, right? In our
experience with large-scale systematic testing any tool with false
positives can't be used in practice in systematic way. KMEMLEAK false
positives do not allow to enable it on syzbot. We know there are tons
of leaks, we have the tool, but we are not detecting leaks. I don't
know who/how uses KMEMLEAK in non-stop-the-world mode, but
stop-the-world is pretty much a requirement for deployment for us. And
it would also solve the problem with disappearing under our feet heap
blocks, right?
FWIW In LeakSanitizer we don't specifically keep track of heap blocks.
Instead we stop the world and then ask memory allocator for metainfo.
I would expect that sl*b also have all required info, maybe in not
O(1) accessible form, so it may require some preprocessing (e.g.
collecting all free objects in a slab and then subtracting it from set
of all objects in the slab to get set of allocated objects).
But I understand that all of this turns this from "add a flag" to
almost a complete rewrite of the tool...

Catalin Marinas June 4, 2018, 4:41 p.m. UTC | #10

On Mon, Jun 04, 2018 at 05:36:31PM +0200, Dmitry Vyukov wrote:
> On Mon, Jun 4, 2018 at 5:08 PM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Jun 04, 2018 at 02:42:10PM +0200, Michal Hocko wrote:
> >> On Mon 04-06-18 10:41:39, Dmitry Vyukov wrote:
> >> [...]
> >> > FWIW this problem is traditionally solved in dynamic analysis tools by
> >> > embedding meta info right in headers of heap blocks. All of KASAN,
> >> > KMSAN, slub debug, LeakSanitizer, asan, valgrind work this way. Then
> >> > an object is either allocated or not. If caller has something to
> >> > prevent allocations from failing in any context, then the same will be
> >> > true for KMEMLEAK meta data.
> >>
> >> This makes much more sense, of course. I thought there were some
> >> fundamental reasons why kmemleak needs to have an off-object tracking
> >> which makes the whole thing much more complicated of course.
> >
> > Kmemleak needs to track all memory blocks that may contain pointers
> > (otherwise the dependency graph cannot be correctly tracked leading to
> > lots of false positives). Not all these objects come from the slab
> > allocator, for example it tracks certain alloc_pages() blocks, all of
> > memblock_alloc().
> 
> I understand that this will make KMEMLEAK tracking non-uniform, but
> heap objects are the most important class of allocations.
> page struct already contains stackdepot id if CONFIG_PAGE_OWNER is
> enabled. Do we need anything else other than stack trace for pages?
> I don't know about memblock's.

Well, it needs most of the other stuff that's in struct kmemleak_object
(list_head, rb_node, some counters, spinlock_t).

> > An option would be to use separate metadata only for non-slab objects,
> > though I'd have to see how intrusive this is for mm/sl*b.c. Also there
> > is RCU freeing for the kmemleak metadata to avoid locking when
> > traversing the internal lists. If the metadata is in the slab object
> > itself, we'd have to either defer its freeing or add some bigger lock to
> > kmemleak.
> 
> This relates to scanning without slopped world, right?

Initially the RCU mechanism was added to defer kmemleak freeing its
metadata with another recursive call into the slab freeing routine
(since it does this when the tracked object is freed). This came in
handy for other lists traversal in kmemleak. For the actual memory
scanning, there is some fine-grained locking per metadata object as we
want to block the freeing until the scanning of the specific object
completes (e.g. vfree() must not unmap the object during scanning).

> In our
> experience with large-scale systematic testing any tool with false
> positives can't be used in practice in systematic way. KMEMLEAK false
> positives do not allow to enable it on syzbot. We know there are tons
> of leaks, we have the tool, but we are not detecting leaks. I don't
> know who/how uses KMEMLEAK in non-stop-the-world mode, but
> stop-the-world is pretty much a requirement for deployment for us. And
> it would also solve the problem with disappearing under our feet heap
> blocks, right?

A hard requirement during the early kmemleak development was not to
actually stop the world (as it can even take minutes to complete the
scanning). It employs various heuristics to deal with false positives
like checksumming, delaying the actual reporting, waiting for an object
to be detected as a leak in two successive scans while its checksum is
the same. While not ideal, it works most of the time.

Now, there was indeed a recent requirement to implement stop-the-world
scanning via a "stopscan" command to /sys/kernel/debug/kmemleak (using
stop_machine()) but I never got around to implementing it. This would be
very useful for non-interactive sessions like automated testing.

> FWIW In LeakSanitizer we don't specifically keep track of heap blocks.
> Instead we stop the world and then ask memory allocator for metainfo.
> I would expect that sl*b also have all required info, maybe in not
> O(1) accessible form, so it may require some preprocessing (e.g.
> collecting all free objects in a slab and then subtracting it from set
> of all objects in the slab to get set of allocated objects).
> But I understand that all of this turns this from "add a flag" to
> almost a complete rewrite of the tool...

As I said above, background scanning is still a requirement but we could
add a stopscan command on top, should be too hard.

kmemleak: don't use __GFP_NOFAIL

Commit Message

Comments

Patch