From patchwork Tue Apr 1 11:38:51 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Russell King - ARM Linux X-Patchwork-Id: 3920851 Return-Path: X-Original-To: patchwork-linux-arm@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 33F6FBF540 for ; Tue, 1 Apr 2014 11:40:13 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 31767201F4 for ; Tue, 1 Apr 2014 11:40:12 +0000 (UTC) Received: from casper.infradead.org (casper.infradead.org [85.118.1.10]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id B9BF6201C8 for ; Tue, 1 Apr 2014 11:40:10 +0000 (UTC) Received: from merlin.infradead.org ([2001:4978:20e::2]) by casper.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux)) id 1WUx2u-00031e-Mm; Tue, 01 Apr 2014 11:40:00 +0000 Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.80.1 #2 (Red Hat Linux)) id 1WUx2s-0005sR-Ar; Tue, 01 Apr 2014 11:39:58 +0000 Received: from pandora.arm.linux.org.uk ([2001:4d48:ad52:3201:214:fdff:fe10:1be6]) by merlin.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux)) id 1WUx2p-0005rI-9a for linux-arm-kernel@lists.infradead.org; Tue, 01 Apr 2014 11:39:56 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=arm.linux.org.uk; s=pandora; h=Sender:In-Reply-To:Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date; bh=ASkoqbgkeqgVWNsBoV5Rj206bLk8Qwt9gbd5+4cJ4w4=; b=IAHsvkoEgs5ITbnoL2BcGGIXsonx4zy8/9MHSrNimRGGvkNKieVK89/OTltQBjNkMAUYbLWthcDn1bHlomYGRYpdG3jnMB3qgDvrXJuhbX+nHFwr7voYCAqpx/w2NgWAtibYvLLWifWwGow5TxBrhIPJVHeb7yvJHcbqSTZDB6A=; Received: from n2100.arm.linux.org.uk ([2002:4e20:1eda:1:214:fdff:fe10:4f86]:34376) by pandora.arm.linux.org.uk with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.76) (envelope-from ) id 1WUx28-0001iu-2z; Tue, 01 Apr 2014 12:39:12 +0100 Received: from linux by n2100.arm.linux.org.uk with local (Exim 4.76) (envelope-from ) id 1WUx1n-0004R0-VU; Tue, 01 Apr 2014 12:38:52 +0100 Date: Tue, 1 Apr 2014 12:38:51 +0100 From: Russell King - ARM Linux To: Catalin Marinas , Linus Torvalds Subject: Re: Recent 3.x kernels: Memory leak causing OOMs Message-ID: <20140401113851.GA15317@n2100.arm.linux.org.uk> References: <20140216200503.GN30257@n2100.arm.linux.org.uk> <20140216225000.GO30257@n2100.arm.linux.org.uk> <1392670951.24429.10.camel@sakura.staff.proxad.net> <20140217210954.GA21483@n2100.arm.linux.org.uk> <20140315101952.GT21483@n2100.arm.linux.org.uk> <20140317180748.644d30e2@notabene.brown> <20140317181813.GA24144@arm.com> <20140317193316.GF21483@n2100.arm.linux.org.uk> <20140401091959.GA10912@n2100.arm.linux.org.uk> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20140401091959.GA10912@n2100.arm.linux.org.uk> User-Agent: Mutt/1.5.19 (2009-01-05) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20140401_073955_810481_60AA7138 X-CRM114-Status: GOOD ( 34.00 ) X-Spam-Score: -2.6 (--) Cc: NeilBrown , linux-raid@vger.kernel.org, linux-mm@kvack.org, David Rientjes , Maxime Bizon , linux-arm-kernel@lists.infradead.org X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org X-Spam-Status: No, score=-4.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_MED,RP_MATCHES_RCVD,T_DKIM_INVALID,UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Tue, Apr 01, 2014 at 10:19:59AM +0100, Russell King - ARM Linux wrote: > On Mon, Mar 17, 2014 at 07:33:16PM +0000, Russell King - ARM Linux wrote: > > On Mon, Mar 17, 2014 at 06:18:13PM +0000, Catalin Marinas wrote: > > > On Mon, Mar 17, 2014 at 06:07:48PM +1100, NeilBrown wrote: > > > > On Sat, 15 Mar 2014 10:19:52 +0000 Russell King - ARM Linux > > > > wrote: > > > > > unreferenced object 0xc3c3f880 (size 256): > > > > > comm "md2_resync", pid 4680, jiffies 638245 (age 8615.570s) > > > > > hex dump (first 32 bytes): > > > > > 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 f0 ................ > > > > > 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 ................ > > > > > backtrace: > > > > > [] __save_stack_trace+0x34/0x40 > > > > > [] create_object+0xf4/0x214 > > > > > [] kmemleak_alloc+0x3c/0x6c > > > > > [] __kmalloc+0xd0/0x124 > > > > > [] bio_alloc_bioset+0x4c/0x1a4 > > > > > [] r1buf_pool_alloc+0x40/0x148 > > > > > [] mempool_alloc+0x54/0xfc > > > > > [] sync_request+0x168/0x85c > > > > > [] md_do_sync+0x75c/0xbc0 > > > > > [] md_thread+0x138/0x154 > > > > > [] kthread+0xb0/0xbc > > > > > [] ret_from_fork+0x14/0x24 > > > > > [] 0xffffffff > > > > > > > > > > with 3077 of these in the debug file. 3075 are for "md2_resync" and > > > > > two are for "md4_resync". > > > > > > > > > > /proc/slabinfo shows for this bucket: > > > > > kmalloc-256 3237 3450 256 15 1 : tunables 120 60 0 : slabdata 230 230 0 > > > > > > > > > > but this would only account for about 800kB of memory usage, which itself > > > > > is insignificant - so this is not the whole story. > > > > > > > > > > It seems that this is the culpret for the allocations: > > > > > for (j = pi->raid_disks ; j-- ; ) { > > > > > bio = bio_kmalloc(gfp_flags, RESYNC_PAGES); > > > > > > > > > > Since RESYNC_PAGES will be 64K/4K=16, each struct bio_vec is 12 bytes > > > > > (12 * 16 = 192) plus the size of struct bio, which would fall into this > > > > > bucket. > > > > > > > > > > I don't see anything obvious - it looks like it isn't every raid check > > > > > which loses bios. Not quite sure what to make of this right now. I now see something very obvious, having had the problem again, dumped the physical memory to file, and inspected the full leaked struct bio. What I find is that the leaked struct bio's have a bi_cnt of one, which confirms that they were never freed - free'd struct bio's would have a bi_cnt of zero due to the atomic_dec_and_test() before bio_free() inside bio_put(). When looking at the bi_inline_vecs, I see that there was a failure to allocate a page. Now, let's look at what r1buf_pool_alloc() does: for (j = pi->raid_disks ; j-- ; ) { bio = bio_kmalloc(gfp_flags, RESYNC_PAGES); if (!bio) goto out_free_bio; r1_bio->bios[j] = bio; } if (test_bit(MD_RECOVERY_REQUESTED, &pi->mddev->recovery)) j = pi->raid_disks; else j = 1; while(j--) { bio = r1_bio->bios[j]; bio->bi_vcnt = RESYNC_PAGES; if (bio_alloc_pages(bio, gfp_flags)) goto out_free_bio; } out_free_bio: while (++j < pi->raid_disks) bio_put(r1_bio->bios[j]); r1bio_pool_free(r1_bio, data); Consider what happens when bio_alloc_pages() fails. j starts off as one for non-recovery operations, and we enter the loop to allocate the pages. j is post-decremented to zero. So, bio = r1_bio->bios[0]. bio_alloc_pages(bio) fails, we jump to out_free_bio. The first thing that does is increment j, so we free from r1_bio->bios[1] up to the number of raid disks, leaving r1_bio->bios[0] leaked as the r1_bio is then freed. The obvious fix is to set j to -1 before jumping to out_free_bio on bio_alloc_pages() failure. However, that's not the end of the story - there's more leaks here. bio_put() will not free the pages allocated by the previously successful bio_alloc_pages(). What's more is that I don't see any function in BIO which performs that function, which makes me wonder how many other places in the kernel dealing with BIOs end up leaking like this. Anyway, this is what I've come up with - it's not particularly nice, but hopefully it will plug this leak. I'm now running with this patch in place, and time will tell. drivers/md/raid1.c | 30 ++++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index aacf6bf352d8..604bad2fa442 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -123,8 +123,14 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data) bio = r1_bio->bios[j]; bio->bi_vcnt = RESYNC_PAGES; - if (bio_alloc_pages(bio, gfp_flags)) - goto out_free_bio; + if (bio_alloc_pages(bio, gfp_flags)) { + /* + * Mark this as having no pages - bio_alloc_pages + * removes any it allocated. + */ + bio->bi_vcnt = 0; + goto out_free_all_bios; + } } /* If not user-requests, copy the page pointers to all bios */ if (!test_bit(MD_RECOVERY_REQUESTED, &pi->mddev->recovery)) { @@ -138,9 +144,25 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data) return r1_bio; +out_free_all_bios: + j = -1; out_free_bio: - while (++j < pi->raid_disks) - bio_put(r1_bio->bios[j]); + while (++j < pi->raid_disks) { + bio = r1_bio->bios[j]; + if (bio->bi_vcnt) { + struct bio_vec *bv; + int i; + /* + * Annoyingly, BIO has no way to do this, so we have + * to do it manually. Given the trouble here, and + * the lack of BIO support for cleaning up, I don't + * care about linux/bio.h's comment about this helper. + */ + bio_for_each_segment_all(bv, bio, i) + __free_page(bv->bv_page); + } + bio_put(bio); + } r1bio_pool_free(r1_bio, data); return NULL; }