diff mbox

fix crash when using XFS on loopback

Message ID alpine.LRH.2.02.1401041241590.4648@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State Awaiting Upstream, archived
Headers show

Commit Message

Mikulas Patocka Jan. 4, 2014, 5:45 p.m. UTC
The patch 8456a648cf44f14365f1f44de90a3da2526a4776 causes crash in the
LVM2 testsuite on PA-RISC (the crashing test is fsadm.sh). The testsuite
doesn't crash on 3.12, crashes on 3.13-rc1 and later.

 Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
 CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
 task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000

      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
 PSW: 00001000000001101111100100001110 Not tainted
 r00-03  000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
 r04-07  00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
 r08-11  0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
 r12-15  0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
 r16-19  000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
 r20-23  0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
 r24-27  00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
 r28-31  202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
 sr00-03  000000000532c000 0000000000000000 0000000000000000 000000000532c000
 sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000

 IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
  IIR: 539c0030    ISR: 00000000202d6000  IOR: 000006202224647d
  CPU:        3   CR30: 000000413edd8000 CR31: 0000000000000000
  ORIG_R28: 00000000405a95e0
  IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
  IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
  RP(r2): flush_dcache_page+0x128/0x388
 Backtrace:
  [<000000004013e6c0>] flush_dcache_page+0x128/0x388
  [<0000000010fe6ca0>] lo_splice_actor+0x90/0x148 [loop]
  [<00000000402579b0>] splice_from_pipe_feed+0xc0/0x1d0
  [<00000000402580a4>] __splice_from_pipe+0xac/0xc0
  [<0000000010fe6bbc>] lo_direct_splice_actor+0x1c/0x70 [loop]
  [<000000004025854c>] splice_direct_to_actor+0xec/0x228
  [<0000000010fe63ac>] lo_receive+0xe4/0x298 [loop]
  [<0000000010fe69d8>] loop_thread+0x478/0x640 [loop]
  [<000000004018975c>] kthread+0x134/0x168
  [<000000004012c020>] end_fault_vector+0x20/0x28
  [<00000000115e0098>] xfs_setsize_buftarg+0x0/0x90 [xfs]

 Kernel panic - not syncing: Bad Address (null pointer deref?)

The patch 8456a648cf44f14365f1f44de90a3da2526a4776 changes the page
structure so that the slab subsystem reuses the page->mapping field.

The crash happens in the following way:
* XFS allocates some memory from slab and issues a bio to read data into
  it.
* the bio is sent to the loopback device.
* lo_receive creates an actor and calls splice_direct_to_actor.
* lo_splice_actor copies data to the target page.
* lo_splice_actor calls flush_dcache_page because the page may be mapped
  by userspace. In that case we need to flush the kernel cache.
* flush_dcache_page asks for the list of userspace mappings, however that
  page->mapping field is reused by the slab subsystem for a different
  purpose. This causes the crash.

Note that other architectures without coherent caches (sparc, arm, mips)
also call page_mapping from flush_dcache_page, so they may crash in the
same way.

This patch fixes this bug by testing if the page is a slab page in
page_mapping and returning NULL if it is.


The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
earlier kernels in the same scenario on architectures without cache
coherence when CONFIG_DEBUG_VM is enabled - so it should be backported to
stable kernels.


In the old kernels, the function page_mapping is placed in
include/linux/mm.h, so you should modify the patch accordingly when
backporting it.


Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 mm/util.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

John David Anglin Jan. 4, 2014, 6:48 p.m. UTC | #1
On 4-Jan-14, at 12:45 PM, Mikulas Patocka wrote:

> * flush_dcache_page asks for the list of userspace mappings, however  
> that
>  page->mapping field is reused by the slab subsystem for a different
>  purpose. This causes the crash.

I'd noticed the other day that the parisc implementation of  
flush_dcache_page()
should return if "!mapping || mapping != page->mapping" is true.  This  
would
have avoided crash.

Dave
--
John David Anglin	dave.anglin@bell.net



--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Jan. 4, 2014, 7:55 p.m. UTC | #2
On Sat, 4 Jan 2014, John David Anglin wrote:

> On 4-Jan-14, at 12:45 PM, Mikulas Patocka wrote:
> 
> > * flush_dcache_page asks for the list of userspace mappings, however that
> > page->mapping field is reused by the slab subsystem for a different
> > purpose. This causes the crash.
> 
> I'd noticed the other day that the parisc implementation of
> flush_dcache_page()
> should return if "!mapping || mapping != page->mapping" is true.  This would
> have avoided crash.
> 
> Dave

I think no.

page_mapping returns NULL if the page has only anonymous mapping and it is 
not placed in the swap cache. In this case, you need to flush the kernel 
cache.

Maybe you could skip cache flush if the page is neither anonymous nor 
file-backed, but I haven't seen this condition in other architectures' 
flush_dcache_page.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John David Anglin Jan. 4, 2014, 8:31 p.m. UTC | #3
On 4-Jan-14, at 2:55 PM, Mikulas Patocka wrote:

> On Sat, 4 Jan 2014, John David Anglin wrote:
>
>> On 4-Jan-14, at 12:45 PM, Mikulas Patocka wrote:
>>
>>> * flush_dcache_page asks for the list of userspace mappings,  
>>> however that
>>> page->mapping field is reused by the slab subsystem for a different
>>> purpose. This causes the crash.
>>
>> I'd noticed the other day that the parisc implementation of
>> flush_dcache_page()
>> should return if "!mapping || mapping != page->mapping" is true.   
>> This would
>> have avoided crash.
>>
>> Dave
>
> I think no.
>
> page_mapping returns NULL if the page has only anonymous mapping and  
> it is
> not placed in the swap cache. In this case, you need to flush the  
> kernel
> cache.


The suggestion is to add the "mapping != page->mapping" to the current  
NULL check.
It occurs after the kernel cache flush.

It doesn't seem right to flush the vma mappings associated with swap  
address space
and that appears to be happening with current code.

Dave
--
John David Anglin	dave.anglin@bell.net



--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Jan. 4, 2014, 8:52 p.m. UTC | #4
On Sat, 4 Jan 2014, John David Anglin wrote:

> On 4-Jan-14, at 2:55 PM, Mikulas Patocka wrote:
> 
> > On Sat, 4 Jan 2014, John David Anglin wrote:
> > 
> > > On 4-Jan-14, at 12:45 PM, Mikulas Patocka wrote:
> > > 
> > > > * flush_dcache_page asks for the list of userspace mappings, however
> > > > that
> > > > page->mapping field is reused by the slab subsystem for a different
> > > > purpose. This causes the crash.
> > > 
> > > I'd noticed the other day that the parisc implementation of
> > > flush_dcache_page()
> > > should return if "!mapping || mapping != page->mapping" is true.  This
> > > would
> > > have avoided crash.
> > > 
> > > Dave
> > 
> > I think no.
> > 
> > page_mapping returns NULL if the page has only anonymous mapping and it is
> > not placed in the swap cache. In this case, you need to flush the kernel
> > cache.
> 
> 
> The suggestion is to add the "mapping != page->mapping" to the current NULL
> check.
> It occurs after the kernel cache flush.

"if (!mapping || mapping != page->mapping) return;"
returns if the mapping is NULL (and that is wrong because the variable 
mapping is NULL for anonymous pages).

You could probably return "if (!mapping && !PageAnon(page))", but the 
other architectures aren't doing it.

> It doesn't seem right to flush the vma mappings associated with swap address
> space
> and that appears to be happening with current code.
>
> Dave
> --
> John David Anglin	dave.anglin@bell.net

I suppose that "vma_interval_tree_foreach" is empty operation for swap 
address space. Or isn't it?

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joonsoo Kim Jan. 6, 2014, 7:35 a.m. UTC | #5
On Sat, Jan 04, 2014 at 12:45:45PM -0500, Mikulas Patocka wrote:
> The patch 8456a648cf44f14365f1f44de90a3da2526a4776 causes crash in the
> LVM2 testsuite on PA-RISC (the crashing test is fsadm.sh). The testsuite
> doesn't crash on 3.12, crashes on 3.13-rc1 and later.
> 
>  Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
>  CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
>  task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000
> 
>       YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
>  PSW: 00001000000001101111100100001110 Not tainted
>  r00-03  000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
>  r04-07  00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
>  r08-11  0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
>  r12-15  0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
>  r16-19  000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
>  r20-23  0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
>  r24-27  00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
>  r28-31  202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
>  sr00-03  000000000532c000 0000000000000000 0000000000000000 000000000532c000
>  sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> 
>  IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
>   IIR: 539c0030    ISR: 00000000202d6000  IOR: 000006202224647d
>   CPU:        3   CR30: 000000413edd8000 CR31: 0000000000000000
>   ORIG_R28: 00000000405a95e0
>   IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
>   IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
>   RP(r2): flush_dcache_page+0x128/0x388
>  Backtrace:
>   [<000000004013e6c0>] flush_dcache_page+0x128/0x388
>   [<0000000010fe6ca0>] lo_splice_actor+0x90/0x148 [loop]
>   [<00000000402579b0>] splice_from_pipe_feed+0xc0/0x1d0
>   [<00000000402580a4>] __splice_from_pipe+0xac/0xc0
>   [<0000000010fe6bbc>] lo_direct_splice_actor+0x1c/0x70 [loop]
>   [<000000004025854c>] splice_direct_to_actor+0xec/0x228
>   [<0000000010fe63ac>] lo_receive+0xe4/0x298 [loop]
>   [<0000000010fe69d8>] loop_thread+0x478/0x640 [loop]
>   [<000000004018975c>] kthread+0x134/0x168
>   [<000000004012c020>] end_fault_vector+0x20/0x28
>   [<00000000115e0098>] xfs_setsize_buftarg+0x0/0x90 [xfs]
> 
>  Kernel panic - not syncing: Bad Address (null pointer deref?)
> 
> The patch 8456a648cf44f14365f1f44de90a3da2526a4776 changes the page
> structure so that the slab subsystem reuses the page->mapping field.
> 
> The crash happens in the following way:
> * XFS allocates some memory from slab and issues a bio to read data into
>   it.
> * the bio is sent to the loopback device.
> * lo_receive creates an actor and calls splice_direct_to_actor.
> * lo_splice_actor copies data to the target page.
> * lo_splice_actor calls flush_dcache_page because the page may be mapped
>   by userspace. In that case we need to flush the kernel cache.
> * flush_dcache_page asks for the list of userspace mappings, however that
>   page->mapping field is reused by the slab subsystem for a different
>   purpose. This causes the crash.
> 
> Note that other architectures without coherent caches (sparc, arm, mips)
> also call page_mapping from flush_dcache_page, so they may crash in the
> same way.
> 
> This patch fixes this bug by testing if the page is a slab page in
> page_mapping and returning NULL if it is.
> 
> 
> The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
> earlier kernels in the same scenario on architectures without cache
> coherence when CONFIG_DEBUG_VM is enabled - so it should be backported to
> stable kernels.
> 
> 
> In the old kernels, the function page_mapping is placed in
> include/linux/mm.h, so you should modify the patch accordingly when
> backporting it.
> 
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
> 
> ---
>  mm/util.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> Index: linux-3.13-rc6/mm/util.c
> ===================================================================
> --- linux-3.13-rc6.orig/mm/util.c	2014-01-04 00:06:07.000000000 +0100
> +++ linux-3.13-rc6/mm/util.c	2014-01-04 00:24:42.000000000 +0100
> @@ -390,7 +390,10 @@ struct address_space *page_mapping(struc
>  {
>  	struct address_space *mapping = page->mapping;
>  
> -	VM_BUG_ON(PageSlab(page));
> +	/* This happens if someone calls flush_dcache_page on slab page */
> +	if (unlikely(PageSlab(page)))
> +		return NULL;
> +
>  	if (unlikely(PageSwapCache(page))) {
>  		swp_entry_t entry;
>  
> --

Hello,

I'm surprised that this VM_BUG_ON() has not been triggered until now. It was
introduced in 2007 by commit (b5fab14). Maybe there is no person who test
with CONFIG_DEBUG_VM.

There is one more bug report same as this.
* possible regression on 3.13 when calling flush_dcache_page
  (lkml.org/lkml/2013/12/12/255)

As mentioned in the description of commit (b5fab14), slab object may not be
properly aligned and use of page oriented function to this object can be
dangerous. I searched the XFS code and found that they only try to allocate
multiple of 512 bytes, so there is no problem for now. But, IMHO, it is better
not to use slab objects for this purpose.

And I rapidly searched every callsites of page_mapping() and, IMHO, this patch
would work correctly. But possibly reverting original commit is better solution.

Hello, Pekka and Christoph.
Could you teach me which direction we have to go?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Jan. 6, 2014, 5:54 p.m. UTC | #6
Hi

On Mon, 6 Jan 2014, Joonsoo Kim wrote:

> Hello,
> 
> I'm surprised that this VM_BUG_ON() has not been triggered until now. It was
> introduced in 2007 by commit (b5fab14). Maybe there is no person who test
> with CONFIG_DEBUG_VM.

Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.

> There is one more bug report same as this.
> * possible regression on 3.13 when calling flush_dcache_page
>   (lkml.org/lkml/2013/12/12/255)

That link doesn't show anything.

> As mentioned in the description of commit (b5fab14), slab object may not be
> properly aligned and use of page oriented function to this object can be
> dangerous. I searched the XFS code and found that they only try to allocate
> multiple of 512 bytes, so there is no problem for now. But, IMHO, it is better
> not to use slab objects for this purpose.

If slab debugging is enabled, kmalloc memory is not aligned.

In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses 
page boundary - if it does, they free the kmalloc memory and allocate a 
full page. Maybe this approach could still run into problems with some 
bus-master adapters that assume alignment in hardware...


dm-bufio also does I/O to slab-allocated buffers, but it allocates the 
object from slab (not kmalloc) with proper alignment.

> And I rapidly searched every callsites of page_mapping() and, IMHO, this 
> patch would work correctly. But possibly reverting original commit is 
> better solution.

Reverting the original commit wouldn't fix that VM_BUG_ON.

> Hello, Pekka and Christoph.
> Could you teach me which direction we have to go?
> 
> Thanks.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joonsoo Kim Jan. 7, 2014, 1:41 a.m. UTC | #7
On Mon, Jan 06, 2014 at 12:54:22PM -0500, Mikulas Patocka wrote:
> Hi
> 
> On Mon, 6 Jan 2014, Joonsoo Kim wrote:
> 
> > Hello,
> > 
> > I'm surprised that this VM_BUG_ON() has not been triggered until now. It was
> > introduced in 2007 by commit (b5fab14). Maybe there is no person who test
> > with CONFIG_DEBUG_VM.
> 
> Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.
> 
> > There is one more bug report same as this.
> > * possible regression on 3.13 when calling flush_dcache_page
> >   (lkml.org/lkml/2013/12/12/255)
> 
> That link doesn't show anything.
> 
> > As mentioned in the description of commit (b5fab14), slab object may not be
> > properly aligned and use of page oriented function to this object can be
> > dangerous. I searched the XFS code and found that they only try to allocate
> > multiple of 512 bytes, so there is no problem for now. But, IMHO, it is better
> > not to use slab objects for this purpose.
> 
> If slab debugging is enabled, kmalloc memory is not aligned.
> 
> In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses 
> page boundary - if it does, they free the kmalloc memory and allocate a 
> full page. Maybe this approach could still run into problems with some 
> bus-master adapters that assume alignment in hardware...
> 
> 
> dm-bufio also does I/O to slab-allocated buffers, but it allocates the 
> object from slab (not kmalloc) with proper alignment.

Hello,

Okay. I see.
Thanks for good explanation.

> 
> > And I rapidly searched every callsites of page_mapping() and, IMHO, this 
> > patch would work correctly. But possibly reverting original commit is 
> > better solution.
> 
> Reverting the original commit wouldn't fix that VM_BUG_ON.

Initially, I thought that VM_BUG_ON() isn't wrong and it was better to remove
the callsites where do I/O with slab-allocated buffers, because doing I/O
with slab-allocated buffers needs a great care. So I didn't fully agreed with
your patch and recommended to revert original commit yesterday. After reverting
that, I would attempt to remove the callsites.

But, now, I change my thought, because of your explanation. There are already
some users to do I/O with slab-allocated buffers and they already did it with
some cares, so I guess that admitting this usage is more beneficial than
forbidding it.

Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Helge Deller Jan. 8, 2014, 9:05 p.m. UTC | #8
On 01/07/2014 02:41 AM, Joonsoo Kim wrote:
> On Mon, Jan 06, 2014 at 12:54:22PM -0500, Mikulas Patocka wrote:
>> Hi
>>
>> On Mon, 6 Jan 2014, Joonsoo Kim wrote:
>>
>>> Hello,
>>>
>>> I'm surprised that this VM_BUG_ON() has not been triggered until now. It was
>>> introduced in 2007 by commit (b5fab14). Maybe there is no person who test
>>> with CONFIG_DEBUG_VM.
>> Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.
>>
>>> There is one more bug report same as this.
>>> * possible regression on 3.13 when calling flush_dcache_page
>>>    (lkml.org/lkml/2013/12/12/255)
>> That link doesn't show anything.
>>
>>> As mentioned in the description of commit (b5fab14), slab object may not be
>>> properly aligned and use of page oriented function to this object can be
>>> dangerous. I searched the XFS code and found that they only try to allocate
>>> multiple of 512 bytes, so there is no problem for now. But, IMHO, it is better
>>> not to use slab objects for this purpose.
>> If slab debugging is enabled, kmalloc memory is not aligned.
>>
>> In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses
>> page boundary - if it does, they free the kmalloc memory and allocate a
>> full page. Maybe this approach could still run into problems with some
>> bus-master adapters that assume alignment in hardware...
>>
>>
>> dm-bufio also does I/O to slab-allocated buffers, but it allocates the
>> object from slab (not kmalloc) with proper alignment.
> Hello,
>
> Okay. I see.
> Thanks for good explanation.
>
>>> And I rapidly searched every callsites of page_mapping() and, IMHO, this
>>> patch would work correctly. But possibly reverting original commit is
>>> better solution.
>> Reverting the original commit wouldn't fix that VM_BUG_ON.
> Initially, I thought that VM_BUG_ON() isn't wrong and it was better to remove
> the callsites where do I/O with slab-allocated buffers, because doing I/O
> with slab-allocated buffers needs a great care. So I didn't fully agreed with
> your patch and recommended to revert original commit yesterday. After reverting
> that, I would attempt to remove the callsites.
>
> But, now, I change my thought, because of your explanation. There are already
> some users to do I/O with slab-allocated buffers and they already did it with
> some cares, so I guess that admitting this usage is more beneficial than
> forbidding it.
>
> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

I can queue up this patch in my next pull-request for the parisc-tree 
which I plan to
send tomorrow, unless people want this patch to go via mm-tree or 
similiar...
Please let me know.

Helge
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg Jan. 8, 2014, 9:37 p.m. UTC | #9
On Wed, Jan 8, 2014 at 11:05 PM, Helge Deller <deller@gmx.de> wrote:
> On 01/07/2014 02:41 AM, Joonsoo Kim wrote:
>>
>> On Mon, Jan 06, 2014 at 12:54:22PM -0500, Mikulas Patocka wrote:
>>>
>>> Hi
>>>
>>> On Mon, 6 Jan 2014, Joonsoo Kim wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm surprised that this VM_BUG_ON() has not been triggered until now. It
>>>> was
>>>> introduced in 2007 by commit (b5fab14). Maybe there is no person who
>>>> test
>>>> with CONFIG_DEBUG_VM.
>>>
>>> Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.
>>>
>>>> There is one more bug report same as this.
>>>> * possible regression on 3.13 when calling flush_dcache_page
>>>>    (lkml.org/lkml/2013/12/12/255)
>>>
>>> That link doesn't show anything.
>>>
>>>> As mentioned in the description of commit (b5fab14), slab object may not
>>>> be
>>>> properly aligned and use of page oriented function to this object can be
>>>> dangerous. I searched the XFS code and found that they only try to
>>>> allocate
>>>> multiple of 512 bytes, so there is no problem for now. But, IMHO, it is
>>>> better
>>>> not to use slab objects for this purpose.
>>>
>>> If slab debugging is enabled, kmalloc memory is not aligned.
>>>
>>> In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses
>>> page boundary - if it does, they free the kmalloc memory and allocate a
>>> full page. Maybe this approach could still run into problems with some
>>> bus-master adapters that assume alignment in hardware...
>>>
>>>
>>> dm-bufio also does I/O to slab-allocated buffers, but it allocates the
>>> object from slab (not kmalloc) with proper alignment.
>>
>> Hello,
>>
>> Okay. I see.
>> Thanks for good explanation.
>>
>>>> And I rapidly searched every callsites of page_mapping() and, IMHO, this
>>>> patch would work correctly. But possibly reverting original commit is
>>>> better solution.
>>>
>>> Reverting the original commit wouldn't fix that VM_BUG_ON.
>>
>> Initially, I thought that VM_BUG_ON() isn't wrong and it was better to
>> remove
>> the callsites where do I/O with slab-allocated buffers, because doing I/O
>> with slab-allocated buffers needs a great care. So I didn't fully agreed
>> with
>> your patch and recommended to revert original commit yesterday. After
>> reverting
>> that, I would attempt to remove the callsites.
>>
>> But, now, I change my thought, because of your explanation. There are
>> already
>> some users to do I/O with slab-allocated buffers and they already did it
>> with
>> some cares, so I guess that admitting this usage is more beneficial than
>> forbidding it.
>>
>> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
>
> I can queue up this patch in my next pull-request for the parisc-tree which
> I plan to
> send tomorrow, unless people want this patch to go via mm-tree or
> similiar...
> Please let me know.

The patch looks good to me but it probably should go through Andrew's tree.

Acked-by: Pekka Enberg <penberg@kernel.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Helge Deller Jan. 8, 2014, 9:42 p.m. UTC | #10
On 01/08/2014 10:37 PM, Pekka Enberg wrote:
> On Wed, Jan 8, 2014 at 11:05 PM, Helge Deller <deller@gmx.de> wrote:
>> On 01/07/2014 02:41 AM, Joonsoo Kim wrote:
>>> On Mon, Jan 06, 2014 at 12:54:22PM -0500, Mikulas Patocka wrote:
>>>> Hi
>>>>
>>>> On Mon, 6 Jan 2014, Joonsoo Kim wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm surprised that this VM_BUG_ON() has not been triggered until now. It
>>>>> was
>>>>> introduced in 2007 by commit (b5fab14). Maybe there is no person who
>>>>> test
>>>>> with CONFIG_DEBUG_VM.
>>>> Last time I tried it, PS-RISC didn't work with CONFIG_DEBUG_VM at all.
>>>>
>>>>> There is one more bug report same as this.
>>>>> * possible regression on 3.13 when calling flush_dcache_page
>>>>>     (lkml.org/lkml/2013/12/12/255)
>>>> That link doesn't show anything.
>>>>
>>>>> As mentioned in the description of commit (b5fab14), slab object may not
>>>>> be
>>>>> properly aligned and use of page oriented function to this object can be
>>>>> dangerous. I searched the XFS code and found that they only try to
>>>>> allocate
>>>>> multiple of 512 bytes, so there is no problem for now. But, IMHO, it is
>>>>> better
>>>>> not to use slab objects for this purpose.
>>>> If slab debugging is enabled, kmalloc memory is not aligned.
>>>>
>>>> In XFS in xfs_buf_allocate_memory they test if the kmalloc memory crosses
>>>> page boundary - if it does, they free the kmalloc memory and allocate a
>>>> full page. Maybe this approach could still run into problems with some
>>>> bus-master adapters that assume alignment in hardware...
>>>>
>>>>
>>>> dm-bufio also does I/O to slab-allocated buffers, but it allocates the
>>>> object from slab (not kmalloc) with proper alignment.
>>> Hello,
>>>
>>> Okay. I see.
>>> Thanks for good explanation.
>>>
>>>>> And I rapidly searched every callsites of page_mapping() and, IMHO, this
>>>>> patch would work correctly. But possibly reverting original commit is
>>>>> better solution.
>>>> Reverting the original commit wouldn't fix that VM_BUG_ON.
>>> Initially, I thought that VM_BUG_ON() isn't wrong and it was better to
>>> remove
>>> the callsites where do I/O with slab-allocated buffers, because doing I/O
>>> with slab-allocated buffers needs a great care. So I didn't fully agreed
>>> with
>>> your patch and recommended to revert original commit yesterday. After
>>> reverting
>>> that, I would attempt to remove the callsites.
>>>
>>> But, now, I change my thought, because of your explanation. There are
>>> already
>>> some users to do I/O with slab-allocated buffers and they already did it
>>> with
>>> some cares, so I guess that admitting this usage is more beneficial than
>>> forbidding it.
>>>
>>> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>> I can queue up this patch in my next pull-request for the parisc-tree which
>> I plan to
>> send tomorrow, unless people want this patch to go via mm-tree or
>> similiar...
>> Please let me know.
> The patch looks good to me but it probably should go through Andrew's tree.
>
> Acked-by: Pekka Enberg <penberg@kernel.org>

Absolutely fine with me. Andrew, can you please pick it up for 3.13 ?
Thanks,
Helge

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Jan. 8, 2014, 9:59 p.m. UTC | #11
On Wed, 8 Jan 2014 23:37:49 +0200 Pekka Enberg <penberg@kernel.org> wrote:

> The patch looks good to me but it probably should go through Andrew's tree.

yup.

page_mapping() will be called quite frequently, and adding a new
test-n-branch in there will be somewhat costly.  We might end up with a
better kernel if we were to instead revert 8456a648cf44f.  How useful
was that patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joonsoo Kim Jan. 9, 2014, 12:13 a.m. UTC | #12
On Wed, Jan 08, 2014 at 01:59:30PM -0800, Andrew Morton wrote:
> On Wed, 8 Jan 2014 23:37:49 +0200 Pekka Enberg <penberg@kernel.org> wrote:
> 
> > The patch looks good to me but it probably should go through Andrew's tree.
> 
> yup.
> 
> page_mapping() will be called quite frequently, and adding a new
> test-n-branch in there will be somewhat costly.  We might end up with a
> better kernel if we were to instead revert 8456a648cf44f.  How useful
> was that patch?

Hello,

Performance effect of this patch was decribed in the cover-letter, but
I missed to attach it to patch description. Sorry about that.

In summary, this patch saves some memory and decreases cache-footprint
so that it increases performance.

Here goes the description in cover-letter.

Below is some numbers of 'cat /proc/slabinfo'.

* Before *
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables [snip...]
kmalloc-512          527    600    512    8    1 : tunables   54   27    0 : slabdata     75     75      0
kmalloc-256          210    210    256   15    1 : tunables  120   60    0 : slabdata     14     14      0
kmalloc-192         1040   1040    192   20    1 : tunables  120   60    0 : slabdata     52     52      0
kmalloc-96           750    750    128   30    1 : tunables  120   60    0 : slabdata     25     25      0
kmalloc-64          2773   2773     64   59    1 : tunables  120   60    0 : slabdata     47     47      0
kmalloc-128          660    690    128   30    1 : tunables  120   60    0 : slabdata     23     23      0
kmalloc-32         11200  11200     32  112    1 : tunables  120   60    0 : slabdata    100    100      0
kmem_cache           197    200    192   20    1 : tunables  120   60    0 : slabdata     10     10      0

* After *
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables [snip...]
kmalloc-512          525    640    512    8    1 : tunables   54   27    0 : slabdata     80     80      0
kmalloc-256          210    210    256   15    1 : tunables  120   60    0 : slabdata     14     14      0
kmalloc-192         1016   1040    192   20    1 : tunables  120   60    0 : slabdata     52     52      0
kmalloc-96           560    620    128   31    1 : tunables  120   60    0 : slabdata     20     20      0
kmalloc-64          2148   2280     64   60    1 : tunables  120   60    0 : slabdata     38     38      0
kmalloc-128          647    682    128   31    1 : tunables  120   60    0 : slabdata     22     22      0
kmalloc-32         11360  11413     32  113    1 : tunables  120   60    0 : slabdata    101    101      0
kmem_cache           197    200    192   20    1 : tunables  120   60    0 : slabdata     10     10      0

kmem_caches consisting of objects less than or equal to 128 byte have one more
objects in a slab. You can see it at objperslab.


Here are the performance results on my 4 cpus machine.

* Before *

 Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs):

       238,309,671 cache-misses                                                  ( +-  0.40% )

      12.010172090 seconds time elapsed                                          ( +-  0.21% )

* After *

 Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs):

       229,945,138 cache-misses                                                  ( +-  0.23% )

      11.627897174 seconds time elapsed                                          ( +-  0.14% )

cache-misses are reduced by this patchset, roughly 5%.
And elapsed times are also improved by 3.1% to baseline.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Jan. 9, 2014, 12:19 a.m. UTC | #13
On Thu, 9 Jan 2014 09:13:31 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> On Wed, Jan 08, 2014 at 01:59:30PM -0800, Andrew Morton wrote:
> > On Wed, 8 Jan 2014 23:37:49 +0200 Pekka Enberg <penberg@kernel.org> wrote:
> > 
> > > The patch looks good to me but it probably should go through Andrew's tree.
> > 
> > yup.
> > 
> > page_mapping() will be called quite frequently, and adding a new
> > test-n-branch in there will be somewhat costly.  We might end up with a
> > better kernel if we were to instead revert 8456a648cf44f.  How useful
> > was that patch?
> 
> Hello,
> 
> Performance effect of this patch was decribed in the cover-letter, but
> I missed to attach it to patch description. Sorry about that.
> 
> In summary, this patch saves some memory and decreases cache-footprint
> so that it increases performance.
> 
> Here goes the description in cover-letter.
> 
> ...
>
> cache-misses are reduced by this patchset, roughly 5%.
> And elapsed times are also improved by 3.1% to baseline.

ah, OK, thanks, useful.  A few instructions added to page_mapping()
won't have effects like that!
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg Jan. 9, 2014, 8:35 a.m. UTC | #14
On Thu, Jan 9, 2014 at 2:19 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> cache-misses are reduced by this patchset, roughly 5%.
>> And elapsed times are also improved by 3.1% to baseline.
>
> ah, OK, thanks, useful.  A few instructions added to page_mapping()
> won't have effects like that!

Yup, I merged the series because the numbers were so impressive.

There's a link to the cover letter in merge commit 24f971a but it
would have been better to include them in the changelog itself.

                     Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Baatz Jan. 9, 2014, 8:49 a.m. UTC | #15
Hi Mikulas,

On Sat, Jan 04, 2014 at 12:45:45PM -0500, Mikulas Patocka wrote:
> The patch 8456a648cf44f14365f1f44de90a3da2526a4776 causes crash in the
> LVM2 testsuite on PA-RISC (the crashing test is fsadm.sh). The testsuite
> doesn't crash on 3.12, crashes on 3.13-rc1 and later.
> 
>  Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
>  CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
>  task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000
> 
>       YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
>  PSW: 00001000000001101111100100001110 Not tainted
>  r00-03  000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
>  r04-07  00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
>  r08-11  0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
>  r12-15  0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
>  r16-19  000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
>  r20-23  0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
>  r24-27  00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
>  r28-31  202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
>  sr00-03  000000000532c000 0000000000000000 0000000000000000 000000000532c000
>  sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> 
>  IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
>   IIR: 539c0030    ISR: 00000000202d6000  IOR: 000006202224647d
>   CPU:        3   CR30: 000000413edd8000 CR31: 0000000000000000
>   ORIG_R28: 00000000405a95e0
>   IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
>   IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
>   RP(r2): flush_dcache_page+0x128/0x388
>  Backtrace:
>   [<000000004013e6c0>] flush_dcache_page+0x128/0x388
>   [<0000000010fe6ca0>] lo_splice_actor+0x90/0x148 [loop]
>   [<00000000402579b0>] splice_from_pipe_feed+0xc0/0x1d0
>   [<00000000402580a4>] __splice_from_pipe+0xac/0xc0
>   [<0000000010fe6bbc>] lo_direct_splice_actor+0x1c/0x70 [loop]
>   [<000000004025854c>] splice_direct_to_actor+0xec/0x228
>   [<0000000010fe63ac>] lo_receive+0xe4/0x298 [loop]
>   [<0000000010fe69d8>] loop_thread+0x478/0x640 [loop]
>   [<000000004018975c>] kthread+0x134/0x168
>   [<000000004012c020>] end_fault_vector+0x20/0x28
>   [<00000000115e0098>] xfs_setsize_buftarg+0x0/0x90 [xfs]
> 
>  Kernel panic - not syncing: Bad Address (null pointer deref?)
> 
> The patch 8456a648cf44f14365f1f44de90a3da2526a4776 changes the page
> structure so that the slab subsystem reuses the page->mapping field.
> 
> The crash happens in the following way:
> * XFS allocates some memory from slab and issues a bio to read data into
>   it.
> * the bio is sent to the loopback device.
> * lo_receive creates an actor and calls splice_direct_to_actor.
> * lo_splice_actor copies data to the target page.
> * lo_splice_actor calls flush_dcache_page because the page may be mapped
>   by userspace. In that case we need to flush the kernel cache.
> * flush_dcache_page asks for the list of userspace mappings, however that
>   page->mapping field is reused by the slab subsystem for a different
>   purpose. This causes the crash.
> 
> Note that other architectures without coherent caches (sparc, arm, mips)
> also call page_mapping from flush_dcache_page, so they may crash in the
> same way.
> 
> This patch fixes this bug by testing if the page is a slab page in
> page_mapping and returning NULL if it is.
> 
> 
> The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
> earlier kernels in the same scenario on architectures without cache
> coherence when CONFIG_DEBUG_VM is enabled - so it should be backported to
> stable kernels.
> 
> 
> In the old kernels, the function page_mapping is placed in
> include/linux/mm.h, so you should modify the patch accordingly when
> backporting it.
> 
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
> 
> ---
>  mm/util.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> Index: linux-3.13-rc6/mm/util.c
> ===================================================================
> --- linux-3.13-rc6.orig/mm/util.c	2014-01-04 00:06:07.000000000 +0100
> +++ linux-3.13-rc6/mm/util.c	2014-01-04 00:24:42.000000000 +0100
> @@ -390,7 +390,10 @@ struct address_space *page_mapping(struc
>  {
>  	struct address_space *mapping = page->mapping;
>  
> -	VM_BUG_ON(PageSlab(page));
> +	/* This happens if someone calls flush_dcache_page on slab page */
> +	if (unlikely(PageSlab(page)))
> +		return NULL;
> +
>  	if (unlikely(PageSwapCache(page))) {
>  		swp_entry_t entry;

I don't think that this is the correct fix. According to cachetlb.txt
flush_(kernel_)dcache_page() is not supposed to be called with a slab
page in the first place.  There is code in the kernel to avoid that
(see for example the discussion in [1] and [2]).

Also on ARM, page_mapping() == NULL results in
flush_(kernel_)dcache_page() assuming that the page is an anon page. 
Consequently, it would flush the slab page, which make no sense.

Thus, I think we either need to add the check to the original caller
of flush_dcache_page() or we allow flush_(kernel_)dcache_page() to be
called with slab pages and put the check there (this has been
proposed by Russell King once [3], but would affect multiple
architectures)

- Simon


[1] https://lkml.org/lkml/2013/10/24/414
[2] https://lkml.org/lkml/2013/10/28/432
[3] https://lkml.org/lkml/2013/10/27/89
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Daichi Jan. 29, 2014, 11:47 a.m. UTC | #16
Haben Sie eine dringende Darlehen benötigen? Sofort für schnelle Darlehen in 
DAICHI Darlehen beantragen 
FIRM-SPS. Wir sind 24 Stunden für Sie online. E-Mail: 
peterdaichi2012@gmail.com 
Vollständiger Name: ............................ 
Kontakt Adresse: ...................... 
Land: .............................. 
Betrag als Darlehen benötigt: ................ 
Loan Dauer: ........................ 
Zweck des Darlehens: ...................... 
Beruf: ........................... 
Geschlecht: .................................. 
Alter: .................................. 
Telefon: ................................ 

Mit freundlichen Grüßen, 
Herr Peter Daichi 
Director / MD 

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-3.13-rc6/mm/util.c
===================================================================
--- linux-3.13-rc6.orig/mm/util.c	2014-01-04 00:06:07.000000000 +0100
+++ linux-3.13-rc6/mm/util.c	2014-01-04 00:24:42.000000000 +0100
@@ -390,7 +390,10 @@  struct address_space *page_mapping(struc
 {
 	struct address_space *mapping = page->mapping;
 
-	VM_BUG_ON(PageSlab(page));
+	/* This happens if someone calls flush_dcache_page on slab page */
+	if (unlikely(PageSlab(page)))
+		return NULL;
+
 	if (unlikely(PageSwapCache(page))) {
 		swp_entry_t entry;