Message ID | 1448054554-24138-1-git-send-email-idryomov@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Nov 20, 2015 at 10:22:34PM +0100, Ilya Dryomov wrote: > Since 52ebea749aae ("writeback: make backing_dev_info host > cgroup-specific bdi_writebacks") inode, at some point in its lifetime, > gets attached to a wb (struct bdi_writeback). Detaching happens on > evict, in inode_detach_wb() called from __destroy_inode(), and involves > updating wb. > > However, detaching an internal bdev inode from its wb in > __destroy_inode() is too late. Its bdi and by extension root wb are > embedded into struct request_queue, which has different lifetime rules > and can be freed long before the final bdput() is called (can be from > __fput() of a corresponding /dev inode, through dput() - evict() - > bd_forget(). bdevs hold onto the underlying disk/queue pair only while > opened; as soon as bdev is closed all bets are off. In fact, > disk/queue can be gone before __blkdev_put() even returns: > > 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) > 1500 { > ... > 1518 if (bdev->bd_contains == bdev) { > 1519 if (disk->fops->release) > 1520 disk->fops->release(disk, mode); > > [ Driver puts its references to disk/queue ] > > 1521 } > 1522 if (!bdev->bd_openers) { > 1523 struct module *owner = disk->fops->owner; > 1524 > 1525 disk_put_part(bdev->bd_part); > 1526 bdev->bd_part = NULL; > 1527 bdev->bd_disk = NULL; > 1528 if (bdev != bdev->bd_contains) > 1529 victim = bdev->bd_contains; > 1530 bdev->bd_contains = NULL; > 1531 > 1532 put_disk(disk); > > [ We put ours, the queue is gone > The last bdput() would result in a write to invalid memory ] > > 1533 module_put(owner); > ... > 1539 } > > Since bdev inodes are special anyway, detach them in __blkdev_put() > after clearing inode's dirty bits, turning the problematic > inode_detach_wb() in __destroy_inode() into a noop. > > add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make > gendisk hold a reference to its queue"), so the old ->release comment > is removed in favor of the new inode_detach_wb() comment. > > Cc: stable@vger.kernel.org # 4.2+, needs backporting > Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Thanks.
The right fix is to never point from the inode to a BDI. I fixed this with a lot of effort, and the BDI writeback series put it back a little later. We need to revert that damage (not neseccarily literally, but the effect). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Christoph. On Sun, Nov 22, 2015 at 04:02:16PM +0100, Christoph Hellwig wrote: > The right fix is to never point from the inode to a BDI. I fixed this > with a lot of effort, and the BDI writeback series put it back a little > later. We need to revert that damage (not neseccarily literally, but the > effect). Can you please explain a bit more why inode pointing to its associated bdi_writeback is a bad idea? The only alternative would be recording a key and looking up each time. We sure can do that but I don't get why we would want to. Thanks.
On Mon, Nov 23, 2015 at 10:35:52AM -0500, Tejun Heo wrote: > Hello, Christoph. > > On Sun, Nov 22, 2015 at 04:02:16PM +0100, Christoph Hellwig wrote: > > The right fix is to never point from the inode to a BDI. I fixed this > > with a lot of effort, and the BDI writeback series put it back a little > > later. We need to revert that damage (not neseccarily literally, but the > > effect). > > Can you please explain a bit more why inode pointing to its associated > bdi_writeback is a bad idea? The only alternative would be recording > a key and looking up each time. We sure can do that but I don't get > why we would want to. Because the writeback context really is a per-super block concept (except for the magic block device inodes). Having another pointer pointer in the inode (or address_space back then) lead to all kinds of confusing bugs and life time issues, nevermind massively bloating the inode for no reason. But then again bloating the inode has been rather en vogue lately anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Christoph. On Tue, Nov 24, 2015 at 08:54:57AM +0100, Christoph Hellwig wrote: > > Can you please explain a bit more why inode pointing to its associated > > bdi_writeback is a bad idea? The only alternative would be recording > > a key and looking up each time. We sure can do that but I don't get > > why we would want to. > > Because the writeback context really is a per-super block concept > (except for the magic block device inodes). Having another pointer With cgroup writeback, inodes belonging to the same superblock can belong to different writeback domains, so it no longer is strictly per-super block concept. > pointer in the inode (or address_space back then) lead to all kinds > of confusing bugs and life time issues, nevermind massively bloating the The bdev case is different because a bdev's lifetime doesn't coincide with the actual object it's representing but the root cause there is lower layer objects being destroyed while higher layer objects depending on them are still around. Severing inode -> lower layer object dependency may be able to solve some issues but those approaches are usually pretty brittle. The right thing to do is letting objects getting torn down in order by maintaining proper refs. > inode for no reason. But then again bloating the inode has been rather > en vogue lately anyway. How is it for no reason when there are clearly purposes for them? If the extra pointer is problematic, we can try to split sb's so that we can encode writeback domain information by pointing to per-writeback-domain split sb's but I'm not sure the added complexity would be justifiable. Thanks.
* Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]: > Since 52ebea749aae ("writeback: make backing_dev_info host > cgroup-specific bdi_writebacks") inode, at some point in its lifetime, > gets attached to a wb (struct bdi_writeback). Detaching happens on > evict, in inode_detach_wb() called from __destroy_inode(), and involves > updating wb. > > However, detaching an internal bdev inode from its wb in > __destroy_inode() is too late. Its bdi and by extension root wb are > embedded into struct request_queue, which has different lifetime rules > and can be freed long before the final bdput() is called (can be from > __fput() of a corresponding /dev inode, through dput() - evict() - > bd_forget(). bdevs hold onto the underlying disk/queue pair only while > opened; as soon as bdev is closed all bets are off. In fact, > disk/queue can be gone before __blkdev_put() even returns: > > 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) > 1500 { > ... > 1518 if (bdev->bd_contains == bdev) { > 1519 if (disk->fops->release) > 1520 disk->fops->release(disk, mode); > > [ Driver puts its references to disk/queue ] > > 1521 } > 1522 if (!bdev->bd_openers) { > 1523 struct module *owner = disk->fops->owner; > 1524 > 1525 disk_put_part(bdev->bd_part); > 1526 bdev->bd_part = NULL; > 1527 bdev->bd_disk = NULL; > 1528 if (bdev != bdev->bd_contains) > 1529 victim = bdev->bd_contains; > 1530 bdev->bd_contains = NULL; > 1531 > 1532 put_disk(disk); > > [ We put ours, the queue is gone > The last bdput() would result in a write to invalid memory ] > > 1533 module_put(owner); > ... > 1539 } > > Since bdev inodes are special anyway, detach them in __blkdev_put() > after clearing inode's dirty bits, turning the problematic > inode_detach_wb() in __destroy_inode() into a noop. > > add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make > gendisk hold a reference to its queue"), so the old ->release comment > is removed in favor of the new inode_detach_wb() comment. > > Cc: stable@vger.kernel.org # 4.2+, needs backporting > Signed-off-by: Ilya Dryomov <idryomov@gmail.com> > --- Feel free to add Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> I was facing bad memory access problem while creating thousands of containers. With this patch I am able to create more than 10k containers without hitting the problem. I had reported the problem here: https://lkml.org/lkml/2015/11/19/149 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote: > * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]: > >> Since 52ebea749aae ("writeback: make backing_dev_info host >> cgroup-specific bdi_writebacks") inode, at some point in its lifetime, >> gets attached to a wb (struct bdi_writeback). Detaching happens on >> evict, in inode_detach_wb() called from __destroy_inode(), and involves >> updating wb. >> >> However, detaching an internal bdev inode from its wb in >> __destroy_inode() is too late. Its bdi and by extension root wb are >> embedded into struct request_queue, which has different lifetime rules >> and can be freed long before the final bdput() is called (can be from >> __fput() of a corresponding /dev inode, through dput() - evict() - >> bd_forget(). bdevs hold onto the underlying disk/queue pair only while >> opened; as soon as bdev is closed all bets are off. In fact, >> disk/queue can be gone before __blkdev_put() even returns: >> >> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) >> 1500 { >> ... >> 1518 if (bdev->bd_contains == bdev) { >> 1519 if (disk->fops->release) >> 1520 disk->fops->release(disk, mode); >> >> [ Driver puts its references to disk/queue ] >> >> 1521 } >> 1522 if (!bdev->bd_openers) { >> 1523 struct module *owner = disk->fops->owner; >> 1524 >> 1525 disk_put_part(bdev->bd_part); >> 1526 bdev->bd_part = NULL; >> 1527 bdev->bd_disk = NULL; >> 1528 if (bdev != bdev->bd_contains) >> 1529 victim = bdev->bd_contains; >> 1530 bdev->bd_contains = NULL; >> 1531 >> 1532 put_disk(disk); >> >> [ We put ours, the queue is gone >> The last bdput() would result in a write to invalid memory ] >> >> 1533 module_put(owner); >> ... >> 1539 } >> >> Since bdev inodes are special anyway, detach them in __blkdev_put() >> after clearing inode's dirty bits, turning the problematic >> inode_detach_wb() in __destroy_inode() into a noop. >> >> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make >> gendisk hold a reference to its queue"), so the old ->release comment >> is removed in favor of the new inode_detach_wb() comment. >> >> Cc: stable@vger.kernel.org # 4.2+, needs backporting >> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> >> --- > Feel free to add > Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > > I was facing bad memory access problem while creating thousands of containers. > With this patch I am able to create more than 10k containers without hitting > the problem. > I had reported the problem here: https://lkml.org/lkml/2015/11/19/149 Great! Christoph's concern is with ->i_wb as a whole, not this particular patch - Al, this one is marked for stable, can we get it merged into -rc4? Or should it go through Jens' tree, as cgroup writeback patches did? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote: > On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T > <raghavendra.kt@linux.vnet.ibm.com> wrote: >> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]: >> >>> Since 52ebea749aae ("writeback: make backing_dev_info host >>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime, >>> gets attached to a wb (struct bdi_writeback). Detaching happens on >>> evict, in inode_detach_wb() called from __destroy_inode(), and involves >>> updating wb. >>> >>> However, detaching an internal bdev inode from its wb in >>> __destroy_inode() is too late. Its bdi and by extension root wb are >>> embedded into struct request_queue, which has different lifetime rules >>> and can be freed long before the final bdput() is called (can be from >>> __fput() of a corresponding /dev inode, through dput() - evict() - >>> bd_forget(). bdevs hold onto the underlying disk/queue pair only while >>> opened; as soon as bdev is closed all bets are off. In fact, >>> disk/queue can be gone before __blkdev_put() even returns: >>> >>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) >>> 1500 { >>> ... >>> 1518 if (bdev->bd_contains == bdev) { >>> 1519 if (disk->fops->release) >>> 1520 disk->fops->release(disk, mode); >>> >>> [ Driver puts its references to disk/queue ] >>> >>> 1521 } >>> 1522 if (!bdev->bd_openers) { >>> 1523 struct module *owner = disk->fops->owner; >>> 1524 >>> 1525 disk_put_part(bdev->bd_part); >>> 1526 bdev->bd_part = NULL; >>> 1527 bdev->bd_disk = NULL; >>> 1528 if (bdev != bdev->bd_contains) >>> 1529 victim = bdev->bd_contains; >>> 1530 bdev->bd_contains = NULL; >>> 1531 >>> 1532 put_disk(disk); >>> >>> [ We put ours, the queue is gone >>> The last bdput() would result in a write to invalid memory ] >>> >>> 1533 module_put(owner); >>> ... >>> 1539 } >>> >>> Since bdev inodes are special anyway, detach them in __blkdev_put() >>> after clearing inode's dirty bits, turning the problematic >>> inode_detach_wb() in __destroy_inode() into a noop. >>> >>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make >>> gendisk hold a reference to its queue"), so the old ->release comment >>> is removed in favor of the new inode_detach_wb() comment. >>> >>> Cc: stable@vger.kernel.org # 4.2+, needs backporting >>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> >>> --- >> Feel free to add >> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> >> >> I was facing bad memory access problem while creating thousands of containers. >> With this patch I am able to create more than 10k containers without hitting >> the problem. >> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149 > > Great! Christoph's concern is with ->i_wb as a whole, not this > particular patch - Al, this one is marked for stable, can we get it > merged into -rc4? Or should it go through Jens' tree, as cgroup > writeback patches did? Ping? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/04/2015 07:07 AM, Ilya Dryomov wrote: > On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote: >> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T >> <raghavendra.kt@linux.vnet.ibm.com> wrote: >>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]: >>> >>>> Since 52ebea749aae ("writeback: make backing_dev_info host >>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime, >>>> gets attached to a wb (struct bdi_writeback). Detaching happens on >>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves >>>> updating wb. >>>> >>>> However, detaching an internal bdev inode from its wb in >>>> __destroy_inode() is too late. Its bdi and by extension root wb are >>>> embedded into struct request_queue, which has different lifetime rules >>>> and can be freed long before the final bdput() is called (can be from >>>> __fput() of a corresponding /dev inode, through dput() - evict() - >>>> bd_forget(). bdevs hold onto the underlying disk/queue pair only while >>>> opened; as soon as bdev is closed all bets are off. In fact, >>>> disk/queue can be gone before __blkdev_put() even returns: >>>> >>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) >>>> 1500 { >>>> ... >>>> 1518 if (bdev->bd_contains == bdev) { >>>> 1519 if (disk->fops->release) >>>> 1520 disk->fops->release(disk, mode); >>>> >>>> [ Driver puts its references to disk/queue ] >>>> >>>> 1521 } >>>> 1522 if (!bdev->bd_openers) { >>>> 1523 struct module *owner = disk->fops->owner; >>>> 1524 >>>> 1525 disk_put_part(bdev->bd_part); >>>> 1526 bdev->bd_part = NULL; >>>> 1527 bdev->bd_disk = NULL; >>>> 1528 if (bdev != bdev->bd_contains) >>>> 1529 victim = bdev->bd_contains; >>>> 1530 bdev->bd_contains = NULL; >>>> 1531 >>>> 1532 put_disk(disk); >>>> >>>> [ We put ours, the queue is gone >>>> The last bdput() would result in a write to invalid memory ] >>>> >>>> 1533 module_put(owner); >>>> ... >>>> 1539 } >>>> >>>> Since bdev inodes are special anyway, detach them in __blkdev_put() >>>> after clearing inode's dirty bits, turning the problematic >>>> inode_detach_wb() in __destroy_inode() into a noop. >>>> >>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make >>>> gendisk hold a reference to its queue"), so the old ->release comment >>>> is removed in favor of the new inode_detach_wb() comment. >>>> >>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting >>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> >>>> --- >>> Feel free to add >>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> >>> >>> I was facing bad memory access problem while creating thousands of containers. >>> With this patch I am able to create more than 10k containers without hitting >>> the problem. >>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149 >> >> Great! Christoph's concern is with ->i_wb as a whole, not this >> particular patch - Al, this one is marked for stable, can we get it >> merged into -rc4? Or should it go through Jens' tree, as cgroup >> writeback patches did? > > Ping? Was holding off, but I think we should get the simple fix in for 4.4. I've applied it for this series.
On Fri, Dec 4, 2015 at 6:44 PM, Jens Axboe <axboe@kernel.dk> wrote: > On 12/04/2015 07:07 AM, Ilya Dryomov wrote: >> >> On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote: >>> >>> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T >>> <raghavendra.kt@linux.vnet.ibm.com> wrote: >>>> >>>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]: >>>> >>>>> Since 52ebea749aae ("writeback: make backing_dev_info host >>>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime, >>>>> gets attached to a wb (struct bdi_writeback). Detaching happens on >>>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves >>>>> updating wb. >>>>> >>>>> However, detaching an internal bdev inode from its wb in >>>>> __destroy_inode() is too late. Its bdi and by extension root wb are >>>>> embedded into struct request_queue, which has different lifetime rules >>>>> and can be freed long before the final bdput() is called (can be from >>>>> __fput() of a corresponding /dev inode, through dput() - evict() - >>>>> bd_forget(). bdevs hold onto the underlying disk/queue pair only while >>>>> opened; as soon as bdev is closed all bets are off. In fact, >>>>> disk/queue can be gone before __blkdev_put() even returns: >>>>> >>>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, >>>>> int for_part) >>>>> 1500 { >>>>> ... >>>>> 1518 if (bdev->bd_contains == bdev) { >>>>> 1519 if (disk->fops->release) >>>>> 1520 disk->fops->release(disk, mode); >>>>> >>>>> [ Driver puts its references to disk/queue ] >>>>> >>>>> 1521 } >>>>> 1522 if (!bdev->bd_openers) { >>>>> 1523 struct module *owner = disk->fops->owner; >>>>> 1524 >>>>> 1525 disk_put_part(bdev->bd_part); >>>>> 1526 bdev->bd_part = NULL; >>>>> 1527 bdev->bd_disk = NULL; >>>>> 1528 if (bdev != bdev->bd_contains) >>>>> 1529 victim = bdev->bd_contains; >>>>> 1530 bdev->bd_contains = NULL; >>>>> 1531 >>>>> 1532 put_disk(disk); >>>>> >>>>> [ We put ours, the queue is gone >>>>> The last bdput() would result in a write to invalid memory ] >>>>> >>>>> 1533 module_put(owner); >>>>> ... >>>>> 1539 } >>>>> >>>>> Since bdev inodes are special anyway, detach them in __blkdev_put() >>>>> after clearing inode's dirty bits, turning the problematic >>>>> inode_detach_wb() in __destroy_inode() into a noop. >>>>> >>>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make >>>>> gendisk hold a reference to its queue"), so the old ->release comment >>>>> is removed in favor of the new inode_detach_wb() comment. >>>>> >>>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting >>>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> >>>>> --- >>>> >>>> Feel free to add >>>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> >>>> >>>> I was facing bad memory access problem while creating thousands of >>>> containers. >>>> With this patch I am able to create more than 10k containers without >>>> hitting >>>> the problem. >>>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149 >>> >>> >>> Great! Christoph's concern is with ->i_wb as a whole, not this >>> particular patch - Al, this one is marked for stable, can we get it >>> merged into -rc4? Or should it go through Jens' tree, as cgroup >>> writeback patches did? >> >> >> Ping? > > > Was holding off, but I think we should get the simple fix in for 4.4. I've > applied it for this series. I think you missed Raghavendra's Tested-by (I'm looking at https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus&id=f2148d49930497c66d38b13f137c30d78c60da3d). Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/04/2015 10:55 AM, Ilya Dryomov wrote: > On Fri, Dec 4, 2015 at 6:44 PM, Jens Axboe <axboe@kernel.dk> wrote: >> On 12/04/2015 07:07 AM, Ilya Dryomov wrote: >>> >>> On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote: >>>> >>>> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T >>>> <raghavendra.kt@linux.vnet.ibm.com> wrote: >>>>> >>>>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]: >>>>> >>>>>> Since 52ebea749aae ("writeback: make backing_dev_info host >>>>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime, >>>>>> gets attached to a wb (struct bdi_writeback). Detaching happens on >>>>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves >>>>>> updating wb. >>>>>> >>>>>> However, detaching an internal bdev inode from its wb in >>>>>> __destroy_inode() is too late. Its bdi and by extension root wb are >>>>>> embedded into struct request_queue, which has different lifetime rules >>>>>> and can be freed long before the final bdput() is called (can be from >>>>>> __fput() of a corresponding /dev inode, through dput() - evict() - >>>>>> bd_forget(). bdevs hold onto the underlying disk/queue pair only while >>>>>> opened; as soon as bdev is closed all bets are off. In fact, >>>>>> disk/queue can be gone before __blkdev_put() even returns: >>>>>> >>>>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, >>>>>> int for_part) >>>>>> 1500 { >>>>>> ... >>>>>> 1518 if (bdev->bd_contains == bdev) { >>>>>> 1519 if (disk->fops->release) >>>>>> 1520 disk->fops->release(disk, mode); >>>>>> >>>>>> [ Driver puts its references to disk/queue ] >>>>>> >>>>>> 1521 } >>>>>> 1522 if (!bdev->bd_openers) { >>>>>> 1523 struct module *owner = disk->fops->owner; >>>>>> 1524 >>>>>> 1525 disk_put_part(bdev->bd_part); >>>>>> 1526 bdev->bd_part = NULL; >>>>>> 1527 bdev->bd_disk = NULL; >>>>>> 1528 if (bdev != bdev->bd_contains) >>>>>> 1529 victim = bdev->bd_contains; >>>>>> 1530 bdev->bd_contains = NULL; >>>>>> 1531 >>>>>> 1532 put_disk(disk); >>>>>> >>>>>> [ We put ours, the queue is gone >>>>>> The last bdput() would result in a write to invalid memory ] >>>>>> >>>>>> 1533 module_put(owner); >>>>>> ... >>>>>> 1539 } >>>>>> >>>>>> Since bdev inodes are special anyway, detach them in __blkdev_put() >>>>>> after clearing inode's dirty bits, turning the problematic >>>>>> inode_detach_wb() in __destroy_inode() into a noop. >>>>>> >>>>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make >>>>>> gendisk hold a reference to its queue"), so the old ->release comment >>>>>> is removed in favor of the new inode_detach_wb() comment. >>>>>> >>>>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting >>>>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> >>>>>> --- >>>>> >>>>> Feel free to add >>>>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> >>>>> >>>>> I was facing bad memory access problem while creating thousands of >>>>> containers. >>>>> With this patch I am able to create more than 10k containers without >>>>> hitting >>>>> the problem. >>>>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149 >>>> >>>> >>>> Great! Christoph's concern is with ->i_wb as a whole, not this >>>> particular patch - Al, this one is marked for stable, can we get it >>>> merged into -rc4? Or should it go through Jens' tree, as cgroup >>>> writeback patches did? >>> >>> >>> Ping? >> >> >> Was holding off, but I think we should get the simple fix in for 4.4. I've >> applied it for this series. > > I think you missed Raghavendra's Tested-by (I'm looking at > https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus&id=f2148d49930497c66d38b13f137c30d78c60da3d). I did, fixed.
diff --git a/fs/block_dev.c b/fs/block_dev.c index bb0dfb1c7af1..6f8dd407c533 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1509,11 +1509,14 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) WARN_ON_ONCE(bdev->bd_holders); sync_blockdev(bdev); kill_bdev(bdev); + + bdev_write_inode(bdev); /* - * ->release can cause the queue to disappear, so flush all - * dirty data before. + * Detaching bdev inode from its wb in __destroy_inode() + * is too late: the queue which embeds its bdi (along with + * root wb) can be gone as soon as we put_disk() below. */ - bdev_write_inode(bdev); + inode_detach_wb(bdev->bd_inode); } if (bdev->bd_contains == bdev) { if (disk->fops->release)
Since 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") inode, at some point in its lifetime, gets attached to a wb (struct bdi_writeback). Detaching happens on evict, in inode_detach_wb() called from __destroy_inode(), and involves updating wb. However, detaching an internal bdev inode from its wb in __destroy_inode() is too late. Its bdi and by extension root wb are embedded into struct request_queue, which has different lifetime rules and can be freed long before the final bdput() is called (can be from __fput() of a corresponding /dev inode, through dput() - evict() - bd_forget(). bdevs hold onto the underlying disk/queue pair only while opened; as soon as bdev is closed all bets are off. In fact, disk/queue can be gone before __blkdev_put() even returns: 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) 1500 { ... 1518 if (bdev->bd_contains == bdev) { 1519 if (disk->fops->release) 1520 disk->fops->release(disk, mode); [ Driver puts its references to disk/queue ] 1521 } 1522 if (!bdev->bd_openers) { 1523 struct module *owner = disk->fops->owner; 1524 1525 disk_put_part(bdev->bd_part); 1526 bdev->bd_part = NULL; 1527 bdev->bd_disk = NULL; 1528 if (bdev != bdev->bd_contains) 1529 victim = bdev->bd_contains; 1530 bdev->bd_contains = NULL; 1531 1532 put_disk(disk); [ We put ours, the queue is gone The last bdput() would result in a write to invalid memory ] 1533 module_put(owner); ... 1539 } Since bdev inodes are special anyway, detach them in __blkdev_put() after clearing inode's dirty bits, turning the problematic inode_detach_wb() in __destroy_inode() into a noop. add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make gendisk hold a reference to its queue"), so the old ->release comment is removed in favor of the new inode_detach_wb() comment. Cc: stable@vger.kernel.org # 4.2+, needs backporting Signed-off-by: Ilya Dryomov <idryomov@gmail.com> --- fs/block_dev.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)