Message ID | CAOi1vP9igmg_jayELVDq_NYXUHtByZz5a4M5+CqCMMHUxF5ofA@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hello, Ilya. On Mon, Nov 16, 2015 at 09:59:18PM +0100, Ilya Dryomov wrote: ... > Looking at __blkdev_put(), the issue becomes clear: we are taking > precautions to flush before calling out to ->release() because, at > least according to the comment, ->release() can free queue; we are > recording owner pointer because put_disk() may free both gendisk and > queue, and then, after all that, we are calling bdput() which may > touch the queue through wb_put() in inode_detach_wb(). (The fun part > is wb_put() is supposed to be a noop for root wbs, but slab debugging > interferes with that by poisoning wb->bdi pointer.) > > 1514 * dirty data before. > 1515 */ > 1516 bdev_write_inode(bdev); > 1517 } > 1518 if (bdev->bd_contains == bdev) { > 1519 if (disk->fops->release) > 1520 disk->fops->release(disk, mode); > 1521 } > 1522 if (!bdev->bd_openers) { > 1523 struct module *owner = disk->fops->owner; > 1524 > 1525 disk_put_part(bdev->bd_part); > 1526 bdev->bd_part = NULL; > 1527 bdev->bd_disk = NULL; > 1528 if (bdev != bdev->bd_contains) > 1529 victim = bdev->bd_contains; > 1530 bdev->bd_contains = NULL; > 1531 > 1532 put_disk(disk); <-- may free q > 1533 module_put(owner); > 1534 } > 1535 mutex_unlock(&bdev->bd_mutex); > 1536 bdput(bdev); <-- may touch q.backing_dev_info.wb Ah, that's a sneaky bug. Thanks a lot for chasing it down. The scenario sounds completely plausible to me. > To reproduce, apply the attached patch (systemd-udevd condition is just > a convenience: udev reacts to change events by getting the bdev which > it then has to put), boot with slub_debug=,blkdev_queue and do: > > $ sudo modprobe loop > $ sudo losetup /dev/loop0 foo.img > $ sudo dd if=/dev/urandom of=/dev/loop0 bs=1M count=1 > $ sudo losetup -d /dev/loop0 > $ sudo rmmod loop > > (rmmod is key - it's the only way to get loop to do put_disk(). For > rbd, it's just rbd map - dd - rbd unmap.) > > In the past we used to reassign to default_backing_dev_info here, but > it was nuked in b83ae6d42143 ("fs: remove mapping->backing_dev_info"). Woohoo, it wasn't me. :) > Shortly after that cgroup-specific writebacks patches from Tejun got > merged, adding inode::i_wb and inode_detach_wb() call. The fix seems > to be to detach the inode earlier, but I'm not familiar enough with > cgroups code, so sending my findings instead of a patch. Christoph, > Tejun? It's stinky that the bdi is going away while the inode is still there. Yeah, blkdev inodes are special and created early but I think it makes sense to keep the underlying structures (queue and bdi) around while bdev is associated with it. Would simply moving put_disk() after bdput() work? Thanks.
On Tue, Nov 17, 2015 at 9:56 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Ilya. > > On Mon, Nov 16, 2015 at 09:59:18PM +0100, Ilya Dryomov wrote: > ... >> Looking at __blkdev_put(), the issue becomes clear: we are taking >> precautions to flush before calling out to ->release() because, at >> least according to the comment, ->release() can free queue; we are >> recording owner pointer because put_disk() may free both gendisk and >> queue, and then, after all that, we are calling bdput() which may >> touch the queue through wb_put() in inode_detach_wb(). (The fun part >> is wb_put() is supposed to be a noop for root wbs, but slab debugging >> interferes with that by poisoning wb->bdi pointer.) >> >> 1514 * dirty data before. >> 1515 */ >> 1516 bdev_write_inode(bdev); >> 1517 } >> 1518 if (bdev->bd_contains == bdev) { >> 1519 if (disk->fops->release) >> 1520 disk->fops->release(disk, mode); >> 1521 } >> 1522 if (!bdev->bd_openers) { >> 1523 struct module *owner = disk->fops->owner; >> 1524 >> 1525 disk_put_part(bdev->bd_part); >> 1526 bdev->bd_part = NULL; >> 1527 bdev->bd_disk = NULL; >> 1528 if (bdev != bdev->bd_contains) >> 1529 victim = bdev->bd_contains; >> 1530 bdev->bd_contains = NULL; >> 1531 >> 1532 put_disk(disk); <-- may free q >> 1533 module_put(owner); >> 1534 } >> 1535 mutex_unlock(&bdev->bd_mutex); >> 1536 bdput(bdev); <-- may touch q.backing_dev_info.wb > > Ah, that's a sneaky bug. Thanks a lot for chasing it down. The > scenario sounds completely plausible to me. > >> To reproduce, apply the attached patch (systemd-udevd condition is just >> a convenience: udev reacts to change events by getting the bdev which >> it then has to put), boot with slub_debug=,blkdev_queue and do: >> >> $ sudo modprobe loop >> $ sudo losetup /dev/loop0 foo.img >> $ sudo dd if=/dev/urandom of=/dev/loop0 bs=1M count=1 >> $ sudo losetup -d /dev/loop0 >> $ sudo rmmod loop >> >> (rmmod is key - it's the only way to get loop to do put_disk(). For >> rbd, it's just rbd map - dd - rbd unmap.) >> >> In the past we used to reassign to default_backing_dev_info here, but >> it was nuked in b83ae6d42143 ("fs: remove mapping->backing_dev_info"). > > Woohoo, it wasn't me. :) Well, you and Christoph have been pulling it in different directions. He removed default_backing_dev_info along with mapping->backing_dev_info, but you then kind of readded this link with inode->i_wb. > >> Shortly after that cgroup-specific writebacks patches from Tejun got >> merged, adding inode::i_wb and inode_detach_wb() call. The fix seems >> to be to detach the inode earlier, but I'm not familiar enough with >> cgroups code, so sending my findings instead of a patch. Christoph, >> Tejun? > > It's stinky that the bdi is going away while the inode is still there. > Yeah, blkdev inodes are special and created early but I think it makes > sense to keep the underlying structures (queue and bdi) around while > bdev is associated with it. Would simply moving put_disk() after > bdput() work? I'd think so. struct block_device is essentially a "block device" pseudo-filesystem inode, and as such, may not be around during the entire lifetime of gendisk / queue. It may be kicked out of the inode cache as soon as the device is closed, so it makes sense to put it before putting gendisk / queue, which will outlive it. However, I'm confused by this comment /* * ->release can cause the queue to disappear, so flush all * dirty data before. */ bdev_write_inode(bdev); It's not true, at least since your 523e1d399ce0 ("block: make gendisk hold a reference to its queue"), right? (It used to say "->release can cause the old bdi to disappear, so must switch it out first" and was changed by Christoph in the middle of his backing_dev_info series.) Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Ilya. On Wed, Nov 18, 2015 at 04:12:07PM +0100, Ilya Dryomov wrote: > > It's stinky that the bdi is going away while the inode is still there. > > Yeah, blkdev inodes are special and created early but I think it makes > > sense to keep the underlying structures (queue and bdi) around while > > bdev is associated with it. Would simply moving put_disk() after > > bdput() work? > > I'd think so. struct block_device is essentially a "block device" > pseudo-filesystem inode, and as such, may not be around during the > entire lifetime of gendisk / queue. It may be kicked out of the inode > cache as soon as the device is closed, so it makes sense to put it > before putting gendisk / queue, which will outlive it. > > However, I'm confused by this comment > > /* > * ->release can cause the queue to disappear, so flush all > * dirty data before. > */ > bdev_write_inode(bdev); > > It's not true, at least since your 523e1d399ce0 ("block: make gendisk > hold a reference to its queue"), right? (It used to say "->release can > cause the old bdi to disappear, so must switch it out first" and was > changed by Christoph in the middle of his backing_dev_info series.) Right, it started with each layer going away separately, which tends to get tricky with hotunplug, and we've been gradually moving towards a model where the entire stack stays till the last ref is gone, so yeah the comment isn't true anymore. Thanks.
On Wed, Nov 18, 2015 at 4:30 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Ilya. > > On Wed, Nov 18, 2015 at 04:12:07PM +0100, Ilya Dryomov wrote: >> > It's stinky that the bdi is going away while the inode is still there. >> > Yeah, blkdev inodes are special and created early but I think it makes >> > sense to keep the underlying structures (queue and bdi) around while >> > bdev is associated with it. Would simply moving put_disk() after >> > bdput() work? >> >> I'd think so. struct block_device is essentially a "block device" >> pseudo-filesystem inode, and as such, may not be around during the >> entire lifetime of gendisk / queue. It may be kicked out of the inode >> cache as soon as the device is closed, so it makes sense to put it >> before putting gendisk / queue, which will outlive it. >> >> However, I'm confused by this comment >> >> /* >> * ->release can cause the queue to disappear, so flush all >> * dirty data before. >> */ >> bdev_write_inode(bdev); >> >> It's not true, at least since your 523e1d399ce0 ("block: make gendisk >> hold a reference to its queue"), right? (It used to say "->release can >> cause the old bdi to disappear, so must switch it out first" and was >> changed by Christoph in the middle of his backing_dev_info series.) > > Right, it started with each layer going away separately, which tends > to get tricky with hotunplug, and we've been gradually moving towards > a model where the entire stack stays till the last ref is gone, so > yeah the comment isn't true anymore. OK, I'll try to work up a patch to do bdput before put_disk and also drop this comment. Just to be clear, the bdi/wb vs inode lifetime rules are that inodes should always be within bdi/wb? There's been a lot of churn in this and related areas recently, including in block drivers: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered"), b02176f30cd3 ("block: don't release bdi while request_queue has live references"), so I want to fully get my head around this. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Ilya. On Wed, Nov 18, 2015 at 04:48:06PM +0100, Ilya Dryomov wrote: > Just to be clear, the bdi/wb vs inode lifetime rules are that inodes > should always be within bdi/wb? There's been a lot of churn in this Yes, that's where *I* think we should be headed. Stuff in lower layers should stick around while upper layer things are around. > and related areas recently, including in block drivers: 6cd18e711dd8 > ("block: destroy bdi before blockdev is unregistered"), b02176f30cd3 > ("block: don't release bdi while request_queue has live references"), > so I want to fully get my head around this. End-of-life issue has always been a bit of mess in the block layer. Thanks a lot for working on this.
On Wed, Nov 18, 2015 at 4:48 PM, Ilya Dryomov <idryomov@gmail.com> wrote: > On Wed, Nov 18, 2015 at 4:30 PM, Tejun Heo <tj@kernel.org> wrote: >> Hello, Ilya. >> >> On Wed, Nov 18, 2015 at 04:12:07PM +0100, Ilya Dryomov wrote: >>> > It's stinky that the bdi is going away while the inode is still there. >>> > Yeah, blkdev inodes are special and created early but I think it makes >>> > sense to keep the underlying structures (queue and bdi) around while >>> > bdev is associated with it. Would simply moving put_disk() after >>> > bdput() work? >>> >>> I'd think so. struct block_device is essentially a "block device" >>> pseudo-filesystem inode, and as such, may not be around during the >>> entire lifetime of gendisk / queue. It may be kicked out of the inode >>> cache as soon as the device is closed, so it makes sense to put it >>> before putting gendisk / queue, which will outlive it. >>> >>> However, I'm confused by this comment >>> >>> /* >>> * ->release can cause the queue to disappear, so flush all >>> * dirty data before. >>> */ >>> bdev_write_inode(bdev); >>> >>> It's not true, at least since your 523e1d399ce0 ("block: make gendisk >>> hold a reference to its queue"), right? (It used to say "->release can >>> cause the old bdi to disappear, so must switch it out first" and was >>> changed by Christoph in the middle of his backing_dev_info series.) >> >> Right, it started with each layer going away separately, which tends >> to get tricky with hotunplug, and we've been gradually moving towards >> a model where the entire stack stays till the last ref is gone, so >> yeah the comment isn't true anymore. > > OK, I'll try to work up a patch to do bdput before put_disk and also > drop this comment. Doing bdput before put_disk in fs/block_dev.c helps, but isn't enough. There is nothing guaranteeing that our bdput in __blkdev_put() is the last one. One particular issue is the linkage between /dev inodes and bdev internal inodes. /dev inodes hold bdev inodes, so: 186 static void __fput(struct file *file) 187 { 188 struct dentry *dentry = file->f_path.dentry; 189 struct vfsmount *mnt = file->f_path.mnt; 190 struct inode *inode = file->f_inode; 191 192 might_sleep(); 193 194 fsnotify_close(file); 195 /* 196 * The function eventpoll_release() should be the first called 197 * in the file cleanup chain. 198 */ 199 eventpoll_release(file); 200 locks_remove_file(file); 201 202 if (unlikely(file->f_flags & FASYNC)) { 203 if (file->f_op->fasync) 204 file->f_op->fasync(-1, file, 0); 205 } 206 ima_file_free(file); 207 if (file->f_op->release) 208 file->f_op->release(inode, file); This translates to blkdev_put(). Suppose in response to this release block device driver dropped gendisk, queue, etc. Then we, still in blkdev_put(), did our bdput and put_disk. The queue is now gone, but there's still a ref on the bdev inode - from the /dev inode. When the latter gets evicted thanks to dput below, we end up in bd_forget(), which finishes up with iput(bdev->bd_inode)... 209 security_file_free(file); 210 if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL && 211 !(file->f_mode & FMODE_PATH))) { 212 cdev_put(inode->i_cdev); 213 } 214 fops_put(file->f_op); 215 put_pid(file->f_owner.pid); 216 if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) 217 i_readcount_dec(inode); 218 if (file->f_mode & FMODE_WRITER) { 219 put_write_access(inode); 220 __mnt_drop_write(mnt); 221 } 222 file->f_path.dentry = NULL; 223 file->f_path.mnt = NULL; 224 file->f_inode = NULL; 225 file_free(file); 226 dput(dentry); 227 mntput(mnt); 228 } On Wed, Nov 18, 2015 at 4:56 PM, Tejun Heo <tj@kernel.org> wrote: > On Wed, Nov 18, 2015 at 04:48:06PM +0100, Ilya Dryomov wrote: >> Just to be clear, the bdi/wb vs inode lifetime rules are that inodes >> should always be within bdi/wb? There's been a lot of churn in this > > Yes, that's where *I* think we should be headed. Stuff in lower > layers should stick around while upper layer things are around I think the fundamental problem is the embedding of bdi in the queue. The lifetime rules (or, rather, expectations) for the two seem to be completely different and, while used together, they belong to different subsystems. Even if we find a way to fix this particular race, there is a good chance someone will reintroduce it in the future, perhaps in a more subtle way. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Ilya. On Thu, Nov 19, 2015 at 09:56:21PM +0100, Ilya Dryomov wrote: > > Yes, that's where *I* think we should be headed. Stuff in lower > > layers should stick around while upper layer things are around > > I think the fundamental problem is the embedding of bdi in the queue. > The lifetime rules (or, rather, expectations) for the two seem to be > completely different and, while used together, they belong to different > subsystems. Even if we find a way to fix this particular race, there > is a good chance someone will reintroduce it in the future, perhaps in > a more subtle way. You're right. This is nasty. Hmmm... the root problem is that beyond the last __blkdev_put() the bdev and disk don't really have anything to do with each other but the bdev is still pointing to it. We are already guaranteeing that the underlying disk hangs around while there are bdevs associated with it. We already know that the bdev is idle once bd_openers hits zero and the inode gets flushed, so at that point, the problem is bdev's inode->i_wb is still pointing to something that the bdev doesn't have anything to do with. So, can we do inode_detach_wb() after flushing the inode? Thanks.
On Thu, Nov 19, 2015 at 10:18 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Ilya. > > On Thu, Nov 19, 2015 at 09:56:21PM +0100, Ilya Dryomov wrote: >> > Yes, that's where *I* think we should be headed. Stuff in lower >> > layers should stick around while upper layer things are around >> >> I think the fundamental problem is the embedding of bdi in the queue. >> The lifetime rules (or, rather, expectations) for the two seem to be >> completely different and, while used together, they belong to different >> subsystems. Even if we find a way to fix this particular race, there >> is a good chance someone will reintroduce it in the future, perhaps in >> a more subtle way. > > You're right. This is nasty. Hmmm... the root problem is that beyond > the last __blkdev_put() the bdev and disk don't really have anything > to do with each other but the bdev is still pointing to it. We are > already guaranteeing that the underlying disk hangs around while there > are bdevs associated with it. > > We already know that the bdev is idle once bd_openers hits zero and > the inode gets flushed, so at that point, the problem is bdev's > inode->i_wb is still pointing to something that the bdev doesn't have > anything to do with. So, can we do inode_detach_wb() after flushing > the inode? Detaching the inode earlier is what I suggested in the first email, but I didn't know if this kind of special casing was OK. I'll try it out. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Thu, Nov 19, 2015 at 10:56:43PM +0100, Ilya Dryomov wrote: > Detaching the inode earlier is what I suggested in the first email, but > I didn't know if this kind of special casing was OK. I'll try it out. Yeah, I was confused. Sorry about that. On the surface, it looks like a special case but everything around bdev is special case anyway and looking at the underlying lifetime rules, I think this is the right thing to do. Thanks.
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 565b8dac5782..5a4ff505fd12 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -552,6 +552,7 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head) { struct request_queue *q = container_of(rcu_head, struct request_queue, rcu_head); + printk("freeing q\n"); kmem_cache_free(blk_requestq_cachep, q); } diff --git a/fs/block_dev.c b/fs/block_dev.c index bb0dfb1c7af1..6475dac5f3bc 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1496,6 +1496,8 @@ static int blkdev_open(struct inode * inode, struct file * filp) return blkdev_get(bdev, filp->f_mode, filp); } +#include <linux/delay.h> + static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) { struct gendisk *disk = bdev->bd_disk; @@ -1531,6 +1533,12 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) put_disk(disk); module_put(owner); + + if (!strcmp(current->comm, "systemd-udevd")) { + printk("sleep start %d\n", task_pid_nr(current)); + ssleep(3); + printk("sleep end %d\n", task_pid_nr(current)); + } } mutex_unlock(&bdev->bd_mutex); bdput(bdev); diff --git a/fs/inode.c b/fs/inode.c index 1be5f9003eb3..10625eeb7816 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1490,6 +1490,8 @@ void iput(struct inode *inode) { if (!inode) return; + if (inode->i_wb) + BUG_ON(atomic_long_read(&inode->i_wb->refcnt.count) == 0x6b6b6b6b6b6b6b6b); BUG_ON(inode->i_state & I_CLEAR); retry: if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {