Message ID | 20210511181558.380764-1-gulam.mohamed@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [V1,1/1] Fix race between iscsi logout and systemd-udevd | expand |
On Tue, May 11, 2021 at 06:15:58PM +0000, Gulam Mohamed wrote: > Problem description: > > During the kernel patching, customer was switching between the iscsi > disks. To switch between the iscsi disks, it was logging out the > currently connected iscsi disk and then logging in to the new iscsi > disk. This was being done using a script. Customer was also using the > "parted" command in the script to list the partition details just > before the iscsi logout. This usage of "parted" command was creating > an issue and we were seeing stale links of the > disks in /sys/class/block. > > Analysis: > > As part of iscsi logout, the partitions and the disk will be removed > in the function del_gendisk() which is done through a kworker. The > parted command, used to list the partitions, will open the disk in > RW mode which results in systemd-udevd re-reading the partitions. The > ioctl used to re-read partitions is BLKRRPART. This will trigger the > rescanning of partitions which will also delete and re-add the > partitions. So, both iscsi logout processing (through kworker) and the > "parted" command (through systemd-udevd) will be involved in > add/delete of partitions. In our case, the following sequence of > operations happened (the iscsi device is /dev/sdb with partition sdb1): > > 1. sdb1 was removed by PARTED > 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was > already removed by PARTED > 3. sdb1 was added by parted > 4. sdb was NOW removed as part of iscsi logout (the last part of the > device removal after remoing the partitions) > > Since the symlink /sys/class/block/sdb1 points to > /sys/class/devices/platform/hostx/sessionx/targetx:x/block/sdb/sdb1 > and since sdb is already removed, the symlink /sys/class/block/sdb1 > will be orphan and stale. So, this stale link is a result of the race > condition in kernel between the systemd-udevd and iscsi-logout > processing as described above. We were able to reproduce this even > with latest upstream kernel. > > Fix: > > While Dropping/Adding partitions as part of BLKRRPART ioctl, take the > read lock for "bdev_lookup_sem" to sync with del_gendisk(). > > Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com> > --- > fs/block_dev.c | 15 +++++++++++++-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 09d6f7229db9..e903a7edfd63 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1245,9 +1245,17 @@ int bdev_disk_changed(struct block_device *bdev, bool invalidate) > lockdep_assert_held(&bdev->bd_mutex); > > rescan: > + down_read(&bdev_lookup_sem); > + if (!(disk->flags & GENHD_FL_UP)) { > + up_read(&bdev_lookup_sem); > + return -ENXIO; > + } This way might cause deadlock: 1) code path BLKRRPART: mutex_lock(bdev->bd_mutex) down_read(&bdev_lookup_sem); 2) del_gendisk(): down_write(&bdev_lookup_sem); mutex_lock(&disk->part0->bd_mutex); Given GENHD_FL_UP is only checked when opening one bdev, and fsync_bdev() and __invalidate_device() needn't to open bdev, so the following way may work for your issue: diff --git a/block/genhd.c b/block/genhd.c index 39ca97b0edc6..5eb27995d4ab 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -617,6 +617,7 @@ void del_gendisk(struct gendisk *disk) mutex_lock(&disk->part0->bd_mutex); blk_drop_partitions(disk); + disk->flags &= ~GENHD_FL_UP; mutex_unlock(&disk->part0->bd_mutex); fsync_bdev(disk->part0); @@ -629,7 +630,6 @@ void del_gendisk(struct gendisk *disk) remove_inode_hash(disk->part0->bd_inode); set_capacity(disk, 0); - disk->flags &= ~GENHD_FL_UP; up_write(&bdev_lookup_sem); if (!(disk->flags & GENHD_FL_HIDDEN)) { diff --git a/fs/block_dev.c b/fs/block_dev.c index b8abccd03e5d..06b70b8e3f67 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1245,6 +1245,8 @@ int bdev_disk_changed(struct block_device *bdev, bool invalidate) lockdep_assert_held(&bdev->bd_mutex); rescan: + if(!(disk->flags & GENHD_FL_UP)) + return -ENXIO; if (bdev->bd_part_count) return -EBUSY; sync_blockdev(bdev); Thanks, Ming
On Wed, May 12, 2021 at 11:23:59AM +0800, Ming Lei wrote: > > 1) code path BLKRRPART: > mutex_lock(bdev->bd_mutex) > down_read(&bdev_lookup_sem); > > 2) del_gendisk(): > down_write(&bdev_lookup_sem); > mutex_lock(&disk->part0->bd_mutex); > > Given GENHD_FL_UP is only checked when opening one bdev, and > fsync_bdev() and __invalidate_device() needn't to open bdev, so > the following way may work for your issue: If we move the clearing of GENHD_FL_UP earlier we can do away with bdev_lookup_sem entirely I think. Something like this untested patch: diff --git a/block/genhd.c b/block/genhd.c index a5847560719c..ef717084b343 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -29,8 +29,6 @@ static struct kobject *block_depr; -DECLARE_RWSEM(bdev_lookup_sem); - /* for extended dynamic devt allocation, currently only one major is used */ #define NR_EXT_DEVT (1 << MINORBITS) static DEFINE_IDA(ext_devt_ida); @@ -609,13 +607,8 @@ void del_gendisk(struct gendisk *disk) blk_integrity_del(disk); disk_del_events(disk); - /* - * Block lookups of the disk until all bdevs are unhashed and the - * disk is marked as dead (GENHD_FL_UP cleared). - */ - down_write(&bdev_lookup_sem); - mutex_lock(&disk->open_mutex); + disk->flags &= ~GENHD_FL_UP; blk_drop_partitions(disk); mutex_unlock(&disk->open_mutex); @@ -627,10 +620,7 @@ void del_gendisk(struct gendisk *disk) * up any more even if openers still hold references to it. */ remove_inode_hash(disk->part0->bd_inode); - set_capacity(disk, 0); - disk->flags &= ~GENHD_FL_UP; - up_write(&bdev_lookup_sem); if (!(disk->flags & GENHD_FL_HIDDEN)) { sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi"); diff --git a/fs/block_dev.c b/fs/block_dev.c index 8dd8e2fd1401..bde23940190f 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1377,33 +1377,24 @@ struct block_device *blkdev_get_no_open(dev_t dev) struct block_device *bdev; struct gendisk *disk; - down_read(&bdev_lookup_sem); bdev = bdget(dev); if (!bdev) { - up_read(&bdev_lookup_sem); blk_request_module(dev); - down_read(&bdev_lookup_sem); - bdev = bdget(dev); if (!bdev) - goto unlock; + return NULL; } disk = bdev->bd_disk; if (!kobject_get_unless_zero(&disk_to_dev(disk)->kobj)) goto bdput; - if ((disk->flags & (GENHD_FL_UP | GENHD_FL_HIDDEN)) != GENHD_FL_UP) - goto put_disk; if (!try_module_get(bdev->bd_disk->fops->owner)) goto put_disk; - up_read(&bdev_lookup_sem); return bdev; put_disk: put_disk(disk); bdput: bdput(bdev); -unlock: - up_read(&bdev_lookup_sem); return NULL; } @@ -1462,7 +1453,10 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) disk_block_events(disk); + ret = -ENXIO; mutex_lock(&disk->open_mutex); + if ((disk->flags & (GENHD_FL_UP | GENHD_FL_HIDDEN)) != GENHD_FL_UP) + goto abort_claiming; if (bdev_is_partition(bdev)) ret = blkdev_get_part(bdev, mode); else
On Wed, May 12, 2021 at 08:35:05AM +0200, Christoph Hellwig wrote: > On Wed, May 12, 2021 at 11:23:59AM +0800, Ming Lei wrote: > > > > 1) code path BLKRRPART: > > mutex_lock(bdev->bd_mutex) > > down_read(&bdev_lookup_sem); > > > > 2) del_gendisk(): > > down_write(&bdev_lookup_sem); > > mutex_lock(&disk->part0->bd_mutex); > > > > Given GENHD_FL_UP is only checked when opening one bdev, and > > fsync_bdev() and __invalidate_device() needn't to open bdev, so > > the following way may work for your issue: > > If we move the clearing of GENHD_FL_UP earlier we can do away with > bdev_lookup_sem entirely I think. Something like this untested patch: > > diff --git a/block/genhd.c b/block/genhd.c > index a5847560719c..ef717084b343 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -29,8 +29,6 @@ > > static struct kobject *block_depr; > > -DECLARE_RWSEM(bdev_lookup_sem); > - > /* for extended dynamic devt allocation, currently only one major is used */ > #define NR_EXT_DEVT (1 << MINORBITS) > static DEFINE_IDA(ext_devt_ida); > @@ -609,13 +607,8 @@ void del_gendisk(struct gendisk *disk) > blk_integrity_del(disk); > disk_del_events(disk); > > - /* > - * Block lookups of the disk until all bdevs are unhashed and the > - * disk is marked as dead (GENHD_FL_UP cleared). > - */ > - down_write(&bdev_lookup_sem); > - > mutex_lock(&disk->open_mutex); > + disk->flags &= ~GENHD_FL_UP; > blk_drop_partitions(disk); > mutex_unlock(&disk->open_mutex); > > @@ -627,10 +620,7 @@ void del_gendisk(struct gendisk *disk) > * up any more even if openers still hold references to it. > */ > remove_inode_hash(disk->part0->bd_inode); > - > set_capacity(disk, 0); > - disk->flags &= ~GENHD_FL_UP; > - up_write(&bdev_lookup_sem); > > if (!(disk->flags & GENHD_FL_HIDDEN)) { > sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi"); > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 8dd8e2fd1401..bde23940190f 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1377,33 +1377,24 @@ struct block_device *blkdev_get_no_open(dev_t dev) > struct block_device *bdev; > struct gendisk *disk; > > - down_read(&bdev_lookup_sem); > bdev = bdget(dev); > if (!bdev) { > - up_read(&bdev_lookup_sem); > blk_request_module(dev); > - down_read(&bdev_lookup_sem); > - > bdev = bdget(dev); > if (!bdev) > - goto unlock; > + return NULL; > } > > disk = bdev->bd_disk; > if (!kobject_get_unless_zero(&disk_to_dev(disk)->kobj)) > goto bdput; > - if ((disk->flags & (GENHD_FL_UP | GENHD_FL_HIDDEN)) != GENHD_FL_UP) > - goto put_disk; > if (!try_module_get(bdev->bd_disk->fops->owner)) > goto put_disk; > - up_read(&bdev_lookup_sem); > return bdev; > put_disk: > put_disk(disk); > bdput: > bdput(bdev); > -unlock: > - up_read(&bdev_lookup_sem); > return NULL; > } > > @@ -1462,7 +1453,10 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) > > disk_block_events(disk); > > + ret = -ENXIO; > mutex_lock(&disk->open_mutex); > + if ((disk->flags & (GENHD_FL_UP | GENHD_FL_HIDDEN)) != GENHD_FL_UP) > + goto abort_claiming; > if (bdev_is_partition(bdev)) > ret = blkdev_get_part(bdev, mode); > else This patch looks fine, and new openers can be prevented really with help of ->open_mutex. Thanks, Ming
On Wed, May 12, 2021 at 03:23:10PM +0800, Ming Lei wrote: > This patch looks fine, and new openers can be prevented really with help > of ->open_mutex. Yes. I have a patch directly on top of block-5.13 without my open_mutex series undergoing testing right now. I'll post it later today.
diff --git a/fs/block_dev.c b/fs/block_dev.c index 09d6f7229db9..e903a7edfd63 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1245,9 +1245,17 @@ int bdev_disk_changed(struct block_device *bdev, bool invalidate) lockdep_assert_held(&bdev->bd_mutex); rescan: + down_read(&bdev_lookup_sem); + if (!(disk->flags & GENHD_FL_UP)) { + up_read(&bdev_lookup_sem); + return -ENXIO; + } + ret = blk_drop_partitions(bdev); - if (ret) + if (ret) { + up_read(&bdev_lookup_sem); return ret; + } clear_bit(GD_NEED_PART_SCAN, &disk->state); @@ -1270,8 +1278,10 @@ int bdev_disk_changed(struct block_device *bdev, bool invalidate) if (get_capacity(disk)) { ret = blk_add_partitions(disk, bdev); - if (ret == -EAGAIN) + if (ret == -EAGAIN) { + up_read(&bdev_lookup_sem); goto rescan; + } } else if (invalidate) { /* * Tell userspace that the media / partition table may have @@ -1280,6 +1290,7 @@ int bdev_disk_changed(struct block_device *bdev, bool invalidate) kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE); } + up_read(&bdev_lookup_sem); return ret; } /*
Problem description: During the kernel patching, customer was switching between the iscsi disks. To switch between the iscsi disks, it was logging out the currently connected iscsi disk and then logging in to the new iscsi disk. This was being done using a script. Customer was also using the "parted" command in the script to list the partition details just before the iscsi logout. This usage of "parted" command was creating an issue and we were seeing stale links of the disks in /sys/class/block. Analysis: As part of iscsi logout, the partitions and the disk will be removed in the function del_gendisk() which is done through a kworker. The parted command, used to list the partitions, will open the disk in RW mode which results in systemd-udevd re-reading the partitions. The ioctl used to re-read partitions is BLKRRPART. This will trigger the rescanning of partitions which will also delete and re-add the partitions. So, both iscsi logout processing (through kworker) and the "parted" command (through systemd-udevd) will be involved in add/delete of partitions. In our case, the following sequence of operations happened (the iscsi device is /dev/sdb with partition sdb1): 1. sdb1 was removed by PARTED 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was already removed by PARTED 3. sdb1 was added by parted 4. sdb was NOW removed as part of iscsi logout (the last part of the device removal after remoing the partitions) Since the symlink /sys/class/block/sdb1 points to /sys/class/devices/platform/hostx/sessionx/targetx:x/block/sdb/sdb1 and since sdb is already removed, the symlink /sys/class/block/sdb1 will be orphan and stale. So, this stale link is a result of the race condition in kernel between the systemd-udevd and iscsi-logout processing as described above. We were able to reproduce this even with latest upstream kernel. Fix: While Dropping/Adding partitions as part of BLKRRPART ioctl, take the read lock for "bdev_lookup_sem" to sync with del_gendisk(). Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com> --- fs/block_dev.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)