Message ID | 1448028746-24504-1-git-send-email-fdmanana@kernel.org (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On Fri, Nov 20, 2015 at 02:12:26PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > During the final phase of a device replace operation, I ran into a > transaction abort that resulted in the following trace: > > [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]() > [23919.664742] BTRFS: Transaction aborted (error -2) > [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs] > [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1 > [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 > [23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98 > [23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48 > [23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0 > [23919.696716] Call Trace: > [23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79 > [23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 > [23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] > [23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 > [23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] > [23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs] > [23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs] > [23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs] > [23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs] > [23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194 > [23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78 > [23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0 > [23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0 > [23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff > [23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b > [23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6 > [23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e > [23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 > [23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 > [23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f > [23919.733330] ---[ end trace 166ef301a335832a ]--- > > This is due to a race between device replace and chunk allocation, which > the following diagram illustrates: > > CPU 1 CPU 2 > > btrfs_dev_replace_finishing() > > at this point > dev_replace->tgtdev->devid == > BTRFS_DEV_REPLACE_DEVID (0ULL) > > ... > > btrfs_start_transaction() > btrfs_commit_transaction() > > btrfs_fallocate() > btrfs_alloc_data_chunk_ondemand() > btrfs_join_transaction() > --> starts a new transaction > do_chunk_alloc() > lock fs_info->chunk_mutex > btrfs_alloc_chunk() > --> creates extent map for > the new chunk with > em->bdev->map->stripes[i]->dev->devid > == X (X > 0) > --> extent map is added to > fs_info->mapping_tree > --> initial phase of bg A > allocation completes > unlock fs_info->chunk_mutex > > lock fs_info->chunk_mutex > > btrfs_dev_replace_update_device_in_mapping_tree() > --> iterates fs_info->mapping_tree and > replaces the device in every extent > map's map->stripes[] with > dev_replace->tgtdev, which still has > an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) > > btrfs_end_transaction() > btrfs_create_pending_block_groups() > --> starts final phase of > bg A creation (update device, > extent, and chunk trees, etc) > btrfs_finish_chunk_alloc() > > btrfs_update_device() > --> attempts to update a device > item with ID == 0ULL > (BTRFS_DEV_REPLACE_DEVID) > which is the current ID of > bg A's > em->bdev->map->stripes[i]->dev->devid > --> doesn't find such item > returns -ENOENT > --> the device id should have been X > and not 0ULL > > got -ENOENT from > btrfs_finish_chunk_alloc() > and aborts current transaction > > finishes setting up the target device, > namely it sets tgtdev->devid to the value > of srcdev->devid, which is X (and X > 0) > > So fix this by taking the device list mutex when processing the chunk's > extent map stripes to update the device items. > > This happened while running fstest btrfs/071. The code looks good to me. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Thanks, -liubo > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > > V2: Made the attribution of a new id from the old device to happen > after replacing the device for the extent maps with the new device, > as obviously it was missing before. > > V3: Make it really race free. The previous approach was still racy, but in a > different way. Need to really synchronize the finishing phase of device > replace that sets up the new device with the final phase of chunk allocation > that processes the stripes and updates the device items. > > V4: Moved critical section a bit further, as there's further usage of a > stripe's device. > > fs/btrfs/volumes.c | 17 +++++++++++++++-- > 1 file changed, 15 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 45f2025..2290469 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -4827,20 +4827,32 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, > goto out; > } > > + /* > + * Take the device list mutex to prevent races with the final phase of > + * a device replace operation that replaces the device object associated > + * with the map's stripes, because the device object's id can change > + * at any time during that final phase of the device replace operation > + * (dev-replace.c:btrfs_dev_replace_finishing()). > + */ > + mutex_lock(&chunk_root->fs_info->fs_devices->device_list_mutex); > for (i = 0; i < map->num_stripes; i++) { > device = map->stripes[i].dev; > dev_offset = map->stripes[i].physical; > > ret = btrfs_update_device(trans, device); > if (ret) > - goto out; > + break; > ret = btrfs_alloc_dev_extent(trans, device, > chunk_root->root_key.objectid, > BTRFS_FIRST_CHUNK_TREE_OBJECTID, > chunk_offset, dev_offset, > stripe_size); > if (ret) > - goto out; > + break; > + } > + if (ret) { > + mutex_unlock(&chunk_root->fs_info->fs_devices->device_list_mutex); > + goto out; > } > > stripe = &chunk->stripe; > @@ -4853,6 +4865,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, > memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE); > stripe++; > } > + mutex_unlock(&chunk_root->fs_info->fs_devices->device_list_mutex); > > btrfs_set_stack_chunk_length(chunk, chunk_size); > btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid); > -- > 2.1.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 45f2025..2290469 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4827,20 +4827,32 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, goto out; } + /* + * Take the device list mutex to prevent races with the final phase of + * a device replace operation that replaces the device object associated + * with the map's stripes, because the device object's id can change + * at any time during that final phase of the device replace operation + * (dev-replace.c:btrfs_dev_replace_finishing()). + */ + mutex_lock(&chunk_root->fs_info->fs_devices->device_list_mutex); for (i = 0; i < map->num_stripes; i++) { device = map->stripes[i].dev; dev_offset = map->stripes[i].physical; ret = btrfs_update_device(trans, device); if (ret) - goto out; + break; ret = btrfs_alloc_dev_extent(trans, device, chunk_root->root_key.objectid, BTRFS_FIRST_CHUNK_TREE_OBJECTID, chunk_offset, dev_offset, stripe_size); if (ret) - goto out; + break; + } + if (ret) { + mutex_unlock(&chunk_root->fs_info->fs_devices->device_list_mutex); + goto out; } stripe = &chunk->stripe; @@ -4853,6 +4865,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE); stripe++; } + mutex_unlock(&chunk_root->fs_info->fs_devices->device_list_mutex); btrfs_set_stack_chunk_length(chunk, chunk_size); btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid);