Message ID | d156fbaf743d5ec2de50a894170f3d9c7b7a146c.1732111502.git.bcodding@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | two fixes for pNFS SCSI device handling | expand |
On Wed, Nov 20, 2024 at 09:09:35AM -0500, Benjamin Coddington wrote: > If we're unable to register a SCSI device, ensure we mark the device as > unavailable so that it will timeout and be re-added via GETDEVINFO. This > avoids repeated doomed attempts to register a device in the IO path. > > Add some clarifying comments as well. > > Fixes: d869da91cccb ("nfs/blocklayout: Fix premature PR key unregistration") > Signed-off-by: Benjamin Coddington <bcodding@redhat.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > --- > fs/nfs/blocklayout/blocklayout.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c > index 0becdec12970..b36bc2f4f7e2 100644 > --- a/fs/nfs/blocklayout/blocklayout.c > +++ b/fs/nfs/blocklayout/blocklayout.c > @@ -571,19 +571,29 @@ bl_find_get_deviceid(struct nfs_server *server, > if (!node) > return ERR_PTR(-ENODEV); > > + /* > + * Devices that are marked unavailable are left in the cache with a > + * timeout to avoid sending GETDEVINFO after every LAYOUTGET, or > + * constantly attempting to register the device. Once marked as > + * unavailable they must be deleted and never reused. > + */ > if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags)) { > unsigned long end = jiffies; > unsigned long start = end - PNFS_DEVICE_RETRY_TIMEOUT; > > if (!time_in_range(node->timestamp_unavailable, start, end)) { > + /* Force a new GETDEVINFO for this LAYOUT */ Or perhaps: "Uncork subsequent GETDEVINFO operations for this device" <shrug> > nfs4_delete_deviceid(node->ld, node->nfs_client, id); > goto retry; > } > goto out_put; > } > > - if (!bl_register_dev(container_of(node, struct pnfs_block_dev, node))) > + /* If we cannot register, treat this device as transient */ How about "Make a negative cache entry for this device" > + if (!bl_register_dev(container_of(node, struct pnfs_block_dev, node))) { > + nfs4_mark_deviceid_unavailable(node); > goto out_put; > + } > > return node; > > -- > 2.47.0 > It took me a bit to understand what this patch does. It is like setting up a negative dentry so the local device cache absorbs bursts of checks for the device. OK. Just an observation: Negative caching has some consequences too. For instance, there will now be a period where, if the device happens to become available, the layout is still unusable. I wonder if that's going to have some undesirable operational effects.
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c index 0becdec12970..b36bc2f4f7e2 100644 --- a/fs/nfs/blocklayout/blocklayout.c +++ b/fs/nfs/blocklayout/blocklayout.c @@ -571,19 +571,29 @@ bl_find_get_deviceid(struct nfs_server *server, if (!node) return ERR_PTR(-ENODEV); + /* + * Devices that are marked unavailable are left in the cache with a + * timeout to avoid sending GETDEVINFO after every LAYOUTGET, or + * constantly attempting to register the device. Once marked as + * unavailable they must be deleted and never reused. + */ if (test_bit(NFS_DEVICEID_UNAVAILABLE, &node->flags)) { unsigned long end = jiffies; unsigned long start = end - PNFS_DEVICE_RETRY_TIMEOUT; if (!time_in_range(node->timestamp_unavailable, start, end)) { + /* Force a new GETDEVINFO for this LAYOUT */ nfs4_delete_deviceid(node->ld, node->nfs_client, id); goto retry; } goto out_put; } - if (!bl_register_dev(container_of(node, struct pnfs_block_dev, node))) + /* If we cannot register, treat this device as transient */ + if (!bl_register_dev(container_of(node, struct pnfs_block_dev, node))) { + nfs4_mark_deviceid_unavailable(node); goto out_put; + } return node;