Message ID | 20170412161017.GA16590@infradead.org (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
On 12.04.2017 19:10, Christoph Hellwig wrote: > Hi Nikolay, > > I guess the culprit is that truncate can free up to two extents in > the same transaction and thus try to lock two different AGs without > requiring them to be in increasing order. > > Does the one liner below fix the problem for you? > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 7605d8396596..29f2cd5afb04 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -58,7 +58,7 @@ kmem_zone_t *xfs_inode_zone; > * Used in xfs_itruncate_extents(). This is the maximum number of extents > * freed from a file in a single transaction. > */ > -#define XFS_ITRUNC_MAX_EXTENTS 2 > +#define XFS_ITRUNC_MAX_EXTENTS 1 > > STATIC int xfs_iflush_int(struct xfs_inode *, struct xfs_buf *); > STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *); > I will apply this to 3.12 and 4.4 and run tests since I can reproduce fairly reliably on those. You don't expect any fallout on older kernel, yes ? -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 12, 2017 at 07:12:50PM +0300, Nikolay Borisov wrote: > I will apply this to 3.12 and 4.4 and run tests since I can reproduce > fairly reliably on those. You don't expect any fallout on older kernel, > yes ? No. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12.04.2017 19:10, Christoph Hellwig wrote: > Hi Nikolay, > > I guess the culprit is that truncate can free up to two extents in > the same transaction and thus try to lock two different AGs without > requiring them to be in increasing order. On the other hand Darrick suggested that the problem might be in the allocation path due to it having a dirty buffer for AGF1 and proceeding to lock AGF0, resulting in locking order violation. So the bli holding AGF1 in the allocating task is: crash> struct xfs_buf_log_item.bli_flags 0xffff8800a60b1570 bli_flags = 2 That's XFS_BLI_DIRTY. According to Darick's opinion here is what *should* happen: " djwong: either agf1 is clean and it needs to release that before going for agf0, or agf1 is dirty and thus it cannot go for agf0 " In this case agf1 is dirty and allocation path continues to agf0 which is clear lock order violation? On the truncation side the bli's flags for agf0 : crash> struct -x xfs_buf_log_item.bli_flags 0xffff8801394ed2b8 bli_flags = 0xa => BLI_DIRTY | BLI_LOGGED And then it is proceeding to lock AGF1 (ascending order) correctly. In spite of this your patch is likely to help this situation though I'm not sure if it is modifying the right side of the violation. Regards, Nikolay -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 12, 2017 at 08:44:32PM +0300, Nikolay Borisov wrote: > " > djwong: either agf1 is clean and it needs to release that before going > for agf0, or agf1 is dirty and thus it cannot go for agf0 > " Yes. Older kernels had some bugs in this area due to busy extent tracking, where xfs_alloc_ag_vextent would fail despite xfs_alloc_fix_freelist picking an AG and possibly dirtying the AGF. My busy extent tracking changes for the asynchronous discard code in 4.11-rc should have fixed that. But even with that in place I think that locking two AGFs in any order in the truncate path is wrong. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12.04.2017 19:10, Christoph Hellwig wrote: > Hi Nikolay, > > I guess the culprit is that truncate can free up to two extents in > the same transaction and thus try to lock two different AGs without > requiring them to be in increasing order. > > Does the one liner below fix the problem for you? So after 200 runs of generic/299 I didn't hit the deadlock whereas before it would hit in the first 30 or so. FWIW : Tested-by: Nikolay Borisov <nborisov@suse.com> On a different note - do you think that reducing the unmapped extents from 2 to 1 would introduce any performance degradation during truncation? Looking around the code this define is only used when doing truncation, so perhaps a better thing to do would be to turn this xfs_bunmapi arg to a boolean which signal whether we are doing truncation or not. And if it is set to true have xfs_bunmapi unmap all possible extents from only a single AG? I'm going to sift through the git history to figure out where this requirement of maximum 2 extent came to truncate, came. Regards, Nikolay -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 13, 2017 at 04:52:03PM +0300, Nikolay Borisov wrote: > On a different note - do you think that reducing the unmapped extents > from 2 to 1 would introduce any performance degradation during > truncation? There will be some. But now that we have the CIL it will just additional in-kernel overhead instead of overhead in the on-disk log. > Looking around the code this define is only used when doing > truncation, so perhaps a better thing to do would be to turn this > xfs_bunmapi arg to a boolean which signal whether we are doing > truncation or not. And if it is set to true have xfs_bunmapi unmap all > possible extents from only a single AG? I'm going to sift through the > git history to figure out where this requirement of maximum 2 extent > came to truncate, came. We have the problem with all transactions that could lock multiple AGF headers, so that's not going to cut it. I think we could do multiple transactions IFF in the same AG. I'll need to check if that's worth it. And on top of that I have started entirely reworking what is currently xfs_bunmapi, but that will have to wait until after a fix for your issue. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
FYI, I've been testing with the patch quite a bit and it seems to be doing fine. But I fear it's not the complete fix: if we do truncates, hole punches or other complicated operations the bmap btree blocks might be from different AGs than the data blocks and we could still run into these issues in theory. So I fear we might have to come up with a solution where we roll into a new chained transaction everytime we encounter a new AG instead. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 7605d8396596..29f2cd5afb04 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -58,7 +58,7 @@ kmem_zone_t *xfs_inode_zone; * Used in xfs_itruncate_extents(). This is the maximum number of extents * freed from a file in a single transaction. */ -#define XFS_ITRUNC_MAX_EXTENTS 2 +#define XFS_ITRUNC_MAX_EXTENTS 1 STATIC int xfs_iflush_int(struct xfs_inode *, struct xfs_buf *); STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);