Message ID | 20201111153913.41840-1-mlevitsk@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | RFC: Issue with discards on raw block device without O_DIRECT | expand |
On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> clone of "starship_production"
The git-publish destroyed the cover letter:
For the reference this is for bz #1872633
The issue is that current kernel code that implements 'fallocate'
on kernel block devices roughly works like that:
1. Flush the page cache on the range that is about to be discarded.
2. Issue the discard and wait for it to finish.
(as far as I can see the discard doesn't go through the
page cache).
3. Check if the page cache is dirty for this range,
if it is dirty (meaning that someone wrote to it meanwhile)
return -EBUSY.
This means that if qemu (or qemu-img) issues a write, and then
discard to the area that shares a page, -EBUSY can be returned by
the kernel.
On the other hand, for example, the ext4 implementation of discard
doesn't seem to be affected. It does take a lock on the inode to avoid
concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
Doing fsync and retrying is seems to resolve this issue, but it might be a too big hammer.
Just retrying doesn't work, indicating that maybe the code that flushes the page
cache in (1) doesn't do this correctly ?
It also can be racy unless special means are done to block IO from happening
from qemu during this fsync.
This patch series contains two patches:
First patch just lets the file-posix ignore the -EBUSY errors, which is
technically enough to fail back to plain write in this case, but seems wrong.
And the second patch adds an optimization to qemu-img to avoid such a
fragmented write/discard in the first place.
Both patches make the reproducer work for this particular bugzilla,
but I don't think they are enough.
What do you think?
Best regards,
Maxim Levitsky
[added some relevant people and lists to CC] On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > clone of "starship_production" > > The git-publish destroyed the cover letter: > > For the reference this is for bz #1872633 > > The issue is that current kernel code that implements 'fallocate' > on kernel block devices roughly works like that: > > 1. Flush the page cache on the range that is about to be discarded. > 2. Issue the discard and wait for it to finish. > (as far as I can see the discard doesn't go through the > page cache). > > 3. Check if the page cache is dirty for this range, > if it is dirty (meaning that someone wrote to it meanwhile) > return -EBUSY. > > This means that if qemu (or qemu-img) issues a write, and then > discard to the area that shares a page, -EBUSY can be returned by > the kernel. Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back EBUSY which seems wrong to me. IMO we should handle this gracefully in the kernel so we need to fix this. > On the other hand, for example, the ext4 implementation of discard > doesn't seem to be affected. It does take a lock on the inode to avoid > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. Well, filesystem hole punching is somewhat different beast than block device discard (at least implementation wise). > Doing fsync and retrying is seems to resolve this issue, but it might be > a too big hammer. Just retrying doesn't work, indicating that maybe the > code that flushes the page cache in (1) doesn't do this correctly ? > > It also can be racy unless special means are done to block IO from happening > from qemu during this fsync. > > This patch series contains two patches: > > First patch just lets the file-posix ignore the -EBUSY errors, which is > technically enough to fail back to plain write in this case, but seems wrong. > > And the second patch adds an optimization to qemu-img to avoid such a > fragmented write/discard in the first place. > > Both patches make the reproducer work for this particular bugzilla, > but I don't think they are enough. > > What do you think? So if the EBUSY error happens because something happened to the page cache outside of discarded range (like you describe above), that is a kernel bug than needs to get fixed. EBUSY should really mean - someone wrote to the discarded range while discard was running and userspace app has to deal with that depending on what it aims to do... Honza
On Thu 12-11-20 12:19:51, Jan Kara wrote: > [added some relevant people and lists to CC] > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > clone of "starship_production" > > > > The git-publish destroyed the cover letter: > > > > For the reference this is for bz #1872633 > > > > The issue is that current kernel code that implements 'fallocate' > > on kernel block devices roughly works like that: > > > > 1. Flush the page cache on the range that is about to be discarded. > > 2. Issue the discard and wait for it to finish. > > (as far as I can see the discard doesn't go through the > > page cache). > > > > 3. Check if the page cache is dirty for this range, > > if it is dirty (meaning that someone wrote to it meanwhile) > > return -EBUSY. > > > > This means that if qemu (or qemu-img) issues a write, and then > > discard to the area that shares a page, -EBUSY can be returned by > > the kernel. > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > kernel so we need to fix this. > > > On the other hand, for example, the ext4 implementation of discard > > doesn't seem to be affected. It does take a lock on the inode to avoid > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > Well, filesystem hole punching is somewhat different beast than block device > discard (at least implementation wise). > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > It also can be racy unless special means are done to block IO from happening > > from qemu during this fsync. > > > > This patch series contains two patches: > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > technically enough to fail back to plain write in this case, but seems wrong. > > > > And the second patch adds an optimization to qemu-img to avoid such a > > fragmented write/discard in the first place. > > > > Both patches make the reproducer work for this particular bugzilla, > > but I don't think they are enough. > > > > What do you think? > > So if the EBUSY error happens because something happened to the page cache > outside of discarded range (like you describe above), that is a kernel bug > than needs to get fixed. EBUSY should really mean - someone wrote to the > discarded range while discard was running and userspace app has to deal > with that depending on what it aims to do... So I was looking what it would take to fix this inside the kernel. The problem is that invalidate_inode_pages2_range() is working on page granularity and it is non-trivial to extend it to work on byte granularity since we don't support something like "try to reclaim part of a page". But I'm also somewhat wondering why we use invalidate_inode_pages2_range() here instead of truncate_inode_pages_range() again? I mean the EBUSY detection cannot be reliable anyway and userspace has no way of knowing whether a write happened before discard or after it so just discarding data is fine from this point of view. Darrick? Honza
On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote: > [added some relevant people and lists to CC] > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > clone of "starship_production" > > > > The git-publish destroyed the cover letter: > > > > For the reference this is for bz #1872633 > > > > The issue is that current kernel code that implements 'fallocate' > > on kernel block devices roughly works like that: > > > > 1. Flush the page cache on the range that is about to be discarded. > > 2. Issue the discard and wait for it to finish. > > (as far as I can see the discard doesn't go through the > > page cache). > > > > 3. Check if the page cache is dirty for this range, > > if it is dirty (meaning that someone wrote to it meanwhile) > > return -EBUSY. > > > > This means that if qemu (or qemu-img) issues a write, and then > > discard to the area that shares a page, -EBUSY can be returned by > > the kernel. > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > kernel so we need to fix this. > > > On the other hand, for example, the ext4 implementation of discard > > doesn't seem to be affected. It does take a lock on the inode to avoid > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > Well, filesystem hole punching is somewhat different beast than block device > discard (at least implementation wise). > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > It also can be racy unless special means are done to block IO from happening > > from qemu during this fsync. > > > > This patch series contains two patches: > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > technically enough to fail back to plain write in this case, but seems wrong. > > > > And the second patch adds an optimization to qemu-img to avoid such a > > fragmented write/discard in the first place. > > > > Both patches make the reproducer work for this particular bugzilla, > > but I don't think they are enough. > > > > What do you think? > > So if the EBUSY error happens because something happened to the page cache > outside of discarded range (like you describe above), that is a kernel bug > than needs to get fixed. EBUSY should really mean - someone wrote to the > discarded range while discard was running and userspace app has to deal > with that depending on what it aims to do... I double checked this, those are the writes/discards according to my debug prints (I print start and then start+len-1 for each request) I have attached the patch for this for reference. ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000) fallocate 00007fe00000 00007fffefff WRITE: 0x00007ffff000 00007ffffdff (len:0xe00) write 00007ffff000 00007ffffdff ZERO: 0x00007ffffe00 0000801fefff (len:0x1ff200) fallocate 00007ffffe00 0000801fefff FALLOCATE failed with error 16 qemu-img: error while writing at byte 2147483136: Device or resource busy Best regards, Maxim Levitsky > > Honza
On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote: > On Thu 12-11-20 12:19:51, Jan Kara wrote: > > [added some relevant people and lists to CC] > > > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > > clone of "starship_production" > > > > > > The git-publish destroyed the cover letter: > > > > > > For the reference this is for bz #1872633 > > > > > > The issue is that current kernel code that implements 'fallocate' > > > on kernel block devices roughly works like that: > > > > > > 1. Flush the page cache on the range that is about to be discarded. > > > 2. Issue the discard and wait for it to finish. > > > (as far as I can see the discard doesn't go through the > > > page cache). > > > > > > 3. Check if the page cache is dirty for this range, > > > if it is dirty (meaning that someone wrote to it meanwhile) > > > return -EBUSY. > > > > > > This means that if qemu (or qemu-img) issues a write, and then > > > discard to the area that shares a page, -EBUSY can be returned by > > > the kernel. > > > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > > kernel so we need to fix this. > > > > > On the other hand, for example, the ext4 implementation of discard > > > doesn't seem to be affected. It does take a lock on the inode to avoid > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > > > Well, filesystem hole punching is somewhat different beast than block device > > discard (at least implementation wise). > > > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > > > It also can be racy unless special means are done to block IO from happening > > > from qemu during this fsync. > > > > > > This patch series contains two patches: > > > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > > technically enough to fail back to plain write in this case, but seems wrong. > > > > > > And the second patch adds an optimization to qemu-img to avoid such a > > > fragmented write/discard in the first place. > > > > > > Both patches make the reproducer work for this particular bugzilla, > > > but I don't think they are enough. > > > > > > What do you think? > > > > So if the EBUSY error happens because something happened to the page cache > > outside of discarded range (like you describe above), that is a kernel bug > > than needs to get fixed. EBUSY should really mean - someone wrote to the > > discarded range while discard was running and userspace app has to deal > > with that depending on what it aims to do... > > So I was looking what it would take to fix this inside the kernel. The > problem is that invalidate_inode_pages2_range() is working on page > granularity and it is non-trivial to extend it to work on byte granularity > since we don't support something like "try to reclaim part of a page". But > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here > instead of truncate_inode_pages_range() again? I mean the EBUSY detection > cannot be reliable anyway and userspace has no way of knowing whether a > write happened before discard or after it so just discarding data is fine > from this point of view. Darrick? Hmmm, I think I overlooked the fact that we can do buffered writes into a block device's pagecache without taking any of the usual locks that have to be held for filesystem files. This is essentially a race between a not-page-aligned fallocate and a buffered write to a different sector that is mapped by a page that caches part of the fallocate range. So yes, Jan is right that we need to use truncate_bdev_range instead of invalidate_inode_pages2_range because the former will zero the sub-page ranges on either end of the fallocate request instead of returning -EBUSY because someone dirtied a part of a page that wasn't involved in the fallocate operation. I /probably/ just copy-pasta'd that invalidation call from directio without thinking hard enough about it, sorry about that. :( --D > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Thu 12-11-20 17:38:36, Maxim Levitsky wrote: > On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote: > > [added some relevant people and lists to CC] > > > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > > clone of "starship_production" > > > > > > The git-publish destroyed the cover letter: > > > > > > For the reference this is for bz #1872633 > > > > > > The issue is that current kernel code that implements 'fallocate' > > > on kernel block devices roughly works like that: > > > > > > 1. Flush the page cache on the range that is about to be discarded. > > > 2. Issue the discard and wait for it to finish. > > > (as far as I can see the discard doesn't go through the > > > page cache). > > > > > > 3. Check if the page cache is dirty for this range, > > > if it is dirty (meaning that someone wrote to it meanwhile) > > > return -EBUSY. > > > > > > This means that if qemu (or qemu-img) issues a write, and then > > > discard to the area that shares a page, -EBUSY can be returned by > > > the kernel. > > > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > > kernel so we need to fix this. > > > > > On the other hand, for example, the ext4 implementation of discard > > > doesn't seem to be affected. It does take a lock on the inode to avoid > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > > > Well, filesystem hole punching is somewhat different beast than block device > > discard (at least implementation wise). > > > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > > > It also can be racy unless special means are done to block IO from happening > > > from qemu during this fsync. > > > > > > This patch series contains two patches: > > > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > > technically enough to fail back to plain write in this case, but seems wrong. > > > > > > And the second patch adds an optimization to qemu-img to avoid such a > > > fragmented write/discard in the first place. > > > > > > Both patches make the reproducer work for this particular bugzilla, > > > but I don't think they are enough. > > > > > > What do you think? > > > > So if the EBUSY error happens because something happened to the page cache > > outside of discarded range (like you describe above), that is a kernel bug > > than needs to get fixed. EBUSY should really mean - someone wrote to the > > discarded range while discard was running and userspace app has to deal > > with that depending on what it aims to do... > I double checked this, those are the writes/discards according to my debug > prints (I print start and then start+len-1 for each request) > I have attached the patch for this for reference. > > ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000) > fallocate 00007fe00000 00007fffefff Yeah, the end at 7ffff000 is indeed not 4k aligned... > WRITE: 0x00007ffff000 00007ffffdff (len:0xe00) > write 00007ffff000 00007ffffdff .. and this write is following discarded area in the same page (7ffff000..7ffffdff). Honza
On Thu, 2020-11-12 at 14:08 -0800, Darrick J. Wong wrote: > On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote: > > On Thu 12-11-20 12:19:51, Jan Kara wrote: > > > [added some relevant people and lists to CC] > > > > > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > > > clone of "starship_production" > > > > > > > > The git-publish destroyed the cover letter: > > > > > > > > For the reference this is for bz #1872633 > > > > > > > > The issue is that current kernel code that implements 'fallocate' > > > > on kernel block devices roughly works like that: > > > > > > > > 1. Flush the page cache on the range that is about to be discarded. > > > > 2. Issue the discard and wait for it to finish. > > > > (as far as I can see the discard doesn't go through the > > > > page cache). > > > > > > > > 3. Check if the page cache is dirty for this range, > > > > if it is dirty (meaning that someone wrote to it meanwhile) > > > > return -EBUSY. > > > > > > > > This means that if qemu (or qemu-img) issues a write, and then > > > > discard to the area that shares a page, -EBUSY can be returned by > > > > the kernel. > > > > > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > > > kernel so we need to fix this. > > > > > > > On the other hand, for example, the ext4 implementation of discard > > > > doesn't seem to be affected. It does take a lock on the inode to avoid > > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > > > > > Well, filesystem hole punching is somewhat different beast than block device > > > discard (at least implementation wise). > > > > > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > > > > > It also can be racy unless special means are done to block IO from happening > > > > from qemu during this fsync. > > > > > > > > This patch series contains two patches: > > > > > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > > > technically enough to fail back to plain write in this case, but seems wrong. > > > > > > > > And the second patch adds an optimization to qemu-img to avoid such a > > > > fragmented write/discard in the first place. > > > > > > > > Both patches make the reproducer work for this particular bugzilla, > > > > but I don't think they are enough. > > > > > > > > What do you think? > > > > > > So if the EBUSY error happens because something happened to the page cache > > > outside of discarded range (like you describe above), that is a kernel bug > > > than needs to get fixed. EBUSY should really mean - someone wrote to the > > > discarded range while discard was running and userspace app has to deal > > > with that depending on what it aims to do... > > > > So I was looking what it would take to fix this inside the kernel. The > > problem is that invalidate_inode_pages2_range() is working on page > > granularity and it is non-trivial to extend it to work on byte granularity > > since we don't support something like "try to reclaim part of a page". But > > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here > > instead of truncate_inode_pages_range() again? I mean the EBUSY detection > > cannot be reliable anyway and userspace has no way of knowing whether a > > write happened before discard or after it so just discarding data is fine > > from this point of view. Darrick? > > Hmmm, I think I overlooked the fact that we can do buffered writes into > a block device's pagecache without taking any of the usual locks that > have to be held for filesystem files. This is essentially a race > between a not-page-aligned fallocate and a buffered write to a different > sector that is mapped by a page that caches part of the fallocate range. > > So yes, Jan is right that we need to use truncate_bdev_range instead of > invalidate_inode_pages2_range because the former will zero the sub-page > ranges on either end of the fallocate request instead of returning > -EBUSY because someone dirtied a part of a page that wasn't involved in > the fallocate operation. > > I /probably/ just copy-pasta'd that invalidation call from directio > without thinking hard enough about it, sorry about that. :( Any update on this? Today I took a look at both truncate_bdev_range, and at invalidate_inode_pages2_range. I see that the truncate_bdev_range can't really fail other than check for a mounted filesystem. It calls truncate_inode_pages_range which writes back all dirty pages and waits for writeback to finish. So it won't detect dirty pages as invalidate_inode_pages2_range does Also AFAIK the kernel page cache indeed only tracks the dirtiness of a whole page (e.g a 512 byte write, will cause the whole page to be written back, unless the write was done using O_DIRECT) So if a page, part of which is discarded, becomes dirty we can't really know if someone wrote to the discarded region, or only near it. I vote to keep the call to invalidate_inode_pages2_range but exclude the non page aligned areas from it. This way we will still do a best effort detection of this case, while not causing any false positives. If you agree, I'll send a patch for this. Best regards, Maxim Levitsky > > --D > > > Honza > > -- > > Jan Kara <jack@suse.com> > > SUSE Labs, CR