Message ID | 1407396229-4785-9-git-send-email-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Aug 7, 2014 at 3:23 PM, Christoph Hellwig <hch@lst.de> wrote: > The Linux VM subsystem can't support block sizes larger than page size > for block based filesystems very well. While this can be hacked around > to some extent for simple filesystems the read-modify-write cycles > required for pnfs block invalid extents are extremly deadlock prone > when operating on multiple pages. Reject this case early on instead > of pretending to support it (badly). > So this kills EMC server support. Can you please share what kind of badly deadlock you saw with large block size support? Thanks, Tao > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/nfs/blocklayout/blocklayout.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c > index cbb1797..6c1a421 100644 > --- a/fs/nfs/blocklayout/blocklayout.c > +++ b/fs/nfs/blocklayout/blocklayout.c > @@ -1115,6 +1115,12 @@ bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh) > dprintk("%s Server did not return blksize\n", __func__); > return -EINVAL; > } > + if (server->pnfs_blksize > PAGE_SIZE) { > + printk(KERN_ERR "%s: pNFS blksize %d not supported.\n", > + __func__, server->pnfs_blksize); > + return -EINVAL; > + } > + > b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_NOFS); > if (!b_mt_id) { > status = -ENOMEM; > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote: > So this kills EMC server support. Given the state the code claiming support for any server is a large exaggeration.. > Can you please share what kind of > badly deadlock you saw with large block size support? The read-modify write code (which I'll remove later) can lock arbitary numbers of additional pages from the writeback back code without doing a trylock, which is required for doing this in page writeback. Note that it's not a deadlock, but I can also triv?ally corrupt data in those pages as it doesn't lock against them, you just need a race window where it's modified after writeback has been started for a large extents, which isn't too hard to hit with tools like fsstress. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 7, 2014 at 7:25 PM, Christoph Hellwig <hch@lst.de> wrote: > On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote: >> So this kills EMC server support. > > Given the state the code claiming support for any server is a large > exaggeration.. > >> Can you please share what kind of >> badly deadlock you saw with large block size support? > > The read-modify write code (which I'll remove later) can lock arbitary > numbers of additional pages from the writeback back code without doing > a trylock, which is required for doing this in page writeback. Note > that it's not a deadlock, but I can also triv?ally corrupt data in > those pages as it doesn't lock against them, you just need a race > window where it's modified after writeback has been started for a large > extents, which isn't too hard to hit with tools like fsstress. > Is it bl_find_get_zeroing_page() you are concerning about? I was hoping page flags can tell if some other threads are flushing the same page. And the extra page is always locked before readin or zeroed, after which the page is marked uptodate before unlocking. So the problem is that a page that is being written back gets modified by a new writer, is it correct? How about marking it writeback before unlocking in bl_find_get_zeroing_page()? That should keep new writers from modifying it concurrently. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 07, 2014 at 07:51:57PM +0800, Peng Tao wrote:
> Is it bl_find_get_zeroing_page() you are concerning about?
That's the primary culprit.
If you want to do the read-modify write cycle properly you'll have to
do it at the time the invalid extent is read, not written. Take a look
at my patch to add a flag to do that in page 5, although it would
be a lot more involved to do it for large pnfs block sizes. Given
that the code never really fully worked beforehand I'm not tempted
to come up with that myself, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Why don't you send a patch? -----Original Message----- From: Peng Tao [mailto:bergwolf@gmail.com] Sent: Thursday, August 07, 2014 7:52 AM To: Christoph Hellwig Cc: Trond Myklebust; linuxnfs; faibish, sorin Subject: Re: [PATCH 08/17] pnfs/blocklayout: reject pnfs blocksize larger than page size On Thu, Aug 7, 2014 at 7:25 PM, Christoph Hellwig <hch@lst.de> wrote: > On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote: >> So this kills EMC server support. > > Given the state the code claiming support for any server is a large > exaggeration.. > >> Can you please share what kind of >> badly deadlock you saw with large block size support? > > The read-modify write code (which I'll remove later) can lock arbitary > numbers of additional pages from the writeback back code without doing > a trylock, which is required for doing this in page writeback. Note > that it's not a deadlock, but I can also triv?ally corrupt data in > those pages as it doesn't lock against them, you just need a race > window where it's modified after writeback has been started for a > large extents, which isn't too hard to hit with tools like fsstress. > Is it bl_find_get_zeroing_page() you are concerning about? I was hoping page flags can tell if some other threads are flushing the same page. And the extra page is always locked before readin or zeroed, after which the page is marked uptodate before unlocking. So the problem is that a page that is being written back gets modified by a new writer, is it correct? How about marking it writeback before unlocking in bl_find_get_zeroing_page()? That should keep new writers from modifying it concurrently.
On 08/07/2014 07:25 AM, Christoph Hellwig wrote: > On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote: >> > So this kills EMC server support. > Given the state the code claiming support for any server is a large > exaggeration.. > But eliminating the potential of supporting block layouts does not sound like a very good idea to me.... There are rumors of other block layouts implementations coming down the pike.... So hopefully we don't eliminating any and all possible support of block layouts in the future... steved. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/07/2014 09:13 AM, Steve Dickson wrote: > > > On 08/07/2014 07:25 AM, Christoph Hellwig wrote: >> On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote: >>>> So this kills EMC server support. >> Given the state the code claiming support for any server is a large >> exaggeration.. >> > But eliminating the potential of supporting block layouts > does not sound like a very good idea to me.... > > There are rumors of other block layouts implementations > coming down the pike.... So hopefully we don't eliminating > any and all possible support of block layouts in the future... > My bad... I see this not the case... I jumped in with out reading the entire thread.... But EMC server support is still not a good idea IMHO... steved. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 7, 2014 at 8:10 PM, Christoph Hellwig <hch@lst.de> wrote: > On Thu, Aug 07, 2014 at 07:51:57PM +0800, Peng Tao wrote: >> Is it bl_find_get_zeroing_page() you are concerning about? > > That's the primary culprit. > > If you want to do the read-modify write cycle properly you'll have to > do it at the time the invalid extent is read, not written. Take a look > at my patch to add a flag to do that in page 5, although it would > be a lot more involved to do it for large pnfs block sizes. Given > that the code never really fully worked beforehand I'm not tempted > to come up with that myself, though. > we can't assume all pages written back have their pari pages (for 8K block size e.g.) read in read_pagelists(). A page can also be read in via MDS read. So what we need is a hook into nfs_readpage to read or zero additional pages. But we might not even have a layout there. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 7, 2014 at 8:56 PM, faibish, sorin <faibish_sorin@emc.com> wrote: > Why don't you send a patch? > I can't be sure what I proposed is correct and I don't have any server to test against. > -----Original Message----- > From: Peng Tao [mailto:bergwolf@gmail.com] > Sent: Thursday, August 07, 2014 7:52 AM > To: Christoph Hellwig > Cc: Trond Myklebust; linuxnfs; faibish, sorin > Subject: Re: [PATCH 08/17] pnfs/blocklayout: reject pnfs blocksize larger than page size > > On Thu, Aug 7, 2014 at 7:25 PM, Christoph Hellwig <hch@lst.de> wrote: >> On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote: >>> So this kills EMC server support. >> >> Given the state the code claiming support for any server is a large >> exaggeration.. >> >>> Can you please share what kind of >>> badly deadlock you saw with large block size support? >> >> The read-modify write code (which I'll remove later) can lock arbitary >> numbers of additional pages from the writeback back code without doing >> a trylock, which is required for doing this in page writeback. Note >> that it's not a deadlock, but I can also triv?ally corrupt data in >> those pages as it doesn't lock against them, you just need a race >> window where it's modified after writeback has been started for a >> large extents, which isn't too hard to hit with tools like fsstress. >> > Is it bl_find_get_zeroing_page() you are concerning about? I was hoping page flags can tell if some other threads are flushing the same page. And the extra page is always locked before readin or zeroed, after which the page is marked uptodate before unlocking. So the problem is that a page that is being written back gets modified by a new writer, is it correct? How about marking it writeback before unlocking in bl_find_get_zeroing_page()? That should keep new writers from modifying it concurrently. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 07, 2014 at 08:56:04AM -0400, faibish, sorin wrote:
> Why don't you send a patch?
What patch? Rewriting the broken code that was put in to support a server I
don't have access to and that apparently no one cared enough before to
fix even the most trivial bugs for?
I can leave the code for it in, but I can neither test nor really support it,
and from the state of the code it doesn't seem like anyone else really
cares for it.
How will sign up for supporting and testing > PAGE_SIZE servers if we
leave that support in? And with support I mean regular runs of basic
QA like xfstests or just doing a simple dd if=/dev/zero of=/mnt/nfs/bigfile,
which already causes softlocks with the current code.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
What Peng suggested to solve: "marking it writeback before unlocking in bl_find_get_zeroing_page()?" -----Original Message----- From: Christoph Hellwig [mailto:hch@lst.de] Sent: Thursday, August 07, 2014 12:11 PM To: faibish, sorin Cc: Peng Tao; Trond Myklebust; linuxnfs Subject: Re: [PATCH 08/17] pnfs/blocklayout: reject pnfs blocksize larger than page size On Thu, Aug 07, 2014 at 08:56:04AM -0400, faibish, sorin wrote: > Why don't you send a patch? What patch? Rewriting the broken code that was put in to support a server I don't have access to and that apparently no one cared enough before to fix even the most trivial bugs for? I can leave the code for it in, but I can neither test nor really support it, and from the state of the code it doesn't seem like anyone else really cares for it. How will sign up for supporting and testing > PAGE_SIZE servers if we leave that support in? And with support I mean regular runs of basic QA like xfstests or just doing a simple dd if=/dev/zero of=/mnt/nfs/bigfile, which already causes softlocks with the current code. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 07, 2014 at 09:43:09PM +0800, Peng Tao wrote: > we can't assume all pages written back have their pari pages (for 8K > block size e.g.) read in read_pagelists(). A page can also be read in > via MDS read. So what we need is a hook into nfs_readpage to read or > zero additional pages. But we might not even have a layout there. We can't assume the page is there for writeback either, what all this mess exists for. That's why we really shouldn't even attempt to support a a block size large than the page size, and that's also why the local Linux filesystems strictly refuse to support it. If you want to hack around it you will run into problems in either case. I also don't really see why a server would insist on this large block size, there really isn't any major benefit in doing that today (aka the last 20 years) now that we have extent based filesystems. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 8, 2014 at 12:20 AM, Christoph Hellwig <hch@lst.de> wrote: > On Thu, Aug 07, 2014 at 09:43:09PM +0800, Peng Tao wrote: >> we can't assume all pages written back have their pari pages (for 8K >> block size e.g.) read in read_pagelists(). A page can also be read in >> via MDS read. So what we need is a hook into nfs_readpage to read or >> zero additional pages. But we might not even have a layout there. > > We can't assume the page is there for writeback either, what all this > mess exists for. In write_pagelist, we can find or create the pair page. It is indeed cow extent that makes things complicated by requiring to read from disk. If we drop cow support (which is required by rfc but I don't know of any server that supports it), we can just zero the extra pages and mark them uptodate. No extra read in or writeback required. That is doable IMHO. > That's why we really shouldn't even attempt to support > a a block size large than the page size, and that's also why the local > Linux filesystems strictly refuse to support it. If you want to hack > around it you will run into problems in either case. > > I also don't really see why a server would insist on this large block > size, there really isn't any major benefit in doing that today (aka the last 20 > years) now that we have extent based filesystems. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c index cbb1797..6c1a421 100644 --- a/fs/nfs/blocklayout/blocklayout.c +++ b/fs/nfs/blocklayout/blocklayout.c @@ -1115,6 +1115,12 @@ bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh) dprintk("%s Server did not return blksize\n", __func__); return -EINVAL; } + if (server->pnfs_blksize > PAGE_SIZE) { + printk(KERN_ERR "%s: pNFS blksize %d not supported.\n", + __func__, server->pnfs_blksize); + return -EINVAL; + } + b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_NOFS); if (!b_mt_id) { status = -ENOMEM;
The Linux VM subsystem can't support block sizes larger than page size for block based filesystems very well. While this can be hacked around to some extent for simple filesystems the read-modify-write cycles required for pnfs block invalid extents are extremly deadlock prone when operating on multiple pages. Reject this case early on instead of pretending to support it (badly). Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/nfs/blocklayout/blocklayout.c | 6 ++++++ 1 file changed, 6 insertions(+)