Message ID | 1238088937-10994-5-git-send-email-jlayton@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote: > We already flush all the dirty pages for an inode before doing > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if > we change the mode so that the file becomes read-only then we may not > be able to write data to it after a reconnect. Do we have any idea of how badly this would hurt performance? Changing the mode or owner may be rare enough ... but I don't have any data to prove that. Changing the file's mode on the fly while writing could cause problems for file systems other than cifs too, not sure we want to hurt performance for a corner case of a bad application. I don't this posix specifies that chmod does an implied fsync
On Thu, 26 Mar 2009 15:10:49 -0500 Steve French <smfrench@gmail.com> wrote: > On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote: > > We already flush all the dirty pages for an inode before doing > > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if > > we change the mode so that the file becomes read-only then we may not > > be able to write data to it after a reconnect. > > Do we have any idea of how badly this would hurt performance? > > Changing the mode or owner may be rare enough ... but I don't have any > data to prove that. > > Changing the file's mode on the fly while writing could cause problems > for file systems other than cifs too, not sure we want to hurt > performance for a corner case of a bad application. I don't this > posix specifies that chmod does an implied fsync > (cc'ing Linus since he brought this to our attention and might have input) No, posix doesn't specify that chmod does an implied fsync, but if that's what's required to ensure that we don't break writeback across a reconnect then I think it's the right thing to do. We can try to optimize it away on some setattr cases, but they're probably not worth it. The best analogue is probably NFS, which also flushes dirty data on all setattr calls. Granted there are differences there (due to the connectionless nature of older NFS protocols), but it's a pretty similar situation to CIFS if you account for the possibility of a reconnect between the setattr and writeback. I don't think we can make a real generalization about performance impact of this change. It will heavily depend on workload. My personal feeling (just a guess, really) is that setattr calls are pretty rare in general in comparison to reads and writes. There are obviously workloads where this won't be the case but it's hard to imagine them as a majority.
On Thu, 26 Mar 2009, Jeff Layton wrote: > On Thu, 26 Mar 2009 15:10:49 -0500 > Steve French <smfrench@gmail.com> wrote: > > > On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote: > > > We already flush all the dirty pages for an inode before doing > > > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if > > > we change the mode so that the file becomes read-only then we may not > > > be able to write data to it after a reconnect. > > > > Do we have any idea of how badly this would hurt performance? > > > > Changing the mode or owner may be rare enough ... but I don't have any > > data to prove that. > > > > Changing the file's mode on the fly while writing could cause problems > > for file systems other than cifs too, not sure we want to hurt > > performance for a corner case of a bad application. I don't this > > posix specifies that chmod does an implied fsync > > > > (cc'ing Linus since he brought this to our attention and might have > input) > > No, posix doesn't specify that chmod does an implied fsync, but if > that's what's required to ensure that we don't break writeback across a > reconnect then I think it's the right thing to do. We can try to > optimize it away on some setattr cases, but they're probably not worth > it. Yes, I don't care about anything but obvious correctness. Right now, CIFS is _broken_. Real programs have seen real breakage. I don't see why Steve argues against the patch on some specious grounds of it "not being necessary", when it clearly is, and there clearly is a CIFS bug there. Performance is totally irrelevant. It doesn't matter at all, when the alternative is "lost data". Linus
On Thu, 26 Mar 2009 15:32:44 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, 26 Mar 2009, Jeff Layton wrote: > > On Thu, 26 Mar 2009 15:10:49 -0500 > > Steve French <smfrench@gmail.com> wrote: > > > > > On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote: > > > > We already flush all the dirty pages for an inode before doing > > > > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if > > > > we change the mode so that the file becomes read-only then we may not > > > > be able to write data to it after a reconnect. > > > > > > Do we have any idea of how badly this would hurt performance? > > > > > > Changing the mode or owner may be rare enough ... but I don't have any > > > data to prove that. > > > > > > Changing the file's mode on the fly while writing could cause problems > > > for file systems other than cifs too, not sure we want to hurt > > > performance for a corner case of a bad application. I don't this > > > posix specifies that chmod does an implied fsync > > > > > > > (cc'ing Linus since he brought this to our attention and might have > > input) > > > > No, posix doesn't specify that chmod does an implied fsync, but if > > that's what's required to ensure that we don't break writeback across a > > reconnect then I think it's the right thing to do. We can try to > > optimize it away on some setattr cases, but they're probably not worth > > it. > > Yes, I don't care about anything but obvious correctness. > > Right now, CIFS is _broken_. Real programs have seen real breakage. I > don't see why Steve argues against the patch on some specious grounds of > it "not being necessary", when it clearly is, and there clearly is a CIFS > bug there. > > Performance is totally irrelevant. It doesn't matter at all, when the > alternative is "lost data". > I suspect that there is another bug for the reporter that's consistently causing a reconnect of the socket after the fchmod. That problem could even be server-side (hard to know without a wire capture and/or some debug logs). The current code should work just fine as long as we don't incur a reconnect and have to reclaim the open files between the fchmod and the writes. The problem is that we can never 100% predict when we'll need to reconnect, so we have to be very careful about how we handle changes that might make us unable to reclaim the open. On the performance side, we can always go back and optimize the flush out of some of setattr calls when we *know* that they won't cause a problem, but the list of those cases is starting to look pretty small.
On Thu, 26 Mar 2009, Jeff Layton wrote: > > I suspect that there is another bug for the reporter that's consistently > causing a reconnect of the socket after the fchmod. That problem could > even be server-side (hard to know without a wire capture and/or some > debug logs). Sure. There probably is. But even if we were to fix that unrelated issue, you might still have totally random events that cause re-connects for non-buggy reasons (say, a real network problem), so regardless it's not correct to assume that reconnects cannot happen. > On the performance side, we can always go back and optimize the flush > out of some of setattr calls when we *know* that they won't cause a > problem, but the list of those cases is starting to look pretty small. Agreed. Especially as you end up having to flush at the close() anyway, and anybody who does some attribute change after writing dirty data is likely to close the file afterwards anyway. IOW, I suspect open write+write+write fchmod/fchown close is fairly normal, but _mixing_ things so that you see open write+write fchmod/fchown write+write close is rather less likely. So flushing the data probably doesn't even add any overhead. Linus
On Thu, Mar 26, 2009 at 7:07 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, 26 Mar 2009, Jeff Layton wrote: >> I suspect that there is another bug for the reporter that's consistently >> causing a reconnect of the socket after the fchmod. That problem could >> even be server-side (hard to know without a wire capture and/or some >> debug logs). > > Sure. There probably is. But even if we were to fix that unrelated issue, > you might still have totally random events that cause re-connects for > non-buggy reasons (say, a real network problem), so regardless it's not > correct to assume that reconnects cannot happen. There is probably still more that could be done, e.g. to flush data to the server more frequently in the background, that would help. >> On the performance side, we can always go back and optimize the flush >> out of some of setattr calls when we *know* that they won't cause a >> problem, but the list of those cases is starting to look pretty small. > > Agreed. Especially as you end up having to flush at the close() anyway, > and anybody who does some attribute change after writing dirty data is > likely to close the file afterwards OK. I will merge the change to flush on all setattr into cifs-2.6.git
On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote: > We already flush all the dirty pages for an inode before doing > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if > we change the mode so that the file becomes read-only then we may not > be able to write data to it after a reconnect. > > Fix this by just going back to flushing all the dirty data on any > setattr call. > + Â Â Â Â * BB: This should be smarter. Why bother flushing pages that > + Â Â Â Â * will be truncated anyway? Also, should we error out here if > + Â Â Â Â * the flush returns error? > + Â Â Â Â */ Merged the patch into cifs-2.6.git, but Jeff brings up an interesting point about what really happens when we are shrinking the file size (truncate) here ... is it possible that the vfs will flush pages that are going to be thrown away
On Thu, 26 Mar 2009 21:52:51 -0500 Steve French <smfrench@gmail.com> wrote: > On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote: > > We already flush all the dirty pages for an inode before doing > > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if > > we change the mode so that the file becomes read-only then we may not > > be able to write data to it after a reconnect. > > > > Fix this by just going back to flushing all the dirty data on any > > setattr call. > > + Â Â Â Â * BB: This should be smarter. Why bother flushing pages that > > + Â Â Â Â * will be truncated anyway? Also, should we error out here if > > + Â Â Â Â * the flush returns error? > > + Â Â Â Â */ > > Merged the patch into cifs-2.6.git, but Jeff brings up an interesting > point about what really happens when we are shrinking the file size > (truncate) here ... is it possible that the vfs will flush pages that > are going to be thrown away > I think so. If I follow the code correctly, cifs_set_file_size does: flush the data (filemap_write_and_wait) set the file size on the server (CIFSSMBSetFileSize, CIFSSMBSetEOF, ...) truncate the inode in memory (cifs_vmtruncate) Perhaps it makes sense to reorder those operations, but that might make it tough to handle errors. Another (better?) option would be to only flush the data that's within the new size of the file. A trivial optimization might be to cheat here and only skip the flushing if the file is going to be truncated to 0 length (that's probably the most common case anyway). Now that I look at this though, I have to wonder...does the existing code have a race? Is it possible for some writes to occur after we set the file size on the server but before we truncate the pages off of the inode?
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index 3525ea5..ed7f048 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -1793,20 +1793,21 @@ cifs_setattr_unix(struct dentry *direntry, struct iattr *attrs) goto out; } - if ((attrs->ia_valid & ATTR_MTIME) || (attrs->ia_valid & ATTR_SIZE)) { - /* - Flush data before changing file size or changing the last - write time of the file on the server. If the - flush returns error, store it to report later and continue. - BB: This should be smarter. Why bother flushing pages that - will be truncated anyway? Also, should we error out here if - the flush returns error? - */ - rc = filemap_write_and_wait(inode->i_mapping); - if (rc != 0) { - cifsInode->write_behind_rc = rc; - rc = 0; - } + /* + * Attempt to flush data before changing attributes. We need to do + * this for ATTR_SIZE and ATTR_MTIME for sure, and if we change the + * ownership or mode then we may also need to do this. Here, we take + * the safe way out and just do the flush on all setattr requests. If + * the flush returns error, store it to report later and continue. + * + * BB: This should be smarter. Why bother flushing pages that + * will be truncated anyway? Also, should we error out here if + * the flush returns error? + */ + rc = filemap_write_and_wait(inode->i_mapping); + if (rc != 0) { + cifsInode->write_behind_rc = rc; + rc = 0; } if (attrs->ia_valid & ATTR_SIZE) { @@ -1904,20 +1905,21 @@ cifs_setattr_nounix(struct dentry *direntry, struct iattr *attrs) return -ENOMEM; } - if ((attrs->ia_valid & ATTR_MTIME) || (attrs->ia_valid & ATTR_SIZE)) { - /* - Flush data before changing file size or changing the last - write time of the file on the server. If the - flush returns error, store it to report later and continue. - BB: This should be smarter. Why bother flushing pages that - will be truncated anyway? Also, should we error out here if - the flush returns error? - */ - rc = filemap_write_and_wait(inode->i_mapping); - if (rc != 0) { - cifsInode->write_behind_rc = rc; - rc = 0; - } + /* + * Attempt to flush data before changing attributes. We need to do + * this for ATTR_SIZE and ATTR_MTIME for sure, and if we change the + * ownership or mode then we may also need to do this. Here, we take + * the safe way out and just do the flush on all setattr requests. If + * the flush returns error, store it to report later and continue. + * + * BB: This should be smarter. Why bother flushing pages that + * will be truncated anyway? Also, should we error out here if + * the flush returns error? + */ + rc = filemap_write_and_wait(inode->i_mapping); + if (rc != 0) { + cifsInode->write_behind_rc = rc; + rc = 0; } if (attrs->ia_valid & ATTR_SIZE) {
We already flush all the dirty pages for an inode before doing ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if we change the mode so that the file becomes read-only then we may not be able to write data to it after a reconnect. Fix this by just going back to flushing all the dirty data on any setattr call. There are probably some cases that can be optimized out, but I'm not sure they're worthwhile and we need to consider them more carefully to make sure that we don't cause regressions if we have to reconnect before writeback occurs. Signed-off-by: Jeff Layton <jlayton@redhat.com> --- fs/cifs/inode.c | 58 ++++++++++++++++++++++++++++-------------------------- 1 files changed, 30 insertions(+), 28 deletions(-)