Message ID | 1434495193-31182-3-git-send-email-tj@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Jun 16, 2015 at 06:53:13PM -0400, Tejun Heo wrote: > For ordered and writeback data modes, all data IOs go through > ext4_io_submit. This patch adds cgroup writeback support by invoking > wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in > io_submit_add_bh(). Journal data which is written by jbd2 worker is > left alone by this patch and will always be written out from the root > cgroup. > > ext4_fill_super() is updated to set MS_CGROUPWB when data mode is > either ordered or writeback. In journaled data mode, most IOs become > synchronous through the journal and enabling cgroup writeback support > doesn't make much sense or difference. Journaled data mode is left > alone. > > Lightly tested with sequential data write workload. Behaves as > expected. > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: "Theodore Ts'o" <tytso@mit.edu> > Cc: Andreas Dilger <adilger.kernel@dilger.ca> > Cc: linux-ext4@vger.kernel.org Thanks, applied. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote: > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB. > > > > Signed-off-by: Tejun Heo <tj@kernel.org> > > Cc: "Theodore Ts'o" <tytso@mit.edu> > > Cc: Andreas Dilger <adilger.kernel@dilger.ca> > > Cc: linux-ext4@vger.kernel.org > > Thanks, applied. Hi, this patch introduces a regression - a major one, I'd say. Symptoms: copy a bunch of file, run sync, then run 'reboot', and after you boot up the copied files are corrupted. So basically the user -visible symptom is that 'sync' does not work. I quite an effort to bisect it, but it led me to this patch. If I take the latest upstream (v4.3-rc2+), and revert this patch: 001e4a8 ext4: imlpement cgroup writeback support then the problem goes away - files are not corrupted after reboot. I use ext4 on top of a "bare" partition, no LVM or dm layers involved. I use Fedora 22 with all the latest package updates, and I only change the kernel there. The corruption seems to be that the start with a bunch of zeroes instead of the real data, but I did not check carefully, looked only at one file briefly. Artem. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-09-23 at 15:49 +0300, Artem Bityutskiy wrote: > On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote: > > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB. > > > > > > Signed-off-by: Tejun Heo <tj@kernel.org> > > > Cc: "Theodore Ts'o" <tytso@mit.edu> > > > Cc: Andreas Dilger <adilger.kernel@dilger.ca> > > > Cc: linux-ext4@vger.kernel.org > > > > Thanks, applied. > > Hi, this patch introduces a regression - a major one, I'd say. > > Symptoms: copy a bunch of file, run sync, then run 'reboot', and > after > you boot up the copied files are corrupted. So basically the user > -visible symptom is that 'sync' does not work. Just FYI, this is the issue I briefly reported last Fri: https://lkml.org/lkml/2015/9/18/640 Artem. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2015 at 04:50:53PM +0300, Artem Bityutskiy wrote: > On Wed, 2015-09-23 at 15:49 +0300, Artem Bityutskiy wrote: > > On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote: > > > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB. > > > > > > > > Signed-off-by: Tejun Heo <tj@kernel.org> > > > > Cc: "Theodore Ts'o" <tytso@mit.edu> > > > > Cc: Andreas Dilger <adilger.kernel@dilger.ca> > > > > Cc: linux-ext4@vger.kernel.org > > > > > > Thanks, applied. > > > > Hi, this patch introduces a regression - a major one, I'd say. > > > > Symptoms: copy a bunch of file, run sync, then run 'reboot', and > > after > > you boot up the copied files are corrupted. So basically the user > > -visible symptom is that 'sync' does not work. > > Just FYI, this is the issue I briefly reported last Fri: > > https://lkml.org/lkml/2015/9/18/640 Also note this performance regression reported by Dexuan Cui https://lkml.org/lkml/2015/9/23/333 I didn't notice these problems since I my userspace doesn't enable the writeback cgroup. (In fact I don't know how to do it using Debian Jessie.) Tejun, can you please take a look at this and give me a recommendation? I'm willing to wait a day or two while we try to fix the problem, but past that point I'd like to have a fix or revert this commit before Linus releases the next -rc release. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2015 at 03:49:12PM +0300, Artem Bityutskiy wrote: > On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote: > > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB. > > > > > > Signed-off-by: Tejun Heo <tj@kernel.org> > > > Cc: "Theodore Ts'o" <tytso@mit.edu> > > > Cc: Andreas Dilger <adilger.kernel@dilger.ca> > > > Cc: linux-ext4@vger.kernel.org > > > > Thanks, applied. > > Hi, this patch introduces a regression - a major one, I'd say. > > Symptoms: copy a bunch of file, run sync, then run 'reboot', and after > you boot up the copied files are corrupted. So basically the user > -visible symptom is that 'sync' does not work. Hi Artem, Are you doing a hard shutdown (reboot -nf)? If you're doing a friendly shutdown, is the FS unmounting cleanly? > > I quite an effort to bisect it, but it led me to this patch. I bet it was a long bisect. Trying to see if the same patch to btrfs has similar impacts. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2015 at 08:41:25PM +0300, Artem Bityutskiy wrote: > Hi > > $ sync > $ reboot If this is case, it should be possible to reproduce with: cp a bunch of stuff to /ext4 unmount /ext4 mount ext4 compare data If you're not getting a clean unmount of the test FS during the reboot, its a different test. Trying to reproduce here, so far its clean. Could you please double check for failed unmounts? Thanks, Chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Ted. On Wed, Sep 23, 2015 at 01:02:38PM -0400, Theodore Ts'o wrote: > Also note this performance regression reported by Dexuan Cui > > https://lkml.org/lkml/2015/9/23/333 > > I didn't notice these problems since I my userspace doesn't enable the > writeback cgroup. (In fact I don't know how to do it using Debian > Jessie.) The thing is I don't think they're either. > Tejun, can you please take a look at this and give me a > recommendation? I'm willing to wait a day or two while we try to fix > the problem, but past that point I'd like to have a fix or revert this > commit before Linus releases the next -rc release. Yeap, looking at them both. Will update as soon as I know more. Thanks.
Hello, Artem. On Wed, Sep 23, 2015 at 03:49:12PM +0300, Artem Bityutskiy wrote: > I use Fedora 22 with all the latest package updates, and I only change > the kernel there. What's your cgroup setup like? Are you trying out the unified hierarchy? Given that you didn't mention that, I'm guessing you aren't which makes me pretty head-scratchy because I don't see how the actual behavior would change depending on that flag. I'll dig more. Thanks.
Artem, Can you (or someone on the cgroups list, perhaps) give more details about how Fedora 22 sets up groups? Unfortunately apparently no one has gotten an official Fedora image for Google Compute Engine so it's a bit of a pain for me to reproduce the problem. (I suppose I could use AWS, but all of my test infrastructure uses GCE, and I'd really rather not have to install a Java Runtime on my laptop. :-) Or can you reproduce this problem on Debian Jessie? Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2015 at 02:09:34PM -0400, Tejun Heo wrote: > Hello, Artem. > > On Wed, Sep 23, 2015 at 03:49:12PM +0300, Artem Bityutskiy wrote: > > I use Fedora 22 with all the latest package updates, and I only change > > the kernel there. > > What's your cgroup setup like? Are you trying out the unified > hierarchy? Given that you didn't mention that, I'm guessing you > aren't which makes me pretty head-scratchy because I don't see how the > actual behavior would change depending on that flag. I'll dig more. Ugh... never mind. I spotted what I did wrong. The sync thing still needs figuring out (it still should have worked) but fixing the regression should be easy. Will update soon. Thanks.
Hello, Artem. On Wed, Sep 23, 2015 at 08:41:25PM +0300, Artem Bityutskiy wrote: > $ sync > $ reboot > > is exactly the sequence. Can you please test with 4.3-rc2 and see whether the issue is reproducible? The multi-wb wait logic rewrite was merged during rc1 and I'm wondering whether this is a bug in the old implementation. Thanks.
On Wed, Sep 23, 2015 at 9:24 PM, Theodore Ts'o <tytso@mit.edu> wrote: > Artem, > > Can you (or someone on the cgroups list, perhaps) give more details > about how Fedora 22 sets up groups? > > Unfortunately apparently no one has gotten an official Fedora image > for Google Compute Engine so it's a bit of a pain for me to reproduce > the problem. (I suppose I could use AWS, but all of my test > infrastructure uses GCE, and I'd really rather not have to install a > Java Runtime on my laptop. :-) [ My apologies for top posting and for sending HTML e-mails which do not get through vger. I am using gmail web interface, and just learned how to send plain text from here. So re-sending my longer answer. ] Hi Ted, Chris, Tejun, all, quick and probably messy reply before I go to sleep... I can give more information tomorrow. But one note - It would be helpful to get questions like "send us the output of this command" rather than "what are the cgroups you are in", because I am not fluent with cgroups. IOW, more specific questions are welcome. Some more about my setup. I have an number of testboxes, which are 1/2/4-socket servers. I compile the kernel for them on a separate worker box. Then I copy the kernel binary to /boot, and the modules to /lib/modules, then run 'sync' and then reboot to reboot to the new kernel. And vrey often many module files are corrupted. They won't load because of majic/crc mismatches. I copy stuff over scp. Well, this is not exactly scp, but rather a Python 'scp' module, which is based on the 'paramiko' module. But I think this should not matter. Anyway, may be there are some cgroups related with scp/ssh sessions or /lib/modules in Fedora 22? Also note, I tried to be careful during bisecting, I used 4 servers in parallel, and did 5 reboot tests on each of them. With this patch reverted all 4 boxes survive 5 reboots just fine. Without this patch reverted, each fail 1-3 reboots. And, by the way, I forgot this detail - I cut AC power off at the end, then put it back on after a 20 seconds delay. I mean, this is a clean reboot, but with power cut at the end. So the process is this: 1. I run 'sync' on the box remotely over ssh 2. I run 'reboot' on the box remotely over ssh, the ssh connection gets closed at this point 3. I ping the box, and keep doing this until it is stops echoing back 4. I wait several seconds, and then just cut the AC power off. The wall socket power is off. So if there was something in, say, SSD cache which was not synced, it is gone too. May be this patch reveals an existing issue. My setup has been stable with 4.2 and many previous kernels, and it only fails with 4.3-rcX, and my bisecting lead to this patch. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2015 at 10:47:16PM +0300, Artem Bityutskiy wrote: > And, by the way, I forgot this detail - I cut AC power off at the > end, then put it back on after a 20 seconds delay. I mean, this is a > clean reboot, but with power cut at the end. Is this reproducible without the power cut? And what model SSD are you using, and are you sure that it has Power Loss Protection (many SSD vendors use power loss protection as the price discrimination feature between consumer-grade SSD and enterprise-grade SSD's that cost $$$). If this problem wasn't showing up with 4.2, and is only failing with 4.3-rcX, then it might not be a hardware issue --- but it's also possible that there was a timing issue which was hiding a hardware problem. So for the purposes of debugging, removing the power cut from the set of variables is a useful thing to do. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-09-23 at 16:48 -0400, Theodore Ts'o wrote: > On Wed, Sep 23, 2015 at 10:47:16PM +0300, Artem Bityutskiy wrote: > > And, by the way, I forgot this detail - I cut AC power off at the > > end, then put it back on after a 20 seconds delay. I mean, this is > > a > > clean reboot, but with power cut at the end. > > Is this reproducible without the power cut? And what model SSD are > you using, and are you sure that it has Power Loss Protection (many > SSD vendors use power loss protection as the price discrimination > feature between consumer-grade SSD and enterprise-grade SSD's that > cost $$$). If this problem wasn't showing up with 4.2, and is only > failing with 4.3-rcX, then it might not be a hardware issue --- but > it's also possible that there was a timing issue which was hiding a > hardware problem. So for the purposes of debugging, removing the > power cut from the set of variables is a useful thing to do. Ted, you are right that the problem may be anywhere, but since Tejun's patch fixes it, I decided to not spend time on testing without AC power cuts. But I checked and on of my boxes uses an HDD, and it shows the same problem, which makes the theory of imperfect SSD firmware being involved less likely. Thanks, Artem. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 3f80cb2..c56ba7b 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -383,6 +383,7 @@ static int io_submit_init_bio(struct ext4_io_submit *io, bio = bio_alloc(GFP_NOIO, min(nvecs, BIO_MAX_PAGES)); if (!bio) return -ENOMEM; + wbc_init_bio(io->io_wbc, bio); bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio->bi_bdev = bh->b_bdev; bio->bi_end_io = ext4_end_bio; @@ -411,6 +412,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io, ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh)); if (ret != bh->b_size) goto submit_and_retry; + wbc_account_io(io->io_wbc, page, bh->b_size); io->io_next_block++; return 0; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 56b8bb7..3ad1eb4 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3623,6 +3623,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) } if (test_opt(sb, DELALLOC)) clear_opt(sb, DELALLOC); + } else { + sb->s_iflags |= SB_I_CGROUPWB; } sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
For ordered and writeback data modes, all data IOs go through ext4_io_submit. This patch adds cgroup writeback support by invoking wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in io_submit_add_bh(). Journal data which is written by jbd2 worker is left alone by this patch and will always be written out from the root cgroup. ext4_fill_super() is updated to set MS_CGROUPWB when data mode is either ordered or writeback. In journaled data mode, most IOs become synchronous through the journal and enabling cgroup writeback support doesn't make much sense or difference. Journaled data mode is left alone. Lightly tested with sequential data write workload. Behaves as expected. v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org --- fs/ext4/page-io.c | 2 ++ fs/ext4/super.c | 2 ++ 2 files changed, 4 insertions(+)