From patchwork Wed Nov 20 18:24:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254611 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E1CE2109A for ; Wed, 20 Nov 2019 18:24:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B02692089F for ; Wed, 20 Nov 2019 18:24:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="nyuH4pxk" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727975AbfKTSYv (ORCPT ); Wed, 20 Nov 2019 13:24:51 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:42600 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727915AbfKTSYv (ORCPT ); Wed, 20 Nov 2019 13:24:51 -0500 Received: by mail-pg1-f194.google.com with SMTP id q17so122157pgt.9 for ; Wed, 20 Nov 2019 10:24:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=yONqXE/ZzZ3UHCn/pHjqPw1n04MfzutYZUQfl0J7kV8=; b=nyuH4pxkmcthI3U1da1IboUnNNmwnt/Lz9Guaw3nuEMvn1s+FZN19a+paXm+bpCut7 8uH0ne3LzPElbqpjz1RSsSl1XvbB32r+MY0LY83J6uBx+Ie5Io/g8iL9v9fIk7NSClAB MxK6j6ZYX7MpzxaK5oFhYvEz7eLJ8r5HvWh53auVjIHBwcQA/WreqqgeMr7FHvUyjmcZ w/DYY++4wtJcov3AvKpCKinQUWDN6N32gTY5YYGzKx9yo1B8QAtGq3z8gdzEWCB1oVPL dQly2CL4WK+7Uq+cgUQH8mhKHP0blMKXmSb1rL4bTHIAn/y5E6Xkol9Lfum6p1bOidJl GfZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=yONqXE/ZzZ3UHCn/pHjqPw1n04MfzutYZUQfl0J7kV8=; b=GnvLAc9R4FML4dRjLAw3/Ddx2E0A+v6Oy/yOoMZzDQiK5CGcEMp+iQaeTJFPxWoZog XYv0qVeHAb5AZ5VLbLdx+BZgkUWmAH0Ujsti38CAaJ+hnajwhlKguW2sj/N4yoJ1jY8i OCKcXBo1HaObjP/2EW/bjqRQ4ICbOT3DgCoRCrORdiYg61QS4OCO+0jr+MoLPZ63NCba eDNENB4E/H4MIGh5Ba5Pb/Fg9CfoWxmT8OQv29zFtqDo0Yd+D0VRC09xv7WehIXlMX6b 1nPu7B0502UaqVQyCYRprGM7ps/14DqP0wxjXqQxAzGax5wCAy/ZPn1gtcBRRv9OMSfn 4MtA== X-Gm-Message-State: APjAAAUZA7FgxycDYLkXbhhmQfCGrtxN0C9KdQGCp2Cbn7o4T+75kwrF hzYKjwcBaw70SG0DKYBMB6zOh4omiqA= X-Google-Smtp-Source: APXvYqwPrCNhnnUwsu+TaAZ6UojdpAzc6BUtV4rd4hSw+NW1U4rd6BhqnbQU22NYeoexbrI2wBqJCA== X-Received: by 2002:aa7:9467:: with SMTP id t7mr5787937pfq.142.1574274289330; Wed, 20 Nov 2019 10:24:49 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.24.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:24:48 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH man-pages v2] Document encoded I/O Date: Wed, 20 Nov 2019 10:24:19 -0800 Message-Id: <4d5bf2e4c2a22a6c195c79e0ae09a4475f1f9bdc.1574274173.git.osandov@fb.com> X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval This adds a new page, encoded_io(7), providing an overview of encoded I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to reference it. Signed-off-by: Omar Sandoval --- man2/fcntl.2 | 10 +- man2/open.2 | 13 ++ man2/readv.2 | 64 ++++++++++ man7/encoded_io.7 | 308 ++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 394 insertions(+), 1 deletion(-) create mode 100644 man7/encoded_io.7 diff --git a/man2/fcntl.2 b/man2/fcntl.2 index fce4f4c2b..a9a4c0776 100644 --- a/man2/fcntl.2 +++ b/man2/fcntl.2 @@ -222,8 +222,9 @@ On Linux, this command can change only the .BR O_ASYNC , .BR O_DIRECT , .BR O_NOATIME , +.BR O_NONBLOCK , and -.B O_NONBLOCK +.B O_ALLOW_ENCODED flags. It is not possible to change the .BR O_DSYNC @@ -1803,6 +1804,13 @@ Attempted to clear the flag on a file that has the append-only attribute set. .TP .B EPERM +Attempted to set the +.B O_ALLOW_ENCODED +flag and the calling process did not have the +.B CAP_SYS_ADMIN +capability. +.TP +.B EPERM .I cmd was .BR F_ADD_SEALS , diff --git a/man2/open.2 b/man2/open.2 index b0f485b41..a68576d31 100644 --- a/man2/open.2 +++ b/man2/open.2 @@ -421,6 +421,14 @@ was followed by a call to .BR fdatasync (2)). .IR "See NOTES below" . .TP +.B O_ALLOW_ENCODED +Open the file with encoded I/O permissions; +see +.BR encoded_io (7). +The caller must have the +.B CAP_SYS_ADMIN +capability. +.TP .B O_EXCL Ensure that this call creates the file: if this flag is specified in conjunction with @@ -1168,6 +1176,11 @@ did not match the owner of the file and the caller was not privileged. The operation was prevented by a file seal; see .BR fcntl (2). .TP +.B EPERM +The +.B O_ALLOW_ENCODED +flag was specified, but the caller was not privileged. +.TP .B EROFS .I pathname refers to a file on a read-only filesystem and write access was diff --git a/man2/readv.2 b/man2/readv.2 index af27aa63e..8b5458023 100644 --- a/man2/readv.2 +++ b/man2/readv.2 @@ -265,6 +265,11 @@ the data is always appended to the end of the file. However, if the .I offset argument is \-1, the current file offset is updated. +.TP +.BR RWF_ENCODED " (since Linux 5.7)" +Read or write encoded (e.g., compressed) data. +See +.BR encoded_io (7). .SH RETURN VALUE On success, .BR readv (), @@ -284,6 +289,13 @@ than requested (see and .BR write (2)). .PP +If +.B +RWF_ENCODED +was specified in +.IR flags , +then the return value is the number of encoded bytes. +.PP On error, \-1 is returned, and \fIerrno\fP is set appropriately. .SH ERRORS The errors are as given for @@ -314,6 +326,58 @@ is less than zero or greater than the permitted maximum. .TP .B EOPNOTSUPP An unknown flag is specified in \fIflags\fP. +.TP +.B EOPNOTSUPP +.B RWF_ENCODED +is specified in +.I flags +and the filesystem does not implement encoded I/O. +.TP +.B EPERM +.B RWF_ENCODED +is specified in +.I flags +and the file was not opened with the +.B O_ALLOW_ENCODED +flag. +.PP +.BR preadv2 () +can fail for the following reasons: +.TP +.B E2BIG +.B RWF_ENCODED +is specified in +.I flags +and +.I iov[0] +is not large enough to return the encoding metadata. +.TP +.B ENOBUFS +.B RWF_ENCODED +is specified in +.I flags +and the buffers in +.I iov +are not big enough to return the encoded data. +.PP +.BR pwritev2 () +can fail for the following reasons: +.TP +.B E2BIG +.B RWF_ENCODED +is specified in +.I flags +and +.I iov[0] +contains non-zero fields +after the kernel's +.IR "sizeof(struct\ encoded_iov)" . +.TP +.B EINVAL +.B RWF_ENCODED +is specified in +.I flags +and the alignment and/or size requirements are not met. .SH VERSIONS .BR preadv () and diff --git a/man7/encoded_io.7 b/man7/encoded_io.7 new file mode 100644 index 000000000..7be264f6b --- /dev/null +++ b/man7/encoded_io.7 @@ -0,0 +1,308 @@ +.\" Copyright (c) 2019 by Omar Sandoval +.\" +.\" %%%LICENSE_START(VERBATIM) +.\" Permission is granted to make and distribute verbatim copies of this +.\" manual provided the copyright notice and this permission notice are +.\" preserved on all copies. +.\" +.\" Permission is granted to copy and distribute modified versions of this +.\" manual under the conditions for verbatim copying, provided that the +.\" entire resulting derived work is distributed under the terms of a +.\" permission notice identical to this one. +.\" +.\" Since the Linux kernel and libraries are constantly changing, this +.\" manual page may be incorrect or out-of-date. The author(s) assume no +.\" responsibility for errors or omissions, or for damages resulting from +.\" the use of the information contained herein. The author(s) may not +.\" have taken the same level of care in the production of this manual, +.\" which is licensed free of charge, as they might when working +.\" professionally. +.\" +.\" Formatted or processed versions of this manual, if unaccompanied by +.\" the source, must acknowledge the copyright and authors of this work. +.\" %%%LICENSE_END +.\" +.\" +.TH ENCODED_IO 7 2019-10-14 "Linux" "Linux Programmer's Manual" +.SH NAME +encoded_io \- overview of encoded I/O +.SH DESCRIPTION +Several filesystems (e.g., Btrfs) support transparent encoding +(e.g., compression, encryption) of data on disk: +written data is encoded by the kernel before it is written to disk, +and read data is decoded before being returned to the user. +In some cases, it is useful to skip this encoding step. +For example, the user may want to read the compressed contents of a file +or write pre-compressed data directly to a file. +This is referred to as "encoded I/O". +.SS Encoded I/O API +Encoded I/O is specified with the +.B RWF_ENCODED +flag to +.BR preadv2 (2) +and +.BR pwritev2 (2). +If +.B RWF_ENCODED +is specified, then +.I iov[0].iov_base +points to an +.I +encoded_iov +structure, defined in +.I +as: +.PP +.in +4n +.EX +struct encoded_iov { + __aligned_u64 len; + __aligned_u64 unencoded_len; + __aligned_u64 unencoded_offset; + __u32 compression; + __u32 encryption; +}; +.EE +.in +.PP +This may be extended in the future, so +.I iov[0].iov_len +must be set to +.I "sizeof(struct\ encoded_iov)" +for forward/backward compatibility. +The remaining buffers contain the encoded data. +.PP +.I compression +and +.I encryption +are the encoding fields. +.I compression +is one of +.B ENCODED_IOV_COMPRESSION_NONE +(zero), +.BR ENCODED_IOV_COMPRESSION_ZLIB , +.BR ENCODED_IOV_COMPRESSION_LZO , +or +.BR ENCODED_IOV_COMPRESSION_ZSTD . +.I encryption +is currently always +.B ENCODED_IOV_ENCRYPTION_NONE +(zero). +.PP +.I unencoded_len +is the length of the unencoded (i.e., decrypted and decompressed) data. +.I unencoded_offset +is the offset into the unencoded data where the data in the file begins +(strictly less than +.IR unencoded_len ). +.I len +is the length of the data in the file. +.PP +In most cases, +.I len +is equal to +.I unencoded_len +and +.I unencoded_offset +is zero. +However, it may be necessary to refer to a subset of the unencoded data, +usually because a read occurred in the middle of an encoded extent, +because part of an extent was overwritten or deallocated in some +way (e.g., with +.BR write (2), +.BR truncate (2), +or +.BR fallocate (2)) +or because part of an extent was added to the file (e.g., with +.BR ioctl_ficlonerange (2) +or +.BR ioctl_fideduperange (2)). +For example, if +.I len +is 300, +.I unencoded_len +is 1000, +and +.I unencoded_offset +is 600, +then the encoded data is 1000 bytes long when decoded, +of which only the 300 bytes starting at offset 600 are used; +the first 600 and last 100 bytes should be ignored. +.PP +Additionally, +.I len +may be greater than +.I unencoded_len +- +.IR unencoded_offset; +in this case, the data in the file is longer than the unencoded data, +and the difference is zero-filled. +.PP +If the unencoded data is actually longer than +.IR unencoded_len , +then it is truncated; +if it is shorter, then it is extended with zeroes. +.PP +For +.BR pwritev2 (), +the metadata should be specified in +.IR iov[0] . +If +.I iov[0].iov_len +is less than +.I "sizeof(struct\ encoded_iov)" +in the kernel, +then any fields unknown to userspace are treated as if they were zero; +if it is greater and any fields unknown to the kernel are non-zero, +then this returns -1 and sets +.I errno +to +.BR E2BIG . +The encoded data should be passed in the remaining buffers. +This returns the number of encoded bytes written (that is, the sum of +.I iov[n].iov_len +for 1 <= +.I n +< +.IR iovcnt ; +partial writes will not occur). +If the +.I offset +argument to +.BR pwritev2 () +is -1, then the file offset is incremented by +.IR len . +At least one encoding field must be non-zero. +Note that the encoded data is not validated when it is written; +if it is not valid (e.g., it cannot be decompressed), +then a subsequent read may return an error. +.PP +For +.BR preadv2 (), +the metadata is returned in +.IR iov[0] . +If +.I iov[0].iov_len +is less than +.I "sizeof(struct\ encoded_iov)" +in the kernel and any fields unknown to userspace are non-zero, +then this returns -1 and sets +.I errno +to +.BR E2BIG ; +if it is greater, +then any fields unknown to the kernel are returned as zero. +The encoded data is returned in the remaining buffers. +If the provided buffers are not large enough to return an entire encoded +extent, +then this returns -1 and sets +.I errno +to +.BR ENOBUFS . +This returns the number of encoded bytes read. +Note that a return value of zero does not indicate end of file; +one should refer to +.I len +(for example, a hole in the file has a non-zero +.I len +but a zero return value). +A +.I len +of zero indicates end of file. +If the +.I offset +argument to +.BR preadv2 () +is -1, then the file offset is incremented by +.IR len . +This will only return one encoded extent per call. +This can also read data which is not encoded; +all encoding fields will be zero in that case. +.SS Security +Encoded I/O creates the potential for some security issues: +.IP * 3 +Encoded writes allow writing arbitrary data which the kernel will decode on +a subsequent read. Decompression algorithms are complex and may have bugs +which can be exploited by maliciously crafted data. +.IP * +Encoded reads may return data which is not logically present in the file +(see the discussion of +.I len +vs. +.I unencoded_len +above). +It may not be intended for this data to be readable. +.PP +Therefore, encoded I/O requires privilege. +Namely, the +.B RWF_ENCODED +flag may only be used when the file was opened with the +.B O_ALLOW_ENCODED +flag to +.BR open (2), +which requires the +.B CAP_SYS_ADMIN +capability. +.B O_ALLOW_ENCODED +may be set and cleared with +.BR fcntl (2). +Note that it is not cleared on +.BR fork (2) +or +.BR execve (2); +one may wish to use +.B O_CLOEXEC +with +.BR O_ALLOW_ENCODED . +.SS Filesystem support +Encoded I/O is supported on the following filesystems: +.TP +Btrfs (since Linux 5.7) +.IP +Btrfs supports encoded reads and writes of compressed data. +The data is encoded as follows: +.RS +.IP * 3 +If +.I compression +is +.BR ENCODED_IOV_COMPRESSION_ZLIB , +then the encoded data is a single zlib stream. +.IP * +If +.I compression +is +.BR ENCODED_IOV_COMPRESSION_LZO , +then the encoded data is compressed page by page with LZO1X +and wrapped in the format documented in the Linux kernel source file +.IR fs/btrfs/lzo.c . +.IP * +If +.I compression +is +.BR ENCODED_IOV_COMPRESSION_ZSTD , +then the encoded data is a single zstd frame compressed with the +.I windowLog +compression parameter set to no more than 17. +.RE +.IP +Additionally, there are some restrictions on +.BR pwritev2 (): +.RS +.IP * 3 +.I offset +(or the current file offset if +.I offset +is -1) must be aligned to the sector size of the filesystem. +.IP * +.I len +must be aligned to the sector size of the filesystem +unless the data ends at or beyond the current end of the file. +.IP * +.I unencoded_len +and the length of the encoded data must each be no more than 128 KiB. +This limit may increase in the future. +.IP * +The length of the encoded data must be less than or equal to +.IR unencoded_len . +.RE From patchwork Wed Nov 20 18:24:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254629 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E66F9109A for ; Wed, 20 Nov 2019 18:25:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id BE19D20878 for ; Wed, 20 Nov 2019 18:25:00 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="eK7TtBsT" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728453AbfKTSY4 (ORCPT ); Wed, 20 Nov 2019 13:24:56 -0500 Received: from mail-pl1-f196.google.com ([209.85.214.196]:39955 "EHLO mail-pl1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728391AbfKTSYz (ORCPT ); Wed, 20 Nov 2019 13:24:55 -0500 Received: by mail-pl1-f196.google.com with SMTP id f9so151004plr.7 for ; Wed, 20 Nov 2019 10:24:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=o2+igEOnCXCoZoiAJpf5HHbPvlBNpJqwdgFjMnknAbY=; b=eK7TtBsTTOswuydlQGQCX2MZgOEiLqA+PiWtbN5oI92q3BbkVN+U1v9tQPxo4a+nA8 qXLloGoiSXBYGYBi1uoMZ9N5DZHRylT68PwWBNEaIUgs7+98Ke7UlxU8LxMTzTtQ4RVF 79dPyl/ktVGqSBYNPDrSk/g29ZW5hl/rqcLDAWdxzeA4jxFWBc4WYKg6O5ya1geXHC2B AmqWoym6YUG0n6A4ZdOF2sm3so8h4uvd0D+vOHQx54Tl5ln2OQLTkjyiq/sHLTm/rH8a AhVHmKQgVa/zVc+Ussn1SSXr0zkUT4E/hsONWHKxEURZP+UKb8GJP0BFqBaHehqb40jK bAHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=o2+igEOnCXCoZoiAJpf5HHbPvlBNpJqwdgFjMnknAbY=; b=dKgwqEvj2zUtJRWTEtEOlQVi4S5DEzM8uwmva56S5F3vVGj5ZbEzdK2lAUYvRR8BoV BmEgYMyqxp+UMO24qg6CFfrTk51zr+nDQRbLZ/p5a1Jua/h86D+zZEEFnNWn+YQvKhiq H6uo6z0JS1I+NmRmENYNQFu+YcYSIgsGFNUeQrNKUcKRsw3Bf2iaxLgRtYgBMeS03xCp 1bMNv5fyn6LztGl+SUgS0qgQSYKeFXmCWgUDA0juqY20TJ7i5pASnA292DSa9RDuqMmf mEWJTkQ97wdjCm0kQ3yzydhkccUnOjPSXH95NGXKXLsgTtZVrlwwn4UmfquZYBDUbn9D ZxmQ== X-Gm-Message-State: APjAAAVWsKxaJ/WfZWSOsKgWi7GooItv2Boa3JHHRbcseIJ3ncCQ0zQJ 0Csg0r7v6OYhSFL5fYj3Sf6pMSB1pMg= X-Google-Smtp-Source: APXvYqz/nFa1LIb2dXpleNzDUkBRlPW/dFm2GFWU8U4h6SWYn5eD5w9bHRly15ufQHluyuVeanTRjA== X-Received: by 2002:a17:90a:f84:: with SMTP id 4mr5695411pjz.110.1574274292981; Wed, 20 Nov 2019 10:24:52 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.24.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:24:52 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 02/12] fs: add O_ALLOW_ENCODED open flag Date: Wed, 20 Nov 2019 10:24:22 -0800 Message-Id: X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval The upcoming RWF_ENCODED operation introduces some security concerns: 1. Compressed writes will pass arbitrary data to decompression algorithms in the kernel. 2. Compressed reads can leak truncated/hole punched data. Therefore, we need to require privilege for RWF_ENCODED. It's not possible to do the permissions checks at the time of the read or write because, e.g., io_uring submits IO from a worker thread. So, add an open flag which requires CAP_SYS_ADMIN. It can also be set and cleared with fcntl(). The flag is not cleared in any way on fork or exec; it should probably be used with O_CLOEXEC in most cases. Note that the usual issue that unknown open flags are ignored doesn't really matter for O_ALLOW_ENCODED; if the kernel doesn't support O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either. Signed-off-by: Omar Sandoval --- arch/alpha/include/uapi/asm/fcntl.h | 1 + arch/parisc/include/uapi/asm/fcntl.h | 1 + arch/sparc/include/uapi/asm/fcntl.h | 1 + fs/fcntl.c | 10 ++++++++-- fs/namei.c | 4 ++++ include/linux/fcntl.h | 2 +- include/uapi/asm-generic/fcntl.h | 4 ++++ 7 files changed, 20 insertions(+), 3 deletions(-) diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h index 50bdc8e8a271..391e0d112e41 100644 --- a/arch/alpha/include/uapi/asm/fcntl.h +++ b/arch/alpha/include/uapi/asm/fcntl.h @@ -34,6 +34,7 @@ #define O_PATH 040000000 #define __O_TMPFILE 0100000000 +#define O_ALLOW_ENCODED 0200000000 #define F_GETLK 7 #define F_SETLK 8 diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h index 03ce20e5ad7d..1188b27002b3 100644 --- a/arch/parisc/include/uapi/asm/fcntl.h +++ b/arch/parisc/include/uapi/asm/fcntl.h @@ -22,6 +22,7 @@ #define O_PATH 020000000 #define __O_TMPFILE 040000000 +#define O_ALLOW_ENCODED 100000000 #define F_GETLK64 8 #define F_SETLK64 9 diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h index 67dae75e5274..ac3e8c9cb32c 100644 --- a/arch/sparc/include/uapi/asm/fcntl.h +++ b/arch/sparc/include/uapi/asm/fcntl.h @@ -37,6 +37,7 @@ #define O_PATH 0x1000000 #define __O_TMPFILE 0x2000000 +#define O_ALLOW_ENCODED 0x8000000 #define F_GETOWN 5 /* for sockets. */ #define F_SETOWN 6 /* for sockets. */ diff --git a/fs/fcntl.c b/fs/fcntl.c index 3d40771e8e7c..407e663dff4a 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -30,7 +30,8 @@ #include #include -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME) +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \ + O_ALLOW_ENCODED) static int setfl(int fd, struct file * filp, unsigned long arg) { @@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg) if (!inode_owner_or_capable(inode)) return -EPERM; + /* O_ALLOW_ENCODED can only be set by superuser */ + if ((arg & O_ALLOW_ENCODED) && !(filp->f_flags & O_ALLOW_ENCODED) && + !capable(CAP_SYS_ADMIN)) + return -EPERM; + /* required for strict SunOS emulation */ if (O_NONBLOCK != O_NDELAY) if (arg & O_NDELAY) @@ -1031,7 +1037,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY)); diff --git a/fs/namei.c b/fs/namei.c index 671c3c1a3425..737d9f05b095 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2978,6 +2978,10 @@ static int may_open(const struct path *path, int acc_mode, int flag) if (flag & O_NOATIME && !inode_owner_or_capable(inode)) return -EPERM; + /* O_ALLOW_ENCODED can only be set by superuser */ + if ((flag & O_ALLOW_ENCODED) && !capable(CAP_SYS_ADMIN)) + return -EPERM; + return 0; } diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index d019df946cb2..0dc6fa93f7cb 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -9,7 +9,7 @@ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ALLOW_ENCODED) #ifndef force_o_largefile #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T)) diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 9dc0bf0c5a6e..75321c7a66ac 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -89,6 +89,10 @@ #define __O_TMPFILE 020000000 #endif +#ifndef O_ALLOW_ENCODED +#define O_ALLOW_ENCODED 040000000 +#endif + /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) From patchwork Wed Nov 20 18:24:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254623 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 70A5F1871 for ; Wed, 20 Nov 2019 18:24:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3E42320878 for ; Wed, 20 Nov 2019 18:24:58 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="uGKzKYqm" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728482AbfKTSY5 (ORCPT ); Wed, 20 Nov 2019 13:24:57 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:37958 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728391AbfKTSY5 (ORCPT ); Wed, 20 Nov 2019 13:24:57 -0500 Received: by mail-pg1-f193.google.com with SMTP id 15so134641pgh.5 for ; Wed, 20 Nov 2019 10:24:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=8XzHxPi137x/pI2bsQL0zdjTY440X6mKNZblWdV6sp4=; b=uGKzKYqmJGH7hI2tDaniyhEGGulXqY53tmyyMJ6u1QC9wf4UWQdvv73v4TXPFgfP1d SLVzeT7GtMNTZTO7mklH1Fx0FgVvm/zP4VrpgsGoM4lXUVITZ2Xa6tw3+qLyCflrlMXe AOdfLUO9Ybns0MeFETognL7BEakGHn09bWHEQCaFga9wY2Rlz/CKn4qkg2/ckyUhWws2 gKMHeaGSL5VdH3Y+VBufRtLbm4zQvzekc2YS2eEcR2wNwefNGF8+abYgZEG9ajNmNw0y D6/5d7h46exwPpnJ6GfgDMDjW2171+M0MzHkd3o4M/lps18MzUT56BsuJvSZCDK0js7d Uuxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=8XzHxPi137x/pI2bsQL0zdjTY440X6mKNZblWdV6sp4=; b=Cy7WokGO2n9wIcJ55PFCMIGo3CHKdR2dtXYjTftkEWtbfpCrxV3nu2qFKq6AWPyPUQ fZvr48zSX6PtvCsni7eWoe78OdbHEYKumichTex93Yv1MG5BuGI9xwhmOKZ4GgAcM9AB zF8m5QEzobrK6lt02GnGB9nAUrCyNsluHwh2O9R//QHcQO21wqgu4Z20hrIVATb64GZa J+/yTDIWHeVXG66jeqb82Xti/L5gMsN2sMlQmgHepl2vwClvdzekSauGhbNECsb/S0bw KKAUFaGwvgSSibx6LwJbilZxYBxQjyBz0KZQ4i0z7HJDicAmAvo+BWlTXvFLl7/PHGPg Jiwg== X-Gm-Message-State: APjAAAUC6BkY0BOGYW/eyC4ZPGL/dZIMC2LKGMfDFZyGeei5vPYO3kxu vkmoz0JPZlRmPW7k7+4U2M2kGGdP1ZQ= X-Google-Smtp-Source: APXvYqwFl2Y4PqQ4ZTgsWnt5DRDAiKjHqjz6UVVx2FEKN5ekCIRvh29PxvFoCftAn0iP1Qf5sUerLQ== X-Received: by 2002:a65:62d3:: with SMTP id m19mr4696325pgv.270.1574274294198; Wed, 20 Nov 2019 10:24:54 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.24.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:24:53 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 03/12] fs: add RWF_ENCODED for reading/writing compressed data Date: Wed, 20 Nov 2019 10:24:23 -0800 Message-Id: <07f9cc1969052e94818fa50019e7589d206d1d18.1574273658.git.osandov@fb.com> X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Btrfs supports transparent compression: data written by the user can be compressed when written to disk and decompressed when read back. However, we'd like to add an interface to write pre-compressed data directly to the filesystem, and the matching interface to read compressed data without decompressing it. This adds support for so-called "encoded I/O" via preadv2() and pwritev2(). A new RWF_ENCODED flags indicates that a read or write is "encoded". If this flag is set, iov[0].iov_base points to a struct encoded_iov which is used for metadata: namely, the compression algorithm, unencoded (i.e., decompressed) length, and what subrange of the unencoded data should be used (needed for truncated or hole-punched extents and when reading in the middle of an extent). For reads, the filesystem returns this information; for writes, the caller provides it to the filesystem. iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be used to extend the interface in the future a la copy_struct_from_user(). The remaining iovecs contain the encoded extent. This adds the VFS helpers for supporting encoded I/O and documentation for filesystem support. Signed-off-by: Omar Sandoval --- Documentation/filesystems/encoded_io.rst | 79 +++++++++++ Documentation/filesystems/index.rst | 1 + include/linux/fs.h | 16 +++ include/uapi/linux/fs.h | 33 ++++- mm/filemap.c | 165 +++++++++++++++++++++-- 5 files changed, 280 insertions(+), 14 deletions(-) create mode 100644 Documentation/filesystems/encoded_io.rst diff --git a/Documentation/filesystems/encoded_io.rst b/Documentation/filesystems/encoded_io.rst new file mode 100644 index 000000000000..3ed1ba6e34de --- /dev/null +++ b/Documentation/filesystems/encoded_io.rst @@ -0,0 +1,79 @@ +=========== +Encoded I/O +=========== + +Encoded I/O is a mechanism for reading and writing encoded (e.g., compressed +and/or encrypted) data directly from/to the filesystem. The userspace interface +is thoroughly described in the :manpage:`encoded_io(7)` man page; this document +describes the requirements for filesystem support. + +First of all, a filesystem supporting encoded I/O must indicate this by setting +the ``FMODE_ENCODED_IO`` flag in its ``file_open`` file operation:: + + static int foo_file_open(struct inode *inode, struct file *filp) + { + ... + filep->f_mode |= FMODE_ENCODED_IO; + ... + } + +Encoded I/O goes through ``read_iter`` and ``write_iter``, designated by the +``IOCB_ENCODED`` flag in ``kiocb->ki_flags``. + +Reads +===== + +Encoded ``read_iter`` should: + +1. Call ``generic_encoded_read_checks()`` to validate the file and buffers + provided by userspace. +2. Initialize the ``encoded_iov`` appropriately. +3. Copy it to the user with ``copy_encoded_iov_to_iter()``. +4. Copy the encoded data to the user. +5. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``. +6. Return the size of the encoded data read, not including the ``encoded_iov``. + +There are a few details to be aware of: + +* Encoded ``read_iter`` should support reading unencoded data if the extent is + not encoded. +* If the buffers provided by the user are not large enough to contain an entire + encoded extent, then ``read_iter`` should return ``-ENOBUFS``. This is to + avoid confusing userspace with truncated data that cannot be properly + decoded. +* Reads in the middle of an encoded extent can be returned by setting + ``encoded_iov->unencoded_offset`` to non-zero. +* Truncated unencoded data (e.g., because the file does not end on a block + boundary) may be returned by setting ``encoded_iov->len`` to a value smaller + value than ``encoded_iov->unencoded_len - encoded_iov->unencoded_offset``. +* If ``encoded_iov->len`` is greater than ``encoded_iov->unencoded_len - + encoded_iov->unencoded_offset``, then the user will treat the remainder as + zero-filled. Therefore, holes may be returned by setting ``encoded_iov->len`` + to the size of the hole, setting ``encoded_iov->unencoded_len`` to zero, and + copying no data to the user. + +Writes +====== + +Encoded ``write_iter`` should (in addition to the usual accounting/checks done +by ``write_iter``): + +1. Call ``copy_encoded_iov_from_iter()`` to get and validate the + ``encoded_iov``. +2. Call ``generic_encoded_write_checks()`` instead of + ``generic_write_checks()``. +3. Check that the provided encoding in ``encoded_iov`` is supported. +4. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``. +5. Return the size of the encoded data written. + +Again, there are a few details: + +* Encoded ``write_iter`` doesn't need to support writing unencoded data. +* ``write_iter`` should either write all of the encoded data or none of it; it + must not do partial writes. +* ``write_iter`` doesn't need to validate the encoded data; a subsequent read + may return, e.g., ``-EIO`` if the data is not valid. +* The user may lie about the unencoded size of the data; a subsequent read + should truncate or zero-extend the unencoded data rather than returning an + error. +* Be careful of page cache coherency. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 2c3a9f761205..ac9a46fc6120 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -37,6 +37,7 @@ filesystem implementations. journalling fscrypt fsverity + encoded_io Filesystems =========== diff --git a/include/linux/fs.h b/include/linux/fs.h index e0d909d35763..b4f8949d3588 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File does not contribute to nr_files count */ #define FMODE_NOACCOUNT ((__force fmode_t)0x20000000) +/* File supports encoded IO */ +#define FMODE_ENCODED_IO ((__force fmode_t)0x40000000) + /* * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector * that indicates that they should check the contents of the iovec are @@ -314,6 +317,7 @@ enum rw_hint { #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT (1 << 7) +#define IOCB_ENCODED (1 << 8) struct kiocb { struct file *ki_filp; @@ -3088,6 +3092,13 @@ extern int sb_min_blocksize(struct super_block *, int); extern int generic_file_mmap(struct file *, struct vm_area_struct *); extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *); +struct encoded_iov; +extern int generic_encoded_write_checks(struct kiocb *, + const struct encoded_iov *); +extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *); +extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *); +extern int copy_encoded_iov_to_iter(const struct encoded_iov *, + struct iov_iter *); extern int generic_remap_checks(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t *count, unsigned int remap_flags); @@ -3403,6 +3414,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) return -EOPNOTSUPP; ki->ki_flags |= IOCB_NOWAIT; } + if (flags & RWF_ENCODED) { + if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO)) + return -EOPNOTSUPP; + ki->ki_flags |= IOCB_ENCODED; + } if (flags & RWF_HIPRI) ki->ki_flags |= IOCB_HIPRI; if (flags & RWF_DSYNC) diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 379a612f8f1d..775d18bb0efc 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -284,6 +284,34 @@ struct fsxattr { typedef int __bitwise __kernel_rwf_t; +enum { + ENCODED_IOV_COMPRESSION_NONE, +#define ENCODED_IOV_COMPRESSION_NONE ENCODED_IOV_COMPRESSION_NONE + ENCODED_IOV_COMPRESSION_ZLIB, +#define ENCODED_IOV_COMPRESSION_ZLIB ENCODED_IOV_COMPRESSION_ZLIB + ENCODED_IOV_COMPRESSION_LZO, +#define ENCODED_IOV_COMPRESSION_LZO ENCODED_IOV_COMPRESSION_LZO + ENCODED_IOV_COMPRESSION_ZSTD, +#define ENCODED_IOV_COMPRESSION_ZSTD ENCODED_IOV_COMPRESSION_ZSTD + ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD, +}; + +enum { + ENCODED_IOV_ENCRYPTION_NONE, +#define ENCODED_IOV_ENCRYPTION_NONE ENCODED_IOV_ENCRYPTION_NONE + ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE, +}; + +struct encoded_iov { + __aligned_u64 len; + __aligned_u64 unencoded_len; + __aligned_u64 unencoded_offset; + __u32 compression; + __u32 encryption; +}; + +#define ENCODED_IOV_SIZE_VER0 32 + /* high priority request, poll if possible */ #define RWF_HIPRI ((__force __kernel_rwf_t)0x00000001) @@ -299,8 +327,11 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO O_APPEND */ #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) +/* encoded (e.g., compressed and/or encrypted) IO */ +#define RWF_ENCODED ((__force __kernel_rwf_t)0x00000020) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND) + RWF_APPEND | RWF_ENCODED) #endif /* _UAPI_LINUX_FS_H */ diff --git a/mm/filemap.c b/mm/filemap.c index 85b7d087eb45..78b8535b58dd 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2949,24 +2949,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos, return 0; } -/* - * Performs necessary checks before doing a write - * - * Can adjust writing position or amount of bytes to write. - * Returns appropriate error code that caller should return or - * zero in case that write should be allowed. - */ -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; - loff_t count; - int ret; if (IS_SWAPFILE(inode)) return -ETXTBSY; - if (!iov_iter_count(from)) + if (!*count) return 0; /* FIXME: this is for backwards compatibility with 2.4 */ @@ -2976,8 +2967,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT)) return -EINVAL; - count = iov_iter_count(from); - ret = generic_write_check_limits(file, iocb->ki_pos, &count); + return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count); +} + +/* + * Performs necessary checks before doing a write + * + * Can adjust writing position or amount of bytes to write. + * Returns a negative errno or the new number of bytes to write. + */ +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) +{ + loff_t count = iov_iter_count(from); + int ret; + + ret = generic_write_checks_common(iocb, &count); if (ret) return ret; @@ -2986,6 +2990,141 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) } EXPORT_SYMBOL(generic_write_checks); +/** + * generic_encoded_write_checks() - check an encoded write + * @iocb: I/O context. + * @encoded: Encoding metadata. + * + * This should be called by RWF_ENCODED write implementations rather than + * generic_write_checks(). Unlike generic_write_checks(), it returns -EFBIG + * instead of adjusting the size of the write. + * + * Return: 0 on success, -errno on error. + */ +int generic_encoded_write_checks(struct kiocb *iocb, + const struct encoded_iov *encoded) +{ + loff_t count = encoded->len; + int ret; + + if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED)) + return -EPERM; + + ret = generic_write_checks_common(iocb, &count); + if (ret) + return ret; + + if (count != encoded->len) { + /* + * The write got truncated by generic_write_checks_common(). We + * can't do a partial encoded write. + */ + return -EFBIG; + } + return 0; +} +EXPORT_SYMBOL(generic_encoded_write_checks); + +/** + * copy_encoded_iov_from_iter() - copy a &struct encoded_iov from userspace + * @encoded: Returned encoding metadata. + * @from: Source iterator. + * + * This copies in the &struct encoded_iov and does some basic sanity checks. + * This should always be used rather than a plain copy_from_iter(), as it does + * the proper handling for backward- and forward-compatibility. + * + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if the + * copied structure contained non-zero fields that this kernel doesn't + * support, -EINVAL if the copied structure was invalid. + */ +int copy_encoded_iov_from_iter(struct encoded_iov *encoded, + struct iov_iter *from) +{ + size_t usize; + int ret; + + usize = iov_iter_single_seg_count(from); + if (usize > PAGE_SIZE) + return -E2BIG; + if (usize < ENCODED_IOV_SIZE_VER0) + return -EINVAL; + ret = copy_struct_from_iter(encoded, sizeof(*encoded), from, usize); + if (ret) + return ret; + + if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE && + encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) + return -EINVAL; + if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES || + encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES) + return -EINVAL; + if (encoded->unencoded_len && + encoded->unencoded_offset >= encoded->unencoded_len) + return -EINVAL; + return 0; +} +EXPORT_SYMBOL(copy_encoded_iov_from_iter); + +/** + * generic_encoded_read_checks() - sanity check an RWF_ENCODED read + * @iocb: I/O context. + * @iter: Destination iterator for read. + * + * This should always be called by RWF_ENCODED read implementations before + * returning any data. + * + * Return: Number of bytes available to return encoded data in @iter on success, + * -EPERM if the file was not opened with O_ALLOW_ENCODED, -EINVAL if + * the size of the &struct encoded_iov iovec is invalid. + */ +ssize_t generic_encoded_read_checks(struct kiocb *iocb, struct iov_iter *iter) +{ + size_t usize; + + if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED)) + return -EPERM; + usize = iov_iter_single_seg_count(iter); + if (usize > PAGE_SIZE || usize < ENCODED_IOV_SIZE_VER0) + return -EINVAL; + return iov_iter_count(iter) - usize; +} +EXPORT_SYMBOL(generic_encoded_read_checks); + +/** + * copy_encoded_iov_to_iter() - copy a &struct encoded_iov to userspace + * @encoded: Encoding metadata to return. + * @to: Destination iterator. + * + * This should always be used by RWF_ENCODED read implementations rather than a + * plain copy_to_iter(), as it does the proper handling for backward- and + * forward-compatibility. The iterator must be sanity-checked with + * generic_encoded_read_checks() before this is called. + * + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if there + * were non-zero fields in @encoded that the user buffer could not + * accommodate. + */ +int copy_encoded_iov_to_iter(const struct encoded_iov *encoded, + struct iov_iter *to) +{ + size_t ksize = sizeof(*encoded); + size_t usize = iov_iter_single_seg_count(to); + size_t size = min(ksize, usize); + + /* We already sanity-checked usize in generic_encoded_read_checks(). */ + + if (usize < ksize && + memchr_inv((char *)encoded + usize, 0, ksize - usize)) + return -E2BIG; + if (copy_to_iter(encoded, size, to) != size || + (usize > ksize && + iov_iter_zero(usize - ksize, to) != usize - ksize)) + return -EFAULT; + return 0; +} +EXPORT_SYMBOL(copy_encoded_iov_to_iter); + /* * Performs necessary checks before doing a clone. * From patchwork Wed Nov 20 18:24:24 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254625 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 66B6A1390 for ; Wed, 20 Nov 2019 18:24:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3DC9C208A3 for ; Wed, 20 Nov 2019 18:24:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="F98d6NDq" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728483AbfKTSY6 (ORCPT ); Wed, 20 Nov 2019 13:24:58 -0500 Received: from mail-pg1-f195.google.com ([209.85.215.195]:33134 "EHLO mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728459AbfKTSY5 (ORCPT ); Wed, 20 Nov 2019 13:24:57 -0500 Received: by mail-pg1-f195.google.com with SMTP id h27so150119pgn.0 for ; Wed, 20 Nov 2019 10:24:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=qMYv5tjBXDa9AgtLdt13OsJPn/dRLq8/h3Hp1LbqA28=; b=F98d6NDqUB3szlMIApRztfx+e6G8M2lTvX8Ese0eCyXAmywQqCHlIp0efPnAM4Cfc8 4QvR39ZWqyqr6vPsX74Pmhe0DbvDWzALLlzZ10K6lB9r/Wswxaw+sF70Z2zbHmqTVKcs DGrwkahvKaCcl2WwqdLA3uSzJcxK8TAY9vf4T4hg4C4F5kuyMfA/5DdXshz+w6h/OF68 /aZ8TKZ7CSAZdH6gi/jxqAP6bzaDTv8nh+V2YeYXeZQzxMhZLqPG820P1XC2E2M0JTrB 6J91x9J+adKsOapEKpPaz4d1cKnQ8MKgBgfXeutwABGtOnI0n0b74dnmtIV5M03VxN4L bTdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=qMYv5tjBXDa9AgtLdt13OsJPn/dRLq8/h3Hp1LbqA28=; b=dTly0tPE1ewXSvmhU/MTGOUM2Se4znUm4PQRhEwuQ4iJyH6i05gQVDuZ1qbsN6qksN BcM7R1ROJ1s8EPjCEhVu2L02d8fbFKpijgATJAA1l5pcU26fXSo5QDK8NBW5yCgb2ILT 596L3Tw2BbxXoB7AirGKB8+8OYo3eo9yVZeGk1fUXA6NTnokpTJo4uSgYux8NKe0NTyM rndrO8E5yXb69hkyFrOBqP19jMblh0XtwpUbBzGi93iPv/ZL5u9ZZSYyUP5mlMIT34Qa /NQVQ70eQMzZsVEwm4Eiz1Pcv1WV2E2oq4z4ZpbbsDhmwz8iWMFGEgaQQgKA/9CANyul 18LQ== X-Gm-Message-State: APjAAAVqX+TjYkyda2mhbaV48UbADgZTUugY2KIbuU2kxvvyz2KbdWYW Vp4aH694NOu6+jNb7LQ6YB36EWxliXE= X-Google-Smtp-Source: APXvYqy0Hh290nMDk9TfBV3dkkZv50HMvndZzeQszGuqRud6bGPb5TAlbldCGmJ68BC3Y79xAypLDg== X-Received: by 2002:a63:ff1e:: with SMTP id k30mr4668135pgi.273.1574274295768; Wed, 20 Nov 2019 10:24:55 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.24.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:24:54 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 04/12] btrfs: get rid of trivial __btrfs_lookup_bio_sums() wrappers Date: Wed, 20 Nov 2019 10:24:24 -0800 Message-Id: X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Currently, we have two wrappers for __btrfs_lookup_bio_sums(): btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and btrfs_lookup_bio_sums(), which is used everywhere else. The only difference is that the _dio variant looks up csums starting at the given offset instead of using the page index, which isn't actually direct I/O-specific. Let's clean up the signature and return value of __btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get rid of the trivial helpers. Signed-off-by: Omar Sandoval Reviewed-by: Nikolay Borisov --- fs/btrfs/compression.c | 4 ++-- fs/btrfs/ctree.h | 4 +--- fs/btrfs/file-item.c | 35 +++++++++++++++++------------------ fs/btrfs/inode.c | 6 +++--- 4 files changed, 23 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index b05b361e2062..4df6f0c58dc9 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -660,7 +660,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) { ret = btrfs_lookup_bio_sums(inode, comp_bio, - sums); + false, 0, sums); BUG_ON(ret); /* -ENOMEM */ } @@ -689,7 +689,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, BUG_ON(ret); /* -ENOMEM */ if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) { - ret = btrfs_lookup_bio_sums(inode, comp_bio, sums); + ret = btrfs_lookup_bio_sums(inode, comp_bio, false, 0, sums); BUG_ON(ret); /* -ENOMEM */ } diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index fe2b8765d9e6..4bc40bf49b0e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2787,9 +2787,7 @@ struct btrfs_dio_private; int btrfs_del_csums(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr, u64 len); blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, - u8 *dst); -blk_status_t btrfs_lookup_bio_sums_dio(struct inode *inode, struct bio *bio, - u64 logical_offset); + bool at_offset, u64 offset, u8 *dst); int btrfs_insert_file_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, u64 pos, diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 1a599f50837b..a87c40502267 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -148,8 +148,21 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans, return ret; } -static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, - u64 logical_offset, u8 *dst, int dio) +/** + * btrfs_lookup_bio_sums - Look up checksums for a bio. + * @inode: inode that the bio is for. + * @bio: bio embedded in btrfs_io_bio. + * @at_offset: If true, look up checksums for the extent at @c offset. + * If false, use the page offsets from the bio. + * @offset: If @at_offset is true, offset in file to look up checksums for. + * Ignored otherwise. + * @dst: Buffer of size btrfs_super_csum_size() used to return checksum. If + * NULL, the checksum is returned in btrfs_io_bio(bio)->csum instead. + * + * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise. + */ +blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, + bool at_offset, u64 offset, u8 *dst) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct bio_vec bvec; @@ -159,7 +172,6 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; struct btrfs_path *path; u8 *csum; - u64 offset = 0; u64 item_start_offset = 0; u64 item_last_offset = 0; u64 disk_bytenr; @@ -205,15 +217,13 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio } disk_bytenr = (u64)bio->bi_iter.bi_sector << 9; - if (dio) - offset = logical_offset; bio_for_each_segment(bvec, bio, iter) { page_bytes_left = bvec.bv_len; if (count) goto next; - if (!dio) + if (!at_offset) offset = page_offset(bvec.bv_page) + bvec.bv_offset; count = btrfs_find_ordered_sum(inode, offset, disk_bytenr, csum, nblocks); @@ -285,18 +295,7 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio WARN_ON_ONCE(count); btrfs_free_path(path); - return 0; -} - -blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, - u8 *dst) -{ - return __btrfs_lookup_bio_sums(inode, bio, 0, dst, 0); -} - -blk_status_t btrfs_lookup_bio_sums_dio(struct inode *inode, struct bio *bio, u64 offset) -{ - return __btrfs_lookup_bio_sums(inode, bio, offset, NULL, 1); + return BLK_STS_OK; } int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 015910079e73..ad5bffb24199 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2090,7 +2090,7 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio, bio_flags); goto out; } else if (!skip_sum) { - ret = btrfs_lookup_bio_sums(inode, bio, NULL); + ret = btrfs_lookup_bio_sums(inode, bio, false, 0, NULL); if (ret) goto out; } @@ -8332,8 +8332,8 @@ static inline blk_status_t btrfs_lookup_and_bind_dio_csum(struct inode *inode, * contention. */ if (dip->logical_offset == file_offset) { - ret = btrfs_lookup_bio_sums_dio(inode, dip->orig_bio, - file_offset); + ret = btrfs_lookup_bio_sums(inode, dip->orig_bio, true, + file_offset, NULL); if (ret) return ret; } From patchwork Wed Nov 20 18:24:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254633 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2E89314DB for ; Wed, 20 Nov 2019 18:25:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0DAB3206DA for ; Wed, 20 Nov 2019 18:25:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="pa4rOzYV" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728539AbfKTSZC (ORCPT ); Wed, 20 Nov 2019 13:25:02 -0500 Received: from mail-pf1-f195.google.com ([209.85.210.195]:34748 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728496AbfKTSY6 (ORCPT ); Wed, 20 Nov 2019 13:24:58 -0500 Received: by mail-pf1-f195.google.com with SMTP id n13so153418pff.1 for ; Wed, 20 Nov 2019 10:24:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=UITnFvX5GeaGsArn5VbTQFbb6fm1+utex+H31BrX7lo=; b=pa4rOzYVpCl1ER8VDX3Dpc8SFMEJYr/HjJcgJGnRllnNNBz7ogCa52/eZATRyF3bFe J5S4B3dUOID0x0xQHW5r5bYurYacDKknQYpZUWjJCgDXyW8G81M37fzPXj4zzscKPdbm DlGlKF6MmC9BjpyyKUg+2JMstIlbKzJq70HdwYyMRukIg0SOcreh3IzP4kDooJGleJkK mi37w6UwJ2ZOtnwseDOGewkPPl2psssSbRhxK3invITC4WAfB9zkCQZjyLpaRALxJpoz hmhwGuWEJJqgQBENc9FiFsI2s072RSfzetBXsxCRbNdnuW1kPIQHqmv7kNHqUaGUcZPn WfRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=UITnFvX5GeaGsArn5VbTQFbb6fm1+utex+H31BrX7lo=; b=Z5C3xRP5pjTgJYlHJdauHZuwyz+BuAZLWOykr2/EBS0k8uYRpiTAsWz+q3Ef9j9Vtv THJ8jwGsqqlPXGjx1bPob4bBAkD79wUavCHhLDPAagH9vTPc/fMvjDCgpgmiAcqjcuxh CFzHr10KInNTScx0LnNBVOixLhJCetwZJT2OSn4esiguze68lPcwy5REwuBtj3qf6Gs6 FUmArZKWaM0si3jHX8RN3FfjigmW4MAx0pcmGzeujtw1iJbOcA9vwpKKlDBxSdCrs2X6 t5JqecaQKzpgezvkjUZmfYrp0Q8X5oviH7GmdhOXvU+7bL4DdHPbxMH3ART6ZlkL3pWP 0R0A== X-Gm-Message-State: APjAAAU7b2w1A8dGa63c2iWrUwyRtt9Vsnhxki4KPqf4tvjsx3jT8iuK LL9RaVJIcF+1pVfbShg6vI6vkiQaTD4= X-Google-Smtp-Source: APXvYqwdQwafgTnP2BIYzZtBSm+F64wF+CtZELHdbpht99PXCMxbYZ73F61WALyYtXO/slg6GU158w== X-Received: by 2002:a65:4342:: with SMTP id k2mr4906215pgq.63.1574274297353; Wed, 20 Nov 2019 10:24:57 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.24.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:24:56 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 05/12] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Date: Wed, 20 Nov 2019 10:24:25 -0800 Message-Id: X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval btrfs_csum_one_bio() loops over each sector in the bio while keeping a cursor of its current logical position in the file in order to look up the ordered extent to add the checksums to. However, this doesn't make much sense for compressed extents, as a sector on disk does not correspond to a sector of decompressed file data. It happens to work because 1) the compressed bio always covers one ordered extent and 2) the size of the bio is always less than the size of the ordered extent. However, the second point will not always be true for encoded writes. Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that it can assume that the bio only covers one ordered extent. Since we're already changing the signature, let's make contig bool instead of int, too. Signed-off-by: Omar Sandoval --- fs/btrfs/compression.c | 5 +++-- fs/btrfs/ctree.h | 2 +- fs/btrfs/file-item.c | 19 +++++++++++-------- fs/btrfs/inode.c | 8 ++++---- 4 files changed, 19 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 4df6f0c58dc9..05b6e404a291 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -374,7 +374,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, BUG_ON(ret); /* -ENOMEM */ if (!skip_sum) { - ret = btrfs_csum_one_bio(inode, bio, start, 1); + ret = btrfs_csum_one_bio(inode, bio, start, + true, true); BUG_ON(ret); /* -ENOMEM */ } @@ -405,7 +406,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, BUG_ON(ret); /* -ENOMEM */ if (!skip_sum) { - ret = btrfs_csum_one_bio(inode, bio, start, 1); + ret = btrfs_csum_one_bio(inode, bio, start, true, true); BUG_ON(ret); /* -ENOMEM */ } diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4bc40bf49b0e..c32741879088 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2802,7 +2802,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_ordered_sum *sums); blk_status_t btrfs_csum_one_bio(struct inode *inode, struct bio *bio, - u64 file_start, int contig); + u64 file_start, bool contig, bool one_ordered); int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, struct list_head *list, int search_commit); void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode, diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index a87c40502267..c95772949b00 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -423,13 +423,14 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, * @inode: Owner of the data inside the bio * @bio: Contains the data to be checksummed * @file_start: offset in file this bio begins to describe - * @contig: Boolean. If true/1 means all bio vecs in this bio are - * contiguous and they begin at @file_start in the file. False/0 - * means this bio can contains potentially discontigous bio vecs - * so the logical offset of each should be calculated separately. + * @contig: If true, all bio vecs in @bio are contiguous and they begin at + * @file_start in the file. If false, @bio may contain + * discontigous bio vecs, so the logical offset of each should be + * calculated separately (@file_start is ignored). + * @one_ordered: If true, @bio only refers to one ordered extent. */ blk_status_t btrfs_csum_one_bio(struct inode *inode, struct bio *bio, - u64 file_start, int contig) + u64 file_start, bool contig, bool one_ordered) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); SHASH_DESC_ON_STACK(shash, fs_info->csum_shash); @@ -482,8 +483,9 @@ blk_status_t btrfs_csum_one_bio(struct inode *inode, struct bio *bio, - 1); for (i = 0; i < nr_sectors; i++) { - if (offset >= ordered->file_offset + ordered->len || - offset < ordered->file_offset) { + if (!one_ordered && + (offset >= ordered->file_offset + ordered->len || + offset < ordered->file_offset)) { unsigned long bytes_left; sums->len = this_sum_bytes; @@ -515,7 +517,8 @@ blk_status_t btrfs_csum_one_bio(struct inode *inode, struct bio *bio, kunmap_atomic(data); crypto_shash_final(shash, (char *)(sums->sums + index)); index += csum_size; - offset += fs_info->sectorsize; + if (!one_ordered) + offset += fs_info->sectorsize; this_sum_bytes += fs_info->sectorsize; total_bytes += fs_info->sectorsize; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ad5bffb24199..4c1ed6dddfd8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2039,7 +2039,7 @@ static blk_status_t btrfs_submit_bio_start(void *private_data, struct bio *bio, struct inode *inode = private_data; blk_status_t ret = 0; - ret = btrfs_csum_one_bio(inode, bio, 0, 0); + ret = btrfs_csum_one_bio(inode, bio, 0, false, false); BUG_ON(ret); /* -ENOMEM */ return 0; } @@ -2104,7 +2104,7 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio, 0, inode, btrfs_submit_bio_start); goto out; } else if (!skip_sum) { - ret = btrfs_csum_one_bio(inode, bio, 0, 0); + ret = btrfs_csum_one_bio(inode, bio, 0, false, false); if (ret) goto out; } @@ -8272,7 +8272,7 @@ static blk_status_t btrfs_submit_bio_start_direct_io(void *private_data, { struct inode *inode = private_data; blk_status_t ret; - ret = btrfs_csum_one_bio(inode, bio, offset, 1); + ret = btrfs_csum_one_bio(inode, bio, offset, true, false); BUG_ON(ret); /* -ENOMEM */ return 0; } @@ -8379,7 +8379,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio, * If we aren't doing async submit, calculate the csum of the * bio now. */ - ret = btrfs_csum_one_bio(inode, bio, file_offset, 1); + ret = btrfs_csum_one_bio(inode, bio, file_offset, true, false); if (ret) goto err; } else { From patchwork Wed Nov 20 18:24:26 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254631 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BC7E9109A for ; Wed, 20 Nov 2019 18:25:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 771E3208C0 for ; Wed, 20 Nov 2019 18:25:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="VOfTPAcr" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728531AbfKTSZB (ORCPT ); Wed, 20 Nov 2019 13:25:01 -0500 Received: from mail-pg1-f195.google.com ([209.85.215.195]:42614 "EHLO mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728459AbfKTSZA (ORCPT ); Wed, 20 Nov 2019 13:25:00 -0500 Received: by mail-pg1-f195.google.com with SMTP id q17so122369pgt.9 for ; Wed, 20 Nov 2019 10:24:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=8S0n7UBjrH5Tk6ePskIrfFRcf0GpPS7vFCDTAfm/wkg=; b=VOfTPAcrnHu37GUZRw0cAG2EjoYkYIaUi3XOlZr9DPHhOXXPsRJdEb1GC/p6grcx60 huYC+DyoMHHdQfcnGDcXvihrSrodx5sEpU88XDUiqYq+qFbenzPHpy43ybuyBrfYpNtO oPO10UJUzDY0jxxLw+MaaB2EX1cHBfF9Okn1f1SYxN9zmVkj0MEYPrq4zZhtHr0nA7Lo 176KAxuU46ilGq9IA5aYlHReJYcQucwj6FhUhy+476CRg2xI8PMHwJ0QR3KZlEvw1mE2 tmWQlPleCDWy4EAXF4hmHkrW+g458dze2DQ+C3RFjpe+53UJWK4L9/uAGqDIYwj7RCUQ onYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=8S0n7UBjrH5Tk6ePskIrfFRcf0GpPS7vFCDTAfm/wkg=; b=fi+j/0KWvEUUbF/an19Qa3AdoU9vKN4fvRHXed3JcORLfpOnUVsvGyJgvRpVnzOjcV yKAg6DJw3ty2BJRz+fQJXjTa2JAmES8RsL/gQNH0kgGIM/29qV7GSrt4E8870vK+2m/i ziLMXpFhn3cbioRovbfDdwMLDbS9sZdnuMVxkF5h6X8/iu38zmyzQLsxWImciZtDVlCR /whYsbxdH3ZVmLRCt6ScZn5h/UNAOfA4AF1pBv4p0CxRWdPHUnF1sBcUP/25sDUHIyCS 11Y//UUAS25QM6a0OE64jBinM6sd4iaWddatzQme7IC/X+0dcOGbgr77l7i5iMKw5Mdi +9pw== X-Gm-Message-State: APjAAAULvSUKQYjBp2j9CXthJr3wXDmXzivr/EXHdROMMJfYUIsBy+3v 281HXop0MrXNbZ+5MGDrdhCU5pyfqdE= X-Google-Smtp-Source: APXvYqy7Sd4gtfhVpNB0Sgu1D5piLM3zyO5yjulpkHP92sh5O8PC9BvLddlZ4hqAceObMCEgr+v+nQ== X-Received: by 2002:a63:d544:: with SMTP id v4mr4693179pgi.288.1574274298654; Wed, 20 Nov 2019 10:24:58 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.24.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:24:58 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 06/12] btrfs: remove dead snapshot-aware defrag code Date: Wed, 20 Nov 2019 10:24:26 -0800 Message-Id: X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Snapshot-aware defrag has been disabled since commit 8101c8dbf624 ("Btrfs: disable snapshot aware defrag for now") almost 6 years ago. Let's remove the dead code. If someone is up to the task of bringing it back, they can dig it up from git. This is logically a revert of commit 38c227d87c49 ("Btrfs: snapshot-aware defrag") except that now we have to clear the EXTENT_DEFRAG bit to avoid need_force_cow() returning true forever. Signed-off-by: Omar Sandoval Reviewed-by: Nikolay Borisov --- fs/btrfs/inode.c | 695 +---------------------------------------------- 1 file changed, 11 insertions(+), 684 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4c1ed6dddfd8..707b4d86409f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -44,7 +44,6 @@ #include "locking.h" #include "free-space-cache.h" #include "inode-map.h" -#include "backref.h" #include "props.h" #include "qgroup.h" #include "delalloc-space.h" @@ -2353,649 +2352,6 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, return ret; } -/* snapshot-aware defrag */ -struct sa_defrag_extent_backref { - struct rb_node node; - struct old_sa_defrag_extent *old; - u64 root_id; - u64 inum; - u64 file_pos; - u64 extent_offset; - u64 num_bytes; - u64 generation; -}; - -struct old_sa_defrag_extent { - struct list_head list; - struct new_sa_defrag_extent *new; - - u64 extent_offset; - u64 bytenr; - u64 offset; - u64 len; - int count; -}; - -struct new_sa_defrag_extent { - struct rb_root root; - struct list_head head; - struct btrfs_path *path; - struct inode *inode; - u64 file_pos; - u64 len; - u64 bytenr; - u64 disk_len; - u8 compress_type; -}; - -static int backref_comp(struct sa_defrag_extent_backref *b1, - struct sa_defrag_extent_backref *b2) -{ - if (b1->root_id < b2->root_id) - return -1; - else if (b1->root_id > b2->root_id) - return 1; - - if (b1->inum < b2->inum) - return -1; - else if (b1->inum > b2->inum) - return 1; - - if (b1->file_pos < b2->file_pos) - return -1; - else if (b1->file_pos > b2->file_pos) - return 1; - - /* - * [------------------------------] ===> (a range of space) - * |<--->| |<---->| =============> (fs/file tree A) - * |<---------------------------->| ===> (fs/file tree B) - * - * A range of space can refer to two file extents in one tree while - * refer to only one file extent in another tree. - * - * So we may process a disk offset more than one time(two extents in A) - * and locate at the same extent(one extent in B), then insert two same - * backrefs(both refer to the extent in B). - */ - return 0; -} - -static void backref_insert(struct rb_root *root, - struct sa_defrag_extent_backref *backref) -{ - struct rb_node **p = &root->rb_node; - struct rb_node *parent = NULL; - struct sa_defrag_extent_backref *entry; - int ret; - - while (*p) { - parent = *p; - entry = rb_entry(parent, struct sa_defrag_extent_backref, node); - - ret = backref_comp(backref, entry); - if (ret < 0) - p = &(*p)->rb_left; - else - p = &(*p)->rb_right; - } - - rb_link_node(&backref->node, parent, p); - rb_insert_color(&backref->node, root); -} - -/* - * Note the backref might has changed, and in this case we just return 0. - */ -static noinline int record_one_backref(u64 inum, u64 offset, u64 root_id, - void *ctx) -{ - struct btrfs_file_extent_item *extent; - struct old_sa_defrag_extent *old = ctx; - struct new_sa_defrag_extent *new = old->new; - struct btrfs_path *path = new->path; - struct btrfs_key key; - struct btrfs_root *root; - struct sa_defrag_extent_backref *backref; - struct extent_buffer *leaf; - struct inode *inode = new->inode; - struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); - int slot; - int ret; - u64 extent_offset; - u64 num_bytes; - - if (BTRFS_I(inode)->root->root_key.objectid == root_id && - inum == btrfs_ino(BTRFS_I(inode))) - return 0; - - key.objectid = root_id; - key.type = BTRFS_ROOT_ITEM_KEY; - key.offset = (u64)-1; - - root = btrfs_read_fs_root_no_name(fs_info, &key); - if (IS_ERR(root)) { - if (PTR_ERR(root) == -ENOENT) - return 0; - WARN_ON(1); - btrfs_debug(fs_info, "inum=%llu, offset=%llu, root_id=%llu", - inum, offset, root_id); - return PTR_ERR(root); - } - - key.objectid = inum; - key.type = BTRFS_EXTENT_DATA_KEY; - if (offset > (u64)-1 << 32) - key.offset = 0; - else - key.offset = offset; - - ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); - if (WARN_ON(ret < 0)) - return ret; - ret = 0; - - while (1) { - cond_resched(); - - leaf = path->nodes[0]; - slot = path->slots[0]; - - if (slot >= btrfs_header_nritems(leaf)) { - ret = btrfs_next_leaf(root, path); - if (ret < 0) { - goto out; - } else if (ret > 0) { - ret = 0; - goto out; - } - continue; - } - - path->slots[0]++; - - btrfs_item_key_to_cpu(leaf, &key, slot); - - if (key.objectid > inum) - goto out; - - if (key.objectid < inum || key.type != BTRFS_EXTENT_DATA_KEY) - continue; - - extent = btrfs_item_ptr(leaf, slot, - struct btrfs_file_extent_item); - - if (btrfs_file_extent_disk_bytenr(leaf, extent) != old->bytenr) - continue; - - /* - * 'offset' refers to the exact key.offset, - * NOT the 'offset' field in btrfs_extent_data_ref, ie. - * (key.offset - extent_offset). - */ - if (key.offset != offset) - continue; - - extent_offset = btrfs_file_extent_offset(leaf, extent); - num_bytes = btrfs_file_extent_num_bytes(leaf, extent); - - if (extent_offset >= old->extent_offset + old->offset + - old->len || extent_offset + num_bytes <= - old->extent_offset + old->offset) - continue; - break; - } - - backref = kmalloc(sizeof(*backref), GFP_NOFS); - if (!backref) { - ret = -ENOENT; - goto out; - } - - backref->root_id = root_id; - backref->inum = inum; - backref->file_pos = offset; - backref->num_bytes = num_bytes; - backref->extent_offset = extent_offset; - backref->generation = btrfs_file_extent_generation(leaf, extent); - backref->old = old; - backref_insert(&new->root, backref); - old->count++; -out: - btrfs_release_path(path); - WARN_ON(ret); - return ret; -} - -static noinline bool record_extent_backrefs(struct btrfs_path *path, - struct new_sa_defrag_extent *new) -{ - struct btrfs_fs_info *fs_info = btrfs_sb(new->inode->i_sb); - struct old_sa_defrag_extent *old, *tmp; - int ret; - - new->path = path; - - list_for_each_entry_safe(old, tmp, &new->head, list) { - ret = iterate_inodes_from_logical(old->bytenr + - old->extent_offset, fs_info, - path, record_one_backref, - old, false); - if (ret < 0 && ret != -ENOENT) - return false; - - /* no backref to be processed for this extent */ - if (!old->count) { - list_del(&old->list); - kfree(old); - } - } - - if (list_empty(&new->head)) - return false; - - return true; -} - -static int relink_is_mergable(struct extent_buffer *leaf, - struct btrfs_file_extent_item *fi, - struct new_sa_defrag_extent *new) -{ - if (btrfs_file_extent_disk_bytenr(leaf, fi) != new->bytenr) - return 0; - - if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG) - return 0; - - if (btrfs_file_extent_compression(leaf, fi) != new->compress_type) - return 0; - - if (btrfs_file_extent_encryption(leaf, fi) || - btrfs_file_extent_other_encoding(leaf, fi)) - return 0; - - return 1; -} - -/* - * Note the backref might has changed, and in this case we just return 0. - */ -static noinline int relink_extent_backref(struct btrfs_path *path, - struct sa_defrag_extent_backref *prev, - struct sa_defrag_extent_backref *backref) -{ - struct btrfs_file_extent_item *extent; - struct btrfs_file_extent_item *item; - struct btrfs_ordered_extent *ordered; - struct btrfs_trans_handle *trans; - struct btrfs_ref ref = { 0 }; - struct btrfs_root *root; - struct btrfs_key key; - struct extent_buffer *leaf; - struct old_sa_defrag_extent *old = backref->old; - struct new_sa_defrag_extent *new = old->new; - struct btrfs_fs_info *fs_info = btrfs_sb(new->inode->i_sb); - struct inode *inode; - struct extent_state *cached = NULL; - int ret = 0; - u64 start; - u64 len; - u64 lock_start; - u64 lock_end; - bool merge = false; - int index; - - if (prev && prev->root_id == backref->root_id && - prev->inum == backref->inum && - prev->file_pos + prev->num_bytes == backref->file_pos) - merge = true; - - /* step 1: get root */ - key.objectid = backref->root_id; - key.type = BTRFS_ROOT_ITEM_KEY; - key.offset = (u64)-1; - - index = srcu_read_lock(&fs_info->subvol_srcu); - - root = btrfs_read_fs_root_no_name(fs_info, &key); - if (IS_ERR(root)) { - srcu_read_unlock(&fs_info->subvol_srcu, index); - if (PTR_ERR(root) == -ENOENT) - return 0; - return PTR_ERR(root); - } - - if (btrfs_root_readonly(root)) { - srcu_read_unlock(&fs_info->subvol_srcu, index); - return 0; - } - - /* step 2: get inode */ - key.objectid = backref->inum; - key.type = BTRFS_INODE_ITEM_KEY; - key.offset = 0; - - inode = btrfs_iget(fs_info->sb, &key, root, NULL); - if (IS_ERR(inode)) { - srcu_read_unlock(&fs_info->subvol_srcu, index); - return 0; - } - - srcu_read_unlock(&fs_info->subvol_srcu, index); - - /* step 3: relink backref */ - lock_start = backref->file_pos; - lock_end = backref->file_pos + backref->num_bytes - 1; - lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, lock_end, - &cached); - - ordered = btrfs_lookup_first_ordered_extent(inode, lock_end); - if (ordered) { - btrfs_put_ordered_extent(ordered); - goto out_unlock; - } - - trans = btrfs_join_transaction(root); - if (IS_ERR(trans)) { - ret = PTR_ERR(trans); - goto out_unlock; - } - - key.objectid = backref->inum; - key.type = BTRFS_EXTENT_DATA_KEY; - key.offset = backref->file_pos; - - ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); - if (ret < 0) { - goto out_free_path; - } else if (ret > 0) { - ret = 0; - goto out_free_path; - } - - extent = btrfs_item_ptr(path->nodes[0], path->slots[0], - struct btrfs_file_extent_item); - - if (btrfs_file_extent_generation(path->nodes[0], extent) != - backref->generation) - goto out_free_path; - - btrfs_release_path(path); - - start = backref->file_pos; - if (backref->extent_offset < old->extent_offset + old->offset) - start += old->extent_offset + old->offset - - backref->extent_offset; - - len = min(backref->extent_offset + backref->num_bytes, - old->extent_offset + old->offset + old->len); - len -= max(backref->extent_offset, old->extent_offset + old->offset); - - ret = btrfs_drop_extents(trans, root, inode, start, - start + len, 1); - if (ret) - goto out_free_path; -again: - key.objectid = btrfs_ino(BTRFS_I(inode)); - key.type = BTRFS_EXTENT_DATA_KEY; - key.offset = start; - - path->leave_spinning = 1; - if (merge) { - struct btrfs_file_extent_item *fi; - u64 extent_len; - struct btrfs_key found_key; - - ret = btrfs_search_slot(trans, root, &key, path, 0, 1); - if (ret < 0) - goto out_free_path; - - path->slots[0]--; - leaf = path->nodes[0]; - btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]); - - fi = btrfs_item_ptr(leaf, path->slots[0], - struct btrfs_file_extent_item); - extent_len = btrfs_file_extent_num_bytes(leaf, fi); - - if (extent_len + found_key.offset == start && - relink_is_mergable(leaf, fi, new)) { - btrfs_set_file_extent_num_bytes(leaf, fi, - extent_len + len); - btrfs_mark_buffer_dirty(leaf); - inode_add_bytes(inode, len); - - ret = 1; - goto out_free_path; - } else { - merge = false; - btrfs_release_path(path); - goto again; - } - } - - ret = btrfs_insert_empty_item(trans, root, path, &key, - sizeof(*extent)); - if (ret) { - btrfs_abort_transaction(trans, ret); - goto out_free_path; - } - - leaf = path->nodes[0]; - item = btrfs_item_ptr(leaf, path->slots[0], - struct btrfs_file_extent_item); - btrfs_set_file_extent_disk_bytenr(leaf, item, new->bytenr); - btrfs_set_file_extent_disk_num_bytes(leaf, item, new->disk_len); - btrfs_set_file_extent_offset(leaf, item, start - new->file_pos); - btrfs_set_file_extent_num_bytes(leaf, item, len); - btrfs_set_file_extent_ram_bytes(leaf, item, new->len); - btrfs_set_file_extent_generation(leaf, item, trans->transid); - btrfs_set_file_extent_type(leaf, item, BTRFS_FILE_EXTENT_REG); - btrfs_set_file_extent_compression(leaf, item, new->compress_type); - btrfs_set_file_extent_encryption(leaf, item, 0); - btrfs_set_file_extent_other_encoding(leaf, item, 0); - - btrfs_mark_buffer_dirty(leaf); - inode_add_bytes(inode, len); - btrfs_release_path(path); - - btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF, new->bytenr, - new->disk_len, 0); - btrfs_init_data_ref(&ref, backref->root_id, backref->inum, - new->file_pos); /* start - extent_offset */ - ret = btrfs_inc_extent_ref(trans, &ref); - if (ret) { - btrfs_abort_transaction(trans, ret); - goto out_free_path; - } - - ret = 1; -out_free_path: - btrfs_release_path(path); - path->leave_spinning = 0; - btrfs_end_transaction(trans); -out_unlock: - unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, lock_end, - &cached); - iput(inode); - return ret; -} - -static void free_sa_defrag_extent(struct new_sa_defrag_extent *new) -{ - struct old_sa_defrag_extent *old, *tmp; - - if (!new) - return; - - list_for_each_entry_safe(old, tmp, &new->head, list) { - kfree(old); - } - kfree(new); -} - -static void relink_file_extents(struct new_sa_defrag_extent *new) -{ - struct btrfs_fs_info *fs_info = btrfs_sb(new->inode->i_sb); - struct btrfs_path *path; - struct sa_defrag_extent_backref *backref; - struct sa_defrag_extent_backref *prev = NULL; - struct rb_node *node; - int ret; - - path = btrfs_alloc_path(); - if (!path) - return; - - if (!record_extent_backrefs(path, new)) { - btrfs_free_path(path); - goto out; - } - btrfs_release_path(path); - - while (1) { - node = rb_first(&new->root); - if (!node) - break; - rb_erase(node, &new->root); - - backref = rb_entry(node, struct sa_defrag_extent_backref, node); - - ret = relink_extent_backref(path, prev, backref); - WARN_ON(ret < 0); - - kfree(prev); - - if (ret == 1) - prev = backref; - else - prev = NULL; - cond_resched(); - } - kfree(prev); - - btrfs_free_path(path); -out: - free_sa_defrag_extent(new); - - atomic_dec(&fs_info->defrag_running); - wake_up(&fs_info->transaction_wait); -} - -static struct new_sa_defrag_extent * -record_old_file_extents(struct inode *inode, - struct btrfs_ordered_extent *ordered) -{ - struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); - struct btrfs_root *root = BTRFS_I(inode)->root; - struct btrfs_path *path; - struct btrfs_key key; - struct old_sa_defrag_extent *old; - struct new_sa_defrag_extent *new; - int ret; - - new = kmalloc(sizeof(*new), GFP_NOFS); - if (!new) - return NULL; - - new->inode = inode; - new->file_pos = ordered->file_offset; - new->len = ordered->len; - new->bytenr = ordered->start; - new->disk_len = ordered->disk_len; - new->compress_type = ordered->compress_type; - new->root = RB_ROOT; - INIT_LIST_HEAD(&new->head); - - path = btrfs_alloc_path(); - if (!path) - goto out_kfree; - - key.objectid = btrfs_ino(BTRFS_I(inode)); - key.type = BTRFS_EXTENT_DATA_KEY; - key.offset = new->file_pos; - - ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); - if (ret < 0) - goto out_free_path; - if (ret > 0 && path->slots[0] > 0) - path->slots[0]--; - - /* find out all the old extents for the file range */ - while (1) { - struct btrfs_file_extent_item *extent; - struct extent_buffer *l; - int slot; - u64 num_bytes; - u64 offset; - u64 end; - u64 disk_bytenr; - u64 extent_offset; - - l = path->nodes[0]; - slot = path->slots[0]; - - if (slot >= btrfs_header_nritems(l)) { - ret = btrfs_next_leaf(root, path); - if (ret < 0) - goto out_free_path; - else if (ret > 0) - break; - continue; - } - - btrfs_item_key_to_cpu(l, &key, slot); - - if (key.objectid != btrfs_ino(BTRFS_I(inode))) - break; - if (key.type != BTRFS_EXTENT_DATA_KEY) - break; - if (key.offset >= new->file_pos + new->len) - break; - - extent = btrfs_item_ptr(l, slot, struct btrfs_file_extent_item); - - num_bytes = btrfs_file_extent_num_bytes(l, extent); - if (key.offset + num_bytes < new->file_pos) - goto next; - - disk_bytenr = btrfs_file_extent_disk_bytenr(l, extent); - if (!disk_bytenr) - goto next; - - extent_offset = btrfs_file_extent_offset(l, extent); - - old = kmalloc(sizeof(*old), GFP_NOFS); - if (!old) - goto out_free_path; - - offset = max(new->file_pos, key.offset); - end = min(new->file_pos + new->len, key.offset + num_bytes); - - old->bytenr = disk_bytenr; - old->extent_offset = extent_offset; - old->offset = offset - key.offset; - old->len = end - offset; - old->new = new; - old->count = 0; - list_add_tail(&old->list, &new->head); -next: - path->slots[0]++; - cond_resched(); - } - - btrfs_free_path(path); - atomic_inc(&fs_info->defrag_running); - - return new; - -out_free_path: - btrfs_free_path(path); -out_kfree: - free_sa_defrag_extent(new); - return NULL; -} - static void btrfs_release_delalloc_bytes(struct btrfs_fs_info *fs_info, u64 start, u64 len) { @@ -3023,7 +2379,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) struct btrfs_trans_handle *trans = NULL; struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; struct extent_state *cached_state = NULL; - struct new_sa_defrag_extent *new = NULL; int compress_type = 0; int ret = 0; u64 logical_len = ordered_extent->len; @@ -3032,6 +2387,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) bool range_locked = false; bool clear_new_delalloc_bytes = false; bool clear_reserved_extent = true; + unsigned int clear_bits; if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) && @@ -3090,20 +2446,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) ordered_extent->file_offset + ordered_extent->len - 1, &cached_state); - ret = test_range_bit(io_tree, ordered_extent->file_offset, - ordered_extent->file_offset + ordered_extent->len - 1, - EXTENT_DEFRAG, 0, cached_state); - if (ret) { - u64 last_snapshot = btrfs_root_last_snapshot(&root->root_item); - if (0 && last_snapshot >= BTRFS_I(inode)->generation) - /* the inode is shared */ - new = record_old_file_extents(inode, ordered_extent); - - clear_extent_bit(io_tree, ordered_extent->file_offset, - ordered_extent->file_offset + ordered_extent->len - 1, - EXTENT_DEFRAG, 0, 0, &cached_state); - } - if (nolock) trans = btrfs_join_transaction_nolock(root); else @@ -3164,21 +2506,16 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) } ret = 0; out: - if (range_locked || clear_new_delalloc_bytes) { - unsigned int clear_bits = 0; - - if (range_locked) - clear_bits |= EXTENT_LOCKED; - if (clear_new_delalloc_bytes) - clear_bits |= EXTENT_DELALLOC_NEW; - clear_extent_bit(&BTRFS_I(inode)->io_tree, - ordered_extent->file_offset, - ordered_extent->file_offset + - ordered_extent->len - 1, - clear_bits, - (clear_bits & EXTENT_LOCKED) ? 1 : 0, - 0, &cached_state); - } + clear_bits = EXTENT_DEFRAG; + if (range_locked) + clear_bits |= EXTENT_LOCKED; + if (clear_new_delalloc_bytes) + clear_bits |= EXTENT_DELALLOC_NEW; + clear_extent_bit(&BTRFS_I(inode)->io_tree, + ordered_extent->file_offset, + ordered_extent->file_offset + ordered_extent->len - 1, + clear_bits, (clear_bits & EXTENT_LOCKED) ? 1 : 0, 0, + &cached_state); if (trans) btrfs_end_transaction(trans); @@ -3222,16 +2559,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) */ btrfs_remove_ordered_extent(inode, ordered_extent); - /* for snapshot-aware defrag */ - if (new) { - if (ret) { - free_sa_defrag_extent(new); - atomic_dec(&fs_info->defrag_running); - } else { - relink_file_extents(new); - } - } - /* once for us */ btrfs_put_ordered_extent(ordered_extent); /* once for the tree */ From patchwork Wed Nov 20 18:24:27 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254665 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EAA3114DB for ; Wed, 20 Nov 2019 18:25:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id CB0F6208C0 for ; Wed, 20 Nov 2019 18:25:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="drNnk6ly" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728654AbfKTSZP (ORCPT ); Wed, 20 Nov 2019 13:25:15 -0500 Received: from mail-pf1-f195.google.com ([209.85.210.195]:39867 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728549AbfKTSZC (ORCPT ); Wed, 20 Nov 2019 13:25:02 -0500 Received: by mail-pf1-f195.google.com with SMTP id x28so139947pfo.6 for ; Wed, 20 Nov 2019 10:25:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=p5rYlwXOvdRDpCyhL7WakxYaGWlsok8OWIBInqYDbio=; b=drNnk6lyNl9cA91COkH8BndGBLkWcVKxuz8wuZZnKcS6HiYQY6iY53olFxiWKTWTE/ JV3W0RLH3gAhxPuPcHmj2NCy1eQyoT7UwgXJ77HNV6joT3JE1sstJTb/m7qam+EyxGnV 9K+GG+Xajrbc3fWQluZvKB8FWes1O+Es+SSbKO6cVP802FrpBqtwjF7DyslkqnPd3uf9 R6Guh7bEl6qewjbFmWpMOUhkRsl3G1fpDZ3rcwNl2kHETKkzsCWzuNUGGL5+W9RRfskc 7jQ+g/c59m91LMfMQcYtnsnD/qHkApzTGY7enTaFdd0jDMrnCVp2++WhSNN4q5he8dIl toYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=p5rYlwXOvdRDpCyhL7WakxYaGWlsok8OWIBInqYDbio=; b=UXqHZthUM/fcKs0ffAGS3Go+sz/qxSKp+Uig403/OPzMO5+es8v8GEUGA45TkjGQTy TvJ5/8/MzdpnVxxy3VXC8FI/uHTSXH3st02vFL0U7nh5u2L77Z7Owl4IOm0i/sdxrQeZ g1Bkxo9OkYCmDVp0XJ5qyJZCSkItfwmfY+boPpX9JFAaG16cOZwiOHSXQnkWf1GgdkBV HrQNIsDqD7dDzlr4Ap/BhaTJSBZP67Of1C1LpuMH28jf+/UKPhCoJTXyM6vQC2u7RA4X YWhzyRFc90BO0/NvjT3F55FF+d/GghMh34wAEDooyyZZ33R3KDbsMOyekxZ09WmIFkgi JF3A== X-Gm-Message-State: APjAAAUqhG06r54MpD/pLDfGzWMFV7B1YNKs3gzOYDMzZMOwtHcMPm2g ukeC/1luBV7D/N+mupLCn6E8Dc+vVcs= X-Google-Smtp-Source: APXvYqyLip47IVJP/87RPDqShcX5UQGvPup0I+W/1BNyOY1zRI/N3sj0t5396mbcPyVJYqUJeCEoPQ== X-Received: by 2002:a63:4705:: with SMTP id u5mr4720438pga.7.1574274300183; Wed, 20 Nov 2019 10:25:00 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.24.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:24:59 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 07/12] btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item Date: Wed, 20 Nov 2019 10:24:27 -0800 Message-Id: <693ec6051dcda761616283cc3adb5260be0de85f.1574273658.git.osandov@fb.com> X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval ordered->start, ordered->len, and ordered->disk_len correspond to fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively. It's confusing to translate between the two naming schemes. Since a btrfs_ordered_extent is basically a pending btrfs_file_extent_item, let's make the former use the naming from the latter. Note that I didn't touch the names in tracepoints just in case there are scripts depending on the current naming. Signed-off-by: Omar Sandoval --- fs/btrfs/file-item.c | 2 +- fs/btrfs/file.c | 6 ++-- fs/btrfs/inode.c | 67 ++++++++++++++++------------------ fs/btrfs/ordered-data.c | 69 ++++++++++++++++++------------------ fs/btrfs/ordered-data.h | 26 +++++++------- fs/btrfs/relocation.c | 5 +-- include/trace/events/btrfs.h | 6 ++-- 7 files changed, 89 insertions(+), 92 deletions(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index c95772949b00..aae674b37bed 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -484,7 +484,7 @@ blk_status_t btrfs_csum_one_bio(struct inode *inode, struct bio *bio, for (i = 0; i < nr_sectors; i++) { if (!one_ordered && - (offset >= ordered->file_offset + ordered->len || + (offset >= ordered->file_offset + ordered->num_bytes || offset < ordered->file_offset)) { unsigned long bytes_left; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 435a502a3226..34c1a2284e03 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1503,7 +1503,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages, ordered = btrfs_lookup_ordered_range(inode, start_pos, last_pos - start_pos + 1); if (ordered && - ordered->file_offset + ordered->len > start_pos && + ordered->file_offset + ordered->num_bytes > start_pos && ordered->file_offset <= last_pos) { unlock_extent_cached(&inode->io_tree, start_pos, last_pos, cached_state); @@ -2428,7 +2428,7 @@ static int btrfs_punch_hole_lock_range(struct inode *inode, * we need to try again. */ if ((!ordered || - (ordered->file_offset + ordered->len <= lockstart || + (ordered->file_offset + ordered->num_bytes <= lockstart || ordered->file_offset > lockend)) && !filemap_range_has_page(inode->i_mapping, lockstart, lockend)) { @@ -3250,7 +3250,7 @@ static long btrfs_fallocate(struct file *file, int mode, ordered = btrfs_lookup_first_ordered_extent(inode, locked_end); if (ordered && - ordered->file_offset + ordered->len > alloc_start && + ordered->file_offset + ordered->num_bytes > alloc_start && ordered->file_offset < alloc_end) { btrfs_put_ordered_extent(ordered); unlock_extent_cached(&BTRFS_I(inode)->io_tree, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 707b4d86409f..62d6aaccc202 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2379,9 +2379,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) struct btrfs_trans_handle *trans = NULL; struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; struct extent_state *cached_state = NULL; + u64 start, end; int compress_type = 0; int ret = 0; - u64 logical_len = ordered_extent->len; + u64 logical_len = ordered_extent->num_bytes; bool nolock; bool truncated = false; bool range_locked = false; @@ -2389,6 +2390,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) bool clear_reserved_extent = true; unsigned int clear_bits; + start = ordered_extent->file_offset; + end = start + ordered_extent->num_bytes - 1; + if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags)) @@ -2401,10 +2405,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) goto out; } - btrfs_free_io_failure_record(BTRFS_I(inode), - ordered_extent->file_offset, - ordered_extent->file_offset + - ordered_extent->len - 1); + btrfs_free_io_failure_record(BTRFS_I(inode), start, end); if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) { truncated = true; @@ -2422,8 +2423,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) * space for NOCOW range. * As NOCOW won't cause a new delayed ref, just free the space */ - btrfs_qgroup_free_data(inode, NULL, ordered_extent->file_offset, - ordered_extent->len); + btrfs_qgroup_free_data(inode, NULL, start, + ordered_extent->num_bytes); btrfs_ordered_update_i_size(inode, 0, ordered_extent); if (nolock) trans = btrfs_join_transaction_nolock(root); @@ -2442,9 +2443,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) } range_locked = true; - lock_extent_bits(io_tree, ordered_extent->file_offset, - ordered_extent->file_offset + ordered_extent->len - 1, - &cached_state); + lock_extent_bits(io_tree, start, end, &cached_state); if (nolock) trans = btrfs_join_transaction_nolock(root); @@ -2462,31 +2461,30 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) compress_type = ordered_extent->compress_type; if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) { BUG_ON(compress_type); - btrfs_qgroup_free_data(inode, NULL, ordered_extent->file_offset, - ordered_extent->len); + btrfs_qgroup_free_data(inode, NULL, start, + ordered_extent->num_bytes); ret = btrfs_mark_extent_written(trans, BTRFS_I(inode), ordered_extent->file_offset, ordered_extent->file_offset + logical_len); } else { BUG_ON(root == fs_info->tree_root); - ret = insert_reserved_file_extent(trans, inode, - ordered_extent->file_offset, - ordered_extent->start, - ordered_extent->disk_len, + ret = insert_reserved_file_extent(trans, inode, start, + ordered_extent->disk_bytenr, + ordered_extent->disk_num_bytes, logical_len, logical_len, compress_type, 0, 0, BTRFS_FILE_EXTENT_REG); if (!ret) { clear_reserved_extent = false; btrfs_release_delalloc_bytes(fs_info, - ordered_extent->start, - ordered_extent->disk_len); + ordered_extent->disk_bytenr, + ordered_extent->disk_num_bytes); } } unpin_extent_cache(&BTRFS_I(inode)->extent_tree, - ordered_extent->file_offset, ordered_extent->len, - trans->transid); + ordered_extent->file_offset, + ordered_extent->num_bytes, trans->transid); if (ret < 0) { btrfs_abort_transaction(trans, ret); goto out; @@ -2511,27 +2509,23 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) clear_bits |= EXTENT_LOCKED; if (clear_new_delalloc_bytes) clear_bits |= EXTENT_DELALLOC_NEW; - clear_extent_bit(&BTRFS_I(inode)->io_tree, - ordered_extent->file_offset, - ordered_extent->file_offset + ordered_extent->len - 1, - clear_bits, (clear_bits & EXTENT_LOCKED) ? 1 : 0, 0, + clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end, clear_bits, + (clear_bits & EXTENT_LOCKED) ? 1 : 0, 0, &cached_state); if (trans) btrfs_end_transaction(trans); if (ret || truncated) { - u64 start, end; + u64 unwritten_start = start; if (truncated) - start = ordered_extent->file_offset + logical_len; - else - start = ordered_extent->file_offset; - end = ordered_extent->file_offset + ordered_extent->len - 1; - clear_extent_uptodate(io_tree, start, end, NULL); + unwritten_start += logical_len; + clear_extent_uptodate(io_tree, unwritten_start, end, NULL); /* Drop the cache for the part of the extent we didn't write. */ - btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0); + btrfs_drop_extent_cache(BTRFS_I(inode), unwritten_start, end, + 0); /* * If the ordered extent had an IOERR or something else went @@ -2548,11 +2542,11 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) !test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) btrfs_free_reserved_extent(fs_info, - ordered_extent->start, - ordered_extent->disk_len, 1); + ordered_extent->disk_bytenr, + ordered_extent->disk_num_bytes, + 1); } - /* * This needs to be done to make sure anybody waiting knows we are done * updating everything for this ordered extent. @@ -8173,7 +8167,8 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start, page_end - start + 1); if (ordered) { - end = min(page_end, ordered->file_offset + ordered->len - 1); + end = min(page_end, + ordered->file_offset + ordered->num_bytes - 1); /* * IO on this page will never be started, so we need * to account for any ordered extents now @@ -8702,7 +8697,7 @@ void btrfs_destroy_inode(struct inode *inode) else { btrfs_err(fs_info, "found ordered extent %llu %llu on inode cleanup", - ordered->file_offset, ordered->len); + ordered->file_offset, ordered->num_bytes); btrfs_remove_ordered_extent(inode, ordered); btrfs_put_ordered_extent(ordered); btrfs_put_ordered_extent(ordered); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 24b6c72b9a59..94e2485006ab 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -20,9 +20,9 @@ static struct kmem_cache *btrfs_ordered_extent_cache; static u64 entry_end(struct btrfs_ordered_extent *entry) { - if (entry->file_offset + entry->len < entry->file_offset) + if (entry->file_offset + entry->num_bytes < entry->file_offset) return (u64)-1; - return entry->file_offset + entry->len; + return entry->file_offset + entry->num_bytes; } /* returns NULL if the insertion worked, or it returns the node it did find @@ -120,7 +120,7 @@ static struct rb_node *__tree_search(struct rb_root *root, u64 file_offset, static int offset_in_entry(struct btrfs_ordered_extent *entry, u64 file_offset) { if (file_offset < entry->file_offset || - entry->file_offset + entry->len <= file_offset) + entry->file_offset + entry->num_bytes <= file_offset) return 0; return 1; } @@ -129,7 +129,7 @@ static int range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset, u64 len) { if (file_offset + len <= entry->file_offset || - entry->file_offset + entry->len <= file_offset) + entry->file_offset + entry->num_bytes <= file_offset) return 0; return 1; } @@ -161,19 +161,14 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, } /* allocate and add a new ordered_extent into the per-inode tree. - * file_offset is the logical offset in the file - * - * start is the disk block number of an extent already reserved in the - * extent allocation tree - * - * len is the length of the extent * * The tree is given a single reference on the ordered extent that was * inserted. */ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, - u64 start, u64 len, u64 disk_len, - int type, int dio, int compress_type) + u64 disk_bytenr, u64 num_bytes, + u64 disk_num_bytes, int type, int dio, + int compress_type) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_root *root = BTRFS_I(inode)->root; @@ -187,10 +182,10 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, return -ENOMEM; entry->file_offset = file_offset; - entry->start = start; - entry->len = len; - entry->disk_len = disk_len; - entry->bytes_left = len; + entry->disk_bytenr = disk_bytenr; + entry->num_bytes = num_bytes; + entry->disk_num_bytes = disk_num_bytes; + entry->bytes_left = num_bytes; entry->inode = igrab(inode); entry->compress_type = compress_type; entry->truncated_len = (u64)-1; @@ -198,7 +193,7 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, set_bit(type, &entry->flags); if (dio) { - percpu_counter_add_batch(&fs_info->dio_bytes, len, + percpu_counter_add_batch(&fs_info->dio_bytes, num_bytes, fs_info->delalloc_batch); set_bit(BTRFS_ORDERED_DIRECT, &entry->flags); } @@ -247,27 +242,30 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, } int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, - u64 start, u64 len, u64 disk_len, int type) + u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, + int type) { - return __btrfs_add_ordered_extent(inode, file_offset, start, len, - disk_len, type, 0, + return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, + num_bytes, disk_num_bytes, type, 0, BTRFS_COMPRESS_NONE); } int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, - u64 start, u64 len, u64 disk_len, int type) + u64 disk_bytenr, u64 num_bytes, + u64 disk_num_bytes, int type) { - return __btrfs_add_ordered_extent(inode, file_offset, start, len, - disk_len, type, 1, + return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, + num_bytes, disk_num_bytes, type, 1, BTRFS_COMPRESS_NONE); } int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, - u64 start, u64 len, u64 disk_len, - int type, int compress_type) + u64 disk_bytenr, u64 num_bytes, + u64 disk_num_bytes, int type, + int compress_type) { - return __btrfs_add_ordered_extent(inode, file_offset, start, len, - disk_len, type, 0, + return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, + num_bytes, disk_num_bytes, type, 0, compress_type); } @@ -328,8 +326,8 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode, } dec_start = max(*file_offset, entry->file_offset); - dec_end = min(*file_offset + io_size, entry->file_offset + - entry->len); + dec_end = min(*file_offset + io_size, + entry->file_offset + entry->num_bytes); *file_offset = dec_end; if (dec_start > dec_end) { btrfs_crit(fs_info, "bad ordering dec_start %llu end %llu", @@ -471,10 +469,11 @@ void btrfs_remove_ordered_extent(struct inode *inode, btrfs_mod_outstanding_extents(btrfs_inode, -1); spin_unlock(&btrfs_inode->lock); if (root != fs_info->tree_root) - btrfs_delalloc_release_metadata(btrfs_inode, entry->len, false); + btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes, + false); if (test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) - percpu_counter_add_batch(&fs_info->dio_bytes, -entry->len, + percpu_counter_add_batch(&fs_info->dio_bytes, -entry->num_bytes, fs_info->delalloc_batch); tree = &btrfs_inode->ordered_tree; @@ -534,8 +533,8 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr, ordered = list_first_entry(&splice, struct btrfs_ordered_extent, root_extent_list); - if (range_end <= ordered->start || - ordered->start + ordered->disk_len <= range_start) { + if (range_end <= ordered->disk_bytenr || + ordered->disk_bytenr + ordered->disk_num_bytes <= range_start) { list_move_tail(&ordered->root_extent_list, &skipped); cond_resched_lock(&root->ordered_extent_lock); continue; @@ -624,7 +623,7 @@ void btrfs_start_ordered_extent(struct inode *inode, int wait) { u64 start = entry->file_offset; - u64 end = start + entry->len - 1; + u64 end = start + entry->num_bytes - 1; trace_btrfs_ordered_extent_start(inode, entry); @@ -685,7 +684,7 @@ int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len) btrfs_put_ordered_extent(ordered); break; } - if (ordered->file_offset + ordered->len <= start) { + if (ordered->file_offset + ordered->num_bytes <= start) { btrfs_put_ordered_extent(ordered); break; } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 5204171ea962..4a3dd80e776c 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -67,14 +67,13 @@ struct btrfs_ordered_extent { /* logical offset in the file */ u64 file_offset; - /* disk byte number */ - u64 start; - - /* ram length of the extent in bytes */ - u64 len; - - /* extent length on disk */ - u64 disk_len; + /* + * These fields directly correspond to the same fields in + * btrfs_file_extent_item. + */ + u64 disk_bytenr; + u64 num_bytes; + u64 disk_num_bytes; /* number of bytes that still need writing */ u64 bytes_left; @@ -161,12 +160,15 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode, u64 *file_offset, u64 io_size, int uptodate); int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, - u64 start, u64 len, u64 disk_len, int type); + u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, + int type); int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, - u64 start, u64 len, u64 disk_len, int type); + u64 disk_bytenr, u64 num_bytes, + u64 disk_num_bytes, int type); int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, - u64 start, u64 len, u64 disk_len, - int type, int compress_type); + u64 disk_bytenr, u64 num_bytes, + u64 disk_num_bytes, int type, + int compress_type); void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct inode *inode, diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 5cd42b66818c..e3cec29813ee 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4617,7 +4617,7 @@ int btrfs_reloc_clone_csums(struct inode *inode, u64 file_pos, u64 len) LIST_HEAD(list); ordered = btrfs_lookup_ordered_extent(inode, file_pos); - BUG_ON(ordered->file_offset != file_pos || ordered->len != len); + BUG_ON(ordered->file_offset != file_pos || ordered->num_bytes != len); disk_bytenr = file_pos + BTRFS_I(inode)->index_cnt; ret = btrfs_lookup_csums_range(fs_info->csum_root, disk_bytenr, @@ -4641,7 +4641,8 @@ int btrfs_reloc_clone_csums(struct inode *inode, u64 file_pos, u64 len) * disk_len vs real len like with real inodes since it's all * disk length. */ - new_bytenr = ordered->start + (sums->bytenr - disk_bytenr); + new_bytenr = (ordered->disk_bytenr + + (sums->bytenr - disk_bytenr)); sums->bytenr = new_bytenr; btrfs_add_ordered_sum(ordered, sums); diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 75ae1899452b..3a0f172cfc8f 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -496,9 +496,9 @@ DECLARE_EVENT_CLASS(btrfs__ordered_extent, TP_fast_assign_btrfs(btrfs_sb(inode->i_sb), __entry->ino = btrfs_ino(BTRFS_I(inode)); __entry->file_offset = ordered->file_offset; - __entry->start = ordered->start; - __entry->len = ordered->len; - __entry->disk_len = ordered->disk_len; + __entry->start = ordered->disk_bytenr; + __entry->len = ordered->num_bytes; + __entry->disk_len = ordered->disk_num_bytes; __entry->bytes_left = ordered->bytes_left; __entry->flags = ordered->flags; __entry->compress_type = ordered->compress_type; From patchwork Wed Nov 20 18:24:28 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254659 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 091C0109A for ; Wed, 20 Nov 2019 18:25:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id CB79020878 for ; Wed, 20 Nov 2019 18:25:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="oB+yGGy2" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728345AbfKTSZO (ORCPT ); Wed, 20 Nov 2019 13:25:14 -0500 Received: from mail-pj1-f66.google.com ([209.85.216.66]:46255 "EHLO mail-pj1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728560AbfKTSZE (ORCPT ); Wed, 20 Nov 2019 13:25:04 -0500 Received: by mail-pj1-f66.google.com with SMTP id a16so177071pjs.13 for ; Wed, 20 Nov 2019 10:25:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=A8x/o2LWQt4iJLuED+x+O4dWDi2aR7BaTDIRFGy3mvM=; b=oB+yGGy2zS8ViWxIY9kpkDVYR1BTlOqop40AV24/aGvmNCZvx5OPriTpffRnEdI9GA TuStgpqIZJz8x3Ir97Ed7S9YUiksCa9sxLauRaqUOChY5krVHuL4tGYKHxz8KMx8KPsp MAsXUu0Lzacz0rV8cxn5jcsWGlRp3n0sR8jbi9n9oczbN6wbZGtSoxwr7SiBoXRJqZKo uNbTD8jAsZ99HKqp9d6FbM7hZ2MOFHwAMYXMa/cgMTzWtFAleDeQqG82ntOt38YQ6Yw+ sUcF5s+1w3tpx1puEY5ZN5dr9dTVNwr6Z/XHbtkMCFhJzVDyWvqhiUnGG/fh+QhHAN23 ILqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=A8x/o2LWQt4iJLuED+x+O4dWDi2aR7BaTDIRFGy3mvM=; b=MdIABkrpbonUTlHZHZs85nEOOLSN2gJe1e2Ycl2Sy50qHMansLnwIvJ3EPLuwPREq3 1td0act1vhVVtqBT4+1D8B0PiILgjIiXl0Tb+OiLOVL1sSTxC0/fAHBr2/w2K0XQVNyb 7vPSebK1alvJ7EPWRIxS0ktKpar+hOcAXM6hh7zUYHpvdNJAWV/Pn43UTGBS5xxUufE/ gYALCDklEK2YEC2QQPX9OD+P1bePnBWMeMnts4BT+ZBrtsJBc5t5DvD1VLzHv8e0sfEU eowxNuYU1JOz44bupJMxZIfi1J1ZY6xY4Wq8vQWgeW5t6Sx5HJIYv7K1e7DNRPGZOPhU xDEw== X-Gm-Message-State: APjAAAWZB7iMhtwI87xUGLuAx0qiw8tiqob84V5Y4qcoRmTr0RtRETl/ jJLEiMK18/vD65zA+3ARtiyvC5xgsSA= X-Google-Smtp-Source: APXvYqzWFwY+eb+z+hdL/dI2K+EGKA2r57grmCm/iS7yMbmUklvBD77mgchggB2vWiEqx8zvIMYykg== X-Received: by 2002:a17:902:7402:: with SMTP id g2mr4401755pll.6.1574274301549; Wed, 20 Nov 2019 10:25:01 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.25.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:25:00 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 08/12] btrfs: add ram_bytes and offset to btrfs_ordered_extent Date: Wed, 20 Nov 2019 10:24:28 -0800 Message-Id: X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Currently, we only create ordered extents when ram_bytes == num_bytes and offset == 0. However, RWF_ENCODED writes may create extents which only refer to a subset of the full unencoded extent, so we need to plumb these fields through the ordered extent infrastructure and pass them down to insert_reserved_file_extent(). Since we're changing the btrfs_add_ordered_extent* signature, let's get rid of the trivial wrappers and add a kernel-doc. Signed-off-by: Omar Sandoval Reviewed-by: Nikolay Borisov --- fs/btrfs/inode.c | 65 +++++++++++++++++++++++------------------ fs/btrfs/ordered-data.c | 65 +++++++++++++++-------------------------- fs/btrfs/ordered-data.h | 16 ++++------ 3 files changed, 67 insertions(+), 79 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 62d6aaccc202..d53580ad2c46 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -846,13 +846,12 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) goto out_free_reserve; free_extent_map(em); - ret = btrfs_add_ordered_extent_compress(inode, - async_extent->start, - ins.objectid, - async_extent->ram_size, - ins.offset, - BTRFS_ORDERED_COMPRESSED, - async_extent->compress_type); + ret = btrfs_add_ordered_extent(inode, async_extent->start, + async_extent->ram_size, + async_extent->ram_size, + ins.objectid, ins.offset, 0, + 1 << BTRFS_ORDERED_COMPRESSED, + async_extent->compress_type); if (ret) { btrfs_drop_extent_cache(BTRFS_I(inode), async_extent->start, @@ -1046,8 +1045,9 @@ static noinline int cow_file_range(struct inode *inode, } free_extent_map(em); - ret = btrfs_add_ordered_extent(inode, start, ins.objectid, - ram_size, cur_alloc_size, 0); + ret = btrfs_add_ordered_extent(inode, start, ram_size, ram_size, + ins.objectid, cur_alloc_size, 0, + 0, BTRFS_COMPRESS_NONE); if (ret) goto out_drop_extent_cache; @@ -1584,10 +1584,11 @@ static noinline int run_delalloc_nocow(struct inode *inode, goto error; } free_extent_map(em); - ret = btrfs_add_ordered_extent(inode, cur_offset, - disk_bytenr, num_bytes, - num_bytes, - BTRFS_ORDERED_PREALLOC); + ret = btrfs_add_ordered_extent(inode, + cur_offset, num_bytes, num_bytes, + disk_bytenr, num_bytes, 0, + 1 << BTRFS_ORDERED_PREALLOC, + BTRFS_COMPRESS_NONE); if (ret) { btrfs_drop_extent_cache(BTRFS_I(inode), cur_offset, @@ -1597,9 +1598,11 @@ static noinline int run_delalloc_nocow(struct inode *inode, } } else { ret = btrfs_add_ordered_extent(inode, cur_offset, + num_bytes, num_bytes, disk_bytenr, num_bytes, - num_bytes, - BTRFS_ORDERED_NOCOW); + 0, + 1 << BTRFS_ORDERED_NOCOW, + BTRFS_COMPRESS_NONE); if (ret) goto error; } @@ -2269,7 +2272,7 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end) static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, struct inode *inode, u64 file_pos, u64 disk_bytenr, u64 disk_num_bytes, - u64 num_bytes, u64 ram_bytes, + u64 offset, u64 num_bytes, u64 ram_bytes, u8 compression, u8 encryption, u16 other_encoding, int extent_type) { @@ -2319,7 +2322,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, btrfs_set_file_extent_type(leaf, fi, extent_type); btrfs_set_file_extent_disk_bytenr(leaf, fi, disk_bytenr); btrfs_set_file_extent_disk_num_bytes(leaf, fi, disk_num_bytes); - btrfs_set_file_extent_offset(leaf, fi, 0); + btrfs_set_file_extent_offset(leaf, fi, offset); btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes); btrfs_set_file_extent_ram_bytes(leaf, fi, ram_bytes); btrfs_set_file_extent_compression(leaf, fi, compression); @@ -2345,7 +2348,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, qg_released = ret; ret = btrfs_alloc_reserved_file_extent(trans, root, btrfs_ino(BTRFS_I(inode)), - file_pos, qg_released, &ins); + file_pos - offset, qg_released, + &ins); out: btrfs_free_path(path); @@ -2382,7 +2386,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) u64 start, end; int compress_type = 0; int ret = 0; - u64 logical_len = ordered_extent->num_bytes; + u64 num_bytes = ordered_extent->num_bytes; + u64 ram_bytes = ordered_extent->ram_bytes; bool nolock; bool truncated = false; bool range_locked = false; @@ -2409,9 +2414,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) { truncated = true; - logical_len = ordered_extent->truncated_len; + num_bytes = ram_bytes = ordered_extent->truncated_len; /* Truncated the entire extent, don't bother adding */ - if (!logical_len) + if (!num_bytes) goto out; } @@ -2466,13 +2471,14 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) ret = btrfs_mark_extent_written(trans, BTRFS_I(inode), ordered_extent->file_offset, ordered_extent->file_offset + - logical_len); + num_bytes); } else { BUG_ON(root == fs_info->tree_root); ret = insert_reserved_file_extent(trans, inode, start, ordered_extent->disk_bytenr, ordered_extent->disk_num_bytes, - logical_len, logical_len, + ordered_extent->offset, + num_bytes, ram_bytes, compress_type, 0, 0, BTRFS_FILE_EXTENT_REG); if (!ret) { @@ -2520,7 +2526,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) u64 unwritten_start = start; if (truncated) - unwritten_start += logical_len; + unwritten_start += num_bytes; clear_extent_uptodate(io_tree, unwritten_start, end, NULL); /* Drop the cache for the part of the extent we didn't write. */ @@ -2537,7 +2543,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) * errored out then we don't need to do this as the accounting * has already been done. */ - if ((ret || !logical_len) && + if ((ret || !num_bytes) && clear_reserved_extent && !test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) @@ -6609,8 +6615,11 @@ static struct extent_map *btrfs_create_dio_extent(struct inode *inode, if (IS_ERR(em)) goto out; } - ret = btrfs_add_ordered_extent_dio(inode, start, block_start, - len, block_len, type); + ret = btrfs_add_ordered_extent(inode, start, len, len, block_start, + block_len, 0, + (1 << type) | + (1 << BTRFS_ORDERED_DIRECT), + BTRFS_COMPRESS_NONE); if (ret) { if (em) { free_extent_map(em); @@ -9743,7 +9752,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, last_alloc = ins.offset; ret = insert_reserved_file_extent(trans, inode, cur_offset, ins.objectid, - ins.offset, ins.offset, + ins.offset, 0, ins.offset, ins.offset, 0, 0, 0, BTRFS_FILE_EXTENT_PREALLOC); if (ret) { diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 94e2485006ab..3c6edc307657 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -160,15 +160,27 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, return ret; } -/* allocate and add a new ordered_extent into the per-inode tree. +/** + * btrfs_add_ordered_extent - Add an ordered extent to the per-inode tree. + * @inode: inode that this extent is for. + * @file_offset: Logical offset in file where the extent starts. + * @num_bytes: Logical length of extent in file. + * @ram_bytes: Full length of unencoded data. + * @disk_bytenr: Offset of extent on disk. + * @disk_num_bytes: Size of extent on disk. + * @offset: Offset into unencoded data where file data starts. + * @flags: Flags specifying type of extent (1 << BTRFS_ORDERED_*). + * @compress_type: Compression algorithm used for data. * - * The tree is given a single reference on the ordered extent that was - * inserted. + * Most of these parameters correspond to &struct btrfs_file_extent_item. The + * tree is given a single reference on the ordered extent that was inserted. + * + * Return: 0 or -ENOMEM. */ -static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type, int dio, - int compress_type) +int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, int flags, + int compress_type) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_root *root = BTRFS_I(inode)->root; @@ -182,20 +194,19 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, return -ENOMEM; entry->file_offset = file_offset; - entry->disk_bytenr = disk_bytenr; entry->num_bytes = num_bytes; + entry->ram_bytes = ram_bytes; + entry->disk_bytenr = disk_bytenr; entry->disk_num_bytes = disk_num_bytes; + entry->offset = offset; entry->bytes_left = num_bytes; entry->inode = igrab(inode); entry->compress_type = compress_type; entry->truncated_len = (u64)-1; - if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE) - set_bit(type, &entry->flags); - - if (dio) { + entry->flags = flags; + if (flags & (1 << BTRFS_ORDERED_DIRECT)) { percpu_counter_add_batch(&fs_info->dio_bytes, num_bytes, fs_info->delalloc_batch); - set_bit(BTRFS_ORDERED_DIRECT, &entry->flags); } /* one ref for the tree */ @@ -241,34 +252,6 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, return 0; } -int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, - int type) -{ - return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, - num_bytes, disk_num_bytes, type, 0, - BTRFS_COMPRESS_NONE); -} - -int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type) -{ - return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, - num_bytes, disk_num_bytes, type, 1, - BTRFS_COMPRESS_NONE); -} - -int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type, - int compress_type) -{ - return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, - num_bytes, disk_num_bytes, type, 0, - compress_type); -} - /* * Add a struct btrfs_ordered_sum into the list of checksums to be inserted * when an ordered extent is finished. If the list covers more than one diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 4a3dd80e776c..a038bda16fdf 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -71,9 +71,11 @@ struct btrfs_ordered_extent { * These fields directly correspond to the same fields in * btrfs_file_extent_item. */ - u64 disk_bytenr; u64 num_bytes; + u64 ram_bytes; + u64 disk_bytenr; u64 disk_num_bytes; + u64 offset; /* number of bytes that still need writing */ u64 bytes_left; @@ -160,15 +162,9 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode, u64 *file_offset, u64 io_size, int uptodate); int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, - int type); -int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type); -int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type, - int compress_type); + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, int flags, + int compress_type); void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct inode *inode, From patchwork Wed Nov 20 18:24:29 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254643 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F125E109A for ; Wed, 20 Nov 2019 18:25:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id C8F212088F for ; Wed, 20 Nov 2019 18:25:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="nts3UtoR" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728611AbfKTSZH (ORCPT ); Wed, 20 Nov 2019 13:25:07 -0500 Received: from mail-pl1-f193.google.com ([209.85.214.193]:45217 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728516AbfKTSZG (ORCPT ); Wed, 20 Nov 2019 13:25:06 -0500 Received: by mail-pl1-f193.google.com with SMTP id w7so137489plz.12 for ; Wed, 20 Nov 2019 10:25:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=dktaVG6NUYwKd02u9LQ9D8V9u7xDbqFt2b9bnH5B8nA=; b=nts3UtoRpLg1FN4AgAEDqhILGDQsrkR/8hrokDg2BsSHzIx2WT8OGtv0vxOQYvQeCj VHZUC2rJLBxbJDsR7cNexkQZJ59H5ln0K+Uvpi1E9ngs/DCkkvIvHUVL60x/W78KBqS6 4pWqIC9SvxDmDHUFUC/G1Yf0lGlqrMOMrypUCyZBEMVA19Xs3w6RbfimvgrzyanX1+Jz LBE2UqWJIIme5ElD5vAvXXC4glVjg5vCF6IbStMrGDnkVWcvqQK5iXk3IP45ckU7atYZ Q585VPlxeasHVyfvrd+PShnOHkTyczN1mbOvquoLAs4zFkN7pTlN1ZidvQphTeOdulkm jV+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=dktaVG6NUYwKd02u9LQ9D8V9u7xDbqFt2b9bnH5B8nA=; b=H8hXKyI+TNGvJn1AM7CQoAgPulHUjkG+CsgDNyA4jvlF7m3DZjPtAb9XhmSXqekYw5 wdKRKHHt6baAYcaREeYRgj5t3ncJLmJf91YQabo0B3CGwSj1q7XkZT7JpKPv7xHItLoQ j+bry0QnKQctFq2iYueEfuRfFTo3/UrHsjq/lD93u3TGUIn2ZuXIBOsRYzTPYd7sygIq 1ZwS0avjWzw1sGEoDatteiRhdCOEwOiRg5WEzLjffBXICXHxPiWed78JTbHfF7RUDddS RHWv15KQYJ6krOiw0m2Wudf5cWzpCAWLG6xrijALCfFuX9fTdpBlPvN10ekNOITVcxAo tNgQ== X-Gm-Message-State: APjAAAXrpJ8Mpl5MY35c0PnzNEqf9rqIbobRfgGTYHumCOPW+BNxkhNA Wmz1HxzG/4K3soeAGqd00fllow1f8us= X-Google-Smtp-Source: APXvYqxJDL9K0/2qfRjcaAu01VnU5dCKBy/twAcCzTnsoxMEJ32DQF4D5DRh+eBns61AJ52iCECmuw== X-Received: by 2002:a17:90a:b28c:: with SMTP id c12mr5728261pjr.22.1574274302874; Wed, 20 Nov 2019 10:25:02 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.25.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:25:02 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 09/12] btrfs: support different disk extent size for delalloc Date: Wed, 20 Nov 2019 10:24:29 -0800 Message-Id: X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Currently, we always reserve the same extent size in the file and extent size on disk for delalloc because the former is the worst case for the latter. For RWF_ENCODED writes, we know the exact size of the extent on disk, which may be less than or greater than (for bookends) the size in the file. Add a disk_num_bytes parameter to btrfs_delalloc_reserve_metadata() so that we can reserve the correct amount of csum bytes. Additionally, make btrfs_free_reserve_data_space_noquota() take a number of bytes instead of a range, as it refers to the extent size on disk, not in the file. No functional change. Signed-off-by: Omar Sandoval Reviewed-by: Nikolay Borisov --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/delalloc-space.c | 38 +++++++++++++++++--------------------- fs/btrfs/delalloc-space.h | 4 ++-- fs/btrfs/file.c | 3 ++- fs/btrfs/inode.c | 7 ++----- fs/btrfs/relocation.c | 4 ++-- 6 files changed, 27 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c32741879088..f9ac05d1ca60 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2489,7 +2489,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *rsv); void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes); -int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes); +int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, + u64 disk_num_bytes); u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo); int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info, u64 start, u64 end); diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c index db9f2c58eb4a..720b246772fb 100644 --- a/fs/btrfs/delalloc-space.c +++ b/fs/btrfs/delalloc-space.c @@ -153,34 +153,28 @@ int btrfs_check_data_free_space(struct inode *inode, /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */ ret = btrfs_qgroup_reserve_data(inode, reserved, start, len); if (ret < 0) - btrfs_free_reserved_data_space_noquota(inode, start, len); + btrfs_free_reserved_data_space_noquota(fs_info, len); else ret = 0; return ret; } /* - * Called if we need to clear a data reservation for this inode - * Normally in a error case. + * Called if we need to clear a data reservation, normally in an error case. * * This one will *NOT* use accurate qgroup reserved space API, just for case * which we can't sleep and is sure it won't affect qgroup reserved space. * Like clear_bit_hook(). */ -void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start, - u64 len) +void btrfs_free_reserved_data_space_noquota(struct btrfs_fs_info *fs_info, + u64 num_bytes) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_space_info *data_sinfo; - /* Make sure the range is aligned to sectorsize */ - len = round_up(start + len, fs_info->sectorsize) - - round_down(start, fs_info->sectorsize); - start = round_down(start, fs_info->sectorsize); - + num_bytes = ALIGN(num_bytes, fs_info->sectorsize); data_sinfo = fs_info->data_sinfo; spin_lock(&data_sinfo->lock); - btrfs_space_info_update_bytes_may_use(fs_info, data_sinfo, -len); + btrfs_space_info_update_bytes_may_use(fs_info, data_sinfo, -num_bytes); spin_unlock(&data_sinfo->lock); } @@ -201,7 +195,7 @@ void btrfs_free_reserved_data_space(struct inode *inode, round_down(start, root->fs_info->sectorsize); start = round_down(start, root->fs_info->sectorsize); - btrfs_free_reserved_data_space_noquota(inode, start, len); + btrfs_free_reserved_data_space_noquota(root->fs_info, len); btrfs_qgroup_free_data(inode, reserved, start, len); } @@ -280,11 +274,11 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info, } static void calc_inode_reservations(struct btrfs_fs_info *fs_info, - u64 num_bytes, u64 *meta_reserve, - u64 *qgroup_reserve) + u64 num_bytes, u64 disk_num_bytes, + u64 *meta_reserve, u64 *qgroup_reserve) { u64 nr_extents = count_max_extents(num_bytes); - u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes); + u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, disk_num_bytes); u64 inode_update = btrfs_calc_metadata_size(fs_info, 1); *meta_reserve = btrfs_calc_insert_metadata_size(fs_info, @@ -298,7 +292,8 @@ static void calc_inode_reservations(struct btrfs_fs_info *fs_info, *qgroup_reserve = nr_extents * fs_info->nodesize; } -int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) +int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, + u64 disk_num_bytes) { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; @@ -333,6 +328,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) mutex_lock(&inode->delalloc_mutex); num_bytes = ALIGN(num_bytes, fs_info->sectorsize); + disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize); /* * We always want to do it this way, every other way is wrong and ends @@ -344,8 +340,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) * everything out and try again, which is bad. This way we just * over-reserve slightly, and clean up the mess when we are done. */ - calc_inode_reservations(fs_info, num_bytes, &meta_reserve, - &qgroup_reserve); + calc_inode_reservations(fs_info, num_bytes, disk_num_bytes, + &meta_reserve, &qgroup_reserve); ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true); if (ret) goto out_fail; @@ -362,7 +358,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) spin_lock(&inode->lock); nr_extents = count_max_extents(num_bytes); btrfs_mod_outstanding_extents(inode, nr_extents); - inode->csum_bytes += num_bytes; + inode->csum_bytes += disk_num_bytes; btrfs_calculate_inode_block_rsv_size(fs_info, inode); spin_unlock(&inode->lock); @@ -474,7 +470,7 @@ int btrfs_delalloc_reserve_space(struct inode *inode, ret = btrfs_check_data_free_space(inode, reserved, start, len); if (ret < 0) return ret; - ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len); + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len); if (ret < 0) btrfs_free_reserved_data_space(inode, *reserved, start, len); return ret; diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h index 54466fbd7075..f847f0a80409 100644 --- a/fs/btrfs/delalloc-space.h +++ b/fs/btrfs/delalloc-space.h @@ -13,8 +13,8 @@ void btrfs_free_reserved_data_space(struct inode *inode, void btrfs_delalloc_release_space(struct inode *inode, struct extent_changeset *reserved, u64 start, u64 len, bool qgroup_free); -void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start, - u64 len); +void btrfs_free_reserved_data_space_noquota(struct btrfs_fs_info *fs_info, + u64 num_bytes); void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes, bool qgroup_free); int btrfs_delalloc_reserve_space(struct inode *inode, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 34c1a2284e03..bc7ee7c4180e 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1669,7 +1669,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, WARN_ON(reserve_bytes == 0); ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), - reserve_bytes); + reserve_bytes, + reserve_bytes); if (ret) { if (!only_release_metadata) btrfs_free_reserved_data_space(inode, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d53580ad2c46..a8bc193c99ca 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1963,9 +1963,7 @@ void btrfs_clear_delalloc_extent(struct inode *vfs_inode, if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID && do_list && !(state->state & EXTENT_NORESERVE) && (*bits & EXTENT_CLEAR_DATA_RESV)) - btrfs_free_reserved_data_space_noquota( - &inode->vfs_inode, - state->start, len); + btrfs_free_reserved_data_space_noquota(fs_info, len); percpu_counter_add_batch(&fs_info->delalloc_bytes, -len, fs_info->delalloc_batch); @@ -7025,8 +7023,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, * use the existing or preallocated extent, so does not * need to adjust btrfs_space_info's bytes_may_use. */ - btrfs_free_reserved_data_space_noquota(inode, start, - len); + btrfs_free_reserved_data_space_noquota(fs_info, len); goto skip_cow; } } diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index e3cec29813ee..af61e07b5094 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3262,8 +3262,8 @@ static int relocate_file_extent_cluster(struct inode *inode, index = (cluster->start - offset) >> PAGE_SHIFT; last_index = (cluster->end - offset) >> PAGE_SHIFT; while (index <= last_index) { - ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), - PAGE_SIZE); + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE, + PAGE_SIZE); if (ret) goto out; From patchwork Wed Nov 20 18:24:30 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254657 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 469191390 for ; Wed, 20 Nov 2019 18:25:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1E5F820878 for ; Wed, 20 Nov 2019 18:25:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="gISvoFhL" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728600AbfKTSZN (ORCPT ); Wed, 20 Nov 2019 13:25:13 -0500 Received: from mail-pj1-f65.google.com ([209.85.216.65]:43987 "EHLO mail-pj1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728588AbfKTSZF (ORCPT ); Wed, 20 Nov 2019 13:25:05 -0500 Received: by mail-pj1-f65.google.com with SMTP id a10so182291pju.10 for ; Wed, 20 Nov 2019 10:25:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Hx2ggmYufWUwQe/y6dbiVyuZiKCwDqeZoan1ILrR/wY=; b=gISvoFhLkoU0nX05caTC+mX+VZCPU3QZZZro+1IxyzqJIvjmJRXLsGiuT+rvGhCuvF f85L5ICjWdLMz3A+KUn77BHYpBDXqqUX1QWgTFYkqkYG62Lu0rIU6Jndv4qE64jdNqsh CVOOur+R/aCmGz1PIQ7gR+qoxOMkiQV51I4GcU5gNm1oAh67td9GNDhJIuYdJhFmBefu r9F0fjgftA7X4ObVxPROHNvcyziFhu3JnBQMr6A8YXXgZJO1IROMgK0QEIjiRI5DIk4+ J6/4SlwSkfbiooQ0rIEIxdnbcMDosqtrtL2EMkbzOJHm7v8QMesJ6FSFdLfdg8y0UPkz jpXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Hx2ggmYufWUwQe/y6dbiVyuZiKCwDqeZoan1ILrR/wY=; b=BUhHyFcIjEYSfFKx4eH48yodnNyZi8Rsgkx6eB2sBmhXr5T7ylvGZ0/DzJRL8XSxCu RnW/bpQc7H1vFD3ceqeKbPyjw8BCc4fghji2RBTLZTNn/GXrZuvgZU1RgtZgsrx8mxwr gmBqTX975MV4GkvdzvkFGyluHm2VSDKMZXY7FZAC0uH/dBh46Xw1hpHDGZOrLT1lqCV6 ONpm1WvVjTnnkHpCJRb0h+JBOR7dRn0ldwa1Vs169UfrGCCPMyoFtwVO48XaJhF1IyQs 1e3ziZQt57SBMz6KhRORX3OBPfqTB/QYbd08is6iO0CM4MbA5LB2EtyPh/G2bH6jQc1s /+HA== X-Gm-Message-State: APjAAAXuJJuklqP1peC3uGfduMyt1tvtYaxyKr2mh+gUEljth0pTWPgr 0MHm7+OQ2JWoDNQp5hgsugaIOLsuw30= X-Google-Smtp-Source: APXvYqyIilZels0XXWzmRRKfaMT2W7apxfsdhtcgHJLUylvWJXzogWkYHPqpYHgC7d5j2tYFieZRIw== X-Received: by 2002:a17:902:6903:: with SMTP id j3mr4088245plk.231.1574274304139; Wed, 20 Nov 2019 10:25:04 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.25.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:25:03 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 10/12] btrfs: optionally extend i_size in cow_file_range_inline() Date: Wed, 20 Nov 2019 10:24:30 -0800 Message-Id: <4b1b60021a955e9cf9c120d9b03bd6e58e8ab7d9.1574273658.git.osandov@fb.com> X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Currently, an inline extent is always created after i_size is extended from btrfs_dirty_pages(). However, for encoded writes, we only want to update i_size after we successfully created the inline extent. Add an update_i_size parameter to cow_file_range_inline() and insert_inline_extent() and pass in the size of the extent rather than determining it from i_size. Since the start parameter is always passed as 0, get rid of it and simplify the logic in these two functions. While we're here, let's document the requirements for creating an inline extent. Signed-off-by: Omar Sandoval --- fs/btrfs/inode.c | 94 +++++++++++++++++++++++------------------------- 1 file changed, 44 insertions(+), 50 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index a8bc193c99ca..c1c37549155e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -165,9 +165,10 @@ static int btrfs_init_inode_security(struct btrfs_trans_handle *trans, static int insert_inline_extent(struct btrfs_trans_handle *trans, struct btrfs_path *path, int extent_inserted, struct btrfs_root *root, struct inode *inode, - u64 start, size_t size, size_t compressed_size, + size_t size, size_t compressed_size, int compress_type, - struct page **compressed_pages) + struct page **compressed_pages, + bool update_i_size) { struct extent_buffer *leaf; struct page *page = NULL; @@ -176,7 +177,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, struct btrfs_file_extent_item *ei; int ret; size_t cur_size = size; - unsigned long offset; + u64 i_size; ASSERT((compressed_size > 0 && compressed_pages) || (compressed_size == 0 && !compressed_pages)); @@ -191,7 +192,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, size_t datasize; key.objectid = btrfs_ino(BTRFS_I(inode)); - key.offset = start; + key.offset = 0; key.type = BTRFS_EXTENT_DATA_KEY; datasize = btrfs_file_extent_calc_inline_size(cur_size); @@ -230,12 +231,10 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, btrfs_set_file_extent_compression(leaf, ei, compress_type); } else { - page = find_get_page(inode->i_mapping, - start >> PAGE_SHIFT); + page = find_get_page(inode->i_mapping, 0); btrfs_set_file_extent_compression(leaf, ei, 0); kaddr = kmap_atomic(page); - offset = offset_in_page(start); - write_extent_buffer(leaf, kaddr + offset, ptr, size); + write_extent_buffer(leaf, kaddr, ptr, size); kunmap_atomic(kaddr); put_page(page); } @@ -251,7 +250,12 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, * before we unlock the pages. Otherwise we * could end up racing with unlink. */ - BTRFS_I(inode)->disk_i_size = inode->i_size; + i_size = i_size_read(inode); + if (update_i_size && size > i_size) { + i_size_write(inode, size); + i_size = size; + } + BTRFS_I(inode)->disk_i_size = i_size; ret = btrfs_update_inode(trans, root, inode); fail: @@ -264,36 +268,31 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, * does the checks required to make sure the data is small enough * to fit as an inline extent. */ -static noinline int cow_file_range_inline(struct inode *inode, u64 start, - u64 end, size_t compressed_size, +static noinline int cow_file_range_inline(struct inode *inode, u64 size, + size_t compressed_size, int compress_type, - struct page **compressed_pages) + struct page **compressed_pages, + bool update_i_size) { struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_trans_handle *trans; - u64 isize = i_size_read(inode); - u64 actual_end = min(end + 1, isize); - u64 inline_len = actual_end - start; - u64 aligned_end = ALIGN(end, fs_info->sectorsize); - u64 data_len = inline_len; + u64 data_len = compressed_size ? compressed_size : size; int ret; struct btrfs_path *path; int extent_inserted = 0; u32 extent_item_size; - if (compressed_size) - data_len = compressed_size; - - if (start > 0 || - actual_end > fs_info->sectorsize || + /* + * We can create an inline extent if it ends at or beyond the current + * i_size, is no larger than a sector (decompressed), and the (possibly + * compressed) data fits in a leaf and the configured maximum inline + * size. + */ + if (size < i_size_read(inode) || size > fs_info->sectorsize || data_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info) || - (!compressed_size && - (actual_end & (fs_info->sectorsize - 1)) == 0) || - end + 1 < isize || - data_len > fs_info->max_inline) { + data_len > fs_info->max_inline) return 1; - } path = btrfs_alloc_path(); if (!path) @@ -306,27 +305,18 @@ static noinline int cow_file_range_inline(struct inode *inode, u64 start, } trans->block_rsv = &BTRFS_I(inode)->block_rsv; - if (compressed_size && compressed_pages) - extent_item_size = btrfs_file_extent_calc_inline_size( - compressed_size); - else - extent_item_size = btrfs_file_extent_calc_inline_size( - inline_len); - - ret = __btrfs_drop_extents(trans, root, inode, path, - start, aligned_end, NULL, - 1, 1, extent_item_size, &extent_inserted); + extent_item_size = btrfs_file_extent_calc_inline_size(data_len); + ret = __btrfs_drop_extents(trans, root, inode, path, 0, + fs_info->sectorsize, NULL, 1, 1, + extent_item_size, &extent_inserted); if (ret) { btrfs_abort_transaction(trans, ret); goto out; } - if (isize > actual_end) - inline_len = min_t(u64, isize, actual_end); - ret = insert_inline_extent(trans, path, extent_inserted, - root, inode, start, - inline_len, compressed_size, - compress_type, compressed_pages); + ret = insert_inline_extent(trans, path, extent_inserted, root, inode, + size, compressed_size, compress_type, + compressed_pages, update_i_size); if (ret && ret != -ENOSPC) { btrfs_abort_transaction(trans, ret); goto out; @@ -336,7 +326,7 @@ static noinline int cow_file_range_inline(struct inode *inode, u64 start, } set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &BTRFS_I(inode)->runtime_flags); - btrfs_drop_extent_cache(BTRFS_I(inode), start, aligned_end - 1, 0); + btrfs_drop_extent_cache(BTRFS_I(inode), 0, fs_info->sectorsize - 1, 0); out: /* * Don't forget to free the reserved space, as for inlined extent @@ -605,13 +595,15 @@ static noinline int compress_file_range(struct async_chunk *async_chunk) /* we didn't compress the entire range, try * to make an uncompressed inline extent. */ - ret = cow_file_range_inline(inode, start, end, 0, - BTRFS_COMPRESS_NONE, NULL); + ret = cow_file_range_inline(inode, actual_end, 0, + BTRFS_COMPRESS_NONE, NULL, + false); } else { /* try making a compressed inline extent */ - ret = cow_file_range_inline(inode, start, end, + ret = cow_file_range_inline(inode, actual_end, total_compressed, - compress_type, pages); + compress_type, pages, + false); } if (ret <= 0) { unsigned long clear_flags = EXTENT_DELALLOC | @@ -991,9 +983,11 @@ static noinline int cow_file_range(struct inode *inode, inode_should_defrag(BTRFS_I(inode), start, end, num_bytes, SZ_64K); if (start == 0) { + u64 actual_end = min_t(u64, i_size_read(inode), end + 1); + /* lets try to make an inline extent */ - ret = cow_file_range_inline(inode, start, end, 0, - BTRFS_COMPRESS_NONE, NULL); + ret = cow_file_range_inline(inode, actual_end, 0, + BTRFS_COMPRESS_NONE, NULL, false); if (ret == 0) { /* * We use DO_ACCOUNTING here because we need the From patchwork Wed Nov 20 18:24:31 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254647 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6623F14DB for ; Wed, 20 Nov 2019 18:25:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 345CC20878 for ; Wed, 20 Nov 2019 18:25:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="O1h6ueGp" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728644AbfKTSZJ (ORCPT ); Wed, 20 Nov 2019 13:25:09 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:33145 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728624AbfKTSZI (ORCPT ); Wed, 20 Nov 2019 13:25:08 -0500 Received: by mail-pg1-f193.google.com with SMTP id h27so150328pgn.0 for ; Wed, 20 Nov 2019 10:25:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=xkjOA0p/FE+Xdl9b4yb6V+JTBBWjjijfcwafYlvdz+8=; b=O1h6ueGp+TOJZ0A65SkJzWG//3YKlyyHoNYsAAayGwWiI8wmuIicRlC3XEjckexmxc yI8JYp4rVvTtfCTvhD3WJOsSm7Lu/DsFwL2/vAIThx5HR0vkHDG/ctahjD8I9vVdgPNQ BIcn5WFZSlgBjXV6xQ7Dan/Fo4soMmN/MkBeTlLSQ/4E+UAaKxzRgqo7lxjKqHlmlQmN RDgB0rYQA6Ty4SBXE5i0uZTr8uGQBrrrLCUetBJEMbaSlhnrSTmf8GuaRq63hgNaLF6V ZfxN0QKKYrBoMmmc47x0XHrAsGiJAq2gynRt3je27sAD0NtG4DXlw1CeTppJWGFku1Zj nkjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=xkjOA0p/FE+Xdl9b4yb6V+JTBBWjjijfcwafYlvdz+8=; b=IxVvK+ljKqKQVo9J4aVhvLL0yMz7pdjebFQkYYqWF9E7AICQAth49Fqk3unWVoYay5 kCvqGvFhk+Yz/eQ6ZIMTXYeW1FVOliUTirPS33xF0ZdC1jYS85m5UQOPBC0W49fOQyq/ 80+7VKhmfWLSYhe+yRv7E4TPOLogQlQFJKK74DWDcvvjiBXF1YL870+xhvkPlC/5dN5L 4TDMAqkqYLobYKy0o5uVtZMzMzS/dEL5LuZm8mLYs8uazDlPdgVyfA5arBy5ew7YydSH 58brsZf5xXFZJpVSpzkXIYzLHJb/0PUOCIZEt8mhtK7rHSGIAE6ID3Qf/GmWkj5EyfQg TtyA== X-Gm-Message-State: APjAAAUcRgiJakXMMpvCRXMeXkWSpvyntJtbwSnkM1oN0PKpKaZW1RjQ MiD0G+z6ObaueiY2myYR4qSmN/zhGjs= X-Google-Smtp-Source: APXvYqwf0X6AQ1rqufwrA8PUWSOL4qocHuM+oADVB0f1hhUBzJc+GZwiFIyvGpJZwH3k+7EWZSN8uA== X-Received: by 2002:aa7:9f08:: with SMTP id g8mr5787315pfr.59.1574274305506; Wed, 20 Nov 2019 10:25:05 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.25.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:25:04 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 11/12] btrfs: implement RWF_ENCODED reads Date: Wed, 20 Nov 2019 10:24:31 -0800 Message-Id: <0d0ca118dca269fb3e198fc8ffc0e536cba2be15.1574273658.git.osandov@fb.com> X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval There are 4 main cases: 1. Inline extents: we copy the data straight out of the extent buffer. 2. Hole/preallocated extents: we indicate the size of the extent starting from the read position; we don't need to copy zeroes. 3. Regular, uncompressed extents: we read the sectors we need directly from disk. 4. Regular, compressed extents: we read the entire compressed extent from disk and indicate what subset of the decompressed extent is in the file. This initial implementation simplifies a few things that can be improved in the future: - We hold the inode lock during the operation. - Cases 1, 3, and 4 allocate temporary memory to read into before copying out to userspace. - Cases 3 and 4 do not implement repair yet. Signed-off-by: Omar Sandoval --- fs/btrfs/ctree.h | 2 + fs/btrfs/file.c | 12 +- fs/btrfs/inode.c | 454 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 467 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f9ac05d1ca60..3be72a6e022e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2901,6 +2901,8 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page, int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end); void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start, u64 end, int uptodate); +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter); + extern const struct dentry_operations btrfs_dentry_operations; /* ioctl.c */ diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index bc7ee7c4180e..5425200092c2 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -390,6 +390,16 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info) return 0; } +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) +{ + if (iocb->ki_flags & IOCB_ENCODED) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + return btrfs_encoded_read(iocb, iter); + } + return generic_file_read_iter(iocb, iter); +} + /* simple helper to fault in pages and copy. This should go away * and be replaced with calls into generic code. */ @@ -3455,7 +3465,7 @@ static int btrfs_file_open(struct inode *inode, struct file *filp) const struct file_operations btrfs_file_operations = { .llseek = btrfs_file_llseek, - .read_iter = generic_file_read_iter, + .read_iter = btrfs_file_read_iter, .splice_read = generic_file_splice_read, .write_iter = btrfs_file_write_iter, .mmap = btrfs_file_mmap, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c1c37549155e..698e24aa8b21 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9943,6 +9943,460 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) } } +static int encoded_iov_compression_from_btrfs(struct encoded_iov *encoded, + unsigned int compress_type) +{ + switch (compress_type) { + case BTRFS_COMPRESS_NONE: + encoded->compression = ENCODED_IOV_COMPRESSION_NONE; + break; + case BTRFS_COMPRESS_ZLIB: + encoded->compression = ENCODED_IOV_COMPRESSION_ZLIB; + break; + case BTRFS_COMPRESS_LZO: + encoded->compression = ENCODED_IOV_COMPRESSION_LZO; + break; + case BTRFS_COMPRESS_ZSTD: + encoded->compression = ENCODED_IOV_COMPRESSION_ZSTD; + break; + default: + return -EIO; + } + return 0; +} + +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb, + struct iov_iter *iter, u64 start, + u64 lockend, + struct extent_state **cached_state, + u64 extent_start, size_t count, + struct encoded_iov *encoded, + bool *unlocked) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_file_extent_item *item; + u64 ram_bytes; + unsigned long ptr; + void *tmp; + ssize_t ret; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path, + btrfs_ino(BTRFS_I(inode)), extent_start, + 0); + if (ret) { + if (ret > 0) { + /* The extent item disappeared? */ + ret = -EIO; + } + goto out; + } + leaf = path->nodes[0]; + item = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_file_extent_item); + + ram_bytes = btrfs_file_extent_ram_bytes(leaf, item); + ptr = btrfs_file_extent_inline_start(item); + + encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) - + iocb->ki_pos); + ret = encoded_iov_compression_from_btrfs(encoded, + btrfs_file_extent_compression(leaf, item)); + if (ret) + goto out; + if (encoded->compression) { + size_t inline_size; + + inline_size = btrfs_file_extent_inline_item_len(leaf, + btrfs_item_nr(path->slots[0])); + if (inline_size > count) { + ret = -ENOBUFS; + goto out; + } + count = inline_size; + encoded->unencoded_len = ram_bytes; + encoded->unencoded_offset = iocb->ki_pos - extent_start; + } else { + encoded->len = encoded->unencoded_len = count = + min_t(u64, count, encoded->len); + ptr += iocb->ki_pos - extent_start; + } + + tmp = kmalloc(count, GFP_NOFS); + if (!tmp) { + ret = -ENOMEM; + goto out; + } + read_extent_buffer(leaf, tmp, ptr, count); + btrfs_free_path(path); + path = NULL; + unlock_extent_cached(io_tree, start, lockend, cached_state); + inode_unlock(inode); + *unlocked = true; + + ret = copy_encoded_iov_to_iter(encoded, iter); + if (ret) + goto out_free; + if (copy_to_iter(tmp, count, iter) == count) + ret = count; + else + ret = -EFAULT; +out_free: + kfree(tmp); +out: + btrfs_free_path(path); + return ret; +} + +struct btrfs_encoded_read_private { + struct inode *inode; + wait_queue_head_t wait; + atomic_t pending; + bool uptodate; + bool skip_csum; +}; + +static bool btrfs_encoded_read_check_csums(struct btrfs_io_bio *io_bio) +{ + struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private; + struct inode *inode = priv->inode; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + u32 sectorsize = fs_info->sectorsize; + struct bio_vec *bvec; + struct bvec_iter_all iter_all; + u64 offset = 0; + + if (priv->skip_csum) + return true; + bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) { + unsigned int i, nr_sectors, pgoff; + + nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len); + pgoff = bvec->bv_offset; + for (i = 0; i < nr_sectors; i++) { + int csum_pos; + + csum_pos = BTRFS_BYTES_TO_BLKS(fs_info, offset); + if (__readpage_endio_check(inode, io_bio, csum_pos, + bvec->bv_page, pgoff, + io_bio->logical + offset, + sectorsize)) + return false; + offset += sectorsize; + pgoff += sectorsize; + } + } + return true; +} + +static void btrfs_encoded_read_endio(struct bio *bio) +{ + struct btrfs_encoded_read_private *priv = bio->bi_private; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + + if (bio->bi_status || !btrfs_encoded_read_check_csums(io_bio)) + priv->uptodate = false; + if (!atomic_dec_return(&priv->pending)) + wake_up(&priv->wait); + btrfs_io_bio_free_csum(io_bio); + bio_put(bio); +} + +static bool btrfs_submit_encoded_read(struct bio *bio) +{ + struct btrfs_encoded_read_private *priv = bio->bi_private; + struct inode *inode = priv->inode; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + blk_status_t status; + + atomic_inc(&priv->pending); + + if (!priv->skip_csum) { + status = btrfs_lookup_bio_sums(inode, bio, true, + btrfs_io_bio(bio)->logical, + NULL); + if (status) + goto out; + } + + status = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA); + if (status) + goto out; + + status = btrfs_map_bio(fs_info, bio, 0, 0); +out: + if (status) { + bio->bi_status = status; + bio_endio(bio); + return false; + } + return true; +} + +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, + struct iov_iter *iter, + u64 start, u64 lockend, + struct extent_state **cached_state, + struct block_device *bdev, + u64 offset, u64 disk_io_size, + size_t count, + const struct encoded_iov *encoded, + bool *unlocked) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct btrfs_encoded_read_private priv = { + .inode = inode, + .wait = __WAIT_QUEUE_HEAD_INITIALIZER(priv.wait), + .pending = ATOMIC_INIT(1), + .uptodate = true, + .skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM, + }; + struct page **pages; + unsigned long nr_pages, i; + struct bio *bio = NULL; + u64 cur; + size_t page_offset; + ssize_t ret; + + nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE); + pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); + if (!pages) + return -ENOMEM; + for (i = 0; i < nr_pages; i++) { + pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM); + if (!pages[i]) { + ret = -ENOMEM; + goto out; + } + } + + i = 0; + cur = 0; + while (cur < disk_io_size) { + size_t bytes = min_t(u64, disk_io_size - cur, + PAGE_SIZE); + + if (!bio) { + bio = btrfs_bio_alloc(offset + cur); + bio_set_dev(bio, bdev); + bio->bi_end_io = btrfs_encoded_read_endio; + bio->bi_private = &priv; + bio->bi_opf = REQ_OP_READ; + btrfs_io_bio(bio)->logical = start + cur; + } + + if (bio_add_page(bio, pages[i], bytes, 0) < bytes) { + bool success; + + success = btrfs_submit_encoded_read(bio); + bio = NULL; + if (!success) + break; + continue; + } + i++; + cur += bytes; + } + + if (bio) + btrfs_submit_encoded_read(bio); + if (atomic_dec_return(&priv.pending)) + wait_event(priv.wait, !atomic_read(&priv.pending)); + if (!priv.uptodate) { + ret = -EIO; + goto out; + } + + unlock_extent_cached(io_tree, start, lockend, cached_state); + inode_unlock(inode); + *unlocked = true; + + ret = copy_encoded_iov_to_iter(encoded, iter); + if (ret) + goto out; + if (encoded->compression) { + i = 0; + page_offset = 0; + } else { + i = (iocb->ki_pos - start) >> PAGE_SHIFT; + page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1); + } + cur = 0; + while (cur < count) { + size_t bytes = min_t(size_t, count - cur, + PAGE_SIZE - page_offset); + + if (copy_page_to_iter(pages[i], page_offset, bytes, + iter) != bytes) { + ret = -EFAULT; + goto out; + } + i++; + cur += bytes; + page_offset = 0; + } + ret = count; +out: + for (i = 0; i < nr_pages; i++) { + if (pages[i]) + put_page(pages[i]); + } + kfree(pages); + return ret; +} + +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + ssize_t ret; + size_t count; + struct block_device *em_bdev; + u64 start, lockend, offset, disk_io_size; + struct extent_state *cached_state = NULL; + struct extent_map *em; + struct encoded_iov encoded = {}; + bool unlocked = false; + + ret = generic_encoded_read_checks(iocb, iter); + if (ret < 0) + return ret; + if (ret == 0) + return copy_encoded_iov_to_iter(&encoded, iter); + count = ret; + + file_accessed(iocb->ki_filp); + + inode_lock_shared(inode); + + if (iocb->ki_pos >= inode->i_size) { + inode_unlock_shared(inode); + return copy_encoded_iov_to_iter(&encoded, iter); + } + start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize); + /* + * We don't know how long the extent containing iocb->ki_pos is, but if + * it's compressed we know that it won't be longer than this. + */ + lockend = start + BTRFS_MAX_UNCOMPRESSED - 1; + + for (;;) { + struct btrfs_ordered_extent *ordered; + + ret = btrfs_wait_ordered_range(inode, start, + lockend - start + 1); + if (ret) + goto out_unlock_inode; + lock_extent_bits(io_tree, start, lockend, &cached_state); + ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start, + lockend - start + 1); + if (!ordered) + break; + btrfs_put_ordered_extent(ordered); + unlock_extent_cached(io_tree, start, lockend, &cached_state); + cond_resched(); + } + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, + lockend - start + 1, 0); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out_unlock_extent; + } + em_bdev = em->bdev; + + if (em->block_start == EXTENT_MAP_INLINE) { + u64 extent_start = em->start; + + /* + * For inline extents we get everything we need out of the + * extent item. + */ + free_extent_map(em); + em = NULL; + ret = btrfs_encoded_read_inline(iocb, iter, start, lockend, + &cached_state, extent_start, + count, &encoded, &unlocked); + goto out; + } + + /* + * We only want to return up to EOF even if the extent extends beyond + * that. + */ + encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) - + iocb->ki_pos); + if (em->block_start == EXTENT_MAP_HOLE || + test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) { + offset = EXTENT_MAP_HOLE; + } else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { + offset = em->block_start; + /* + * Bail if the buffer isn't large enough to return the whole + * compressed extent. + */ + if (em->block_len > count) { + ret = -ENOBUFS; + goto out_em; + } + disk_io_size = count = em->block_len; + encoded.unencoded_len = em->ram_bytes; + encoded.unencoded_offset = iocb->ki_pos - em->orig_start; + ret = encoded_iov_compression_from_btrfs(&encoded, + em->compress_type); + if (ret) + goto out_em; + } else { + offset = em->block_start + (start - em->start); + if (encoded.len > count) + encoded.len = count; + /* + * Don't read beyond what we locked. This also limits the page + * allocations that we'll do. + */ + disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start; + encoded.len = encoded.unencoded_len = count = + start + disk_io_size - iocb->ki_pos; + disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize); + } + free_extent_map(em); + em = NULL; + + if (offset == EXTENT_MAP_HOLE) { + unlock_extent_cached(io_tree, start, lockend, &cached_state); + inode_unlock_shared(inode); + unlocked = true; + ret = copy_encoded_iov_to_iter(&encoded, iter); + } else { + ret = btrfs_encoded_read_regular(iocb, iter, start, lockend, + &cached_state, em_bdev, offset, + disk_io_size, count, &encoded, + &unlocked); + } + +out: + if (ret >= 0) + iocb->ki_pos += encoded.len; +out_em: + free_extent_map(em); +out_unlock_extent: + if (!unlocked) + unlock_extent_cached(io_tree, start, lockend, &cached_state); +out_unlock_inode: + if (!unlocked) + inode_unlock_shared(inode); + return ret; +} + #ifdef CONFIG_SWAP /* * Add an entry indicating a block group or device which is pinned by a From patchwork Wed Nov 20 18:24:32 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11254655 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 488AA1390 for ; Wed, 20 Nov 2019 18:25:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 16C2B2088F for ; Wed, 20 Nov 2019 18:25:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="UJ7wqgQr" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728626AbfKTSZL (ORCPT ); Wed, 20 Nov 2019 13:25:11 -0500 Received: from mail-pl1-f193.google.com ([209.85.214.193]:41516 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728633AbfKTSZJ (ORCPT ); Wed, 20 Nov 2019 13:25:09 -0500 Received: by mail-pl1-f193.google.com with SMTP id t8so77586plr.8 for ; Wed, 20 Nov 2019 10:25:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=komBlf2A1Mk9wvmTi33yupnzrPAC6Z36n5e71K3zkls=; b=UJ7wqgQrIRpTrGVOQbVdnW90YXRToGOxytNj//0AoO6m4b6sFGqDbYB7kmzeqObI9A NOwGHCCcx6qqORdPk6RaCVkQ+4OAKV8geDKdYECKo747pHVVqWjx26LXtiDLLpDi4CkF uLb1Luo+5A7aVJf2ymYfNrSeW1yx6lFE90cDf62LGJ2aZARvi6f4xPRTck1J9gBvKWoB ynEmEXrryLZrwSfYsG4fo+R2zR2UgwiTEt9rTvXchT8FLLSR1NlYXvBA+IU+vcOFGq/V EZpy58VkuLxlLCoMgbPKnGE7BMyoz6ByGO8R8o9JIg1FwXxnJit8Vqioy/RcU2RjRFIk h+Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=komBlf2A1Mk9wvmTi33yupnzrPAC6Z36n5e71K3zkls=; b=AcEspA8nMt65QIaS3X5XXm9FARjA1uwT4BqkZwUzc4OyZZ0DdyNjfkx+BkP1wMODQI e4MRFiBa0NyLfcGR35TSvePRifDy0oEoDsZT09Ge4rNLxHCiAHskczS+FATpB3FVWbA7 7q9zGyb/A1GaGrntDdEeYPiR20jd+VO40oHi+38C1ARLdUWPnoLZK/6bq3sjpzHwN7y1 oGM88R81SPb+9isgRgFtD0Z1dmoLWTYjpnGVF2mL6rqC2TmA32DJjtbgc0YB3ycJJSGH uTVc7AFmDvMW0e7mI/638sgPFdzBY8g6b2EXxnwxlt8Mq8FVb9+VP33SGTStM0CqDxG4 UM7Q== X-Gm-Message-State: APjAAAW/7b8c2LA2hbtEeQRNXaOk4Djb5hntYUSmXLusYctcBtdx5Mjq /6bzXZdsD7IPm7T1Gaz46q2eTz1wRfI= X-Google-Smtp-Source: APXvYqwC2W4Nh0GjIjxkoVWJ/Wf6ywNxiy7xTM53FRW5lgjfiXDoJ0ZI12WNpe7+RlzshLMVeWiGnA== X-Received: by 2002:a17:902:7586:: with SMTP id j6mr4343930pll.43.1574274306754; Wed, 20 Nov 2019 10:25:06 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:180::1a46]) by smtp.gmail.com with ESMTPSA id q34sm7937866pjb.15.2019.11.20.10.25.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Nov 2019 10:25:06 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v3 12/12] btrfs: implement RWF_ENCODED writes Date: Wed, 20 Nov 2019 10:24:32 -0800 Message-Id: <3252074cdd88bfaea491d3f22a75099970b314b5.1574273658.git.osandov@fb.com> X-Mailer: git-send-email 2.24.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. This also creates inline extents when possible. Now that read and write are implemented, this also sets the FMODE_ENCODED_IO flag in btrfs_file_open(). Signed-off-by: Omar Sandoval --- fs/btrfs/compression.c | 6 +- fs/btrfs/compression.h | 5 +- fs/btrfs/ctree.h | 2 + fs/btrfs/file.c | 40 +++++-- fs/btrfs/inode.c | 243 +++++++++++++++++++++++++++++++++++++++- fs/btrfs/ordered-data.c | 12 +- fs/btrfs/ordered-data.h | 2 + 7 files changed, 293 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 05b6e404a291..ae24e8c5ea34 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio) bio->bi_status == BLK_STS_OK); cb->compressed_pages[0]->mapping = NULL; - end_compressed_writeback(inode, cb); + if (cb->writeback) + end_compressed_writeback(inode, cb); /* note, our inode could be gone now */ /* @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, unsigned long compressed_len, struct page **compressed_pages, unsigned long nr_pages, - unsigned int write_flags) + unsigned int write_flags, bool writeback) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct bio *bio = NULL; @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, cb->mirror_num = 0; cb->compressed_pages = compressed_pages; cb->compressed_len = compressed_len; + cb->writeback = writeback; cb->orig_bio = NULL; cb->nr_pages = nr_pages; diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index 4cb8be9ff88b..d4176384ec15 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -47,6 +47,9 @@ struct compressed_bio { /* the compression algorithm for this bio */ int compress_type; + /* Whether this is a write for writeback. */ + bool writeback; + /* number of compressed pages in the array */ unsigned long nr_pages; @@ -93,7 +96,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, unsigned long compressed_len, struct page **compressed_pages, unsigned long nr_pages, - unsigned int write_flags); + unsigned int write_flags, bool writeback); blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, int mirror_num, unsigned long bio_flags); diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3be72a6e022e..3c020dbe894a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2902,6 +2902,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end); void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start, u64 end, int uptodate); ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter); +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from, + struct encoded_iov *encoded); extern const struct dentry_operations btrfs_dentry_operations; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 5425200092c2..16d8df66378f 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1893,8 +1893,7 @@ static void update_time_for_write(struct inode *inode) inode_inc_iversion(inode); } -static ssize_t btrfs_file_write_iter(struct kiocb *iocb, - struct iov_iter *from) +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { struct file *file = iocb->ki_filp; struct inode *inode = file_inode(file); @@ -1904,30 +1903,51 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, u64 end_pos; ssize_t num_written = 0; const bool sync = iocb->ki_flags & IOCB_DSYNC; + struct encoded_iov encoded; ssize_t err; loff_t pos; size_t count; loff_t oldsize; int clean_page = 0; - if (!(iocb->ki_flags & IOCB_DIRECT) && - (iocb->ki_flags & IOCB_NOWAIT)) + if ((iocb->ki_flags & IOCB_NOWAIT) && + (!(iocb->ki_flags & IOCB_DIRECT) || + (iocb->ki_flags & IOCB_ENCODED))) return -EOPNOTSUPP; + if (iocb->ki_flags & IOCB_ENCODED) { + err = copy_encoded_iov_from_iter(&encoded, from); + if (err) + return err; + } + if (!inode_trylock(inode)) { if (iocb->ki_flags & IOCB_NOWAIT) return -EAGAIN; inode_lock(inode); } - err = generic_write_checks(iocb, from); - if (err <= 0) { + if (iocb->ki_flags & IOCB_ENCODED) { + err = generic_encoded_write_checks(iocb, &encoded); + if (err) { + inode_unlock(inode); + return err; + } + count = encoded.len; + } else { + err = generic_write_checks(iocb, from); + if (err < 0) { + inode_unlock(inode); + return err; + } + count = iov_iter_count(from); + } + if (count == 0) { inode_unlock(inode); return err; } pos = iocb->ki_pos; - count = iov_iter_count(from); if (iocb->ki_flags & IOCB_NOWAIT) { /* * We will allocate space in case nodatacow is not set, @@ -1986,7 +2006,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, if (sync) atomic_inc(&BTRFS_I(inode)->sync_writers); - if (iocb->ki_flags & IOCB_DIRECT) { + if (iocb->ki_flags & IOCB_ENCODED) { + num_written = btrfs_encoded_write(iocb, from, &encoded); + } else if (iocb->ki_flags & IOCB_DIRECT) { num_written = __btrfs_direct_write(iocb, from); } else { num_written = btrfs_buffered_write(iocb, from); @@ -3459,7 +3481,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) static int btrfs_file_open(struct inode *inode, struct file *filp) { - filp->f_mode |= FMODE_NOWAIT; + filp->f_mode |= FMODE_NOWAIT | FMODE_ENCODED_IO; return generic_file_open(inode, filp); } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 698e24aa8b21..b9b410af6d0d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -868,7 +868,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) ins.objectid, ins.offset, async_extent->pages, async_extent->nr_pages, - async_chunk->write_flags)) { + async_chunk->write_flags, true)) { struct page *p = async_extent->pages[0]; const u64 start = async_extent->start; const u64 end = start + async_extent->ram_size - 1; @@ -2392,7 +2392,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) && - !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags)) + !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags) && + !test_bit(BTRFS_ORDERED_ENCODED, &ordered_extent->flags)) clear_new_delalloc_bytes = true; nolock = btrfs_is_free_space_inode(BTRFS_I(inode)); @@ -10397,6 +10398,244 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter) return ret; } +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from, + struct encoded_iov *encoded) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_root *root = BTRFS_I(inode)->root; + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct extent_changeset *data_reserved = NULL; + struct extent_state *cached_state = NULL; + int compression; + size_t orig_count; + u64 start, end; + u64 num_bytes, ram_bytes, disk_num_bytes; + unsigned long nr_pages, i; + struct page **pages; + struct btrfs_key ins; + bool extent_reserved = false; + struct extent_map *em; + ssize_t ret; + + switch (encoded->compression) { + case ENCODED_IOV_COMPRESSION_ZLIB: + compression = BTRFS_COMPRESS_ZLIB; + break; + case ENCODED_IOV_COMPRESSION_LZO: + compression = BTRFS_COMPRESS_LZO; + break; + case ENCODED_IOV_COMPRESSION_ZSTD: + compression = BTRFS_COMPRESS_ZSTD; + break; + default: + return -EINVAL; + } + if (encoded->encryption != ENCODED_IOV_ENCRYPTION_NONE) + return -EINVAL; + + orig_count = iov_iter_count(from); + + /* The extent size must be sane. */ + if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED || + orig_count > BTRFS_MAX_COMPRESSED || orig_count == 0) + return -EINVAL; + + /* + * The compressed data must be smaller than the decompressed data. + * + * It's of course possible for data to compress to larger or the same + * size, but the buffered I/O path falls back to no compression for such + * data, and we don't want to break any assumptions by creating these + * extents. + * + * Note that this is less strict than the current check we have that the + * compressed data must be at least one sector smaller than the + * decompressed data. We only want to enforce the weaker requirement + * from old kernels that it is at least one byte smaller. + */ + if (orig_count >= encoded->unencoded_len) + return -EINVAL; + + /* The extent must start on a sector boundary. */ + start = iocb->ki_pos; + if (!IS_ALIGNED(start, fs_info->sectorsize)) + return -EINVAL; + + /* + * The extent must end on a sector boundary. However, we allow a write + * which ends at or extends i_size to have an unaligned length; we round + * up the extent size and set i_size to the unaligned end. + */ + if (start + encoded->len < inode->i_size && + !IS_ALIGNED(start + encoded->len, fs_info->sectorsize)) + return -EINVAL; + + /* Finally, the offset in the unencoded data must be sector-aligned. */ + if (!IS_ALIGNED(encoded->unencoded_offset, fs_info->sectorsize)) + return -EINVAL; + + num_bytes = ALIGN(encoded->len, fs_info->sectorsize); + ram_bytes = ALIGN(encoded->unencoded_len, fs_info->sectorsize); + end = start + num_bytes - 1; + + /* + * If the extent cannot be inline, the compressed data on disk must be + * sector-aligned. For convenience, we extend it with zeroes if it + * isn't. + */ + disk_num_bytes = ALIGN(orig_count, fs_info->sectorsize); + nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE); + pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL_ACCOUNT); + if (!pages) + return -ENOMEM; + for (i = 0; i < nr_pages; i++) { + size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from)); + char *kaddr; + + pages[i] = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_HIGHMEM); + if (!pages[i]) { + ret = -ENOMEM; + goto out_pages; + } + kaddr = kmap(pages[i]); + if (copy_from_iter(kaddr, bytes, from) != bytes) { + kunmap(pages[i]); + ret = -EFAULT; + goto out_pages; + } + if (bytes < PAGE_SIZE) + memset(kaddr + bytes, 0, PAGE_SIZE - bytes); + kunmap(pages[i]); + } + + for (;;) { + struct btrfs_ordered_extent *ordered; + + ret = btrfs_wait_ordered_range(inode, start, num_bytes); + if (ret) + goto out_pages; + ret = invalidate_inode_pages2_range(inode->i_mapping, + start >> PAGE_SHIFT, + end >> PAGE_SHIFT); + if (ret) + goto out_pages; + lock_extent_bits(io_tree, start, end, &cached_state); + ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start, + num_bytes); + if (!ordered && + !filemap_range_has_page(inode->i_mapping, start, end)) + break; + if (ordered) + btrfs_put_ordered_extent(ordered); + unlock_extent_cached(io_tree, start, end, &cached_state); + cond_resched(); + } + + ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), disk_num_bytes); + if (ret) + goto out_unlock; + ret = btrfs_qgroup_reserve_data(inode, &data_reserved, start, + num_bytes); + if (ret) + goto out_free_data_space; + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), num_bytes, + disk_num_bytes); + if (ret) + goto out_qgroup_free_data; + + /* Try an inline extent first. */ + if (start == 0 && encoded->unencoded_len == encoded->len && + encoded->unencoded_offset == 0) { + ret = cow_file_range_inline(inode, encoded->len, orig_count, + compression, pages, true); + if (ret <= 0) { + if (ret == 0) + ret = orig_count; + goto out_delalloc_release; + } + } + + ret = btrfs_reserve_extent(root, disk_num_bytes, disk_num_bytes, + disk_num_bytes, 0, 0, &ins, 1, 1); + if (ret) + goto out_delalloc_release; + extent_reserved = true; + + em = create_io_em(inode, start, num_bytes, + start - encoded->unencoded_offset, ins.objectid, + ins.offset, ins.offset, ram_bytes, compression, + BTRFS_ORDERED_COMPRESSED); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out_free_reserved; + } + free_extent_map(em); + + ret = btrfs_add_ordered_extent(inode, start, num_bytes, ram_bytes, + ins.objectid, ins.offset, + encoded->unencoded_offset, + (1 << BTRFS_ORDERED_ENCODED) | + (1 << BTRFS_ORDERED_COMPRESSED), + compression); + if (ret) { + btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0); + goto out_free_reserved; + } + btrfs_dec_block_group_reservations(fs_info, ins.objectid); + + if (start + encoded->len > inode->i_size) + i_size_write(inode, start + encoded->len); + + unlock_extent_cached(io_tree, start, end, &cached_state); + + btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes); + + if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid, + ins.offset, pages, nr_pages, 0, + false)) { + struct page *page = pages[0]; + + page->mapping = inode->i_mapping; + btrfs_writepage_endio_finish_ordered(page, start, end, 0); + page->mapping = NULL; + ret = -EIO; + goto out_pages; + } + ret = orig_count; + goto out; + +out_free_reserved: + btrfs_dec_block_group_reservations(fs_info, ins.objectid); + btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1); +out_delalloc_release: + btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes); + btrfs_delalloc_release_metadata(BTRFS_I(inode), disk_num_bytes, + ret < 0); +out_qgroup_free_data: + if (ret < 0) + btrfs_qgroup_free_data(inode, data_reserved, start, num_bytes); +out_free_data_space: + /* + * If btrfs_reserve_extent() succeeded, then we already decremented + * bytes_may_use. + */ + if (!extent_reserved) + btrfs_free_reserved_data_space_noquota(fs_info, disk_num_bytes); +out_unlock: + unlock_extent_cached(io_tree, start, end, &cached_state); +out_pages: + for (i = 0; i < nr_pages; i++) { + if (pages[i]) + put_page(pages[i]); + } + kvfree(pages); +out: + if (ret >= 0) + iocb->ki_pos += encoded->len; + return ret; +} + #ifdef CONFIG_SWAP /* * Add an entry indicating a block group or device which is pinned by a diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 3c6edc307657..1e52105c4c1a 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -451,9 +451,15 @@ void btrfs_remove_ordered_extent(struct inode *inode, spin_lock(&btrfs_inode->lock); btrfs_mod_outstanding_extents(btrfs_inode, -1); spin_unlock(&btrfs_inode->lock); - if (root != fs_info->tree_root) - btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes, - false); + if (root != fs_info->tree_root) { + u64 release; + + if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags)) + release = entry->disk_num_bytes; + else + release = entry->num_bytes; + btrfs_delalloc_release_metadata(btrfs_inode, release, false); + } if (test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) percpu_counter_add_batch(&fs_info->dio_bytes, -entry->num_bytes, diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index a038bda16fdf..0079ce49bc5e 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -61,6 +61,8 @@ enum { BTRFS_ORDERED_TRUNCATED, /* Regular IO for COW */ BTRFS_ORDERED_REGULAR, + /* RWF_ENCODED I/O */ + BTRFS_ORDERED_ENCODED, }; struct btrfs_ordered_extent {