From patchwork Tue Oct 15 18:42:37 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11191489 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E2DE4912 for ; Tue, 15 Oct 2019 18:43:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id AFAB72086A for ; Tue, 15 Oct 2019 18:43:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="CB16xZ/6" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389220AbfJOSnF (ORCPT ); Tue, 15 Oct 2019 14:43:05 -0400 Received: from mail-pg1-f196.google.com ([209.85.215.196]:43013 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728430AbfJOSnF (ORCPT ); Tue, 15 Oct 2019 14:43:05 -0400 Received: by mail-pg1-f196.google.com with SMTP id i32so12653067pgl.10 for ; Tue, 15 Oct 2019 11:43:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=u3yo0pm9yG+s8rZ6ZvLWoqBMl+Gpyo1G0BqQoMMOApE=; b=CB16xZ/62fLAIwczSre0jzoTSCjupkeV7maLjCupp1AAs2bgbwtS4V/9Kn3ELjgiyd CDmSMwHHsAO1HCQGnjgffyIScyKyYGaZZG9lBipFD4FmUbGnK2q3oEK7K+d1NWBz0rjh EFIVnsEfnn2OFqFSUQrHceb4fJ/n6ATk4U6E12SAO3LZWpmFRbMgHJhxvNbsVKuajnqO 1ydYeDaXVBQ4kCWbpBgezGg5OcArWl6a6HaUfBqclGEpNC6yOmwzK4w+mzFWpy9SNYB/ HPBC9xNyzNTHabd77AuiU/CNRVgprH6IIYSB7TLoRJCQwyP2GV2Pco0o+sjwnrkBiGs5 boAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=u3yo0pm9yG+s8rZ6ZvLWoqBMl+Gpyo1G0BqQoMMOApE=; b=MfYHqP4tI8S8eSFPCLi2YDQwi7zPM6H6n2kgge1ZsdmMv96emOCtTECYpRP4wkpw1K 3hP6xyCE1s9zB7BeDjIyDGMywR0NPoyENSBr6JUQaX8t9nWU2vU2WC5dqSyG4Juz9CYD ugpB0vW9H+pbDRaQTlNoc2tLXyvC+I+e9eS4S3Gh3eIvxpRWVdNN+EKrhKIR3Wlkpk+w SzGPpJkHPWokIFtr8Pli6x87IzWfcYsh2jiYPg7YXYId+rUH0gKzz3SEJGUUIbNIhlD8 YcLrs9APdwdi5qCE5+zvxuA1K7DJu5dwRTCTfhjlL4DE8cVzlQEYG5LQkPPpG5DeBOrN CgDg== X-Gm-Message-State: APjAAAUN9J6V7Wd46tZ0Ktz3ClDoS5NW7NVauNi9t5mAgl7X37UFIvyi RhvUiWKPuNIF79DTHMr+dLH68N6RNrM= X-Google-Smtp-Source: APXvYqwbA154iWG9Vttn63Gk/b3u+pkIIkZuZ+kDjp+bNKADECBgEiUwSvLuGRWGAsRbrTfiKqcBSA== X-Received: by 2002:a17:90a:2ec3:: with SMTP id h3mr44651350pjs.131.1571164983662; Tue, 15 Oct 2019 11:43:03 -0700 (PDT) Received: from vader.thefacebook.com ([2620:10d:c090:200::2:3e5e]) by smtp.gmail.com with ESMTPSA id z3sm40396pjd.25.2019.10.15.11.43.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Oct 2019 11:43:03 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH man-pages] Document encoded I/O Date: Tue, 15 Oct 2019 11:42:37 -0700 Message-Id: X-Mailer: git-send-email 2.23.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval This adds a new page, rwf_encoded(7), providing an overview of encoded I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to reference it. Signed-off-by: Omar Sandoval --- man2/fcntl.2 | 10 +- man2/open.2 | 13 ++ man2/readv.2 | 46 +++++++ man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 365 insertions(+), 1 deletion(-) create mode 100644 man7/rwf_encoded.7 diff --git a/man2/fcntl.2 b/man2/fcntl.2 index fce4f4c2b..76fe9cc6f 100644 --- a/man2/fcntl.2 +++ b/man2/fcntl.2 @@ -222,8 +222,9 @@ On Linux, this command can change only the .BR O_ASYNC , .BR O_DIRECT , .BR O_NOATIME , +.BR O_NONBLOCK , and -.B O_NONBLOCK +.B O_ENCODED flags. It is not possible to change the .BR O_DSYNC @@ -1803,6 +1804,13 @@ Attempted to clear the flag on a file that has the append-only attribute set. .TP .B EPERM +Attempted to set the +.B O_ENCODED +flag and the calling process did not have the +.B CAP_SYS_ADMIN +capability. +.TP +.B EPERM .I cmd was .BR F_ADD_SEALS , diff --git a/man2/open.2 b/man2/open.2 index b0f485b41..cdd3c549c 100644 --- a/man2/open.2 +++ b/man2/open.2 @@ -421,6 +421,14 @@ was followed by a call to .BR fdatasync (2)). .IR "See NOTES below" . .TP +.B O_ENCODED +Open the file with encoded I/O permissions; +see +.BR rwf_encoded (7). +The caller must have the +.B CAP_SYS_ADMIN +capabilty. +.TP .B O_EXCL Ensure that this call creates the file: if this flag is specified in conjunction with @@ -1168,6 +1176,11 @@ did not match the owner of the file and the caller was not privileged. The operation was prevented by a file seal; see .BR fcntl (2). .TP +.B EPERM +The +.B O_ENCODED +flag was specified, but the caller was not privileged. +.TP .B EROFS .I pathname refers to a file on a read-only filesystem and write access was diff --git a/man2/readv.2 b/man2/readv.2 index af27aa63e..aa60b980a 100644 --- a/man2/readv.2 +++ b/man2/readv.2 @@ -265,6 +265,11 @@ the data is always appended to the end of the file. However, if the .I offset argument is \-1, the current file offset is updated. +.TP +.BR RWF_ENCODED " (since Linux 5.6)" +Read or write encoded (e.g., compressed) data. +See +.BR rwf_encoded (7). .SH RETURN VALUE On success, .BR readv (), @@ -284,6 +289,13 @@ than requested (see and .BR write (2)). .PP +If +.B +RWF_ENCODED +was specified in +.IR flags , +then the return value is the number of encoded bytes. +.PP On error, \-1 is returned, and \fIerrno\fP is set appropriately. .SH ERRORS The errors are as given for @@ -314,6 +326,40 @@ is less than zero or greater than the permitted maximum. .TP .B EOPNOTSUPP An unknown flag is specified in \fIflags\fP. +.TP +.B EOPNOTSUPP +.B RWF_ENCODED +is specified in +.I flags +and the filesystem does not implement encoded I/O. +.TP +.B EPERM +.B RWF_ENCODED +is specified in +.I flags +and the file was not opened with the +.B O_ENCODED +flag. +.PP +.BR preadv2 () +can fail for the following reasons: +.TP +.B EFBIG +.B RWF_ENCODED +is specified in +.I flags +and buffers in +.I iov +were not big enough to return the encoded data. +.PP +.BR pwritev2 () +can fail for the following reasons: +.TP +.B EINVAL +.B RWF_ENCODED +is specified in +.I flags +and the alignment and/or size requirements are not met. .SH VERSIONS .BR preadv () and diff --git a/man7/rwf_encoded.7 b/man7/rwf_encoded.7 new file mode 100644 index 000000000..90f5292e2 --- /dev/null +++ b/man7/rwf_encoded.7 @@ -0,0 +1,297 @@ +.\" Copyright (c) 2019 by Omar Sandoval +.\" +.\" %%%LICENSE_START(VERBATIM) +.\" Permission is granted to make and distribute verbatim copies of this +.\" manual provided the copyright notice and this permission notice are +.\" preserved on all copies. +.\" +.\" Permission is granted to copy and distribute modified versions of this +.\" manual under the conditions for verbatim copying, provided that the +.\" entire resulting derived work is distributed under the terms of a +.\" permission notice identical to this one. +.\" +.\" Since the Linux kernel and libraries are constantly changing, this +.\" manual page may be incorrect or out-of-date. The author(s) assume no +.\" responsibility for errors or omissions, or for damages resulting from +.\" the use of the information contained herein. The author(s) may not +.\" have taken the same level of care in the production of this manual, +.\" which is licensed free of charge, as they might when working +.\" professionally. +.\" +.\" Formatted or processed versions of this manual, if unaccompanied by +.\" the source, must acknowledge the copyright and authors of this work. +.\" %%%LICENSE_END +.\" +.\" +.TH RWF_ENCODED 7 2019-10-14 "Linux" "Linux Programmer's Manual" +.SH NAME +rwf_encoded \- overview of encoded I/O +.SH DESCRIPTION +Several filesystems (e.g., Btrfs) support transparent encoding +(e.g., compression, encryption) of data on disk: +written data is encoded by the kernel before it is written to disk, +and read data is decoded before being returned to the user. +In some cases, it is useful to skip this encoding step. +For example, the user may want to read the compressed contents of a file +or write pre-compressed data directly to a file. +This is referred to as "encoded I/O". +.SS Encoded I/O API +Encoded I/O is specified with the +.B RWF_ENCODED +flag to +.BR preadv2 (2) +and +.BR pwritev2 (2). +If +.B RWF_ENCODED +is specified, then +.I iov[0].iov_base +points to an +.I +encoded_iov +structure, defined in +.I +as: +.PP +.in +4n +.EX +struct encoded_iov { + __u64 len; + __u64 unencoded_len; + __u64 unencoded_offset; + __u32 compression; + __u32 encryption; + +}; +.EE +.in +.PP +.I iov[0].iov_len +must be set to +.IR "sizeof(struct\ encoded_iov)" . +The remaining buffers contain the encoded data. +.PP +.I compression +and +.I encryption +are the encoding fields. +.I compression +is one of +.B ENCODED_IOV_COMPRESSION_NONE +(zero), +.BR ENCODED_IOV_COMPRESSION_ZLIB , +.BR ENCODED_IOV_COMPRESSION_LZO , +or +.BR ENCODED_IOV_COMPRESSION_ZSTD . +.I encryption +is currently always +.B ENCODED_IOV_ENCRYPTION_NONE +(zero). +.PP +.I unencoded_len +is the length of the unencoded (i.e., decrypted and decompressed) data. +.I unencoded_offset +is the offset into the unencoded data where the data in the file begins +(strictly less than +.IR unencoded_len ). +.I len +is the length of the data in the file. +.PP +In most cases, +.I len +is equal to +.I unencoded_len +and +.I unencoded_offset +is zero. +However, it may be necessary to refer to a subset of the unencoded data, +usually because a read occurred in the middle of an encoded extent, +because part of an extent was overwritten or deallocated in some +way (e.g., with +.BR write (2), +.BR truncate (2), +or +.BR fallocate (2)) +or because part of an extent was added to the file (e.g., with +.BR ioctl_ficlonerange (2) +or +.BR ioctl_fideduperange (2)). +For example, if +.I len +is 300, +.I unencoded_len +is 1000, +and +.I unencoded_offset +is 600, +then the encoded data is 1000 bytes long when decoded, +of which only the 300 bytes starting at offset 600 are used; +the first 600 and last 100 bytes should be ignored. +.PP +Additionally, +.I len +may be greater than +.I unencoded_len +- +.IR unencoded_offset; +in this case, the data in the file is longer than the unencoded data, +and the difference is zero-filled. +.PP +If the unencoded data is actually longer than +.IR unencoded_len , +then it is truncated; +if it is shorter, then it is extended with zeroes. +.PP +For +.BR pwritev2 (), +the metadata should be specified in +.IR iov[0] , +and the encoded data should be passed in the remaining buffers. +This returns the number of encoded bytes written (that is, the sum of +.I iov[n].iov_len +for 1 <= +.I n +< +.IR iovcnt ; +partial writes will not occur). +If the +.I offset +argument to +.BR pwritev2 () +is -1, then the file offset is incremented by +.IR len . +At least one encoding field must be non-zero. +Note that the encoded data is not validated when it is written; +if it is not valid (e.g., it cannot be decompressed), +then a subsequent read may result in an error. +.PP +For +.BR preadv2 (), +the metadata is returned in +.IR iov[0] , +and the encoded data is returned in the remaining buffers. +This returns the number of encoded bytes read. +Note that a return value of zero does not indicate end of file; +one should refer to +.I len +(for example, a hole in the file has a non-zero +.I len +but a zero return value). +A +.I len +of zero indicates end of file. +If the +.I offset +argument to +.BR preadv2 () +is -1, then the file offset is incremented by +.IR len . +If the provided buffers are not large enough to return an entire encoded +extent, +then this returns -1 and sets +.I errno +to +.BR EFBIG . +This will only return one encoded extent per call. +This can also read data which is not encoded; +all encoding fields will be zero in that case. +.SS Security +Encoded I/O creates the potential for some security issues: +.IP * 3 +Encoded writes allow writing arbitrary data which the kernel will decode on +a subsequent read. Decompression algorithms are complex and may have bugs +which can be exploited by malicous data. +.IP * +Encoded reads may return data which is not logically present in the file +(see the discussion of +.I len +vs. +.I unencoded_len +above). +It may not be intended for this data to be readable. +.PP +Therefore, encoded I/O requires privilege. +Namely, the +.B RWF_ENCODED +flag may only be used when the file was opened with the +.B O_ENCODED +flag to +.BR open (2), +which requires the +.B CAP_SYS_ADMIN +capability. +.B O_ENCODED +may be set and cleared with +.BR fcntl (2). +Note that it is not cleared on +.BR fork (2) +or +.BR execve (2); +one may wish to use +.B O_CLOEXEC +with +.BR O_ENCODED . +.SS Filesystem support +Encoded I/O is supported on the following filesystems: +.TP +Btrfs (since Linux 5.6) +.IP +Btrfs supports encoded reads and writes of compressed data. +The data is encoded as follows: +.RS +.IP * 3 +If +.I compression +is +.BR ENCODED_IOV_COMPRESSION_ZLIB , +then the encoded data is a single zlib stream. +.IP * +If +.I compression +is +.BR ENCODED_IOV_COMPRESSION_LZO , +then the encoded data is compressed page by page with LZO1X +and wrapped in the format described in the Linux kernel source file +.IR fs/btrfs/lzo.c . +.IP * +If +.I compression +is +.BR ENCODED_IOV_COMPRESSION_ZSTD , +then the encoded data is a single zstd frame compressed with the +.I windowLog +compression parameter set to no more than 17. +.RE +.IP +Additionally, there are some restrictions on +.BR pwritev2 (): +.RS +.IP * 3 +.I offset +(or the current file offset if +.I offset +is -1) must be aligned to the sector size of the filesystem. +.IP * +.I len +must be aligned to the sector size of the filesystem +unless the data ends at or beyond the current end of the file. +.IP * +.I unencoded_len +and the length of the encoded data must each be no more than 128 KiB. +This limit may increase in the future. +.IP * +The length of the encoded data rounded up to the nearest sector must be +less than +.I unencoded_len +rounded up to the nearest sector. +.IP * +Referring to a subset of unencoded data is not yet implemented; i.e., +.I len +must equal +.I unencoded_len +and +.I unencoded_offset +must be zero. +.IP * +Writing compressed inline extents is not yet implemented. +.RE From patchwork Tue Oct 15 18:42:40 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11191511 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5B2B81668 for ; Tue, 15 Oct 2019 18:43:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 31818214AE for ; Tue, 15 Oct 2019 18:43:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="14vSyNX0" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389258AbfJOSnP (ORCPT ); Tue, 15 Oct 2019 14:43:15 -0400 Received: from mail-pg1-f194.google.com ([209.85.215.194]:35347 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389245AbfJOSnI (ORCPT ); Tue, 15 Oct 2019 14:43:08 -0400 Received: by mail-pg1-f194.google.com with SMTP id p30so12672954pgl.2 for ; Tue, 15 Oct 2019 11:43:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/DGeX/s/F7y1Ptg4TXfC149/3MNxfqpqB94L2lN+tho=; b=14vSyNX0fzDf9xngY1AfBlds+lZ2FMgpQ6oeFmWsWrEdx+Idny6JEU716Serthufy3 qLzTc7wkNyAJ/7IRT1Njsimn3NbWdgg0ICOBELgJoR+GWwwUsSJmbKNeZx5fmu2Us/Ap L9R65uNRI/fkUFAzYRVIVr7ND98NMrVA/tvHl13/I+JkbO+CtftuwHg9poXQ+P8u7a1Z n4w4mvXw6vhFtiWVVHAcvq3QXG+EMb9oJfhiwkx0LPgVoJlL857xC1QNpSH70ZkfqyR9 JSVZIQJ9/EAfzeKJEPY+ySXZiOOTMqZQe1dmDsvDWdNxHBimKI6h82RJ4powbqWfGCJ5 D+lA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/DGeX/s/F7y1Ptg4TXfC149/3MNxfqpqB94L2lN+tho=; b=T3QI1j7Fd1etPZny35D/+8Yfk5bvMf09nLcWGWH4ZBcXgSuvd37xWZ4vvC1sqdY6F6 Z43kMCzTp7yp4rxnHxLtUiSqdCpOAACM0Ba1aNiPES7sJB1f6xNCAYsY5qpP7uOvYKp2 ImoobCGUUWLwkfz4xpOioBIs69dIYRsQVsfRwm4Dqiw7Yzx7hqKDiN36YfWt/zcCAcSR I3SYHTpFrD4fJsckRQ03KwMluOW4/gL+Dpv1tb7aHct63M/zh+0RGs6fh9Y0ZA4BNBtv 59OqKMzDIYlin8TP2RzvmDVGxyxxJlK13PyPR53ZC33uukIqWqgRDGqdN6ec2uzZ2+JC PMqQ== X-Gm-Message-State: APjAAAUu3QPaf+dipfLwHM16R45SYYnN4cFAXJ0lXfBWZk8K37km9JX9 xMDIuz21MeqJosQGfOkJftWbgiTnxlM= X-Google-Smtp-Source: APXvYqywC9a+xVFtPcV06leUtXOTMWAiybNplGHzlZII4ICbIq6sFGKGH4YHelYXDu92DH8tF0PqUQ== X-Received: by 2002:a62:3387:: with SMTP id z129mr38560916pfz.253.1571164986127; Tue, 15 Oct 2019 11:43:06 -0700 (PDT) Received: from vader.thefacebook.com ([2620:10d:c090:200::2:3e5e]) by smtp.gmail.com with ESMTPSA id z3sm40396pjd.25.2019.10.15.11.43.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Oct 2019 11:43:05 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data Date: Tue, 15 Oct 2019 11:42:40 -0700 Message-Id: <7f98cf5409cf2b583cd5b3451fc739fd3428873b.1571164762.git.osandov@fb.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Btrfs supports transparent compression: data written by the user can be compressed when written to disk and decompressed when read back. However, we'd like to add an interface to write pre-compressed data directly to the filesystem, and the matching interface to read compressed data without decompressing it. This adds support for so-called "encoded I/O" via preadv2() and pwritev2(). A new RWF_ENCODED flags indicates that a read or write is "encoded". If this flag is set, iov[0].iov_base points to a struct encoded_iov which is used for metadata: namely, the compression algorithm, unencoded (i.e., decompressed) length, and what subrange of the unencoded data should be used (needed for truncated or hole-punched extents and when reading in the middle of an extent). For reads, the filesystem returns this information; for writes, the caller provides it to the filesystem. iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be used to extend the interface in the future. The remaining iovecs contain the encoded extent. Filesystems must indicate that they support encoded writes by setting FMODE_ENCODED_IO in ->file_open(). Signed-off-by: Omar Sandoval --- include/linux/fs.h | 14 +++++++ include/uapi/linux/fs.h | 26 ++++++++++++- mm/filemap.c | 82 ++++++++++++++++++++++++++++++++++------- 3 files changed, 108 insertions(+), 14 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index e0d909d35763..54681f21e05e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File does not contribute to nr_files count */ #define FMODE_NOACCOUNT ((__force fmode_t)0x20000000) +/* File supports encoded IO */ +#define FMODE_ENCODED_IO ((__force fmode_t)0x40000000) + /* * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector * that indicates that they should check the contents of the iovec are @@ -314,6 +317,7 @@ enum rw_hint { #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT (1 << 7) +#define IOCB_ENCODED (1 << 8) struct kiocb { struct file *ki_filp; @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int); extern int generic_file_mmap(struct file *, struct vm_area_struct *); extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *); +struct encoded_iov; +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *); +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *); +extern int import_encoded_write(struct kiocb *, struct encoded_iov *, + struct iov_iter *); extern int generic_remap_checks(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t *count, unsigned int remap_flags); @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) return -EOPNOTSUPP; ki->ki_flags |= IOCB_NOWAIT; } + if (flags & RWF_ENCODED) { + if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO)) + return -EOPNOTSUPP; + ki->ki_flags |= IOCB_ENCODED; + } if (flags & RWF_HIPRI) ki->ki_flags |= IOCB_HIPRI; if (flags & RWF_DSYNC) diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 379a612f8f1d..ed92a8a257cb 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -284,6 +284,27 @@ struct fsxattr { typedef int __bitwise __kernel_rwf_t; +enum { + ENCODED_IOV_COMPRESSION_NONE, + ENCODED_IOV_COMPRESSION_ZLIB, + ENCODED_IOV_COMPRESSION_LZO, + ENCODED_IOV_COMPRESSION_ZSTD, + ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD, +}; + +enum { + ENCODED_IOV_ENCRYPTION_NONE, + ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE, +}; + +struct encoded_iov { + __u64 len; + __u64 unencoded_len; + __u64 unencoded_offset; + __u32 compression; + __u32 encryption; +}; + /* high priority request, poll if possible */ #define RWF_HIPRI ((__force __kernel_rwf_t)0x00000001) @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO O_APPEND */ #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) +/* encoded (e.g., compressed or encrypted) IO */ +#define RWF_ENCODED ((__force __kernel_rwf_t)0x00000020) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND) + RWF_APPEND | RWF_ENCODED) #endif /* _UAPI_LINUX_FS_H */ diff --git a/mm/filemap.c b/mm/filemap.c index 1146fcfa3215..d2e6d9caf353 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos, return 0; } -/* - * Performs necessary checks before doing a write - * - * Can adjust writing position or amount of bytes to write. - * Returns appropriate error code that caller should return or - * zero in case that write should be allowed. - */ -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; - loff_t count; - int ret; if (IS_SWAPFILE(inode)) return -ETXTBSY; - if (!iov_iter_count(from)) + if (!*count) return 0; /* FIXME: this is for backwards compatibility with 2.4 */ @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT)) return -EINVAL; - count = iov_iter_count(from); - ret = generic_write_check_limits(file, iocb->ki_pos, &count); + return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count); +} + +/* + * Performs necessary checks before doing a write + * + * Can adjust writing position or amount of bytes to write. + * Returns a negative errno or the new number of bytes to write. + */ +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) +{ + loff_t count = iov_iter_count(from); + int ret; + + ret = generic_write_checks_common(iocb, &count); if (ret) return ret; @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) } EXPORT_SYMBOL(generic_write_checks); +int generic_encoded_write_checks(struct kiocb *iocb, + struct encoded_iov *encoded) +{ + loff_t count = encoded->unencoded_len; + int ret; + + ret = generic_write_checks_common(iocb, &count); + if (ret) + return ret; + + if (count != encoded->unencoded_len) { + /* + * The write got truncated by generic_write_checks_common(). We + * can't do a partial encoded write. + */ + return -EFBIG; + } + return 0; +} +EXPORT_SYMBOL(generic_encoded_write_checks); + +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter) +{ + if (!(iocb->ki_filp->f_flags & O_ENCODED)) + return -EPERM; + if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov)) + return -EINVAL; + return iov_iter_count(iter) - sizeof(struct encoded_iov); +} +EXPORT_SYMBOL(check_encoded_read); + +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded, + struct iov_iter *from) +{ + if (!(iocb->ki_filp->f_flags & O_ENCODED)) + return -EPERM; + if (iov_iter_single_seg_count(from) != sizeof(*encoded)) + return -EINVAL; + if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded)) + return -EFAULT; + if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE && + encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) + return -EINVAL; + if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES || + encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES) + return -EINVAL; + if (encoded->unencoded_offset >= encoded->unencoded_len) + return -EINVAL; + return 0; +} +EXPORT_SYMBOL(import_encoded_write); + /* * Performs necessary checks before doing a clone. * From patchwork Tue Oct 15 18:42:41 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11191515 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0C20E912 for ; Tue, 15 Oct 2019 18:43:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E0E7C2086A for ; Tue, 15 Oct 2019 18:43:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="vMK9lfe6" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389245AbfJOSnR (ORCPT ); Tue, 15 Oct 2019 14:43:17 -0400 Received: from mail-pf1-f195.google.com ([209.85.210.195]:40211 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389241AbfJOSnI (ORCPT ); Tue, 15 Oct 2019 14:43:08 -0400 Received: by mail-pf1-f195.google.com with SMTP id x127so13024708pfb.7 for ; Tue, 15 Oct 2019 11:43:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ob1ielu62scvCPElhOPfrcp3zV8IJ5Kidb991l7KPUg=; b=vMK9lfe6h4wUzziMJqGb+LzhC7kdqjIgQ/t7j4EXBi95ZBGjQIqjiBMtOLPlQ47GMH nQ7/48j09w8ejBIzROr6RjYby0ipQYyRd0seWiVNl9nh5RmniOYXeM8zBnji5yd/uXej NDQW7I1RWdqiwggXmjReo/kGiAiStU6+zJLuzbuequ7Q6nzV/ZCvepoJYQffovmsAvEw +ainHyOxj0TVN+dukoNY5xW7Rr556DOsEFFxy0mAlpNPA+LyWzt6xQttqxiGJyOWz/0r yVk9KCy/kGafxGG8E1vLb3e66KttxoYZeU2bkBAcGjhSrilbwNFspOCQ82XJMj6wnliS ZSzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ob1ielu62scvCPElhOPfrcp3zV8IJ5Kidb991l7KPUg=; b=MB4+MnfZ5i1yaPpyrE1yTdiAUBxhoyk4Vevs8/8CMbpYo6RFMMDZJJZY/FTfh5U7Pl hTckVpN8gkGtp1/MhskazIXhn29siBsTIlnlGxDbqc0slgwgMwWVWzYn2jlnkm0tPpHg qImBYnVToIMtmNBmoreJnA/RGWhgUaI439N385EyQ1VqUK95msZpwEwPnRDKXvrf453t Odl+y6RsDostMl7fezQwjRAs0QP8Rbg4drPJe3XsZCHzc5Koa1uTdoFCjU0cD+q9aG0N 5dsJqFiAWTl5CAqRZ/ifzftE6wjwxv1warpuR896LDi7+3FmW8/xxZMxSpwztrwcnZKy d3WA== X-Gm-Message-State: APjAAAUVvytkEVMhmeOFKFL9goThOdcppKX2pPiYQXRs2gRa9QqB+XC1 7MJqxkD42HnqjOEtHGvhrDLGN8+Ws3E= X-Google-Smtp-Source: APXvYqyCuLWWk3+rz48gsDKb6ODVF6rPO2+LUEEMtEsd5P65CRiCp7LwHG20me+uMf/Lc24soBCOng== X-Received: by 2002:a17:90a:8003:: with SMTP id b3mr44492032pjn.43.1571164986984; Tue, 15 Oct 2019 11:43:06 -0700 (PDT) Received: from vader.thefacebook.com ([2620:10d:c090:200::2:3e5e]) by smtp.gmail.com with ESMTPSA id z3sm40396pjd.25.2019.10.15.11.43.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Oct 2019 11:43:06 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v2 3/5] btrfs: generalize btrfs_lookup_bio_sums_dio() Date: Tue, 15 Oct 2019 11:42:41 -0700 Message-Id: <01fdb646d7572f7d0d123937835db5c605e25a5e.1571164762.git.osandov@fb.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval This isn't actually dio-specific; it just looks up the csums starting at the given offset instead of using the page index. Rename it to btrfs_lookup_bio_sums_at_offset() and add the dst parameter. We might even want to expose __btrfs_lookup_bio_sums() as the public API instead of having two trivial wrappers, but I'll leave that for another day. Signed-off-by: Omar Sandoval --- fs/btrfs/ctree.h | 5 +++-- fs/btrfs/file-item.c | 18 +++++++++--------- fs/btrfs/inode.c | 4 ++-- 3 files changed, 14 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 19d669d12ca1..71552b2ca340 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2791,8 +2791,9 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr, u64 len); blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst); -blk_status_t btrfs_lookup_bio_sums_dio(struct inode *inode, struct bio *bio, - u64 logical_offset); +blk_status_t btrfs_lookup_bio_sums_at_offset(struct inode *inode, + struct bio *bio, u64 offset, + u8 *dst); int btrfs_insert_file_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, u64 pos, diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 1a599f50837b..d98f06fc2978 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -148,8 +148,9 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans, return ret; } -static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, - u64 logical_offset, u8 *dst, int dio) +static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, + struct bio *bio, + bool at_offset, u64 offset, u8 *dst) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct bio_vec bvec; @@ -159,7 +160,6 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; struct btrfs_path *path; u8 *csum; - u64 offset = 0; u64 item_start_offset = 0; u64 item_last_offset = 0; u64 disk_bytenr; @@ -205,15 +205,13 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio } disk_bytenr = (u64)bio->bi_iter.bi_sector << 9; - if (dio) - offset = logical_offset; bio_for_each_segment(bvec, bio, iter) { page_bytes_left = bvec.bv_len; if (count) goto next; - if (!dio) + if (!at_offset) offset = page_offset(bvec.bv_page) + bvec.bv_offset; count = btrfs_find_ordered_sum(inode, offset, disk_bytenr, csum, nblocks); @@ -291,12 +289,14 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst) { - return __btrfs_lookup_bio_sums(inode, bio, 0, dst, 0); + return __btrfs_lookup_bio_sums(inode, bio, false, 0, dst); } -blk_status_t btrfs_lookup_bio_sums_dio(struct inode *inode, struct bio *bio, u64 offset) +blk_status_t btrfs_lookup_bio_sums_at_offset(struct inode *inode, + struct bio *bio, u64 offset, + u8 *dst) { - return __btrfs_lookup_bio_sums(inode, bio, offset, NULL, 1); + return __btrfs_lookup_bio_sums(inode, bio, true, offset, dst); } int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0f2754eaa05b..8bce46122ef7 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8319,8 +8319,8 @@ static inline blk_status_t btrfs_lookup_and_bind_dio_csum(struct inode *inode, * contention. */ if (dip->logical_offset == file_offset) { - ret = btrfs_lookup_bio_sums_dio(inode, dip->orig_bio, - file_offset); + ret = btrfs_lookup_bio_sums_at_offset(inode, dip->orig_bio, + file_offset, NULL); if (ret) return ret; } From patchwork Tue Oct 15 18:42:42 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11191503 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 44256912 for ; Tue, 15 Oct 2019 18:43:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1015F20873 for ; Tue, 15 Oct 2019 18:43:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="UOyI7e/y" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389253AbfJOSnL (ORCPT ); Tue, 15 Oct 2019 14:43:11 -0400 Received: from mail-pf1-f193.google.com ([209.85.210.193]:37993 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389248AbfJOSnK (ORCPT ); Tue, 15 Oct 2019 14:43:10 -0400 Received: by mail-pf1-f193.google.com with SMTP id h195so13034213pfe.5 for ; Tue, 15 Oct 2019 11:43:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=nZLb2b0RCY7vgqstSQB8MBGrqVqeAFbGAfT3ja1UqaE=; b=UOyI7e/ypYxPrLB8Zk/fVQLxcKYbWvSCZojN6gcbranfCajXKZ10J10Ivx9vo4mKud DHdWAfMSC5+L+vqfMf0O5Lwu2VTgM7qpc2Oqhgpx0GFa4c71cFduoV36W4lgBjKZoh/Z rnPZvVeCaoDVOWOVwCTvvcWkbjwYT2aCQJmFyqJgY4Wlejha/N9fbXJdxuuygEaGnysn Bo5lpX9c8J/7jAof4zYWDAsDwfGzQXsN2gAOdkng6fI/1gdJwvI1CVK8R5wiLuE+gZiy D9Hj6qHSv6rZOvu7QfUPOhQxH0QY0SyWIzkF3XnO55q3N7xHeI9zY14Ct/e5KsuXpaX2 1XBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=nZLb2b0RCY7vgqstSQB8MBGrqVqeAFbGAfT3ja1UqaE=; b=rKSYHKHmeNaRMy/hcdksip23enZT4n8UAGLP2XYzK2pjJ9itqT8Cq1BomHChatgTGb 9zw9VMaB+IX/tB35ENIzeYonhJh/9lGySTpyOYcSuXLuqoeMtKSa/BD0UybnfB3Is6Wa RbxpSPeMS8t8P5Yxg02Y3GzD9s3282uJJKuzgxXTW2Zcx5Udv/9F7erXtWEabVevLrBf 3lSdSC1JDnAN8uxoCzGBd7awAMNPNF+QUJwEvsJPXk/wH9r9IbLo+TEDVO18AzL1bk1z mUignJ1pjjeorXtTKijUCkyUiyrc2pkq5eFW5e2lITJ4gSRUR5Z3Sx2j1zSAAIBIhf3z 0Uww== X-Gm-Message-State: APjAAAWiPdv9+CjwV51uulQiFcXyqrXrPkujkBU+dBusdNKiXhhF/z8B ziw4kVljmP7TWzndPRlpZU0Y+x4Qbyo= X-Google-Smtp-Source: APXvYqwCDgtP8XXNWpoB+tpyk6j82dFhTvvvAL9XwAdITjx7TvKLrOWF3LU0LO6N/bKnTqfd2dDlUg== X-Received: by 2002:aa7:9157:: with SMTP id 23mr40795567pfi.61.1571164988222; Tue, 15 Oct 2019 11:43:08 -0700 (PDT) Received: from vader.thefacebook.com ([2620:10d:c090:200::2:3e5e]) by smtp.gmail.com with ESMTPSA id z3sm40396pjd.25.2019.10.15.11.43.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Oct 2019 11:43:07 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v2 4/5] btrfs: implement RWF_ENCODED reads Date: Tue, 15 Oct 2019 11:42:42 -0700 Message-Id: <338d3b28dd31249053620b83e6ff190ad965fadc.1571164762.git.osandov@fb.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval There are 4 main cases: 1. Inline extents: we copy the data straight out of the extent buffer. 2. Hole/preallocated extents: we indicate the size of the extent starting from the read position; we don't need to copy zeroes. 3. Regular, uncompressed extents: we read the sectors we need directly from disk. 4. Regular, compressed extents: we read the entire compressed extent from disk and indicate what subset of the decompressed extent is in the file. This initial implementation simplifies a few things that can be improved in the future: - We hold the inode lock during the operation. - Cases 1, 3, and 4 allocate temporary memory to read into before copying out to userspace. - Cases 3 and 4 do not implement repair yet. Signed-off-by: Omar Sandoval --- fs/btrfs/ctree.h | 2 + fs/btrfs/file.c | 12 +- fs/btrfs/inode.c | 462 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 475 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 71552b2ca340..3b2aa1c7218c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2906,6 +2906,8 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page, int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end); void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start, u64 end, int uptodate); +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter); + extern const struct dentry_operations btrfs_dentry_operations; /* ioctl.c */ diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 27e5b269e729..51740cee39fc 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -390,6 +390,16 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info) return 0; } +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) +{ + if (iocb->ki_flags & IOCB_ENCODED) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + return btrfs_encoded_read(iocb, iter); + } + return generic_file_read_iter(iocb, iter); +} + /* simple helper to fault in pages and copy. This should go away * and be replaced with calls into generic code. */ @@ -3457,7 +3467,7 @@ static int btrfs_file_open(struct inode *inode, struct file *filp) const struct file_operations btrfs_file_operations = { .llseek = btrfs_file_llseek, - .read_iter = generic_file_read_iter, + .read_iter = btrfs_file_read_iter, .splice_read = generic_file_splice_read, .write_iter = btrfs_file_write_iter, .mmap = btrfs_file_mmap, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8bce46122ef7..174d0738d2c9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -10593,6 +10593,468 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) } } +static int encoded_iov_compression_from_btrfs(struct encoded_iov *encoded, + unsigned int compress_type) +{ + switch (compress_type) { + case BTRFS_COMPRESS_NONE: + encoded->compression = ENCODED_IOV_COMPRESSION_NONE; + break; + case BTRFS_COMPRESS_ZLIB: + encoded->compression = ENCODED_IOV_COMPRESSION_ZLIB; + break; + case BTRFS_COMPRESS_LZO: + encoded->compression = ENCODED_IOV_COMPRESSION_LZO; + break; + case BTRFS_COMPRESS_ZSTD: + encoded->compression = ENCODED_IOV_COMPRESSION_ZSTD; + break; + default: + return -EIO; + } + return 0; +} + +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb, + struct iov_iter *iter, u64 start, + u64 lockend, + struct extent_state **cached_state, + u64 extent_start, size_t count, + struct encoded_iov *encoded, + bool *unlocked) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_file_extent_item *item; + u64 ram_bytes; + unsigned long ptr; + void *tmp; + ssize_t ret; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path, + btrfs_ino(BTRFS_I(inode)), extent_start, + 0); + if (ret) { + if (ret > 0) { + /* The extent item disappeared? */ + ret = -EIO; + } + goto out; + } + leaf = path->nodes[0]; + item = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_file_extent_item); + + ram_bytes = btrfs_file_extent_ram_bytes(leaf, item); + ptr = btrfs_file_extent_inline_start(item); + + encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) - + iocb->ki_pos); + ret = encoded_iov_compression_from_btrfs(encoded, + btrfs_file_extent_compression(leaf, item)); + if (ret) + goto out; + if (encoded->compression) { + size_t inline_size; + + inline_size = btrfs_file_extent_inline_item_len(leaf, + btrfs_item_nr(path->slots[0])); + if (inline_size > count) { + ret = -EFBIG; + goto out; + } + count = inline_size; + encoded->unencoded_len = ram_bytes; + encoded->unencoded_offset = iocb->ki_pos - extent_start; + } else { + encoded->len = encoded->unencoded_len = count = + min_t(u64, count, encoded->len); + ptr += iocb->ki_pos - extent_start; + } + + tmp = kmalloc(count, GFP_NOFS); + if (!tmp) { + ret = -ENOMEM; + goto out; + } + read_extent_buffer(leaf, tmp, ptr, count); + btrfs_free_path(path); + path = NULL; + unlock_extent_cached(io_tree, start, lockend, cached_state); + inode_unlock(inode); + *unlocked = true; + if (copy_to_iter(encoded, sizeof(*encoded), iter) == sizeof(*encoded) && + copy_to_iter(tmp, count, iter) == count) + ret = count; + else + ret = -EFAULT; + kfree(tmp); + +out: + btrfs_free_path(path); + return ret; +} + +struct btrfs_encoded_read_private { + struct inode *inode; + wait_queue_head_t wait; + atomic_t pending; + bool uptodate; + bool skip_csum; +}; + +static bool btrfs_encoded_read_check_csums(struct btrfs_io_bio *io_bio) +{ + struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private; + struct inode *inode = priv->inode; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + u32 sectorsize = fs_info->sectorsize; + struct bio_vec *bvec; + struct bvec_iter_all iter_all; + u64 offset = 0; + + if (priv->skip_csum) + return true; + bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) { + unsigned int i, nr_sectors, pgoff; + + nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len); + pgoff = bvec->bv_offset; + for (i = 0; i < nr_sectors; i++) { + int csum_pos; + + csum_pos = BTRFS_BYTES_TO_BLKS(fs_info, offset); + if (__readpage_endio_check(inode, io_bio, csum_pos, + bvec->bv_page, pgoff, + io_bio->logical + offset, + sectorsize)) + return false; + offset += sectorsize; + pgoff += sectorsize; + } + } + return true; +} + +static void btrfs_encoded_read_endio(struct bio *bio) +{ + struct btrfs_encoded_read_private *priv = bio->bi_private; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + + if (bio->bi_status || !btrfs_encoded_read_check_csums(io_bio)) + priv->uptodate = false; + if (!atomic_dec_return(&priv->pending)) + wake_up(&priv->wait); + btrfs_io_bio_free_csum(io_bio); + bio_put(bio); +} + +static bool btrfs_submit_encoded_read(struct bio *bio) +{ + struct btrfs_encoded_read_private *priv = bio->bi_private; + struct inode *inode = priv->inode; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + blk_status_t status; + + atomic_inc(&priv->pending); + + if (!priv->skip_csum) { + status = btrfs_lookup_bio_sums_at_offset(inode, bio, + btrfs_io_bio(bio)->logical, + NULL); + if (status) + goto out; + } + + status = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA); + if (status) + goto out; + + status = btrfs_map_bio(fs_info, bio, 0, 0); +out: + if (status) { + bio->bi_status = status; + bio_endio(bio); + return false; + } + return true; +} + +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, + struct iov_iter *iter, + u64 start, u64 lockend, + struct extent_state **cached_state, + struct block_device *bdev, + u64 offset, u64 disk_io_size, + size_t count, + const struct encoded_iov *encoded, + bool *unlocked) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct btrfs_encoded_read_private priv = { + .inode = inode, + .wait = __WAIT_QUEUE_HEAD_INITIALIZER(priv.wait), + .pending = ATOMIC_INIT(1), + .uptodate = true, + .skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM, + }; + struct page **pages; + unsigned long nr_pages, i; + struct bio *bio = NULL; + u64 cur; + size_t page_offset; + ssize_t ret; + + nr_pages = (disk_io_size + PAGE_SIZE - 1) >> PAGE_SHIFT; + pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); + if (!pages) + return -ENOMEM; + for (i = 0; i < nr_pages; i++) { + pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM); + if (!pages[i]) { + ret = -ENOMEM; + goto out; + } + } + + i = 0; + cur = 0; + while (cur < disk_io_size) { + size_t bytes = min_t(u64, disk_io_size - cur, + PAGE_SIZE); + + if (!bio) { + bio = btrfs_bio_alloc(offset + cur); + bio_set_dev(bio, bdev); + bio->bi_end_io = btrfs_encoded_read_endio; + bio->bi_private = &priv; + bio->bi_opf = REQ_OP_READ; + btrfs_io_bio(bio)->logical = start + cur; + } + + if (bio_add_page(bio, pages[i], bytes, 0) < bytes) { + bool success; + + success = btrfs_submit_encoded_read(bio); + bio = NULL; + if (!success) + break; + continue; + } + i++; + cur += bytes; + } + + if (bio) + btrfs_submit_encoded_read(bio); + if (atomic_dec_return(&priv.pending)) + wait_event(priv.wait, !atomic_read(&priv.pending)); + if (!priv.uptodate) { + ret = -EIO; + goto out; + } + + unlock_extent_cached(io_tree, start, lockend, cached_state); + inode_unlock(inode); + *unlocked = true; + + if (copy_to_iter(encoded, sizeof(*encoded), iter) != sizeof(*encoded)) { + ret = -EFAULT; + goto out; + } + if (encoded->compression) { + i = 0; + page_offset = 0; + } else { + i = (iocb->ki_pos - start) >> PAGE_SHIFT; + page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1); + } + cur = 0; + while (cur < count) { + size_t bytes = min_t(size_t, count - cur, + PAGE_SIZE - page_offset); + + if (copy_page_to_iter(pages[i], page_offset, bytes, + iter) != bytes) { + ret = -EFAULT; + goto out; + } + i++; + cur += bytes; + page_offset = 0; + } + ret = count; +out: + for (i = 0; i < nr_pages; i++) { + if (pages[i]) + put_page(pages[i]); + } + kfree(pages); + return ret; +} + +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + ssize_t ret; + size_t count; + struct block_device *em_bdev; + u64 start, lockend, offset, disk_io_size; + struct extent_state *cached_state = NULL; + struct extent_map *em; + struct encoded_iov encoded = {}; + bool unlocked = false; + + ret = check_encoded_read(iocb, iter); + if (ret < 0) + return ret; + if (ret == 0) { +empty: + if (copy_to_iter(&encoded, sizeof(encoded), iter) == + sizeof(encoded)) + return 0; + else + return -EFAULT; + } + count = ret; + + file_accessed(iocb->ki_filp); + + inode_lock(inode); + + if (iocb->ki_pos >= inode->i_size) { + inode_unlock(inode); + goto empty; + } + start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize); + /* + * We don't know how long the extent containing iocb->ki_pos is, but if + * it's compressed we know that it won't be longer than this. + */ + lockend = start + BTRFS_MAX_UNCOMPRESSED - 1; + + for (;;) { + struct btrfs_ordered_extent *ordered; + + ret = btrfs_wait_ordered_range(inode, start, + lockend - start + 1); + if (ret) + goto out_unlock_inode; + lock_extent_bits(io_tree, start, lockend, &cached_state); + ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start, + lockend - start + 1); + if (!ordered) + break; + btrfs_put_ordered_extent(ordered); + unlock_extent_cached(io_tree, start, lockend, &cached_state); + cond_resched(); + } + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, + lockend - start + 1, 0); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out_unlock_extent; + } + em_bdev = em->bdev; + + if (em->block_start == EXTENT_MAP_INLINE) { + u64 extent_start = em->start; + + /* + * For inline extents we get everything we need out of the + * extent item. + */ + free_extent_map(em); + em = NULL; + ret = btrfs_encoded_read_inline(iocb, iter, start, lockend, + &cached_state, extent_start, + count, &encoded, &unlocked); + goto out; + } + + /* + * We only want to return up to EOF even if the extent extends beyond + * that. + */ + encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) - + iocb->ki_pos); + if (em->block_start == EXTENT_MAP_HOLE || + test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) { + offset = EXTENT_MAP_HOLE; + } else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { + offset = em->block_start; + /* + * Bail if the buffer isn't large enough to return the whole + * compressed extent. + */ + if (em->block_len > count) { + ret = -EFBIG; + goto out_em; + } + disk_io_size = count = em->block_len; + encoded.unencoded_len = em->ram_bytes; + encoded.unencoded_offset = iocb->ki_pos - em->orig_start; + ret = encoded_iov_compression_from_btrfs(&encoded, + em->compress_type); + if (ret) + goto out_em; + } else { + offset = em->block_start + (start - em->start); + if (encoded.len > count) + encoded.len = count; + /* + * Don't read beyond what we locked. This also limits the page + * allocations that we'll do. + */ + disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start; + encoded.len = encoded.unencoded_len = count = + start + disk_io_size - iocb->ki_pos; + disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize); + } + free_extent_map(em); + em = NULL; + + if (offset == EXTENT_MAP_HOLE) { + unlock_extent_cached(io_tree, start, lockend, &cached_state); + inode_unlock(inode); + unlocked = true; + if (copy_to_iter(&encoded, sizeof(encoded), iter) == + sizeof(encoded)) + ret = 0; + else + ret = -EFAULT; + } else { + ret = btrfs_encoded_read_regular(iocb, iter, start, lockend, + &cached_state, em_bdev, offset, + disk_io_size, count, &encoded, + &unlocked); + } + +out: + if (ret >= 0) + iocb->ki_pos += encoded.len; +out_em: + free_extent_map(em); +out_unlock_extent: + if (!unlocked) + unlock_extent_cached(io_tree, start, lockend, &cached_state); +out_unlock_inode: + if (!unlocked) + inode_unlock(inode); + return ret; +} + #ifdef CONFIG_SWAP /* * Add an entry indicating a block group or device which is pinned by a From patchwork Tue Oct 15 18:42:43 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 11191505 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BF88515AB for ; Tue, 15 Oct 2019 18:43:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8BBE820873 for ; Tue, 15 Oct 2019 18:43:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=osandov-com.20150623.gappssmtp.com header.i=@osandov-com.20150623.gappssmtp.com header.b="zN2+N8js" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389256AbfJOSnM (ORCPT ); Tue, 15 Oct 2019 14:43:12 -0400 Received: from mail-pl1-f193.google.com ([209.85.214.193]:44741 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727784AbfJOSnL (ORCPT ); Tue, 15 Oct 2019 14:43:11 -0400 Received: by mail-pl1-f193.google.com with SMTP id q15so9987153pll.11 for ; Tue, 15 Oct 2019 11:43:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=2P4yK7ExLp5umtVjGj04165cdN5E9l7hwsUvPQgbWHc=; b=zN2+N8js0beCW4Vj2hQN8LEC23pgf7H2OmdYrYNfaabjotPpQS0uSkT3EhzOfimCxg pcerRmoRUadz2yhmCHU8mL6OKfD+Wm3mLLbskaIRQeoBGaj4r95kLd2t0FucJDuGgjan Yxw2JuW5lU8yjnpEZ+3L5oURi6FIIxfqU8TAgtm8rVcL7TK3LR6G6gXztnWwoEUqPaF0 Zuz7/Ejld1cEvDWeFvWyI5wjlDSwJ+u7K/8qaWEN+zvSp5wXQrDjPK2Y5S0zjBsNvLZ/ 0niSZ45eP4uIlEh996/ohtaj/zWNf7D74bMVU7LOAD/yI/acV8Wub5UPI/3vB2cP/+OG Yh+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=2P4yK7ExLp5umtVjGj04165cdN5E9l7hwsUvPQgbWHc=; b=FxeSrljTKKSJOGs01ociLM58Iys5cZj6xinLA2L/z9XvkG9Z67TQFk8CeI9cWaFdD8 rCUIB8+wcws/4lFXY1yQlYnC0hDiGsmwDc7sKOruz7lLQuNfxXNwXC56eHOxDN0k4Ysm Rg8TxBG7/5WdE3qZ25WYy2FMpyvUzG9xEQbNQD/oDN1gscXMk4m1CTffq5b6uJokIfB7 VMjWAocRUnHe5wlL/yoxZUIdbfBt463uQB9BEF+mZFIZjsHeMeuu4Etnosmar5glJ4k9 +t/mb4Do2tEI9JPifeyPyVQ5BPfos/HNGVEaDi6JGnYNDSIDYYRfXqRWtaJCVK0PLwyS sHvg== X-Gm-Message-State: APjAAAWyVnCfkdKF6CL7QPHONkDFZgMySJTzacTlMmv6kmc1hrIoi6Ur vfX0Ah9qF5eRaDI7JUZPVj0TqfipYPc= X-Google-Smtp-Source: APXvYqxkHh8h3sSzm9pE95/SiBuhGn2lle0uaAmCiDVzwHIt/vUphw6mmRvl8dfu25K9gYBKhOb38Q== X-Received: by 2002:a17:902:8216:: with SMTP id x22mr37862515pln.232.1571164989392; Tue, 15 Oct 2019 11:43:09 -0700 (PDT) Received: from vader.thefacebook.com ([2620:10d:c090:200::2:3e5e]) by smtp.gmail.com with ESMTPSA id z3sm40396pjd.25.2019.10.15.11.43.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Oct 2019 11:43:08 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Cc: Dave Chinner , Jann Horn , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes Date: Tue, 15 Oct 2019 11:42:43 -0700 Message-Id: <904de93d9bbe630aff7f725fd587810c6eb48344.1571164762.git.osandov@fb.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: References: MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. Now that read and write are implemented, this also sets the FMODE_ENCODED_IO flag in btrfs_file_open(). Signed-off-by: Omar Sandoval --- fs/btrfs/compression.c | 6 +- fs/btrfs/compression.h | 5 +- fs/btrfs/ctree.h | 2 + fs/btrfs/file.c | 40 +++++++-- fs/btrfs/inode.c | 197 ++++++++++++++++++++++++++++++++++++++++- 5 files changed, 237 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index b05b361e2062..6632dd8d2e4d 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio) bio->bi_status == BLK_STS_OK); cb->compressed_pages[0]->mapping = NULL; - end_compressed_writeback(inode, cb); + if (cb->writeback) + end_compressed_writeback(inode, cb); /* note, our inode could be gone now */ /* @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, unsigned long compressed_len, struct page **compressed_pages, unsigned long nr_pages, - unsigned int write_flags) + unsigned int write_flags, bool writeback) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct bio *bio = NULL; @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, cb->mirror_num = 0; cb->compressed_pages = compressed_pages; cb->compressed_len = compressed_len; + cb->writeback = writeback; cb->orig_bio = NULL; cb->nr_pages = nr_pages; diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index 4cb8be9ff88b..d4176384ec15 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -47,6 +47,9 @@ struct compressed_bio { /* the compression algorithm for this bio */ int compress_type; + /* Whether this is a write for writeback. */ + bool writeback; + /* number of compressed pages in the array */ unsigned long nr_pages; @@ -93,7 +96,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, unsigned long compressed_len, struct page **compressed_pages, unsigned long nr_pages, - unsigned int write_flags); + unsigned int write_flags, bool writeback); blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, int mirror_num, unsigned long bio_flags); diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3b2aa1c7218c..9e1719e82cc8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2907,6 +2907,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end); void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start, u64 end, int uptodate); ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter); +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from, + struct encoded_iov *encoded); extern const struct dentry_operations btrfs_dentry_operations; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 51740cee39fc..8de6ac9b4b9c 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1893,8 +1893,7 @@ static void update_time_for_write(struct inode *inode) inode_inc_iversion(inode); } -static ssize_t btrfs_file_write_iter(struct kiocb *iocb, - struct iov_iter *from) +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { struct file *file = iocb->ki_filp; struct inode *inode = file_inode(file); @@ -1904,14 +1903,22 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, u64 end_pos; ssize_t num_written = 0; const bool sync = iocb->ki_flags & IOCB_DSYNC; + struct encoded_iov encoded; ssize_t err; loff_t pos; size_t count; loff_t oldsize; int clean_page = 0; - if (!(iocb->ki_flags & IOCB_DIRECT) && - (iocb->ki_flags & IOCB_NOWAIT)) + if (iocb->ki_flags & IOCB_ENCODED) { + err = import_encoded_write(iocb, &encoded, from); + if (err) + return err; + } + + if ((iocb->ki_flags & IOCB_NOWAIT) && + (!(iocb->ki_flags & IOCB_DIRECT) || + (iocb->ki_flags & IOCB_ENCODED))) return -EOPNOTSUPP; if (!inode_trylock(inode)) { @@ -1920,14 +1927,27 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, inode_lock(inode); } - err = generic_write_checks(iocb, from); - if (err <= 0) { + if (iocb->ki_flags & IOCB_ENCODED) { + err = generic_encoded_write_checks(iocb, &encoded); + if (err) { + inode_unlock(inode); + return err; + } + count = encoded.len; + } else { + err = generic_write_checks(iocb, from); + if (err < 0) { + inode_unlock(inode); + return err; + } + count = iov_iter_count(from); + } + if (count == 0) { inode_unlock(inode); return err; } pos = iocb->ki_pos; - count = iov_iter_count(from); if (iocb->ki_flags & IOCB_NOWAIT) { /* * We will allocate space in case nodatacow is not set, @@ -1986,7 +2006,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, if (sync) atomic_inc(&BTRFS_I(inode)->sync_writers); - if (iocb->ki_flags & IOCB_DIRECT) { + if (iocb->ki_flags & IOCB_ENCODED) { + num_written = btrfs_encoded_write(iocb, from, &encoded); + } else if (iocb->ki_flags & IOCB_DIRECT) { num_written = __btrfs_direct_write(iocb, from); } else { num_written = btrfs_buffered_write(iocb, from); @@ -3461,7 +3483,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) static int btrfs_file_open(struct inode *inode, struct file *filp) { - filp->f_mode |= FMODE_NOWAIT; + filp->f_mode |= FMODE_NOWAIT | FMODE_ENCODED_IO; return generic_file_open(inode, filp); } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 174d0738d2c9..bcc5a2bed22b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) ins.objectid, ins.offset, async_extent->pages, async_extent->nr_pages, - async_chunk->write_flags)) { + async_chunk->write_flags, true)) { struct page *p = async_extent->pages[0]; const u64 start = async_extent->start; const u64 end = start + async_extent->ram_size - 1; @@ -11055,6 +11055,201 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter) return ret; } +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from, + struct encoded_iov *encoded) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_root *root = BTRFS_I(inode)->root; + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct extent_changeset *data_reserved = NULL; + struct extent_state *cached_state = NULL; + int compression; + size_t orig_count; + u64 disk_num_bytes, num_bytes; + u64 start, end; + unsigned long nr_pages, i; + struct page **pages; + struct btrfs_key ins; + struct extent_map *em; + ssize_t ret; + + switch (encoded->compression) { + case ENCODED_IOV_COMPRESSION_ZLIB: + compression = BTRFS_COMPRESS_ZLIB; + break; + case ENCODED_IOV_COMPRESSION_LZO: + compression = BTRFS_COMPRESS_LZO; + break; + case ENCODED_IOV_COMPRESSION_ZSTD: + compression = BTRFS_COMPRESS_ZSTD; + break; + default: + return -EINVAL; + } + + disk_num_bytes = orig_count = iov_iter_count(from); + + /* For now, it's too hard to support bookend extents. */ + if (encoded->unencoded_len != encoded->len || + encoded->unencoded_offset != 0) + return -EINVAL; + + /* The extent size must be sane. */ + if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED || + disk_num_bytes > BTRFS_MAX_COMPRESSED || disk_num_bytes == 0) + return -EINVAL; + + /* + * The compressed data on disk must be sector-aligned. For convenience, + * we extend it with zeroes if it isn't. + */ + disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize); + + /* + * The extent in the file must also be sector-aligned. However, we allow + * a write which ends at or extends i_size to have an unaligned length; + * we round up the extent size and set i_size to the given length. + */ + start = iocb->ki_pos; + if (!IS_ALIGNED(start, fs_info->sectorsize)) + return -EINVAL; + if (start + encoded->len >= inode->i_size) { + num_bytes = ALIGN(encoded->len, fs_info->sectorsize); + } else { + num_bytes = encoded->len; + if (!IS_ALIGNED(num_bytes, fs_info->sectorsize)) + return -EINVAL; + } + + /* + * It's valid to have compressed data which is larger than or the same + * size as the decompressed data. However, for buffered I/O, we fall + * back to writing the decompressed data if compression didn't shrink + * it. So, for now, let's not allow creating such extents. + * + * Note that for now this also implicitly prevents writing data that + * would fit in an inline extent. + */ + if (disk_num_bytes >= num_bytes) + return -EINVAL; + + end = start + num_bytes - 1; + + nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT; + pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_USER); + if (!pages) + return -ENOMEM; + for (i = 0; i < nr_pages; i++) { + size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from)); + char *kaddr; + + pages[i] = alloc_page(GFP_HIGHUSER); + if (!pages[i]) { + ret = -ENOMEM; + goto out_pages; + } + kaddr = kmap(pages[i]); + if (copy_from_iter(kaddr, bytes, from) != bytes) { + kunmap(pages[i]); + ret = -EFAULT; + goto out_pages; + } + if (bytes < PAGE_SIZE) + memset(kaddr + bytes, 0, PAGE_SIZE - bytes); + kunmap(pages[i]); + } + + for (;;) { + struct btrfs_ordered_extent *ordered; + + ret = btrfs_wait_ordered_range(inode, start, end - start + 1); + if (ret) + goto out_pages; + ret = invalidate_inode_pages2_range(inode->i_mapping, + start >> PAGE_SHIFT, + end >> PAGE_SHIFT); + if (ret) + goto out_pages; + lock_extent_bits(io_tree, start, end, &cached_state); + ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start, + end - start + 1); + if (!ordered && + !filemap_range_has_page(inode->i_mapping, start, end)) + break; + if (ordered) + btrfs_put_ordered_extent(ordered); + unlock_extent_cached(io_tree, start, end, &cached_state); + cond_resched(); + } + + ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, + num_bytes); + if (ret) + goto out_unlock; + + ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes, + disk_num_bytes, 0, 0, &ins, 1, 1); + if (ret) + goto out_delalloc_release; + + em = create_io_em(inode, start, num_bytes, start, ins.objectid, + ins.offset, ins.offset, num_bytes, compression, + BTRFS_ORDERED_COMPRESSED); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out_free_reserve; + } + free_extent_map(em); + + ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid, + num_bytes, ins.offset, + BTRFS_ORDERED_COMPRESSED, + compression); + if (ret) { + btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0); + goto out_free_reserve; + } + btrfs_dec_block_group_reservations(fs_info, ins.objectid); + + if (start + encoded->len > inode->i_size) + i_size_write(inode, start + encoded->len); + + unlock_extent_cached(io_tree, start, end, &cached_state); + + btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false); + + if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid, + ins.offset, pages, nr_pages, 0, + false)) { + struct page *page = pages[0]; + + page->mapping = inode->i_mapping; + btrfs_writepage_endio_finish_ordered(page, start, end, 0); + page->mapping = NULL; + ret = -EIO; + goto out_pages; + } + iocb->ki_pos += encoded->len; + return orig_count; + +out_free_reserve: + btrfs_dec_block_group_reservations(fs_info, ins.objectid); + btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1); +out_delalloc_release: + btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes, + true); +out_unlock: + unlock_extent_cached(io_tree, start, end, &cached_state); +out_pages: + for (i = 0; i < nr_pages; i++) { + if (pages[i]) + put_page(pages[i]); + } + kvfree(pages); + return ret; +} + #ifdef CONFIG_SWAP /* * Add an entry indicating a block group or device which is pinned by a