From patchwork Fri Jul 21 17:48:52 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff Layton X-Patchwork-Id: 9857467 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id C041C601C0 for ; Fri, 21 Jul 2017 17:49:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AF83A1FFF9 for ; Fri, 21 Jul 2017 17:49:10 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A3676285E1; Fri, 21 Jul 2017 17:49:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 196EC1FFF9 for ; Fri, 21 Jul 2017 17:49:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752811AbdGURsz (ORCPT ); Fri, 21 Jul 2017 13:48:55 -0400 Received: from mail.kernel.org ([198.145.29.99]:33968 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750849AbdGURsz (ORCPT ); Fri, 21 Jul 2017 13:48:55 -0400 Received: from tleilax.poochiereds.net (cpe-45-37-196-243.nc.res.rr.com [45.37.196.243]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 03C3E219A9; Fri, 21 Jul 2017 17:48:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 03C3E219A9 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=jlayton@kernel.org From: Jeff Layton To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: dhowells@redhat.com Subject: [PATCH] Documentation: add some docs for errseq_t Date: Fri, 21 Jul 2017 13:48:52 -0400 Message-Id: <20170721174852.6006-1-jlayton@kernel.org> X-Mailer: git-send-email 2.13.3 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Jeff Layton ...and fix up a few comments in the code. Signed-off-by: Jeff Layton --- Documentation/errseq.rst | 149 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/errseq.h | 5 +- include/linux/fs.h | 2 - 3 files changed, 152 insertions(+), 4 deletions(-) create mode 100644 Documentation/errseq.rst diff --git a/Documentation/errseq.rst b/Documentation/errseq.rst new file mode 100644 index 000000000000..4c29bd5afbc5 --- /dev/null +++ b/Documentation/errseq.rst @@ -0,0 +1,149 @@ +The errseq_t datatype +===================== +An errseq_t is a way of recording errors in one place, and allowing any +number of "subscribers" to tell whether it has changed since a previous +point where it was sampled. + +The initial use case for this is tracking errors for file +synchronization syscalls (fsync, fdatasync, msync and sync_file_range), +but it may be usable in other situations. + +It's implemented as an unsigned 32-bit value. The low order bits are +designated to hold an error code (between 1 and MAX_ERRNO). The upper bits +are used as a counter. This is done with atomics instead of locking so that +these functions can be called from any context. + +Note that there is a risk of collisions if new errors are being recorded +frequently, since we have so few bits to use as a counter. + +To mitigate this, the bit between the error value and counter is used as +a flag to tell whether the value has been sampled since a new value was +recorded. That allows us to avoid bumping the counter if no one has +sampled it since the last time an error was recorded. + +Thus we end up with a value that looks something like this:: + + bit: 31..13 12 11..0 + +-----------------+----+----------------+ + | counter | SF | errno | + +-----------------+----+----------------+ + +The general idea is for "watchers" to sample an errseq_t value and keep +it as a running cursor. That value can later be used to tell whether +any new errors have occurred since that sampling was done, and atomically +record the state at the time that it was checked. This allows us to +record errors in one place, and then have a number of "watchers" that +can tell whether the value has changed since they last checked it. + +A new errseq_t should always be zeroed out. An errseq_t value of all zeroes +is the special (but common) case where there has never been an error. An all +zero value thus serves as the "epoch" if one wishes to know whether there +has ever been an error set since it was first initialized. + +API usage +========= +Let me tell you a story about a worker drone. Now, he's a good worker +overall, but the company is a little...management heavy. He has to +report to 77 supervisors today, and tomorrow the "big boss" is coming in +from out of town and he's sure to test the poor fellow too. + +They're all handing him work to do -- so much he can't keep track of who +handed him what, but that's not really a big problem. The supervisors +just want to know when he's finished all of the work they've handed him so +far and whether he made any mistakes since they last asked. + +He might have made the mistake on work they didn't actually hand him, +but he can't keep track of things at that level of detail, all he can +remember is the most recent mistake that he made. + +Here's our worker_drone representation:: + + struct worker_drone { + errseq_t wd_err; /* for recording errors */ + }; + +Every day, the worker_drone starts out with a blank slate:: + + struct worker_drone wd; + + wd.wd_err = (errseq_t)0; + +The supervisors come in and get an initial read for the day. They +don't care about anything that happened before their watch begins:: + + struct supervisor { + errseq_t s_wd_err; /* private "cursor" for wd_err */ + spinlock_t s_wd_err_lock; /* protects s_wd_err */ + } + + struct supervisor su; + + su.s_wd_err = errseq_sample(&wd.wd_err); + spin_lock_init(&su.s_wd_err_lock); + +Now they start handing him tasks to do. Every few minutes they ask him to +finish up all of the work they've handed him so far. Then they ask him +whether he made any mistakes on any of it:: + + spin_lock(&su.su_wd_err_lock); + err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); + spin_unlock(&su.su_wd_err_lock); + +Up to this point, that just keeps returning 0. + +Now, the owners of this company are quite miserly and have given him +substandard equipment with which to do his job. Occasionally it +glitches and he makes a mistake. He sighs a heavy sigh, and marks it +down:: + + errseq_set(&wd.wd_err, -EIO); + +...and then gets back to work. The supervisors eventually poll again +and they each get the error when they next check. Subsequent calls will +return 0, until another error is recorded, at which point it's reported +to each of them once. + +Note that the supervisors can't tell how many mistakes he made, only +whether one was made since they last checked, and the latest value +recorded. + +Occasionally the big boss comes in for a spot check and asks the worker +to do a one-off job for him. He's not really watching the worker +full-time like the supervisors, but he does need to know whether a +mistake occurred while his job was processing. + +He can just sample the current errseq_t in the worker, and then use that +to tell whether an error has occurred later:: + + errseq_t since = errseq_sample(&wd.wd_err); + /* submit some work and wait for it to complete */ + err = errseq_check(&wd.wd_err, since); + +Since he's just going to discard "since" after that point, he doesn't +need to advance it here. He also doesn't need any locking since it's +not usable by anyone else. + +Serializing errseq_t cursor updates +=================================== +Note that the errseq_t API does not protect the errseq_t cursor during a +check_and_advance_operation. Only the canonical error code is handled +atomically. In a situation where more than one task might be using the +same errseq_t cursor at the same time, it's important to serialize +updates to that cursor. + +If that's not done, then it's possible for the cursor to go backward +in which case the same error could be reported more than once. + +Because of this, it's often advantageous to first do an errseq_check to +see if anything has changed, and only later do an +errseq_check_and_advance after taking the lock. e.g.:: + + if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) { + /* su.s_wd_err is protected by s_wd_err_lock */ + spin_lock(&su.s_wd_err_lock); + err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); + spin_unlock(&su.s_wd_err_lock); + } + +That avoids the spinlock in the common case where nothing has changed +since the last time it was checked. diff --git a/include/linux/errseq.h b/include/linux/errseq.h index 784f0860527b..f746bd8fe4d0 100644 --- a/include/linux/errseq.h +++ b/include/linux/errseq.h @@ -1,8 +1,9 @@ +/* + * See Documentation/errseq.rst and lib/errseq.c + */ #ifndef _LINUX_ERRSEQ_H #define _LINUX_ERRSEQ_H -/* See lib/errseq.c for more info */ - typedef u32 errseq_t; errseq_t errseq_set(errseq_t *eseq, int err); diff --git a/include/linux/fs.h b/include/linux/fs.h index fb615e1eb1d4..a2fc6a029db6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2573,8 +2573,6 @@ extern int __must_check file_write_and_wait_range(struct file *file, * When a writeback error occurs, most filesystems will want to call * filemap_set_wb_err to record the error in the mapping so that it will be * automatically reported whenever fsync is called on the file. - * - * FIXME: mention FS_* flag here? */ static inline void filemap_set_wb_err(struct address_space *mapping, int err) {