diff mbox series

io_uring: add support for barrier fsync

Message ID 7c7276e4-8ffa-495a-6abf-926a58ee899e@kernel.dk (mailing list archive)
State New, archived
Headers show
Series io_uring: add support for barrier fsync | expand

Commit Message

Jens Axboe April 9, 2019, 4:27 p.m. UTC
It's a quite common use case to issue a bunch of writes, then an fsync
or fdatasync when they complete. Since io_uring doesn't guarantee any
type of ordering, the application must track issued writes and wait
with the fsync issue until they have completed.

Add an IORING_FSYNC_BARRIER flag that helps with this so the application
doesn't have to do this manually. If this flag is set for the fsync
request, we won't issue it until pending IO has already completed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Comments

Christoph Hellwig April 9, 2019, 6:17 p.m. UTC | #1
On Tue, Apr 09, 2019 at 10:27:43AM -0600, Jens Axboe wrote:
> It's a quite common use case to issue a bunch of writes, then an fsync
> or fdatasync when they complete. Since io_uring doesn't guarantee any
> type of ordering, the application must track issued writes and wait
> with the fsync issue until they have completed.
> 
> Add an IORING_FSYNC_BARRIER flag that helps with this so the application
> doesn't have to do this manually. If this flag is set for the fsync
> request, we won't issue it until pending IO has already completed.

I think we need a much more detailed explanation of the semantics,
preferably in man page format.

Barrier at least in Linux traditionally means all previously submitted
requests have finished and no new ones are started until the
barrier request finishes, which is very heavy handed.  Is that what
this is supposed to do?  If not what are the exact guarantees vs
ordering and or barrier semantics?
Jens Axboe April 9, 2019, 6:23 p.m. UTC | #2
On 4/9/19 12:17 PM, Christoph Hellwig wrote:
> On Tue, Apr 09, 2019 at 10:27:43AM -0600, Jens Axboe wrote:
>> It's a quite common use case to issue a bunch of writes, then an fsync
>> or fdatasync when they complete. Since io_uring doesn't guarantee any
>> type of ordering, the application must track issued writes and wait
>> with the fsync issue until they have completed.
>>
>> Add an IORING_FSYNC_BARRIER flag that helps with this so the application
>> doesn't have to do this manually. If this flag is set for the fsync
>> request, we won't issue it until pending IO has already completed.
> 
> I think we need a much more detailed explanation of the semantics,
> preferably in man page format.
> 
> Barrier at least in Linux traditionally means all previously submitted
> requests have finished and no new ones are started until the
> barrier request finishes, which is very heavy handed.  Is that what
> this is supposed to do?  If not what are the exact guarantees vs
> ordering and or barrier semantics?

The patch description isn't that great, and maybe the naming isn't that
intuitive either. The way it's implemented, the fsync will NOT be issued
until previously issued IOs have completed. That means both reads and
writes, since there's no way to wait for just one.  In terms of
semantics, any previously submitted writes will have completed before
this fsync is issued. The barrier fsync has no ordering wrt future
writes, no ordering is implied there. Hence:

W1, W2, W3, FSYNC_W_BARRIER, W4, W5

W1..3 will have been completed by the hardware side before we start
FSYNC_W_BARRIER. We don't wait with issuing W4..5 until after the fsync
completes, no ordering is provided there.
Chris Mason April 9, 2019, 6:42 p.m. UTC | #3
On 9 Apr 2019, at 14:23, Jens Axboe wrote:

> On 4/9/19 12:17 PM, Christoph Hellwig wrote:
>> On Tue, Apr 09, 2019 at 10:27:43AM -0600, Jens Axboe wrote:
>>> It's a quite common use case to issue a bunch of writes, then an 
>>> fsync
>>> or fdatasync when they complete. Since io_uring doesn't guarantee 
>>> any
>>> type of ordering, the application must track issued writes and wait
>>> with the fsync issue until they have completed.
>>>
>>> Add an IORING_FSYNC_BARRIER flag that helps with this so the 
>>> application
>>> doesn't have to do this manually. If this flag is set for the fsync
>>> request, we won't issue it until pending IO has already completed.
>>
>> I think we need a much more detailed explanation of the semantics,
>> preferably in man page format.
>>
>> Barrier at least in Linux traditionally means all previously 
>> submitted
>> requests have finished and no new ones are started until the
>> barrier request finishes, which is very heavy handed.  Is that what
>> this is supposed to do?  If not what are the exact guarantees vs
>> ordering and or barrier semantics?
>
> The patch description isn't that great, and maybe the naming isn't 
> that
> intuitive either. The way it's implemented, the fsync will NOT be 
> issued
> until previously issued IOs have completed. That means both reads and
> writes, since there's no way to wait for just one.  In terms of
> semantics, any previously submitted writes will have completed before
> this fsync is issued. The barrier fsync has no ordering wrt future
> writes, no ordering is implied there. Hence:
>
> W1, W2, W3, FSYNC_W_BARRIER, W4, W5
>
> W1..3 will have been completed by the hardware side before we start
> FSYNC_W_BARRIER. We don't wait with issuing W4..5 until after the 
> fsync
> completes, no ordering is provided there.

Looking at the patch, why is fsync special?  Seems like you could add 
this ordering bit to any write?

While you're here, do you want to add a way to FUA/cache flush?  
Basically the rest of what user land would need to make their own 
write-back-cache-safe implementation.

-chris
Jens Axboe April 9, 2019, 6:46 p.m. UTC | #4
On 4/9/19 12:42 PM, Chris Mason wrote:
> On 9 Apr 2019, at 14:23, Jens Axboe wrote:
> 
>> On 4/9/19 12:17 PM, Christoph Hellwig wrote:
>>> On Tue, Apr 09, 2019 at 10:27:43AM -0600, Jens Axboe wrote:
>>>> It's a quite common use case to issue a bunch of writes, then an 
>>>> fsync
>>>> or fdatasync when they complete. Since io_uring doesn't guarantee 
>>>> any
>>>> type of ordering, the application must track issued writes and wait
>>>> with the fsync issue until they have completed.
>>>>
>>>> Add an IORING_FSYNC_BARRIER flag that helps with this so the 
>>>> application
>>>> doesn't have to do this manually. If this flag is set for the fsync
>>>> request, we won't issue it until pending IO has already completed.
>>>
>>> I think we need a much more detailed explanation of the semantics,
>>> preferably in man page format.
>>>
>>> Barrier at least in Linux traditionally means all previously 
>>> submitted
>>> requests have finished and no new ones are started until the
>>> barrier request finishes, which is very heavy handed.  Is that what
>>> this is supposed to do?  If not what are the exact guarantees vs
>>> ordering and or barrier semantics?
>>
>> The patch description isn't that great, and maybe the naming isn't 
>> that
>> intuitive either. The way it's implemented, the fsync will NOT be 
>> issued
>> until previously issued IOs have completed. That means both reads and
>> writes, since there's no way to wait for just one.  In terms of
>> semantics, any previously submitted writes will have completed before
>> this fsync is issued. The barrier fsync has no ordering wrt future
>> writes, no ordering is implied there. Hence:
>>
>> W1, W2, W3, FSYNC_W_BARRIER, W4, W5
>>
>> W1..3 will have been completed by the hardware side before we start
>> FSYNC_W_BARRIER. We don't wait with issuing W4..5 until after the 
>> fsync
>> completes, no ordering is provided there.
> 
> Looking at the patch, why is fsync special?  Seems like you could add 
> this ordering bit to any write?

It's really not, the exact same technique could be used on any type of
command to imply ordering. My initial idea was to have an explicit
barrier/ordering command, but I didn't think that separating it from an
actual command would be needed/useful.

> While you're here, do you want to add a way to FUA/cache flush?  
> Basically the rest of what user land would need to make their own 
> write-back-cache-safe implementation.

FUA would be a WRITEV/WRITE_FIXED flag, that should be trivially doable.

In terms of cache flush, that's very heavy handed (and storage
oriented).  What applications would want/need to do an explicit whole
device flush?
Chris Mason April 9, 2019, 6:56 p.m. UTC | #5
On 9 Apr 2019, at 14:46, Jens Axboe wrote:

> On 4/9/19 12:42 PM, Chris Mason wrote:

>> Looking at the patch, why is fsync special?  Seems like you could add
>> this ordering bit to any write?
>
> It's really not, the exact same technique could be used on any type of
> command to imply ordering. My initial idea was to have an explicit
> barrier/ordering command, but I didn't think that separating it from 
> an
> actual command would be needed/useful.

Might want to order discards?  Commit a transaction to free some blocks, 
discard those blocks?

>
>> While you're here, do you want to add a way to FUA/cache flush?
>> Basically the rest of what user land would need to make their own
>> write-back-cache-safe implementation.
>
> FUA would be a WRITEV/WRITE_FIXED flag, that should be trivially 
> doable.
>
> In terms of cache flush, that's very heavy handed (and storage
> oriented).  What applications would want/need to do an explicit whole
> device flush?

Basically if someone is writing a userland filesystem they'd want to 
cache flush for the same reasons the kernel filesystems do.


-chris
Dave Chinner April 11, 2019, 11:05 a.m. UTC | #6
On Tue, Apr 09, 2019 at 12:46:15PM -0600, Jens Axboe wrote:
> On 4/9/19 12:42 PM, Chris Mason wrote:
> > On 9 Apr 2019, at 14:23, Jens Axboe wrote:
> > 
> >> On 4/9/19 12:17 PM, Christoph Hellwig wrote:
> >>> On Tue, Apr 09, 2019 at 10:27:43AM -0600, Jens Axboe wrote:
> >>>> It's a quite common use case to issue a bunch of writes, then an 
> >>>> fsync
> >>>> or fdatasync when they complete. Since io_uring doesn't guarantee 
> >>>> any
> >>>> type of ordering, the application must track issued writes and wait
> >>>> with the fsync issue until they have completed.
> >>>>
> >>>> Add an IORING_FSYNC_BARRIER flag that helps with this so the 
> >>>> application
> >>>> doesn't have to do this manually. If this flag is set for the fsync
> >>>> request, we won't issue it until pending IO has already completed.
> >>>
> >>> I think we need a much more detailed explanation of the semantics,
> >>> preferably in man page format.
> >>>
> >>> Barrier at least in Linux traditionally means all previously 
> >>> submitted
> >>> requests have finished and no new ones are started until the
> >>> barrier request finishes, which is very heavy handed.  Is that what
> >>> this is supposed to do?  If not what are the exact guarantees vs
> >>> ordering and or barrier semantics?
> >>
> >> The patch description isn't that great, and maybe the naming isn't 
> >> that
> >> intuitive either. The way it's implemented, the fsync will NOT be 
> >> issued
> >> until previously issued IOs have completed. That means both reads and
> >> writes, since there's no way to wait for just one.  In terms of
> >> semantics, any previously submitted writes will have completed before
> >> this fsync is issued. The barrier fsync has no ordering wrt future
> >> writes, no ordering is implied there. Hence:
> >>
> >> W1, W2, W3, FSYNC_W_BARRIER, W4, W5
> >>
> >> W1..3 will have been completed by the hardware side before we start
> >> FSYNC_W_BARRIER. We don't wait with issuing W4..5 until after the 
> >> fsync
> >> completes, no ordering is provided there.
> > 
> > Looking at the patch, why is fsync special?  Seems like you could add 
> > this ordering bit to any write?
> 
> It's really not, the exact same technique could be used on any type of
> command to imply ordering. My initial idea was to have an explicit
> barrier/ordering command, but I didn't think that separating it from an
> actual command would be needed/useful.
> 
> > While you're here, do you want to add a way to FUA/cache flush?  
> > Basically the rest of what user land would need to make their own 
> > write-back-cache-safe implementation.
> 
> FUA would be a WRITEV/WRITE_FIXED flag, that should be trivially doable.

We already have plumbing to make pwritev2 and AIO issue FUA writes
via the RWF_DSYNC flag through the fs/iomap.c direct IO path. FUA is
only valid if the file does not have dirty metadata (e.g. because of
block allocation) and that requires the filesystem block mapping to
tell the IO path if FUA can be used. Otherwise a journal flush is
also required to make the data stable and there's no point in doing
a FUA write for the data in that case...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 07d6ef195d05..08f1e5766554 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -172,6 +172,7 @@  struct io_ring_ctx {
 		 */
 		struct list_head	poll_list;
 		struct list_head	cancel_list;
+		struct list_head	fsync_list;
 	} ____cacheline_aligned_in_smp;
 
 	struct async_list	pending_async[2];
@@ -202,6 +203,11 @@  struct io_poll_iocb {
 	struct wait_queue_entry		wait;
 };
 
+struct io_fsync_iocb {
+	struct file			*file;
+	unsigned			sequence;
+};
+
 /*
  * NOTE! Each of the iocb union members has the file pointer
  * as the first entry in their struct definition. So you can
@@ -213,6 +219,7 @@  struct io_kiocb {
 		struct file		*file;
 		struct kiocb		rw;
 		struct io_poll_iocb	poll;
+		struct io_fsync_iocb	fsync;
 	};
 
 	struct sqe_submit	submit;
@@ -255,6 +262,8 @@  struct io_submit_state {
 	unsigned int		ios_left;
 };
 
+static void io_sq_wq_submit_work(struct work_struct *work);
+
 static struct kmem_cache *req_cachep;
 
 static const struct file_operations io_uring_fops;
@@ -306,10 +315,32 @@  static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	spin_lock_init(&ctx->completion_lock);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->cancel_list);
+	INIT_LIST_HEAD(&ctx->fsync_list);
 	return ctx;
 }
 
-static void io_commit_cqring(struct io_ring_ctx *ctx)
+static inline bool io_sequence_defer(struct io_ring_ctx *ctx, unsigned seq)
+{
+	return seq > ctx->cached_cq_tail + ctx->sq_ring->dropped;
+}
+
+static struct io_kiocb *io_get_ready_fsync(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	if (list_empty(&ctx->fsync_list))
+		return NULL;
+
+	req = list_first_entry(&ctx->fsync_list, struct io_kiocb, list);
+	if (!io_sequence_defer(ctx, req->fsync.sequence)) {
+		list_del_init(&req->list);
+		return req;
+	}
+
+	return NULL;
+}
+
+static void __io_commit_cqring(struct io_ring_ctx *ctx)
 {
 	struct io_cq_ring *ring = ctx->cq_ring;
 
@@ -330,6 +361,16 @@  static void io_commit_cqring(struct io_ring_ctx *ctx)
 	}
 }
 
+static void io_commit_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	__io_commit_cqring(ctx);
+
+	while ((req = io_get_ready_fsync(ctx)) != NULL)
+		queue_work(ctx->sqo_wq, &req->work);
+}
+
 static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
 {
 	struct io_cq_ring *ring = ctx->cq_ring;
@@ -1073,9 +1114,39 @@  static int io_nop(struct io_kiocb *req, u64 user_data)
 	return 0;
 }
 
-static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+static int io_fsync_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_uring_sqe *sqe_copy;
+
+	if (!io_sequence_defer(ctx, req->fsync.sequence))
+		return 0;
+
+	sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL);
+	if (!sqe_copy)
+		return -EAGAIN;
+
+	spin_lock_irq(&ctx->completion_lock);
+	if (!io_sequence_defer(ctx, req->fsync.sequence)) {
+		spin_unlock_irq(&ctx->completion_lock);
+		kfree(sqe_copy);
+		return 0;
+	}
+
+	memcpy(sqe_copy, sqe, sizeof(*sqe_copy));
+	req->submit.sqe = sqe_copy;
+
+	INIT_WORK(&req->work, io_sq_wq_submit_work);
+	list_add_tail(&req->list, &ctx->fsync_list);
+	spin_unlock_irq(&ctx->completion_lock);
+	return -EIOCBQUEUED;
+}
+
+static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			 unsigned fsync_flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
+	int ret = 0;
 
 	if (!req->file)
 		return -EBADF;
@@ -1088,8 +1159,13 @@  static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
+	if (fsync_flags & IORING_FSYNC_BARRIER) {
+		req->fsync.sequence = ctx->cached_sq_head - 1;
+		ret = io_fsync_defer(req, sqe);
+	}
+
 	req->flags |= REQ_F_PREPPED;
-	return 0;
+	return ret;
 }
 
 static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
@@ -1102,12 +1178,15 @@  static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	int ret;
 
 	fsync_flags = READ_ONCE(sqe->fsync_flags);
-	if (unlikely(fsync_flags & ~IORING_FSYNC_DATASYNC))
+	if (unlikely(fsync_flags & ~(IORING_FSYNC_DATASYNC|IORING_FSYNC_BARRIER)))
 		return -EINVAL;
 
-	ret = io_prep_fsync(req, sqe);
-	if (ret)
+	ret = io_prep_fsync(req, sqe, fsync_flags);
+	if (ret) {
+		if (ret == -EIOCBQUEUED)
+			return 0;
 		return ret;
+	}
 
 	/* fsync always requires a blocking context */
 	if (force_nonblock)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e23408692118..57b8f4d57af6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -59,6 +59,7 @@  struct io_uring_sqe {
  * sqe->fsync_flags
  */
 #define IORING_FSYNC_DATASYNC	(1U << 0)
+#define IORING_FSYNC_BARRIER	(1U << 1)
 
 /*
  * IO completion data structure (Completion Queue Entry)