mbox series

[RFC,00/17] zuf: ZUFS Zero-copy User-mode FileSystem

Message ID 20190219115136.29952-1-boaz@plexistor.com (mailing list archive)
Headers show
Series zuf: ZUFS Zero-copy User-mode FileSystem | expand

Message

Boaz Harrosh Feb. 19, 2019, 11:51 a.m. UTC
From: Boaz Harrosh <boazh@netapp.com>

I would please like to present the ZUFS file system and the Kernel code part
in this patchset.

The Kernel code presented here can be found at:
	https://github.com/NetApp/zufs-zuf

And the User-mode Server + example FSs here:
	https://github.com/NetApp/zufs-zus

ZUFS - stands for Zero-copy User-mode FS
* It is geared towards true zero copy end to end of both data and meta data.
* It is geared towards very *low latency*, very high CPU locality, lock-less
  parallelism.
* Synchronous operations (for low latency)
* Numa awareness

Short description:
  ZUFS is a from scratch implementation of a filesystem-in-user-space, which
  tries to address the above goals. from the get go it is aimed for pmem
  based FSs. But can easily support other type of FSs that can utilize x10
  latency and parallelism improvements.
  The novelty of this project is that the interface is designed with a modern
  multi-core NUMA machine in mind down to the ABI, so to reach these goals.

Please see first patch for License of this project

Current status: There are a couple of trivial open-source filesystem
implementations and a full blown proprietary implementation from Netapp.

Together with the Kernel module submitted here the User-mode-Server and the
zusFSs User-mode plugins, this code pass Netapp QA including xfstests +
internal QA tests. And was released to costumers as Maxdata 1.2.
So it is very stable.

In the git repository above there is also a backport for rhel 7.6.
Including rpm packages for Kernel and Server components.
(Also available evaluation licenses of Maxdata 1.2 for developers.
 Please contact Amit Golander <Amit.Golander@netapp.com> if you need one)

Just to get some points across as I said this project is all about
performance and low latency. Here below are some results I have run:

[fuse]
threads wr_iops	wr_bw	wr_lat
1	33606	134424	26.53226
2	57056	228224	30.38476
3	73142	292571	35.75727
4	88667	354668	40.12783
5	102280	409122	42.13261
6	110122	440488	48.29697
7	116561	466245	53.98572
8	129134	516539	55.6134

[fuse-splice]
threads	wr_iops	wr_bw	wr_lat
1	39670	158682	21.8399
2	51100	204400	34.63294
3	62385	249542	39.28847
4	75220	300882	47.42344
5	84522	338088	52.97299
6	93042	372168	57.40804
7	97706	390825	63.04435
8	98034	392137	73.24263

[xfs-dax]
threads	wr_iops	wr_bw	wr_lat   
1	19449	77799	48.03282
2	37704	150819	37.2343
3	55415	221663	30.59375
4	72285	289142	26.08636
5	90348	361392	23.89037
6	103696	414787	22.38045
7	120638	482552	21.38869
8	134157	536630	21.1426

[Maxdata-1.2-zufs]
threads	wr_iops	wr_bw	wr_lat   
1	57506	230026	14.387113
2	98624	394498	16.790232
3	142276	569106	17.344622
4	187984	751936	17.527123
5	190304	761219	19.504314
6	221407	885628	20.862000
7	211579	846316	23.262040
8	246029	984116	24.630604

[*1]
  These good results are when an mm patch is applied which introduces
  VM_LOCAL_CPU flag that eliminates vm_zap_ptes from scheduling on all
  CPUs when creating a per-cpu VMA.
  This patch was not accepted by the Linux Kernel community and is not
  presented in this patchset. (Patch available for review on demand)
  But a few weeks from now I will submit some incremental changes to the
  code which will return the numbers to above, and even better for some
  benchmarks. (without the mm patch)

I have used an 8 way KVM-qemu with 2 NUMA nodes.
Running fio with 4k random writes O_DIRECT | O_SYNC to a DRAM simulated
pmem. (memmap=! at grub), Fuse-fs was a memcpy same 4k null-FS
fio was then run with more and more threads (see threads column)
to test for scalability.

We are still > x2 slower than I would like to.
(Compared to an in-kernel pmem-base FS)
But I believe I can shave off another 1-2 us by farther optimizing
the app-to-server thread switch by developing a new scheduler-object
so to avoid going through the scheduler all together (and its locks)
when switching VMs.
(Currently using couple of wait_queue_head_t with wait_event() calls
 See relay.h in patches)

Please Review and ask any question big or trivial. I would love to
iron this code, and submit it upstream.

Thank you for reading
Boaz

~~~~~~~~~~~~~~~~~~
Boaz Harrosh (17):
  fs: Add the ZUF filesystem to the build + License
  zuf: Preliminary Documentation
  zuf: zuf-rootfs
  zuf: zuf-core The ZTs
  zuf: Multy Devices
  zuf: mounting
  zuf: Namei and directory operations
  zuf: readdir operation
  zuf: symlink
  zuf: More file operation
  zuf: Write/Read implementation
  zuf: mmap & sync
  zuf: ioctl implementation
  zuf: xattr implementation
  zuf: ACL support
  zuf: Special IOCTL fadvise (TODO)
  zuf: Support for dynamic-debug of zusFSs

 Documentation/filesystems/zufs.txt |  351 ++++++++
 fs/Kconfig                         |    1 +
 fs/Makefile                        |    1 +
 fs/zuf/Kconfig                     |   23 +
 fs/zuf/Makefile                    |   23 +
 fs/zuf/_extern.h                   |  166 ++++
 fs/zuf/_pr.h                       |   62 ++
 fs/zuf/acl.c                       |  281 +++++++
 fs/zuf/directory.c                 |  167 ++++
 fs/zuf/file.c                      |  527 ++++++++++++
 fs/zuf/inode.c                     |  648 ++++++++++++++
 fs/zuf/ioctl.c                     |  306 +++++++
 fs/zuf/md.c                        |  761 +++++++++++++++++
 fs/zuf/md.h                        |  318 +++++++
 fs/zuf/md_def.h                    |  145 ++++
 fs/zuf/mmap.c                      |  336 ++++++++
 fs/zuf/module.c                    |   28 +
 fs/zuf/namei.c                     |  435 ++++++++++
 fs/zuf/relay.h                     |   88 ++
 fs/zuf/rw.c                        |  705 ++++++++++++++++
 fs/zuf/super.c                     |  771 +++++++++++++++++
 fs/zuf/symlink.c                   |   74 ++
 fs/zuf/t1.c                        |  138 +++
 fs/zuf/t2.c                        |  375 +++++++++
 fs/zuf/t2.h                        |   68 ++
 fs/zuf/xattr.c                     |  310 +++++++
 fs/zuf/zuf-core.c                  | 1257 ++++++++++++++++++++++++++++
 fs/zuf/zuf-root.c                  |  431 ++++++++++
 fs/zuf/zuf.h                       |  414 +++++++++
 fs/zuf/zus_api.h                   |  869 +++++++++++++++++++
 30 files changed, 10079 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt
 create mode 100644 fs/zuf/Kconfig
 create mode 100644 fs/zuf/Makefile
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/acl.c
 create mode 100644 fs/zuf/directory.c
 create mode 100644 fs/zuf/file.c
 create mode 100644 fs/zuf/inode.c
 create mode 100644 fs/zuf/ioctl.c
 create mode 100644 fs/zuf/md.c
 create mode 100644 fs/zuf/md.h
 create mode 100644 fs/zuf/md_def.h
 create mode 100644 fs/zuf/mmap.c
 create mode 100644 fs/zuf/module.c
 create mode 100644 fs/zuf/namei.c
 create mode 100644 fs/zuf/relay.h
 create mode 100644 fs/zuf/rw.c
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/symlink.c
 create mode 100644 fs/zuf/t1.c
 create mode 100644 fs/zuf/t2.c
 create mode 100644 fs/zuf/t2.h
 create mode 100644 fs/zuf/xattr.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h
 create mode 100644 fs/zuf/zus_api.h

Comments

Matthew Wilcox (Oracle) Feb. 19, 2019, 12:15 p.m. UTC | #1
On Tue, Feb 19, 2019 at 01:51:19PM +0200, Boaz harrosh wrote:
> Please see first patch for License of this project
> 
> Current status: There are a couple of trivial open-source filesystem
> implementations and a full blown proprietary implementation from Netapp.

I regard this patchset as being an attempt to avoid your obligations
under the GPL.  As such, I will not be reviewing this code and I oppose
its inclusion.
Boaz Harrosh Feb. 19, 2019, 7:15 p.m. UTC | #2
On 19/02/19 14:15, Matthew Wilcox wrote:
> On Tue, Feb 19, 2019 at 01:51:19PM +0200, Boaz harrosh wrote:
>> Please see first patch for License of this project
>>
>> Current status: There are a couple of trivial open-source filesystem
>> implementations and a full blown proprietary implementation from Netapp.
> 
> I regard this patchset as being an attempt to avoid your obligations
> under the GPL.  As such, I will not be reviewing this code and I oppose
> its inclusion.
> 

Dearest Matthew

One day We'll sit on a bear and you explain to me. I do trust your
opinion, but I do not understand.

Specifically the above "full blown proprietary implementation from Netapp"
does not break the GPL at all. Parts of it are written in languages alien
to the Kernel and parts using user-mode libs and code IP that are not able
to live in the Kernel. At the beginning we had code to inject the FS into
the application of choice vi ld.so and only selected apps like a DB would
have a view of the filesystem. But you can imagine how this is a nightmare
for IT. Being POSIX under the Kernel is just so much less inventing the wheel
say: backup, disaster-recovery, cloud ....

Now actually if you look at the code submitted you will see that we are using
very very little out of the Kernel. Actually for comparison FUSE is using the
Kernel much heavier. Utilizing page-cache, Kernel re-claimers. Smart write-back
the lot. In ZUFS we take the upper most interfaces and send it down stream as is.
Where ever there is depth of stack we take the top most level and push that to
server as is completely synchronous to the app threads.

The only real novelty in this project is something completely new to this
submission, it is the new RPC we invented here that utilizes per-cpu Technics
to show a kind of performance never seen before between two processes.

You are a Kernel contributor, you have IP in the Kernel. Your opinion is very
important to me and to Netapp. Please point me to these areas that you feel
I have stepped on your IP, and have not respected the GPL? And I would want
very much to fix it.

Or maybe my sin is that I am to successful? Is the GPL guarded by speed?
I mean say FUSE it is already doing all these sins. And or other subsystems
that bridge Kernel functionality to user-mode. There are other user-mode
"drivers" all over the place. But they are all so slooooooow. So a serious
FS or server needs to sit in Kernel. With zufs we can now delegate to user-mode.
The kernel becomes a micro-kernel, very-fast-bridge, and moves out of the way.
Creating space for serious servers to sit in userland.

To summarize. I take your statement very seriously. Please state what service
of the GPLed Kernel am I exposing and circumventing and I will want to fix it ASAP.
I thought, and my philosophy was to take the POSIX interfaces as high as possible
and shove them to userland. In an RPC manner that I invented that is very fast.
If there are such areas that I am not doing so. Please show me?

Best regards
Boaz