Message ID | 20190926020725.19601-1-boazh@netapp.com (mailing list archive) |
---|---|
Headers | show |
Series | zuf: ZUFS Zero-copy User-mode FileSystem | expand |
On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@plexistor.com> wrote: > Performance: > A simple fio direct 4k random write test with incrementing number > of threads. > > [fuse] > threads wr_iops wr_bw wr_lat > 1 33606 134424 26.53226 > 2 57056 228224 30.38476 > 4 88667 354668 40.12783 > 7 116561 466245 53.98572 > 8 129134 516539 55.6134 > > [fuse-splice] > threads wr_iops wr_bw wr_lat > 1 39670 158682 21.8399 > 2 51100 204400 34.63294 > 4 75220 300882 47.42344 > 7 97706 390825 63.04435 > 8 98034 392137 73.24263 > > [xfs-dax] > threads wr_iops wr_bw wr_lat Data missing. > [Maxdata-1.5-zufs] > threads wr_iops wr_bw wr_lat > 1 1041802 260,450 3.623 > 2 1983997 495,999 3.808 > 4 3829456 957,364 3.959 > 7 4501154 1,125,288 5.895330 > 8 4400698 1,100,174 6.922174 Just a heads up, that I have achieved similar results with a prototype using the unmodified fuse protocol. This prototype was built with ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per op). I found a big scheduler scalability bottleneck that is caused by update of mm->cpu_bitmap at context switch. This can be worked around by using shared memory instead of shared page tables, which is a bit of a pain, but it does prove the point. Thought about fixing the cpu_bitmap cacheline pingpong, but didn't really get anywhere. Are you interested in comparing zufs with the scalable fuse prototype? If so, I'll push the code into a public repo with some instructions, Thanks, Miklos
Hi Miklos, > Just a heads up, that I have achieved similar results with a prototype > using the unmodified fuse protocol. This prototype was built with > ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per > op). I found a big scheduler scalability bottleneck that is caused by > update of mm->cpu_bitmap at context switch. This can be worked > around by using shared memory instead of shared page tables, which is > a bit of a pain, but it does prove the point. Thought about fixing > the cpu_bitmap cacheline pingpong, but didn't really get anywhere. > > Are you interested in comparing zufs with the scalable fuse prototype? > If so, I'll push the code into a public repo with some instructions, I would be happy to help here (review, lightly test and debug). I wanted to give the ioctl threads method a try for some time already just never came to it yet. Thanks, Bernd
On 26/09/2019 10:11, Miklos Szeredi wrote: > On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@plexistor.com> wrote: > <> >> [xfs-dax] >> threads wr_iops wr_bw wr_lat > > Data missing. > Ooops sorry will send today >> [Maxdata-1.5-zufs] >> threads wr_iops wr_bw wr_lat >> 1 1041802 260,450 3.623 >> 2 1983997 495,999 3.808 >> 4 3829456 957,364 3.959 >> 7 4501154 1,125,288 5.895330 >> 8 4400698 1,100,174 6.922174 > > Just a heads up, that I have achieved similar results with a prototype > using the unmodified fuse protocol. This prototype was built with > ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per > op). I found a big scheduler scalability bottleneck that is caused by > update of mm->cpu_bitmap at context switch. This can be worked > around by using shared memory instead of shared page tables, which is > a bit of a pain, but it does prove the point. Thought about fixing > the cpu_bitmap cacheline pingpong, but didn't really get anywhere. > > Are you interested in comparing zufs with the scalable fuse prototype? > If so, I'll push the code into a public repo with some instructions, > Yes please do send it. I will give it a good run. What fuseFS do you use in usermode? > Thanks, > Miklos > Thank you Miklos for looking Boaz
On 26/09/2019 05:40, Matt Benjamin wrote: > per discussion 2 weeks ago--is there a git repo or something that I can clone? > > Matt > Please look in the cover letter there is a git tree address to clone here: [v02] The patches submitted are at: git https://github.com/NetApp/zufs-zuf upstream-v02 Also the same for zus Server in user-mode + infra: git https://github.com/NetApp/zufs-zus upstream Please look in the 3rd patch: [PATCH 03/16] zuf: Preliminary Documentation There are instructions what to clone how to compile and install and how to use the scripts in do-zu to run a system. I would love a good review for this documentation as well I'm sure its wrong and missing. I use it for so long I'm already blind to it. Please bug me day and night with any question Thanks Boaz
>> Are you interested in comparing zufs with the scalable fuse prototype? >> If so, I'll push the code into a public repo with some instructions, >> > > Yes please do send it. I will give it a good run. > What fuseFS do you use in usermode? For the start passthrough should do, modified to skip all data. That is what I am doing to measure fuse bandwidth. It also shouldn't be too difficult to add an in-mem tree for dentries and inodes, to be able to measure without tmpfs overhead. Bernd
On 26/09/2019 15:12, Bernd Schubert wrote: >>> Are you interested in comparing zufs with the scalable fuse prototype? >>> If so, I'll push the code into a public repo with some instructions, >>> >> >> Yes please do send it. I will give it a good run. >> What fuseFS do you use in usermode? > > For the start passthrough should do, modified to skip all data. skip all data is not good for me. Because it hides away the page-faults and the actual memory bandwith. But what I do is either memcpy a single preallocated block to all blocks in the IO and/or set in a defined pattern where each ulong in the file contains its offset as data. This gives me true results. > That is > what I am doing to measure fuse bandwidth. It also shouldn't be too > difficult to add an in-mem tree for dentries and inodes, to be able to > measure without tmpfs overhead. > Thanks that is very helpful I will use this Boaz > > Bernd >
On 26/09/2019 10:11, Miklos Szeredi wrote: > On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@plexistor.com> wrote: > > Just a heads up, that I have achieved similar results with a prototype > using the unmodified fuse protocol. This prototype was built with > ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per > op). > I found a big scheduler scalability bottleneck that is caused by > update of mm->cpu_bitmap at context switch. This can be worked > around by using shared memory instead of shared page tables, which is > a bit of a pain, but it does prove the point. Thought about fixing > the cpu_bitmap cacheline pingpong, but didn't really get anywhere. > I'm not sure what is the scalability bottleneck you are seeing above. With zufs I have a very good scalability, almost flat up to the number of CPUs, and/or the limit of the memory bandwith if I'm accessing pmem. I do have a bad scalability bottleneck if I use mmap of pages caused by the call to zap_vma_ptes. Which is why I invented the NIO way. (Inspired by you) Once you send me the git URL I will have a look in the code and see if I can find any differences. That said I do believe that a new Scheduler object that completely bypasses the scheduler and just relinquishes its time slice to the switched to thread, will cut off another 0.5u from the single thread latency. (5th patch talks about that) > Are you interested in comparing zufs with the scalable fuse prototype? > If so, I'll push the code into a public repo with some instructions, > > Thanks, > Miklos > Miklos would you please have some bandwith to review my code? it would make me very happy and calm. Your input is very valuable to me. Thanks Boaz
On Thu, Sep 26, 2019 at 2:24 PM Boaz Harrosh <openosd@gmail.com> wrote: > > On 26/09/2019 15:12, Bernd Schubert wrote: > >>> Are you interested in comparing zufs with the scalable fuse prototype? > >>> If so, I'll push the code into a public repo with some instructions, > >>> > >> > >> Yes please do send it. I will give it a good run. git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#fuse2 Enable: CONFIG_FUSE2_FS=y CONFIG_SAMPLE_FUSE2=y > >> What fuseFS do you use in usermode? It's the example loopback filesystem supplied in the git tree above. I haven't converted libfuse yet to use the new features, so for now this is the only way to try it. Usage: linux/samples/fuse2/loraw -2 -p -t ~/mnt/fuse/ options: -d: debug -s: single threaded -b: FUSE_DEV_IOC_CLONE (v1) -p: use ioctl for device I/O (v2) -m: use "map read" transferring offset into file instead of actual data -1: use regular fuse -2: use experimental fuse2 -t: use shared memory instead of threads I tested with shmfs, and IIRC got about 4-8us latency, depending on the hardware, type of operation, etc... Let me know if something's not working properly (this is experimental code). Thanks, Miklos
On Thu, Sep 26, 2019 at 2:48 PM Boaz Harrosh <openosd@gmail.com> wrote: > > On 26/09/2019 10:11, Miklos Szeredi wrote: > > I found a big scheduler scalability bottleneck that is caused by > > update of mm->cpu_bitmap at context switch. This can be worked > > around by using shared memory instead of shared page tables, which is > > a bit of a pain, but it does prove the point. Thought about fixing > > the cpu_bitmap cacheline pingpong, but didn't really get anywhere. > > > > I'm not sure what is the scalability bottleneck you are seeing above. > With zufs I have a very good scalability, almost flat up to the > number of CPUs, and/or the limit of the memory bandwith if I'm accessing > pmem. This was *really* noticable with NUMA and many cpus (>64). > Miklos would you please have some bandwith to review my code? it would > make me very happy and calm. Your input is very valuable to me. Sure, will look at the patches. Thanks, Miklos