mbox series

[RFC,v2,00/12] Introduce the famfs shared-memory file system

Message ID cover.1714409084.git.john@groves.net (mailing list archive)
Headers show
Series Introduce the famfs shared-memory file system | expand

Message

John Groves April 29, 2024, 5:04 p.m. UTC
This patch set introduces famfs[1] - a special-purpose fs-dax file system
for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
CXL-specific in anyway way.

* Famfs creates a simple access method for storing and sharing data in
  sharable memory. The memory is exposed and accessed as memory-mappable
  dax files.
* Famfs supports multiple hosts mounting the same file system from the
  same memory (something existing fs-dax file systems don't do).
* A famfs file system can be created on a /dev/dax device in devdax mode,
  which rests on dax functionality added in patches 2-7 of this series.

The famfs kernel file system is part the famfs framework; additional
components in user space[2] handle metadata and direct the famfs kernel
module to instantiate files that map to specific memory. The famfs user
space has documentation and a reasonably thorough test suite.

The famfs kernel module never accesses the shared memory directly (either
data or metadata). Because of this, shared memory managed by the famfs
framework does not create a RAS "blast radius" problem that should be able
to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
can be expected to kill apps via SIGBUS and cause mounts to be disabled
due to memory failure notifications.

Famfs does not attempt to solve concurrency or coherency problems for apps,
although it does solve these problems in regard to its own data structures.
Apps may encounter hard concurrency problems, but there are use cases that
are imminently useful and uncomplicated from a concurrency perspective:
serial sharing is one (only one host at a time has access), and read-only
concurrent sharing is another (all hosts can read-cache without worry).

Contents:

* famfs kernel documentation [patch 1]. Note that evolving famfs user
  documentation is at [2]
* dev_dax_iomap patchset [patches 2-7] - This enables fs-dax to use the
  iomap interface via a character /dev/dax device (e.g. /dev/dax0.0). For
  historical reasons the iomap infrastructure was enabled only for
  /dev/pmem devices (which are dax block devices). As famfs is the first
  fs-dax file system that works on /dev/dax, this patch series fills in
  the bare minimum infrastructure to enable iomap api usage with /dev/dax.
* famfs patchset [patches 8-12] - this introduces the kernel component of
  famfs.

Note that there is a developing consensus that /dev/dax requires
some fundamental re-factoring (e.g. [3]) that is related but outside the
scope of this series.

Some observations about using sharable memory

* It does not make sense to online sharable memory as system-ram.
  System-ram gets zeroed when it is onlined, so sharing is basically
  nonsense.
* It does not make sense to put struct page's in sharable memory, because
  those can't be shared. However, separately providing non-sharable
  capacity to be used for struct page's might be a sensible approach if the
  size of struct page array for sharable memory is too large to put in
  conventional system-ram (albeit with possible RAS implications).
* Sharable memory is pmem-like, in that a host is likely to connect in
  order to gain access to data that is already in the memory. Moreover
  the power domain for shared memory is separate for that of the server.
  Having observed that, famfs is not intended for persistent storage. It is
  intended for sharing data sets in memory during a time frame where the
  memory and the compute nodes are expected to remain operational - such
  as during a clustered data analytics job.

Could we do this with FUSE?

The key performance requirement for famfs is efficient handling of VMA
faults. This requires caching the complete dax extent lists for all active
files so faults can be handled without upcalls, which FUSE does not do.
It would probably be possible to put this capability FUSE, but we think
that keeping famfs separate from FUSE is the simpler approach.

We will be discussing this topic at LSFMM 2024 [5] in a topic called "Famfs:
new userspace filesystem driver vs. improving FUSE/DAX" - but other famfs
related discussion will also be welcome!

This patch set is available as a branch at [6]

References

[1] https://lpc.events/event/17/contributions/1455/
[2] https://github.com/cxl-micron-reskit/famfs
[3] https://lore.kernel.org/all/166630293549.1017198.3833687373550679565.stgit@dwillia2-xfh.jf.intel.com/
[4] https://www.computeexpresslink.org/download-the-specification
[5] https://events.linuxfoundation.org/lsfmmbpf/program/schedule-at-a-glance/
[6] https://github.com/cxl-micron-reskit/famfs-linux/tree/famfs-v2


Changes since RFC v1:


* This patch series is a from-scratch refactor of the original. The code
  that maps a file to a dax device is almost identical, but a lot of
  cleanup has been done.
* The get_tree and backing device handling code has been ripped up and
  re-done (in the get-tree case, based on suggestions from Christian
  Brauner - thanks Christian; I hope I haven't done any new dumb stuff!)
  (Note this code has been extensively tested; after all known error cases
  famfs can be umounted and the module can be unloaded)
* Famfs now 'shuts down' if the dax device reports any memory errors. I/O
  and faults start reporting SIGBUS. Famfs detects memory errors via an
  iomap_ops->notify failure call from the devdax layer. This has been tested
  and appears to disable the famfs file system while leaving it able to
  dismount cleanly.
* Dropped fault counters
* Dropped support for symlinks wtihin a famfs file system; we don't think
  supporting symlinks makes sense with famfs, and it has some undesirable
  side effects, so it's out.
* Dropped support for mknod within a famfs file system (other than regular
  files and directories)
* Famfs magic number moved to magic.h
* Famfs ioctl opcodes now documented in
  Documentation/userspace-api/ioctl/ioctl-number.rst
* Dodgy kerneldoc comments cleaned up or removed; hopefully none added...
* Kconfig formatting cleaned up
* Dropped /dev/pmem support. Prior patch series would mount on either
  /dev/pmem or /dev/dax devices. This is unnecessary complexity since
  /ddev/pmem devices can be converted to /dev/dax. Famfs is, however, the
  first file system we know of that mounts from a character device.
* Famfs no longer does a filp_open() of the dax device. It finds the
  device by its dev_t and uses fs_dax_get() to effect exclusivity.
* Added a read-only module param famfs_kabi_version for checkout
  that user space was compiled for the same ABI version
* The famfs kernel module (the code in fs/famfs plus the uapi file
  famfs_ioctl.c dropped from 1030 lines of code in v1 to 760 in v2,
  according to "cloc".
* Fixed issues reported by the kernel test robot
* Many minor improvements in response to v1 code reviews


John Groves (12):
  famfs: Introduce famfs documentation
  dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
  dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  dev_dax_iomap: Save the kva from memremap
  dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  dev_dax_iomap: export dax_dev_get()
  famfs prep: Add fs/super.c:kill_char_super()
  famfs: module operations & fs_context
  famfs: Introduce inode_operations and super_operations
  famfs: Introduce file_operations read/write
  famfs: Introduce mmap and VM fault handling
  famfs: famfs_ioctl and core file-to-memory mapping logic & iomap_ops

 Documentation/filesystems/famfs.rst           | 135 ++++
 Documentation/filesystems/index.rst           |   1 +
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  11 +
 drivers/dax/Kconfig                           |   6 +
 drivers/dax/bus.c                             | 144 ++++-
 drivers/dax/dax-private.h                     |   1 +
 drivers/dax/device.c                          |  38 +-
 drivers/dax/super.c                           |  33 +-
 fs/Kconfig                                    |   2 +
 fs/Makefile                                   |   1 +
 fs/famfs/Kconfig                              |  10 +
 fs/famfs/Makefile                             |   5 +
 fs/famfs/famfs_file.c                         | 605 ++++++++++++++++++
 fs/famfs/famfs_inode.c                        | 452 +++++++++++++
 fs/famfs/famfs_internal.h                     |  52 ++
 fs/namei.c                                    |   1 +
 fs/super.c                                    |   9 +
 include/linux/dax.h                           |   6 +
 include/linux/fs.h                            |   1 +
 include/uapi/linux/famfs_ioctl.h              |  61 ++
 include/uapi/linux/magic.h                    |   1 +
 22 files changed, 1547 insertions(+), 29 deletions(-)
 create mode 100644 Documentation/filesystems/famfs.rst
 create mode 100644 fs/famfs/Kconfig
 create mode 100644 fs/famfs/Makefile
 create mode 100644 fs/famfs/famfs_file.c
 create mode 100644 fs/famfs/famfs_inode.c
 create mode 100644 fs/famfs/famfs_internal.h
 create mode 100644 include/uapi/linux/famfs_ioctl.h


base-commit: ed30a4a51bb196781c8058073ea720133a65596f

Comments

Matthew Wilcox (Oracle) April 29, 2024, 6:32 p.m. UTC | #1
On Mon, Apr 29, 2024 at 12:04:16PM -0500, John Groves wrote:
> This patch set introduces famfs[1] - a special-purpose fs-dax file system
> for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> CXL-specific in anyway way.
> 
> * Famfs creates a simple access method for storing and sharing data in
>   sharable memory. The memory is exposed and accessed as memory-mappable
>   dax files.
> * Famfs supports multiple hosts mounting the same file system from the
>   same memory (something existing fs-dax file systems don't do).

Yes, but we do already have two filesystems that support shared storage,
and are rather more advanced than famfs -- GFS2 and OCFS2.  What are
the pros and cons of improving either of those to support DAX rather
than starting again with a new filesystem?
Kent Overstreet April 29, 2024, 11:08 p.m. UTC | #2
On Mon, Apr 29, 2024 at 07:32:55PM +0100, Matthew Wilcox wrote:
> On Mon, Apr 29, 2024 at 12:04:16PM -0500, John Groves wrote:
> > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > CXL-specific in anyway way.
> > 
> > * Famfs creates a simple access method for storing and sharing data in
> >   sharable memory. The memory is exposed and accessed as memory-mappable
> >   dax files.
> > * Famfs supports multiple hosts mounting the same file system from the
> >   same memory (something existing fs-dax file systems don't do).
> 
> Yes, but we do already have two filesystems that support shared storage,
> and are rather more advanced than famfs -- GFS2 and OCFS2.  What are
> the pros and cons of improving either of those to support DAX rather
> than starting again with a new filesystem?

I could see a shared memory filesystem as being a completely different
beast than a shared block storage filesystem - and I've never heard
anyone talking about gfs2 or ocfs2 as codebases we particularly liked.

This looks like it might not even be persistent? Does it survive a
reboot? If not, that means it'll be much smaller than a conventional
filesystem.

But yeah, a bit more on where this is headed would be nice.

Another concern is that every filesystem tends to be another huge
monolithic codebase without a lot of code sharing between them - how
much are we going to be adding in the end?

Can we start looking for more code sharing, more library code to factor
out?

Some description of the internal data structures would really help here.
John Groves April 30, 2024, 2:11 a.m. UTC | #3
On 24/04/29 07:32PM, Matthew Wilcox wrote:
> On Mon, Apr 29, 2024 at 12:04:16PM -0500, John Groves wrote:
> > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > CXL-specific in anyway way.
> > 
> > * Famfs creates a simple access method for storing and sharing data in
> >   sharable memory. The memory is exposed and accessed as memory-mappable
> >   dax files.
> > * Famfs supports multiple hosts mounting the same file system from the
> >   same memory (something existing fs-dax file systems don't do).
> 
> Yes, but we do already have two filesystems that support shared storage,
> and are rather more advanced than famfs -- GFS2 and OCFS2.  What are
> the pros and cons of improving either of those to support DAX rather
> than starting again with a new filesystem?
> 

Thanks for paying attention to this Willy.

This is a fair question; I'll share some thoughts on the rationale, but it's
probably something that should be an ongoing dialog. We already have a LSFMM
session planned that will discuss whether the famfs functionality should be
merged into fuse, but GFS2 and OCFS2 are also potential candidates.

(I've already seen Kent's reply and will get to that next)

I work for a memory company, and the motivation here is to make disaggregated
shared memory practically usable. Any approach that moves in that direction 
is goodness as far as we're concerned -- provided it doesn't insert years of 
delay. 

Some thoughts on famfs:

* Famfs is not, not, not a general purpose file system.
* One can think of famfs as a shared memory allocator where allocations can be
  accessed as files. For certain data analytics work flows (especially 
  involving Apache Arrow data frames) this is really powerful. Consumers of
  data frames commonly use mmap(MAP_SHARED), and can benefit from the memory
  de-duplication of shared memory and don't need any new abstractions.
* Famfs is not really a data storage tool. It's more of a shared-memroy 
  allocation tool that has the benefit of allocations being accesssible 
  (and memory-mappable) as files. So a lot of software can automatically use 
  it.
* Famfs is oriented to dumping sharable data into files and then allowing a
  scale-out cluster to share it (often read-only) to access a single copy in
  shared memory.
* Although this audience probably already understands this, please forgive me
  for putting a fine point on it: memory mapping a famfs/fs-dax file does 
  not use system-ram as a cache - it directly accesses the memory associated 
  with a file. This would be true of all file systems with proper fs-dax 
  support (of which there are not many, and currently only famfs that supports
  shared access to media/memory).

Some thoughts on shared-storage file systems:

* I'm no expert on GFS2 or OCFS2, but I've been around memory, file systems 
  and storage since well before the turn of the century...
* If you had brought up the existing fs-dax file systems, I would have pointed
  that they use write-back metadata, which does not reconcile with shared
  access to media - but these file systems do handle that.
* The shared media file systems are still oriented to block devices that
  provide durable storage and page-oriented access. CXL DRAM is a character 
  dax (devdax) device and does not provide durable storage.
* fs-dax-style memory mapping for volatile cxl memory requires the 
  dev_dax_iomap portion of this patch set - or something similar. 
* A scale-out shared media file system presumably requires some commitment to
  configure and manage some complexity in a distributed environment; whether
  that should be mandatory for enablement of shared memory is worthy of
  discussion.
* Adding memory to the storage tier for GFS2/OCFS2 would add non-persistent
  media to the storage tier; whether this makes sense would be a topic that
  GFS2/OCFS2 developers/architects should get involved in if they're 
  interested.

Although disaggregated shared memory is not commercially available yet, famfs 
is being actively tested by multiple companies for several use cases and 
patterns with real and simulated shared memory. Demonstrations will start to
surface in the coming weeks & months.

Regards,
John
John Groves April 30, 2024, 2:24 a.m. UTC | #4
On 24/04/29 07:08PM, Kent Overstreet wrote:
> On Mon, Apr 29, 2024 at 07:32:55PM +0100, Matthew Wilcox wrote:
> > On Mon, Apr 29, 2024 at 12:04:16PM -0500, John Groves wrote:
> > > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > > CXL-specific in anyway way.
> > > 
> > > * Famfs creates a simple access method for storing and sharing data in
> > >   sharable memory. The memory is exposed and accessed as memory-mappable
> > >   dax files.
> > > * Famfs supports multiple hosts mounting the same file system from the
> > >   same memory (something existing fs-dax file systems don't do).
> > 
> > Yes, but we do already have two filesystems that support shared storage,
> > and are rather more advanced than famfs -- GFS2 and OCFS2.  What are
> > the pros and cons of improving either of those to support DAX rather
> > than starting again with a new filesystem?
> 
> I could see a shared memory filesystem as being a completely different
> beast than a shared block storage filesystem - and I've never heard
> anyone talking about gfs2 or ocfs2 as codebases we particularly liked.

Thanks for your attention on famfs, Kent.

I think of it as a completely different beast. See my reply to Willy re:
famfs being more of a memory allocator with the benefit of allocations 
being accessible (and memory-mappable) as files.

> 
> This looks like it might not even be persistent? Does it survive a
> reboot? If not, that means it'll be much smaller than a conventional
> filesystem.

Right; cxl memory *can* be persistent, but most of the future products
I'm aware of will not be persistent. Those of us who work at memory
companies have been educated in recent years as to the value (or
lack thereof) of persistence (see 3DX / Optane).

But since shared memory is probably on a separate power domain from
a server, it is likely to persist across reboots. But it still ain't
storage.

> 
> But yeah, a bit more on where this is headed would be nice.

The famfs user space repo has some good documentation as to the on-
media structure of famfs. Scroll down on [1] (the documentation from
the famfs user space repo). There is quite a bit of info in the docs
from that repo.

The other docs from the cover letter are also useful...

> 
> Another concern is that every filesystem tends to be another huge
> monolithic codebase without a lot of code sharing between them - how
> much are we going to be adding in the end?

A fair concern. Famfs is kinda fuse-like, in that the metadata handling
is mostly in user space. Famfs is currently <1 KLOC of code in the 
kernel. That may grow, but it's not clear that there is a risk of
"huge monolithic". 

But it's something we should consider - and I'll be at LSFMM and 
happy to engage about this.

> 
> Can we start looking for more code sharing, more library code to factor
> out?
> 
> Some description of the internal data structures would really help here.


[1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md

Best regards,
John
Kent Overstreet April 30, 2024, 3:11 a.m. UTC | #5
On Mon, Apr 29, 2024 at 09:24:19PM -0500, John Groves wrote:
> On 24/04/29 07:08PM, Kent Overstreet wrote:
> > On Mon, Apr 29, 2024 at 07:32:55PM +0100, Matthew Wilcox wrote:
> > > On Mon, Apr 29, 2024 at 12:04:16PM -0500, John Groves wrote:
> > > > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > > > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > > > CXL-specific in anyway way.
> > > > 
> > > > * Famfs creates a simple access method for storing and sharing data in
> > > >   sharable memory. The memory is exposed and accessed as memory-mappable
> > > >   dax files.
> > > > * Famfs supports multiple hosts mounting the same file system from the
> > > >   same memory (something existing fs-dax file systems don't do).
> > > 
> > > Yes, but we do already have two filesystems that support shared storage,
> > > and are rather more advanced than famfs -- GFS2 and OCFS2.  What are
> > > the pros and cons of improving either of those to support DAX rather
> > > than starting again with a new filesystem?
> > 
> > I could see a shared memory filesystem as being a completely different
> > beast than a shared block storage filesystem - and I've never heard
> > anyone talking about gfs2 or ocfs2 as codebases we particularly liked.
> 
> Thanks for your attention on famfs, Kent.
> 
> I think of it as a completely different beast. See my reply to Willy re:
> famfs being more of a memory allocator with the benefit of allocations 
> being accessible (and memory-mappable) as files.

That's pretty much what I expected.

I would suggest talking to RDMA people; RDMA does similar things with
exposing address spaces across machine, and an "external" memory
allocator is a basic building block there as well - it'd be great if we
could get that turned into some clean library code.

GPU people as well, possibly.

> The famfs user space repo has some good documentation as to the on-
> media structure of famfs. Scroll down on [1] (the documentation from
> the famfs user space repo). There is quite a bit of info in the docs
> from that repo.

Ok, looking through that now.

So youv've got a metadata log; that looks more like a conventional
filesystem than a conventional purely in-memory thing.

But you say it's a shared filesystem, and it doesn't say anything about
that. Inter node locking?

Perhaps the ocfs2/gfs2 comparison is appropriate, after all.
Matthew Wilcox (Oracle) April 30, 2024, 9:01 p.m. UTC | #6
On Mon, Apr 29, 2024 at 09:11:52PM -0500, John Groves wrote:
> On 24/04/29 07:32PM, Matthew Wilcox wrote:
> > On Mon, Apr 29, 2024 at 12:04:16PM -0500, John Groves wrote:
> > > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > > CXL-specific in anyway way.
> > > 
> > > * Famfs creates a simple access method for storing and sharing data in
> > >   sharable memory. The memory is exposed and accessed as memory-mappable
> > >   dax files.
> > > * Famfs supports multiple hosts mounting the same file system from the
> > >   same memory (something existing fs-dax file systems don't do).
> > 
> > Yes, but we do already have two filesystems that support shared storage,
> > and are rather more advanced than famfs -- GFS2 and OCFS2.  What are
> > the pros and cons of improving either of those to support DAX rather
> > than starting again with a new filesystem?
> > 
> 
> Thanks for paying attention to this Willy.

Well, don't mistake this for an endorsement!  I remain convinced that
this is a science project, not a product.  I am hugely sceptical of
disaggregated systems, mostly because I've seen so many fail.  And they
rarely attempt to answer the "janitor tripped over the cable" problem,
the "we need to upgrade the firmware on the switch" problem, or a bunch
of other problems I've outlined in the past on this list.

So I am not supportive of any changes you want to make to the core kernel
to support this kind of adventure.  Play in your own sandbox all you
like, but not one line of code change in the core.  Unless it's something
generally beneficial, of course; you mentioned refactoring DAX and that
might be a good thing for everybody.

> * Famfs is not, not, not a general purpose file system.
> * One can think of famfs as a shared memory allocator where allocations can be
>   accessed as files. For certain data analytics work flows (especially 
>   involving Apache Arrow data frames) this is really powerful. Consumers of
>   data frames commonly use mmap(MAP_SHARED), and can benefit from the memory
>   de-duplication of shared memory and don't need any new abstractions.

... and are OK with the extra latency?

> * Famfs is not really a data storage tool. It's more of a shared-memroy 
>   allocation tool that has the benefit of allocations being accesssible 
>   (and memory-mappable) as files. So a lot of software can automatically use 
>   it.
> * Famfs is oriented to dumping sharable data into files and then allowing a
>   scale-out cluster to share it (often read-only) to access a single copy in
>   shared memory.

Depending on the exact workload, I can see this being more efficient
than replicating the data to each member of the cluster.  In other
workloads, it'll be a loss, of course.

> * I'm no expert on GFS2 or OCFS2, but I've been around memory, file systems 
>   and storage since well before the turn of the century...
> * If you had brought up the existing fs-dax file systems, I would have pointed
>   that they use write-back metadata, which does not reconcile with shared
>   access to media - but these file systems do handle that.
> * The shared media file systems are still oriented to block devices that
>   provide durable storage and page-oriented access. CXL DRAM is a character 

I'd say "block oriented" rather than page oriented, but I agree.

>   dax (devdax) device and does not provide durable storage.
> * fs-dax-style memory mapping for volatile cxl memory requires the 
>   dev_dax_iomap portion of this patch set - or something similar. 
> * A scale-out shared media file system presumably requires some commitment to
>   configure and manage some complexity in a distributed environment; whether
>   that should be mandatory for enablement of shared memory is worthy of
>   discussion.
> * Adding memory to the storage tier for GFS2/OCFS2 would add non-persistent
>   media to the storage tier; whether this makes sense would be a topic that
>   GFS2/OCFS2 developers/architects should get involved in if they're 
>   interested.
> 
> Although disaggregated shared memory is not commercially available yet, famfs 
> is being actively tested by multiple companies for several use cases and 
> patterns with real and simulated shared memory. Demonstrations will start to
> surface in the coming weeks & months.

I guess we'll see.  SGI died for a reason.
John Groves May 1, 2024, 2:09 a.m. UTC | #7
On 24/04/29 11:11PM, Kent Overstreet wrote:
> On Mon, Apr 29, 2024 at 09:24:19PM -0500, John Groves wrote:
> > On 24/04/29 07:08PM, Kent Overstreet wrote:
> > > On Mon, Apr 29, 2024 at 07:32:55PM +0100, Matthew Wilcox wrote:
> > > > On Mon, Apr 29, 2024 at 12:04:16PM -0500, John Groves wrote:
> > > > > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > > > > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > > > > CXL-specific in anyway way.
> > > > > 
> > > > > * Famfs creates a simple access method for storing and sharing data in
> > > > >   sharable memory. The memory is exposed and accessed as memory-mappable
> > > > >   dax files.
> > > > > * Famfs supports multiple hosts mounting the same file system from the
> > > > >   same memory (something existing fs-dax file systems don't do).
> > > > 
> > > > Yes, but we do already have two filesystems that support shared storage,
> > > > and are rather more advanced than famfs -- GFS2 and OCFS2.  What are
> > > > the pros and cons of improving either of those to support DAX rather
> > > > than starting again with a new filesystem?
> > > 
> > > I could see a shared memory filesystem as being a completely different
> > > beast than a shared block storage filesystem - and I've never heard
> > > anyone talking about gfs2 or ocfs2 as codebases we particularly liked.
> > 
> > Thanks for your attention on famfs, Kent.
> > 
> > I think of it as a completely different beast. See my reply to Willy re:
> > famfs being more of a memory allocator with the benefit of allocations 
> > being accessible (and memory-mappable) as files.
> 
> That's pretty much what I expected.
> 
> I would suggest talking to RDMA people; RDMA does similar things with
> exposing address spaces across machine, and an "external" memory
> allocator is a basic building block there as well - it'd be great if we
> could get that turned into some clean library code.
> 
> GPU people as well, possibly.

Thanks for your attention Kent.

I'm on it. Part of the core idea behind famfs is that page-oriented data
movement can be avoided with actual shared memory. Yes, the memory is likely to 
be slower (either BW or latency or both) but it's cacheline access rather than 
full-page (or larger) retrieval, which is a win for some access patterns (and
not so for others).

Part of the issue is communicating the fact that shared access to cachelines
is possible.

There are some interesting possibilities with GPUs retrieving famfs files
(or portions thereof), but I have no insight as to the motivations of GPU 
vendors.

> 
> > The famfs user space repo has some good documentation as to the on-
> > media structure of famfs. Scroll down on [1] (the documentation from
> > the famfs user space repo). There is quite a bit of info in the docs
> > from that repo.
> 
> Ok, looking through that now.
> 
> So youv've got a metadata log; that looks more like a conventional
> filesystem than a conventional purely in-memory thing.
> 
> But you say it's a shared filesystem, and it doesn't say anything about
> that. Inter node locking?
> 
> Perhaps the ocfs2/gfs2 comparison is appropriate, after all.

Famfs is intended to be mounted from more than one host from the same in-memory
image. A metadata log is kinda the simpliest approach to make that work (let me
know your thoughts if you disagree on that). When a client mounts, playing the 
log from the shared memory brings that client mount into sync with the source 
(the Master).

No inter-node locking is currently needed because only the node that created
the file system (the Master) can write the log. Famfs is not intended to be 
a general-purpose FS...

The famfs log is currently append-only, and I think of it as a "code-first"
implementation of a shared memory FS that that gets the job done in something
approaching the simplest possible approach.

If the approach evolves to full allocate-on-write, then moving to a file system
platform that handles that would make sense. If it remains (as I suspect will
make sense) a way to share collections of data sets, or indexes, or other 
data that is published and then consumed [all or mostly] read-only, this
simple approach may be long-term sufficient.

Regards,
John